F1 score
Also known as: F1 scores, F1-scores
Facts (31)
Sources
KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org May 20, 2024 8 facts
claimF1 Score considers both the precision and recall of predicted answers, providing a balance between the two metrics.
referenceThe KG-RAG study uses Exact Match (EM) and F1 Score as standard evaluation metrics for assessing question answering systems, as established by Rajpurkar et al. (2016) in the SQuAD paper.
measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.
measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.
measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.
measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.
referenceThe KG-RAG study uses Exact Match (EM) and F1 Score as standard evaluation metrics for assessing question answering systems, as established by Rajpurkar et al. (2016) in the SQuAD paper.
claimF1 Score considers both the precision and recall of predicted answers, providing a balance between the two metrics.
Empowering GraphRAG with Knowledge Filtering and Integration arxiv.org Mar 18, 2025 5 facts
measurementThe study evaluates performance using the F1 score on the WebQSP (Yih et al., 2016) and CWQ (Talmor and Berant, 2018) datasets.
measurementCategory C, representing cases where the LLM-only model outperforms GraphRAG and GraphRAG leads to wrong predictions for queries the standalone LLM originally answered correctly, accounts for 16.89% of samples when evaluated via F1 score.
referenceThe framework evaluates retrieval methods using Hit Rate, which measures the proportion of relevant items successfully retrieved, and F1-score, which balances precision and recall to assess retrieval quality.
measurementLeveraging logits to filter out low-confidence responses improves performance on the WebQSP and CWQ datasets. Specifically, on WebQSP, the 'LLM with Logits' approach achieved a Hit rate of 84.17 and F1 score of 76.74, compared to 66.15 and 49.97 for the baseline LLM. On CWQ, the 'LLM with Logits' approach achieved a Hit rate of 61.83 and F1 score of 58.19, compared to 40.27 and 34.17 for the baseline LLM.
referenceThe authors categorize prediction outcomes comparing LLMs with GraphRAG versus LLMs without GraphRAG into four groups based on F1 scores: Category A (both correct), Category B (GraphRAG more accurate), Category C (LLM-only outperforms GraphRAG), and Category D (both fail).
KG-IRAG: A Knowledge Graph-Based Iterative Retrieval-Augmented ... arxiv.org Mar 18, 2025 4 facts
claimIn the KG-IRAG study, F1 Score and Hit Rate metrics are excluded for the Q1 dataset because it contains less temporal reasoning compared to the Q2 and Q3 datasets.
claimStandard evaluation metrics for Question Answering (QA) systems include Exact Match (EM), F1 Score, and Hit Rate (HR).
formulaF1 Score is calculated based on precision and recall, where precision measures the correctness of retrieved information and recall assesses the proportion of target data retrieved.
procedureIn the second stage of experiments, the KG-IRAG framework is compared against Graph-RAG and KG-RAG (Sanmartin, 2024) by evaluating generated answers against true answers using exact match, F1 Score, and Hit Rate metrics, while hallucinations are judged based on the answers generated by the LLMs under each framework.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Nov 4, 2024 3 facts
formulaF1-Score is calculated as 2 multiplied by the quotient of (Precision * Recall) divided by (Precision + Recall).
claimF1-Score is a measure that combines precision and recall, calculated as the harmonic mean of precision and recall, used in binary and multi-class classification tasks to provide a balanced evaluation of model performance.
claimEvaluation metrics for Large Language Models integrated with Knowledge Graphs vary depending on the specific downstream tasks and can include accuracy, F1-score, precision, and recall.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 2 facts
measurementGPT-4 achieves an F1-score of approximately 0.625 in detecting subtle falsehoods on the hardest subset of the MedHallu benchmark.
referenceEvaluation benchmarks for vision-language hallucination detection and mitigation include MHaluBench, MFHaluBench, Object HalBench, AMBER, MMHal-Bench, and POPE, which utilize metrics such as accuracy, precision, recall, F1-score, CHAIR, Cover, Hal, and Cog.
[2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org Feb 20, 2025 2 facts
measurementThe best performing model on the MedHallu benchmark achieved an F1 score as low as 0.625 for detecting 'hard' category hallucinations.
claimIncorporating domain-specific knowledge and introducing a 'not sure' category as one of the answer categories improves precision and F1 scores by up to 38% relative to baselines in the MedHallu benchmark.
A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org Jan 6, 2026 2 facts
Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aug 25, 2025 2 facts
measurementF1 scores for hallucination detection methods are consistently higher on HaluBench than on RAGTruth, suggesting that RAGTruth is a more difficult benchmark.
claimThe Datadog hallucination detection method showed the smallest drop in F1 scores between HaluBench and RAGTruth, suggesting robustness as hallucinations become harder to detect.
Enterprise AI Requires the Fusion of LLM and Knowledge Graph stardog.com Dec 4, 2024 1 fact
claimStardog is improving the quality of auto-mappings by utilizing Large Language Models to enhance F1 scores.
LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Jan 27, 2026 1 fact
measurementChain-of-Verification (CoVe) improves F1 scores by 23% (from 0.39 to 0.48) and outperforms Zero-Shot, Few-Shot, and Chain-of-Thought methods, though it does not eliminate hallucinations in complex reasoning chains.
Hybrid Fact-Checking that Integrates Knowledge Graphs, Large ... aclanthology.org 1 fact
measurementThe hybrid fact-checking pipeline developed by Kolli et al. achieves an F1 score of 0.93 on the FEVER benchmark for the Supported/Refuted split without requiring task-specific fine-tuning.