concept

Exact Match

Also known as: EM

Facts (24)

Sources
KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org arXiv May 20, 2024 12 facts
referenceThe KG-RAG study uses Exact Match (EM) and F1 Score as standard evaluation metrics for assessing question answering systems, as established by Rajpurkar et al. (2016) in the SQuAD paper.
measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.
formulaExact Match (EM) calculates the percentage of predicted answers that exactly match the ground truth answers.
measurementHuman benchmarks on the CWQ dataset achieved an Exact Match (EM) score of 63%.
measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.
measurementOn the CWQ dataset, the MHQA-GRN model achieved an Exact Match (EM) score of 33.2%.
measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.
measurementOn the CWQ dataset, the MHQA-GRN model achieved an Exact Match (EM) score of 33.2%.
formulaExact Match (EM) calculates the percentage of predicted answers that exactly match the ground truth answers.
measurementHuman benchmarks on the CWQ dataset achieved an Exact Match (EM) score of 63%.
measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.
referenceThe KG-RAG study uses Exact Match (EM) and F1 Score as standard evaluation metrics for assessing question answering systems, as established by Rajpurkar et al. (2016) in the SQuAD paper.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 5 facts
referenceSQuAD, Natural Questions, and MuSiQue are datasets that utilize F-1 and Exact Match metrics for classification and token-level evaluation.
referenceThe study 'When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories' uses Exact Match and Accuracy as metrics, and utilizes QA datasets with long-tail entities including PopQA, EntityQuestions, and NQ.
referenceThe paper 'Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources' uses Exact Match as an evaluation metric and evaluates performance on the FEVER and Adversarial HotpotQA datasets.
measurementThe 'Monitoring Decoding' framework utilizes Exact Match (TriviaQA, NQ-Open), Truth/Info/Truth×Info scores (TruthfulQA), Accuracy (GSM8K), Latency (ms/token), and Throughput (token/s) as evaluation metrics.
referenceEvaluation metrics for the FactCC, Polytope, SummEval, Legal Contracts, and RCT datasets include EM (Exact Match) and Memorisation ratio.
KG-IRAG: A Knowledge Graph-Based Iterative Retrieval-Augmented ... arxiv.org arXiv Mar 18, 2025 3 facts
claimStandard evaluation metrics for Question Answering (QA) systems include Exact Match (EM), F1 Score, and Hit Rate (HR).
procedureIn the second stage of experiments, the KG-IRAG framework is compared against Graph-RAG and KG-RAG (Sanmartin, 2024) by evaluating generated answers against true answers using exact match, F1 Score, and Hit Rate metrics, while hallucinations are judged based on the answers generated by the LLMs under each framework.
procedureThe first round of experiments in the KG-IRAG study tests four LLMs on three QA datasets (weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW) using three data formats: raw data (table) format, text data (transferred into text descriptions), and triplet format (KG structure). To minimize irrelevant information, input prompts are restricted to questions and the least amount of necessary data, with final answers compared against correct answers using exact match (EM) values.
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 2 facts
referenceThe LPKG method, proposed by Wang et al. in 2024, involves Planning LLM Tuning, Inference, and Execution using GPT-3.5-Turbo, CodeQwen1.5-7B-Chat, and Llama-3-8B-Instruct models with dataset-inherent knowledge graphs (Wikidata) and Wikidata15K for KGQA and Multi-hop QA, evaluated using EM, P, and R metrics on the HotpotQA, 2WikiMQA, Bamboogle, MuSiQue, and CLQA-Wiki datasets.
referenceThe InteractiveKBQA method, proposed by Xiong et al. in 2024, uses Multi-turn Interaction for Observation and Thinking with GPT-4-Turbo, Mistral-7B, and Llama-2-13B models and Freebase, Wikidata, and Movie KG knowledge graphs for KBQA and domain-specific QA, evaluated using F1, Hits@1, EM, and Acc metrics on the WQSP, CWQ, KQA Pro, and MetaQA datasets.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Springer Nov 4, 2024 2 facts
claimExact Match (EM) is the proportion of predictions that match the reference exactly, used in tasks requiring precise matching such as closed-book question answering.
formulaExact Match is calculated as the proportion of exact matches (PEM) divided by the total number of predictions (N_pred).