concept

accuracy

Facts (43)

Sources

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 13 facts

claimAccuracy is a metric used for evaluating QA, dialogue, and summarization tasks in AI systems.

claimPer-topic and average accuracy are metrics used for evaluating AI systems.

referenceThe SAC^3 method for reliable hallucination detection in black-box language models uses accuracy and AUROC as metrics for classification QA and open-domain QA, and utilizes datasets including Prime number and senator search from Snowball Hallucination, HotpotQA, and Nq-open QA.

measurementProMaC is evaluated using Accuracy (detection) and Rouge (correction) metrics on the SummEval, QAGS-C, and QAGS-X datasets.

referenceThe study 'When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories' uses Exact Match and Accuracy as metrics, and utilizes QA datasets with long-tail entities including PopQA, EntityQuestions, and NQ.

measurementThe LARS uncertainty estimation technique is evaluated using Accuracy, Precision, Recall, and AUROC metrics on the TriviaQA, GSM8k, SVAMP, and Common-sense QA datasets.

referenceEvaluation metrics for hallucination detection include Accuracy (Acc), G-Mean, BSS, AUC, and Precision, Recall, and F1 scores for both 'Not Hallucination' and 'Hallucination' classifications.

referenceThe ClaimDecomp dataset contains 1200 complex claims from PolitiFact, each labeled with one of six veracity labels, a justification paragraph from expert fact-checkers, and subquestions annotated by prior work, evaluated using accuracy, F1, precision, and recall.

measurementCausal Faithfulness (CaF) includes variants CaF(M), CaF(T), and CaF(L), and is used alongside Accuracy, CC-SHAP, CFF, and Plausibility to evaluate faithfulness in models like Gemma-2 on the CoS-E, e-SNLI, and ComVE datasets.

measurementThe 'Monitoring Decoding' framework utilizes Exact Match (TriviaQA, NQ-Open), Truth/Info/Truth×Info scores (TruthfulQA), Accuracy (GSM8K), Latency (ms/token), and Throughput (token/s) as evaluation metrics.

referenceThe XEnt metric suite evaluates hallucination and factuality in AI systems using Accuracy, F1, ROUGE, percentage of novel n-grams, and faithfulness metrics including %ENFS, FEQA, and DAE.

referenceEvaluation benchmarks for vision-language hallucination detection and mitigation include MHaluBench, MFHaluBench, Object HalBench, AMBER, MMHal-Bench, and POPE, which utilize metrics such as accuracy, precision, recall, F1-score, CHAIR, Cover, Hal, and Cog.

claimAUROC, PCC, and accuracy are metrics used for evaluating TruthfulQA.

KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org arXiv May 20, 2024 9 facts

claimAccuracy in the KG-RAG evaluation framework is defined as the proportion of answers that have any overlapping word with the ground truth answer, where the overlap function checks for any common word between the predicted and the correct answer.

measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.

measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.

claimThe KG-RAG framework ensures that the next generation of language model applications performs exceptionally across various domains while adhering to high standards of reliability and accuracy.

measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.

claimTo evaluate the KG-RAG approach against vector RAG and no-RAG baselines, the researchers incorporated a conventional accuracy metric and introduced a modified precision metric designed to quantify the incidence of hallucinations.

claimThe KG-RAG framework ensures that the next generation of language model applications performs exceptionally across various domains while adhering to high standards of reliability and accuracy.

measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.

Track: Poster Session 3 - aistats 2026 virtual.aistats.org Samuel Tesfazgi, Leonhard Sprandl, Sandra Hirche · AISTATS 4 facts

claimThe "Accuracy-on-the-line" phenomenon in machine learning describes a positive correlation between a model's in-distribution (ID) and out-of-distribution (OOD) accuracy across different hyperparameters and data configurations.

claimScaling to larger datasets does not mitigate the "Accuracy-on-the-wrong-line" phenomenon and may exacerbate the negative correlation between ID and OOD accuracy.

claimChristina Baek, Aditi Raghunathan, and Zico Kolter proved a lower bound on the residual of the correlation between in-distribution versus out-of-distribution agreement that grows proportionally with the residual of accuracy.

claimThe "Accuracy-on-the-wrong-line" phenomenon occurs when noisy data, nuisance features, or spurious (shortcut) features cause ID and OOD accuracy to become negatively correlated, shattering the standard Accuracy-on-the-line relationship.

Construction of Knowledge Graphs: State and Challenges - arXiv arxiv.org arXiv 3 facts

claimCorrectness in a knowledge graph implies the validity of information (accuracy) and consistency, meaning each entity, concept, relation, and property is canonicalized with a unique identifier and included exactly once.

claimAccuracy in a Knowledge Graph indicates the correctness of facts, including type, value, and relation correctness, and can be separated into syntactic accuracy (assessing wrong value datatype/format) and semantic accuracy (assessing wrong information).

referenceWang et al. identified six main quality dimensions for Knowledge Graph evaluation: accuracy, consistency, timeliness, completeness, trustworthiness, and availability.

A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Springer Nov 4, 2024 2 facts

formulaAccuracy is a metric used to evaluate large language models integrated with knowledge graphs by measuring the proportion of correctly predicted instances out of the total instances, calculated as Accuracy = (TP + TN) / (TP + TN + FP + FN), where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives.

claimEvaluation metrics for Large Language Models integrated with Knowledge Graphs vary depending on the specific downstream tasks and can include accuracy, F1-score, precision, and recall.

The construction and refined extraction techniques of knowledge ... nature.com Nature Feb 10, 2026 2 facts

claimThe knowledge graph construction framework utilizes semantic consistency checks and data fusion techniques to explore latent information within data, enhancing the accuracy and comprehensiveness of the graph.

claimThe KRN decision circuit applies lightweight pruning to reduce latency while preserving accuracy.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 2 facts

claimSurvey respondents prioritized enhancing accuracy (12 mentions), explainability (10), ethical considerations including bias reduction and privacy (8), integration with existing tools (7), and improving speed and efficiency (3) as future priorities for AI improvement.

measurementSurvey respondents identified lack of domain-specific knowledge (30 mentions) as the most critical limitation of AI/LLMs, followed by privacy and data security concerns (25), accuracy issues (24), lack of standardization/validation of AI tools (23), difficulty in explaining AI decisions (21), and ethical considerations (20).

Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 1 fact

referenceThe EFSUM method, proposed by Ko et al. in 2024, performs KG Fact Summarization and uses KG Helpfulness and Faithfulness Filters with GPT-3.5-Turbo, Flan-T5-XL, and Llama-2-7B-Chat models and dataset-inherent knowledge graphs (Freebase, Wikidata) for KGQA and Multi-hop QA, evaluated using Accuracy (Acc) on the WQSP and Mintaka datasets.

LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Zylos Jan 27, 2026 1 fact

claimDetection and verification of LLM hallucinations introduce latency, creating a trade-off between accuracy and system performance.

Phare LLM Benchmark: an analysis of hallucination in ... giskard.ai Giskard Apr 30, 2025 1 fact

claimGiskard researchers observe that Large Language Models prioritize brevity over accuracy when constrained by system instructions to be concise, because effective rebuttals of false information generally require longer explanations.

Practices, opportunities and challenges in the fusion of knowledge ... frontiersin.org Frontiers 1 fact

referenceThe book 'Improving Data Quality: Consistency and Accuracy' discusses methods and principles for enhancing data quality, specifically focusing on consistency and accuracy.

A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 1 fact

claimClassical metrics, including Precision, Recall, Accuracy, and F1-score, are used to quantify performance in the study.

Knowledge Graphs: Opportunities and Challenges - Springer Nature link.springer.com Springer Apr 3, 2023 1 fact

claimUtilizing rich additional information to improve the accuracy of knowledge graph embeddings remains a significant challenge.

A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact

claimMany existing hallucination benchmarks rely on one-dimensional metrics such as Accuracy, Accept/Refusal rates, BLEU, and BERTScore, which limits the interpretability of results and obscures the underlying causes of Large Language Model performance issues.

Construction of intelligent decision support systems through ... - Nature nature.com Nature Oct 10, 2025 1 fact

measurementThe IKEDS framework achieves up to a 24.3% improvement in accuracy for decisions involving multiple interconnected concepts, due to its ability to reason explicitly about relationships between concepts.