concept

Precision

Facts (33)

Sources

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 14 facts

referenceThe paper 'A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation' uses Precision and Recall metrics to detect sentence-level and concept-level hallucinations in ChatGPT-generated paragraphs spanning 150 topics.

referenceThe AutoAIS (Attributable to Identified Sources) evaluation framework for AI systems utilizes zero-shot and fine-tuned precision, recall, and F1 metrics, as well as FActScore F1 scores based on reference factuality labels.

referenceThe Q² metric evaluates factual consistency in knowledge-grounded dialogues and is compared against F1 token-level overlap, Precision and Recall, Q² w/o NLI, E2E NLI, Overlap, BERTScore, and BLEU using the WoW, Topical-Chat, and Dialogue NLI datasets.

referenceEvaluation metrics for hallucination detection include Precision, Recall, and F1, while metrics for mitigation include the ratio of self-contradiction removed, the ratio of informative facts retained, and perplexity increase.

measurementThe LARS uncertainty estimation technique is evaluated using Accuracy, Precision, Recall, and AUROC metrics on the TriviaQA, GSM8k, SVAMP, and Common-sense QA datasets.

referenceEvaluation metrics for hallucination detection include Accuracy (Acc), G-Mean, BSS, AUC, and Precision, Recall, and F1 scores for both 'Not Hallucination' and 'Hallucination' classifications.

measurementThe Directional Levy/Holt metric uses precision and recall with entity insertions and replacements to evaluate models on the Levy/Holt dataset, which consists of premise-hypothesis pairs.

referenceEvaluation metrics for AI systems include Precision, Recall, and F1 scores calculated under cross-examination strategies such as AYS, IDK, Confidence-Based, and IC-IDK.

referenceA custom fine-grained hallucination detection dataset categorizes factual hallucinations into Entity, Relation, Contradictory, Invented, Subjective, and Unverifiable types, evaluated using Precision, Recall, and F1 metrics.

referenceThe ClaimDecomp dataset contains 1200 complex claims from PolitiFact, each labeled with one of six veracity labels, a justification paragraph from expert fact-checkers, and subquestions annotated by prior work, evaluated using accuracy, F1, precision, and recall.

measurementEvaluation metrics for list-based questions on Wikidata and Wiki-Category List include test precision and the average number of positive and negative hallucination entities; MultiSpanQA uses F1, Precision, and Recall; and longform generation of biographies uses FactScore.

referenceA white-box hallucination detector approach treats the Large Language Model as a dynamic graph and analyzes structural properties of internal attention mechanisms. This method extracts spectral features, specifically eigenvalues, from attention maps to predict fabrication: factual retrieval produces stable eigen-structures, while hallucination leads to diffuse, chaotic patterns. This detector operates independently of generated semantic content and was evaluated across seven QA benchmarks (NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEval-QA, TruthfulQA, GSM8K) using AUROC, Precision, Recall, and Cohen's Kappa metrics.

referenceEvaluation benchmarks for vision-language hallucination detection and mitigation include MHaluBench, MFHaluBench, Object HalBench, AMBER, MMHal-Bench, and POPE, which utilize metrics such as accuracy, precision, recall, F1-score, CHAIR, Cover, Hal, and Cog.

referenceEvaluation metrics for estimating hallucination degree include BLEU, ROUGE-L, FeQA, QuestEval, and EntityCoverage (Precision, Recall, F1).

A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Springer Nov 4, 2024 5 facts

formulaF1-Score is calculated as 2 multiplied by the quotient of (Precision * Recall) divided by (Precision + Recall).

claimF1-Score is a measure that combines precision and recall, calculated as the harmonic mean of precision and recall, used in binary and multi-class classification tasks to provide a balanced evaluation of model performance.

formulaPrecision is calculated as the ratio of True Positives to the sum of True Positives and False Positives.

claimEvaluation metrics for Large Language Models integrated with Knowledge Graphs vary depending on the specific downstream tasks and can include accuracy, F1-score, precision, and recall.

claimPrecision is the proportion of true positive predictions out of all positive predictions made by a model, measuring the accuracy of the model in identifying relevant instances correctly.

KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org arXiv May 20, 2024 2 facts

claimF1 Score considers both the precision and recall of predicted answers, providing a balance between the two metrics.

claimTo evaluate the KG-RAG approach against vector RAG and no-RAG baselines, the researchers incorporated a conventional accuracy metric and introduced a modified precision metric designed to quantify the incidence of hallucinations.

Detect hallucinations for RAG-based systems - AWS aws.amazon.com Amazon Web Services May 16, 2025 2 facts

claimThe BERT score has been shown to correlate with human judgment on both sentence-level and system-level evaluation and computes precision, recall, and F1 measures for language generation tasks.

claimFor use cases where precision is the highest priority, the token similarity, LLM prompt-based, and semantic similarity methods are recommended, whereas the BERT stochastic method outperforms other methods for high recall.

Hallucinations in LLMs: Can You Even Measure the Problem? linkedin.com Sewak, Ph.D. · LinkedIn Jan 18, 2025 1 fact

claimHallucination detection methods often utilize metrics such as Recall, Precision, and K-Precision to evaluate the performance of the detector.

Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 1 fact

referenceEvaluation metrics for synthesizing Large Language Models with Knowledge Graphs for Question Answering are categorized into: (1) Answer Quality, including BERTScore (Peng et al., 2024), answer relevance (AR), hallucination (HAL) (Yang et al., 2025), accuracy matching, and human-verified completeness (Yu and McQuade, 2025); (2) Retrieval Quality, including context relevance (Es et al., 2024), faithfulness score (FS) (Yang et al., 2024), precision, context recall (Yu et al., 2024; Huang et al., 2025), mean reciprocal rank (MRR) (Xu et al., 2024), and normalized discounted cumulative gain (NDCG) (Xu et al., 2024); and (3) Reasoning Quality, including Hop-Acc (Gu et al., 2024) and reasoning accuracy (RA) (Li et al., 2025a).

Combining large language models with enterprise knowledge graphs frontiersin.org Frontiers Aug 26, 2024 1 fact

perspectiveAI solutions should be accompanied by a high degree of explainability, robustness, and precision to ensure that enrichment systems are transparent and reliable.

The construction and refined extraction techniques of knowledge ... nature.com Nature Feb 10, 2026 1 fact

procedureThe BERTScore evaluation method proceeds in four steps: (1) map the words of the generated text and reference text to the embedding space to obtain word vectors, (2) calculate the cosine similarity for each word pair between the generated and reference texts to form a similarity matrix, (3) calculate Precision (P) as the average similarity of each word vector in the generated text to the most similar word vector in the reference text, and (4) calculate Recall (R) as the average similarity of each word vector in the reference text to the most similar word vector in the generated text.

LLM-Powered Knowledge Graphs for Enterprise Intelligence and ... arxiv.org arXiv Mar 11, 2025 1 fact

measurementThe evaluation of the LLM-knowledge graph framework demonstrated high performance in metrics including NDCG, precision, recall, and user satisfaction, with notable improvements in prioritization accuracy and expert identification.

A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 1 fact

claimClassical metrics, including Precision, Recall, Accuracy, and F1-score, are used to quantify performance in the study.

[2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org arXiv Feb 20, 2025 1 fact

claimIncorporating domain-specific knowledge and introducing a 'not sure' category as one of the answer categories improves precision and F1 scores by up to 38% relative to baselines in the MedHallu benchmark.

KG-IRAG: A Knowledge Graph-Based Iterative Retrieval-Augmented ... arxiv.org arXiv Mar 18, 2025 1 fact

formulaF1 Score is calculated based on precision and recall, where precision measures the correctness of retrieved information and recall assesses the proportion of target data retrieved.

Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Cleanlab Sep 30, 2024 1 fact

claimCleanlab evaluates popular hallucination detectors across four public Retrieval-Augmented Generation (RAG) datasets using precision and recall metrics.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

procedureQuestEval, proposed by Scialom et al. (2021), combines recall and precision by generating questions from both the source and the summary, and introduces a weighting mechanism for key information to improve evaluation robustness.