concept

F1

Facts (17)

Sources

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 15 facts

referenceThe AutoAIS (Attributable to Identified Sources) evaluation framework for AI systems utilizes zero-shot and fine-tuned precision, recall, and F1 metrics, as well as FActScore F1 scores based on reference factuality labels.

referenceThe Q² metric evaluates factual consistency in knowledge-grounded dialogues and is compared against F1 token-level overlap, Precision and Recall, Q² w/o NLI, E2E NLI, Overlap, BERTScore, and BLEU using the WoW, Topical-Chat, and Dialogue NLI datasets.

referenceEvaluation metrics for hallucination detection include Precision, Recall, and F1, while metrics for mitigation include the ratio of self-contradiction removed, the ratio of informative facts retained, and perplexity increase.

claimHallucination detection metrics measure either the degree of hallucination in generated responses relative to given knowledge or their overlap with gold faithful responses, including Critic, Q² (F1, NLI), BERTScore, F1, BLEU, and ROUGE.

referenceEvaluation metrics for hallucination detection include Accuracy (Acc), G-Mean, BSS, AUC, and Precision, Recall, and F1 scores for both 'Not Hallucination' and 'Hallucination' classifications.

measurementThe Concept-7 dataset is used to evaluate hallucinatory instruction classification using metrics including AUC, ACC, F1, and PEA.

referenceEvaluation metrics for AI systems include Precision, Recall, and F1 scores calculated under cross-examination strategies such as AYS, IDK, Confidence-Based, and IC-IDK.

referenceA custom fine-grained hallucination detection dataset categorizes factual hallucinations into Entity, Relation, Contradictory, Invented, Subjective, and Unverifiable types, evaluated using Precision, Recall, and F1 metrics.

referenceThe ClaimDecomp dataset contains 1200 complex claims from PolitiFact, each labeled with one of six veracity labels, a justification paragraph from expert fact-checkers, and subquestions annotated by prior work, evaluated using accuracy, F1, precision, and recall.

measurementEvaluation metrics for list-based questions on Wikidata and Wiki-Category List include test precision and the average number of positive and negative hallucination entities; MultiSpanQA uses F1, Precision, and Recall; and longform generation of biographies uses FactScore.

referenceSCALE is a metric proposed for hallucination detection that is compared against Q², ANLI, SummaC, F1, BLEURT, QuestEval, BARTScore, and BERTScore.

referenceThe XEnt metric suite evaluates hallucination and factuality in AI systems using Accuracy, F1, ROUGE, percentage of novel n-grams, and faithfulness metrics including %ENFS, FEQA, and DAE.

measurementEvaluation of generation tasks uses Perplexity, Unigram Overlap (F1), BLEU-4, ROUGE-L, Knowledge F1, and Rare F1 as metrics, and utilizes datasets including WoW and CMU Document Grounded Conversations (CMU_DoG) with the KiLT Wikipedia dump as the knowledge source.

measurementEvaluation of faithfulness between predicted responses and ground-truth knowledge uses Critic, Q², BERT F1, and F1 as metrics, and utilizes datasets including Wizard-of-Wikipedia (WoW), DSTC9 and DSTC11 extensions of MultiWoZ 2.1, and FaithDial.

referenceEvaluation metrics for estimating hallucination degree include BLEU, ROUGE-L, FeQA, QuestEval, and EntityCoverage (Precision, Recall, F1).

Evaluating RAG applications with Amazon Bedrock knowledge base ... aws.amazon.com Amazon Web Services Mar 14, 2025 1 fact

claimMetrics such as ROUGE and F1 can be inaccurate because they rely on shallow linguistic similarities (word overlap) between ground truth and LLM responses, even when the actual meaning differs.

Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 1 fact

referenceThe InteractiveKBQA method, proposed by Xiong et al. in 2024, uses Multi-turn Interaction for Observation and Thinking with GPT-4-Turbo, Mistral-7B, and Llama-2-13B models and Freebase, Wikidata, and Movie KG knowledge graphs for KBQA and domain-specific QA, evaluated using F1, Hits@1, EM, and Acc metrics on the WQSP, CWQ, KQA Pro, and MetaQA datasets.