concept

factual consistency evaluation

Also known as: factual consistency evaluation, factual consistency

Facts (18)

Sources
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 7 facts
claimGrounded pretraining and fine-tuning improves factual consistency by integrating knowledge sources or fact-labeled datasets during pretraining or fine-tuning stages, as noted by Zhang et al. (2023).
referenceKazemi et al. (2023) introduced CoHS, a dataset designed for evaluating the factual consistency of summaries.
claimTraditional automatic metrics like BLEU, ROUGE, and METEOR are inadequate for assessing factual consistency in large language models, according to Maynez et al. (2020).
referenceLiu et al. (2023) conducted a survey on methods for evaluating the factual consistency of large language models.
referenceFabbri et al. (2022) introduced QAGS (QA-based factual consistency evaluation for summarization) as a method for improving factual consistency evaluation in summarization tasks, presented at the 2022 Conference of the North American Chapter of the Association for Computational Linguistics.
claimAutomatic metrics such as BLEU or ROUGE fail to capture factual consistency and reliability in Large Language Models, according to Maynez et al. (2020).
referenceCohS (Kazemi et al., 2023) and QAFactEval (Fabbri et al., 2022) are benchmarks that focus on factual consistency in summarization tasks.
vectara/hallucination-leaderboard - GitHub github.com Vectara 2 facts
claimThe Vectara hallucination leaderboard does not evaluate summarization quality, but rather focuses exclusively on the factual consistency of the summaries produced by the models.
claimThe evaluation protocol used by the Vectara hallucination leaderboard builds upon a large body of existing academic work on factual consistency.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 2 facts
referenceThe Q² metric evaluates factual consistency in knowledge-grounded dialogues and is compared against F1 token-level overlap, Precision and Recall, Q² w/o NLI, E2E NLI, Overlap, BERTScore, and BLEU using the WoW, Topical-Chat, and Dialogue NLI datasets.
measurementEvaluation of factual consistency in summaries uses BERT-Precision and FactKB as metrics, and utilizes datasets including CNN-DM and XSUM for summarization, and MemoTrap and NQ-Swap for knowledge conflicts.
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 2 facts
claimSophisticated metrics including BERTScore, BLEU, and UniEval-fact show substantial disagreement with judgments from strong LLM-based evaluators, indicating limitations in capturing factual consistency.
referenceThe paper 'TRUE: Re-evaluating Factual Consistency Evaluation' by Honovich et al. (2022) proposes the TRUE benchmark for re-evaluating factual consistency in language models, published as an arXiv preprint.
Practices, opportunities and challenges in the fusion of knowledge ... frontiersin.org Frontiers 2 facts
referenceLuo et al. (2024) evaluated the factual consistency of summarization in the era of large language models in the journal Expert Systems with Applications.
referenceHonovich et al. (2022) proposed 'True', a framework for re-evaluating factual consistency evaluation in language models.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact
referenceThe paper 'Evaluating the factual consistency of abstractive text summarization' proposes methods for assessing factual consistency in summarization models.
A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature May 13, 2025 1 fact
referenceThe Vectara Hallucination Leaderboard, maintained by Vectara, Inc. since 2023, compares large language model performance in maintaining factual consistency when summarizing sets of facts.
What Really Causes Hallucinations in LLMs? - AI Exploration Journey aiexpjourney.substack.com AI Innovations and Insights Sep 12, 2025 1 fact
claimLarge language models may hallucinate because their specific architecture is incapable of learning certain patterns, such as identifying impossible trigrams, which prevents the model from maintaining factual consistency.