TruthfulQA
Facts (38)
Sources
Survey and analysis of hallucinations in large language models frontiersin.org Sep 29, 2025 16 facts
claimLLaMA 2 frequently hallucinates on the TruthfulQA benchmark by incorrectly responding that swallowed chewing gum stays in the stomach for seven years, reflecting popular misconceptions rather than factual grounding.
measurementGPT-4 demonstrated a hallucination rate reduction of approximately 15% compared to LLaMA 2 on the TruthfulQA benchmark.
measurementThe aggregated hallucination rates (%) for GPT-4 are 14.3 on TruthfulQA, 9.8 on HallucinationEval, and 4.7 on QAFactEval.
procedureThe authors of the survey "Survey and analysis of hallucinations in large language models" conducted controlled experiments on multiple Large Language Models (GPT-4, LLaMA 2, DeepSeek, Gwen) using standardized hallucination evaluation benchmarks, specifically TruthfulQA, HallucinationEval, and RealToxicityPrompts.
referenceLin et al. (2022) developed TruthfulQA, a benchmark for measuring how language models mimic human falsehoods.
referenceBenchmarks such as TruthfulQA (Lin et al., 2022), HallucinationEval (Wu et al., 2023), and RealToxicityPrompts (Gehman et al., 2020) were introduced to assess hallucination bias across models and tasks.
referenceExisting benchmarks for evaluating hallucinations in large language models include TruthfulQA (Lin et al., 2022), HallucinationEval (Wu et al., 2023), QAFactEval (Fabbri et al., 2022), and CohS (Kazemi et al., 2023).
claimModels with higher PS (Prompt Sensitivity) and MV (Model Variance) metrics generally performed worse on factuality benchmarks like TruthfulQA (Lin et al., 2022) and HallucinationEval (Wu et al., 2023), while models with low MV, such as GPT-4, achieved better TruthfulQA scores.
claimGPT-4 avoids factual hallucinations on the TruthfulQA benchmark by using nuanced, cautious phrasing, a strategy likely derived from reinforcement learning from human feedback.
referenceTruthfulQA (Lin et al., 2022) is a benchmark that evaluates whether large language models produce answers that mimic human false beliefs.
claimThe study 'Survey and analysis of hallucinations in large language models' utilized three primary datasets to analyze hallucination patterns: TruthfulQA, HallucinationEval, and QAFactEval.
measurementThe aggregated hallucination rates (%) for DeepSeek are 22.5 on TruthfulQA, 21.4 on HallucinationEval, and 20.1 on QAFactEval.
measurementThe aggregated hallucination rates (%) for LLaMA 2 are 31.2 on TruthfulQA, 27.6 on HallucinationEval, and 24.8 on QAFactEval.
referenceThe study utilized the TruthfulQA dataset (Lin et al., 2022), a multiple-choice question answering dataset designed to test whether models reproduce common human misconceptions or produce false information.
claimExperimental evaluations using benchmarks like TruthfulQA and HallucinationEval demonstrate performance differences among LLaMA 2, DeepSeek, and GPT-4 regarding hallucination susceptibility.
claimThe TruthfulQA benchmark evaluates large language models' susceptibility to factual hallucinations by presenting questions designed to provoke common misconceptions.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 12 facts
referenceThe TruthfulQA multiple-choice task uses MC1, MC2, and MC3 scores, while the TruthfulQA open-ended generation task uses %Truth, %Info, and %Truth*Info metrics.
referenceThe MultiHal benchmark is a factual language modeling benchmark that extends previous benchmarks such as Shroom2024, HaluEval, HaluBench, TruthfulQA, Felm, Defan, and SimpleQA by mining relevant knowledge graph paths from Wikidata.
referenceA contrastive decoding method addresses limitations of early-exit strategies by constructing an 'amateur' model via dynamic layer pruning rather than simple truncation. Removing specific intermediate reasoning layers produces a better-calibrated contrastive prior with more informative logits, steering generation away from factually incorrect but high-probability tokens while maintaining fluency. This approach achieves consistent factuality improvements with minimal inference overhead, evaluated on TruthfulQA, FACTOR (News, Wiki), and StrategyQA datasets using TruthfulQA (MC1, MC2, %Truth, %Info), FACTOR, and StrategyQA Accuracy metrics.
referenceEvaluation metrics for HotpotQA, OpenbookQA, StrategyQA, and TruthfulQA include Accuracy, Final Answer Truncation Sensitivity, Final Answer Corruption Sensitivity, and Biased-Context Accuracy Change.
referenceThe 'Monitoring Decoding' framework is evaluated using the TruthfulQA (817 questions), TriviaQA (1,200 samples), NQ-Open (1,000 samples), and GSM8K (1,319 samples) datasets.
referenceThe TruthfulQA benchmark evaluates AI models using MC1, MC2, and MC3 multiple-choice scores, and for open-ended generation, it uses %Truth, %Info, %Truth*Info, and %Reject metrics.
referenceA white-box hallucination detector approach treats the Large Language Model as a dynamic graph and analyzes structural properties of internal attention mechanisms. This method extracts spectral features, specifically eigenvalues, from attention maps to predict fabrication: factual retrieval produces stable eigen-structures, while hallucination leads to diffuse, chaotic patterns. This detector operates independently of generated semantic content and was evaluated across seven QA benchmarks (NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEval-QA, TruthfulQA, GSM8K) using AUROC, Precision, Recall, and Cohen's Kappa metrics.
measurementThe 'Monitoring Decoding' framework utilizes Exact Match (TriviaQA, NQ-Open), Truth/Info/Truth×Info scores (TruthfulQA), Accuracy (GSM8K), Latency (ms/token), and Throughput (token/s) as evaluation metrics.
referenceThe study 'Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback' uses Expected Calibration Error (ECE) with temperature scaling (ECE-t), accuracy@coverage, and coverage@accuracy as metrics, and utilizes QA datasets including TriviaQA, SciQ, and TruthfulQA.
claimCoQA, SQuAD, Natural Questions, TriviaQA, and TruthfulQA are datasets used for evaluating AI systems.
measurementEvaluation metrics for hallucination detection and knowledge consistency include MC1, MC2, and MC3 scores for the TruthfulQA multiple-choice task; %Truth, %Info, and %Truth*Info for the TruthfulQA open-ended generation task; subspan Exact Match for open-domain QA tasks (NQ-Open, NQ-Swap, TriviaQA, PopQA, MuSiQue); accuracy for MemoTrap; and Prompt-level and Instruction-level accuracies for IFEval.
claimAUROC, PCC, and accuracy are metrics used for evaluating TruthfulQA.
The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Jan 29, 2024 5 facts
measurementThe Hallucination Leaderboard normalizes all metrics to a scale where a score of 0.8 represents 80% accuracy, such as in the TruthfulQA MC1 and MC2 tasks.
measurementModels based on Mistral 7B demonstrate higher accuracy on TriviaQA (8-shot) and TruthfulQA compared to other models evaluated on the Hallucinations Leaderboard.
claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).
referenceTruthfulQA is a dataset designed to address the challenge of truthfulness and factual accuracy in AI-generated responses.
procedureIn the TruthfulQA task, models are evaluated in a multi-class (MC1) or multi-label (MC2) zero-shot classification setting where the task is to select the correct answer from provided options.
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org Sep 22, 2025 1 fact
referenceThe KnowLA method, proposed by Luo et al. in 2024, utilizes knowledgeable adaptation with Llama2-7B and Alpaca2 language models, incorporating WordNet, ConceptNet, and Wikidata knowledge graphs to perform MCQA, CBQA, and TruthfulQA tasks on the CSQA, SIQA, BBH, WQSP, and TriviaQA datasets, evaluated using Acc, CE Score, BLEU, and ROUGE metrics.
Unknown source 1 fact
claimShroom2024, HaluEval, HaluBench, TruthfulQA, Felm, Defan, and SimpleQA are identified as past benchmarks for hallucination detection in AI systems.
The Role of Hallucinations in Large Language Models - CloudThat cloudthat.com Sep 1, 2025 1 fact
claimFact-checking tools for large language models include TruthfulQA benchmarks, LLM Fact Checker models, and custom fine-tuned LLMs trained specifically for verification.
LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Jan 27, 2026 1 fact
measurementIntegrative Decoding achieves performance improvements on the following benchmarks: TruthfulQA (+11.2%), Biographies (+15.4%), and LongFact (+8.5%).
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org Feb 23, 2026 1 fact
referenceThe paper 'TruthfulQA: measuring how models mimic human falsehoods' was published in the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) in Dublin, Ireland, pp. 3214–3252.