Relations (1)
related 0.30 — supporting 3 facts
Large Language Models are evaluated using the TruthfulQA benchmark to assess their tendency to mimic human false beliefs as described in [1], and this benchmark is explicitly used in controlled experiments to analyze hallucinations in these models as noted in [2] and [3].
Facts (3)
Sources
Survey and analysis of hallucinations in large language models frontiersin.org 2 facts
procedureThe authors of the survey "Survey and analysis of hallucinations in large language models" conducted controlled experiments on multiple Large Language Models (GPT-4, LLaMA 2, DeepSeek, Gwen) using standardized hallucination evaluation benchmarks, specifically TruthfulQA, HallucinationEval, and RealToxicityPrompts.
referenceTruthfulQA (Lin et al., 2022) is a benchmark that evaluates whether large language models produce answers that mimic human false beliefs.
The Role of Hallucinations in Large Language Models - CloudThat cloudthat.com 1 fact
claimFact-checking tools for large language models include TruthfulQA benchmarks, LLM Fact Checker models, and custom fine-tuned LLMs trained specifically for verification.