Relations (1)

related 0.30 — supporting 3 facts

Large Language Models are evaluated using the TruthfulQA benchmark to assess their tendency to mimic human false beliefs as described in [1], and this benchmark is explicitly used in controlled experiments to analyze hallucinations in these models as noted in [2] and [3].

Facts (3)

Sources
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers 2 facts
procedureThe authors of the survey "Survey and analysis of hallucinations in large language models" conducted controlled experiments on multiple Large Language Models (GPT-4, LLaMA 2, DeepSeek, Gwen) using standardized hallucination evaluation benchmarks, specifically TruthfulQA, HallucinationEval, and RealToxicityPrompts.
referenceTruthfulQA (Lin et al., 2022) is a benchmark that evaluates whether large language models produce answers that mimic human false beliefs.
The Role of Hallucinations in Large Language Models - CloudThat cloudthat.com CloudThat 1 fact
claimFact-checking tools for large language models include TruthfulQA benchmarks, LLM Fact Checker models, and custom fine-tuned LLMs trained specifically for verification.