concept

HaluEval

Also known as: HaluEval-QA

Facts (15)

Sources

The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Hugging Face Jan 29, 2024 6 facts

procedureIn the HaluEval QA task, a model is provided with a question, a knowledge snippet, and an answer. The model must predict whether the answer contains hallucinations in a zero-shot setting.

measurementHaluEval includes 5,000 general user queries with ChatGPT responses and 30,000 task-specific examples across three tasks: question answering (HaluEval QA), knowledge-grounded dialogue (HaluEval Dialogue), and summarisation (HaluEval Summarisation).

referenceFaithDial, True-False, and HaluEval (covering QA, Dialogue, and Summarisation) are datasets specifically designed to target hallucination detection in Large Language Models.

claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).

referenceThe Hallucinations Leaderboard evaluates hallucination detection using two tasks: SelfCheckGPT, which checks for self-consistency in model answers, and HaluEval, which checks for faithfulness hallucinations in QA, Dialog, and Summarisation tasks relative to a knowledge snippet.

claimFor HaluEval QA, Dialog, and Summarisation tasks, Mistral and LLaMA2-based models produce the best results.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 5 facts

referenceThe MultiHal benchmark is a factual language modeling benchmark that extends previous benchmarks such as Shroom2024, HaluEval, HaluBench, TruthfulQA, Felm, Defan, and SimpleQA by mining relevant knowledge graph paths from Wikidata.

claimDatasets utilized for hallucination detection research include HELM (50K Wikipedia articles), MedHALT, LegalBench, HaluEval, and XSum.

claimHaluEval is a collection of generated and human-annotated hallucinated samples used for evaluating the performance of large language models in recognizing hallucinations.

referenceA white-box hallucination detector approach treats the Large Language Model as a dynamic graph and analyzes structural properties of internal attention mechanisms. This method extracts spectral features, specifically eigenvalues, from attention maps to predict fabrication: factual retrieval produces stable eigen-structures, while hallucination leads to diffuse, chaotic patterns. This detector operates independently of generated semantic content and was evaluated across seven QA benchmarks (NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEval-QA, TruthfulQA, GSM8K) using AUROC, Precision, Recall, and Cohen's Kappa metrics.

measurementA curriculum learning strategy that transitions training from easier to harder negatives demonstrates up to 24% relative F1 gains on the MedHallu and HaluEval datasets.

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 1 fact

referenceLi et al. (2023) created 'HaluEval', a large-scale benchmark for evaluating hallucinations in Large Language Models.

Unknown source 1 fact

claimShroom2024, HaluEval, HaluBench, TruthfulQA, Felm, Defan, and SimpleQA are identified as past benchmarks for hallucination detection in AI systems.

Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog Aug 25, 2025 1 fact

referenceHaluBench is a partially synthetic hallucination benchmarking dataset where negative examples (non-hallucinated answers) are derived from existing question answering benchmarks including HaluEval, DROP, CovidQA, FinanceBench, and PubMedQA.

Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Cleanlab Sep 30, 2024 1 fact

claimThe Cleanlab researchers excluded the HaluEval and RAGTruth datasets from their benchmark suite because they discovered significant errors in the ground truth annotations of those datasets.