concept

RAGTruth

Facts (11)

Sources

Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog Aug 25, 2025 4 facts

claimRAGTruth ensures that positive and negative examples come from the same distribution by using LLM-generated answers for all samples.

referenceRAGTruth is a human-labeled benchmark for hallucination detection that covers three tasks: question answering, summarization, and data-to-text translation.

measurementF1 scores for hallucination detection methods are consistently higher on HaluBench than on RAGTruth, suggesting that RAGTruth is a more difficult benchmark.

claimThe Datadog hallucination detection method showed the smallest drop in F1 scores between HaluBench and RAGTruth, suggesting robustness as hallucinations become harder to detect.

Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Cleanlab Sep 30, 2024 3 facts

measurementThe RAGAS Faithfulness evaluation framework experienced a 58.90% failure rate on the DROP dataset, 0.70% on RAGTruth, 83.50% on FinanceBench, 0.10% on PubMedQA, and 21.20% on CovidQA, where a failure is defined as the software returning an error instead of a score.

claimThe Cleanlab researchers excluded the HaluEval and RAGTruth datasets from their benchmark suite because they discovered significant errors in the ground truth annotations of those datasets.

measurementThe RAGAS++ evaluation framework experienced a 0.10% failure rate on the DROP dataset, 0.00% on RAGTruth, 0.00% on FinanceBench, 0.00% on PubMedQA, and 0.00% on CovidQA, where a failure is defined as the software returning an error instead of a score.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 2 facts

procedureThe Lynx model is trained on 2400 samples from RAGTruth, DROP, CovidQA, and PubMedQA, incorporating GPT-4o generated reasoning as part of the training data.

measurementOn the RAGTruth dataset, which covers QA, summarization, and data-to-text tasks, the RL4HS framework improves fine-grained hallucination detection compared to chain-of-thought-based and supervised baselines.

vectara/hallucination-leaderboard - GitHub github.com Vectara 1 fact

referenceKey academic papers regarding factual consistency in summarization include: SUMMAC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization; TRUE: Re-evaluating Factual Consistency Evaluation; TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models; ALIGNSCORE: Evaluating Factual Consistency with A Unified Alignment Function; MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents; TOFUEVAL: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization; RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models; and FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs.

Awesome-Hallucination-Detection-and-Mitigation - GitHub github.com GitHub 1 fact

referenceNiu et al. (2024) published 'RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models' in the proceedings of ACL 2024.