concept

DROP

Facts (11)

Sources

Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Cleanlab Sep 30, 2024 5 facts

measurementThe RAGAS Faithfulness evaluation framework experienced a 58.90% failure rate on the DROP dataset, 0.70% on RAGTruth, 83.50% on FinanceBench, 0.10% on PubMedQA, and 21.20% on CovidQA, where a failure is defined as the software returning an error instead of a score.

claimThe DROP dataset contains difficult questions, such as asking for the number of touchdown runs of 5 yards or less in a 49ers football game, which requires an LLM to read and compare data against a specific requirement.

claimThe CovidQA dataset consists of Q&A pairs based on scientific articles related to COVID-19 and contains simpler problems than the DROP dataset, typically requiring simple synthesis of information.

measurementIn the DROP dataset application, the Trustworthy Language Model (TLM) exhibited the best performance for hallucination detection, followed by improved RAGAS metrics and LLM Self-Evaluation.

measurementThe RAGAS++ evaluation framework experienced a 0.10% failure rate on the DROP dataset, 0.00% on RAGTruth, 0.00% on FinanceBench, 0.00% on PubMedQA, and 0.00% on CovidQA, where a failure is defined as the software returning an error instead of a score.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 3 facts

referenceThe HaluBench dataset consists of approximately 500 random samples from CovidQA, PubMedQA, DROP, and FinanceBench, along with a set of perturbations based on the retrieved samples.

claimThe curriculum learning strategy that transitions training from easier to harder negatives outperforms larger state-of-the-art models on the DROP, CovidQA, and PubMedQA benchmarks.

procedureThe Lynx model is trained on 2400 samples from RAGTruth, DROP, CovidQA, and PubMedQA, incorporating GPT-4o generated reasoning as part of the training data.

Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Cleanlab Apr 7, 2025 2 facts

claimIn the DROP benchmark, the TLM evaluation model detects incorrect AI responses with the highest precision and recall, followed by LLM-as-a-judge, with no other evaluation model appearing very useful.

claimPatronus Lynx was trained on RAG datasets including CovidQA, PubmedQA, DROP, and FinanceBench.

Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog Aug 25, 2025 1 fact

referenceHaluBench is a partially synthetic hallucination benchmarking dataset where negative examples (non-hallucinated answers) are derived from existing question answering benchmarks including HaluEval, DROP, CovidQA, FinanceBench, and PubMedQA.