concept

CovidQA

Facts (11)

Sources
Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Cleanlab Sep 30, 2024 4 facts
measurementIn the CovidQA dataset application, RAGAS Faithfulness performs relatively well for hallucination detection but remains less effective than the Trustworthy Language Model (TLM).
measurementThe RAGAS Faithfulness evaluation framework experienced a 58.90% failure rate on the DROP dataset, 0.70% on RAGTruth, 83.50% on FinanceBench, 0.10% on PubMedQA, and 21.20% on CovidQA, where a failure is defined as the software returning an error instead of a score.
claimThe CovidQA dataset consists of Q&A pairs based on scientific articles related to COVID-19 and contains simpler problems than the DROP dataset, typically requiring simple synthesis of information.
measurementThe RAGAS++ evaluation framework experienced a 0.10% failure rate on the DROP dataset, 0.00% on RAGTruth, 0.00% on FinanceBench, 0.00% on PubMedQA, and 0.00% on CovidQA, where a failure is defined as the software returning an error instead of a score.
Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Cleanlab Apr 7, 2025 3 facts
claimIn the CovidQA benchmark, the TLM evaluation model detects incorrect AI responses with the highest precision and recall, followed by Prometheus and LLM-as-a-judge.
referenceThe CovidQA dataset contains scientific articles as retrieved context to help experts answer questions related to the Covid-19 pandemic based on medical research literature.
claimPatronus Lynx was trained on RAG datasets including CovidQA, PubmedQA, DROP, and FinanceBench.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 3 facts
referenceThe HaluBench dataset consists of approximately 500 random samples from CovidQA, PubMedQA, DROP, and FinanceBench, along with a set of perturbations based on the retrieved samples.
claimThe curriculum learning strategy that transitions training from easier to harder negatives outperforms larger state-of-the-art models on the DROP, CovidQA, and PubMedQA benchmarks.
procedureThe Lynx model is trained on 2400 samples from RAGTruth, DROP, CovidQA, and PubMedQA, incorporating GPT-4o generated reasoning as part of the training data.
Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog Aug 25, 2025 1 fact
referenceHaluBench is a partially synthetic hallucination benchmarking dataset where negative examples (non-hallucinated answers) are derived from existing question answering benchmarks including HaluEval, DROP, CovidQA, FinanceBench, and PubMedQA.