PubmedQA
Also known as: Pubmed QA
Facts (16)
Sources
Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Sep 30, 2024 4 facts
referencePubmed QA is a biomedical Q&A dataset based on PubMed abstracts, where each instance contains a passage from a medical publication, a question derived from that passage, and an LLM-generated answer.
measurementThe RAGAS Faithfulness evaluation framework experienced a 58.90% failure rate on the DROP dataset, 0.70% on RAGTruth, 83.50% on FinanceBench, 0.10% on PubMedQA, and 21.20% on CovidQA, where a failure is defined as the software returning an error instead of a score.
claimFor the Pubmed QA application, the TLM method is the most effective technique for detecting hallucinations, followed by the DeepEval Hallucination metric, RAGAS Faithfulness, and LLM Self-Evaluation.
measurementThe RAGAS++ evaluation framework experienced a 0.10% failure rate on the DROP dataset, 0.00% on RAGTruth, 0.00% on FinanceBench, 0.00% on PubMedQA, and 0.00% on CovidQA, where a failure is defined as the software returning an error instead of a score.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 4 facts
referenceThe HaluBench dataset consists of approximately 500 random samples from CovidQA, PubMedQA, DROP, and FinanceBench, along with a set of perturbations based on the retrieved samples.
referenceThe MedHallu benchmark, derived from PubMedQA, contains 10,000 question-answer pairs with deliberately planted plausible hallucinations to evaluate medical hallucination detection.
claimThe curriculum learning strategy that transitions training from easier to harder negatives outperforms larger state-of-the-art models on the DROP, CovidQA, and PubMedQA benchmarks.
procedureThe Lynx model is trained on 2400 samples from RAGTruth, DROP, CovidQA, and PubMedQA, incorporating GPT-4o generated reasoning as part of the training data.
Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Apr 7, 2025 3 facts
referenceThe PubmedQA dataset uses PubMed medical research publication abstracts as context for LLMs to answer biomedical questions.
claimIn the PubmedQA benchmark, the Prometheus and TLM evaluation models detect incorrect AI responses with the highest precision and recall, effectively catching hallucinations.
claimPatronus Lynx was trained on RAG datasets including CovidQA, PubmedQA, DROP, and FinanceBench.
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org Sep 22, 2025 1 fact
referenceThe InfuserKI method, proposed by Wang et al. in 2024, utilizes knowledge-based fine-tuning with the Llama-2-7B language model, incorporating UMLS and Movie KG (MetaQA) knowledge graphs to perform KGQA tasks on the PubMedQA and MetaQA-1HopQA datasets, evaluated using NR, RR, and F1 metrics.
Medical Hallucination in Foundation Models and Their ... medrxiv.org Mar 3, 2025 1 fact
claimGoogle's Med-PaLM and Med-PaLM 2 demonstrate strong performance on medical benchmarks such as MedQA (Jin et al., 2021), MedMCQA (Pal et al., 2022), and PubMedQA (Jin et al., 2019) by integrating biomedical texts into their training regimes, as reported by Singhal et al. (2022).
A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org 1 fact
claimMedHallu is a benchmark designed for detecting medical hallucinations in large language models, consisting of 10,000 high-quality question-answer pairs derived from PubMedQA.
Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aug 25, 2025 1 fact
referenceHaluBench is a partially synthetic hallucination benchmarking dataset where negative examples (non-hallucinated answers) are derived from existing question answering benchmarks including HaluEval, DROP, CovidQA, FinanceBench, and PubMedQA.
MedHallu - GitHub github.com 1 fact
measurementThe MedHallu dataset consists of 10,000 high-quality question-answering pairs derived from PubMedQA, which include systematically generated hallucinated answers.