concept

MedHallu

Facts (33)

Sources

MedHallu - GitHub github.com GitHub 10 facts

claimHarder-to-detect hallucinations in the MedHallu benchmark are semantically closer to the ground truth.

referenceMedHallu classifies hallucinations into four medical-specific categories: Misinterpretation of Question, Incomplete Information, Mechanism and Pathway Misattribution, and Methodological and Evidence Fabrication.

codeThe MedHallu software stack requires Python 3.8+, PyTorch, Transformers, vLLM, and Sentence-Transformers.

measurementState-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.

claimThe MedHallu project is licensed under the MIT License.

claimGeneral-purpose Large Language Models outperform medical fine-tuned Large Language Models when provided with domain knowledge, according to findings from the MedHallu benchmark study.

measurementAdding a 'not sure' response option to Large Language Models improves hallucination detection precision by up to 38% in the MedHallu benchmark.

referenceThe MedHallu repository includes a dataset generation pipeline, detection evaluation scripts, bidirectional entailment checking tools, and medical category (MeSH) analysis utilities.

procedureThe MedHallu benchmark utilizes multi-level difficulty classification (easy, medium, hard) based on the subtlety of the hallucinations.

measurementThe MedHallu dataset consists of 10,000 high-quality question-answering pairs derived from PubMedQA, which include systematically generated hallucinated answers.

[Literature Review] MedHallu: A Comprehensive Benchmark for ... themoonlight.io The Moonlight 6 facts

claimThe MedHallu benchmark provides a framework for evaluating hallucination prevalence and detection capabilities in medical applications of large language models, emphasizing the need for human oversight for patient safety.

claimThe MedHallu dataset is stratified into three levels of difficulty—easy, medium, and hard—based on the subtlety of the hallucinations present in the data.

claimThe MedHallu benchmark defines hallucination in large language models as instances where a model produces information that is plausible but factually incorrect.

claimThe MedHallu study observes that detection difficulty varies by hallucination type, with 'Incomplete Information' being identified as a particularly challenging category for large language models.

claimGeneral-purpose large language models often outperform specialized medical models in hallucination detection tasks according to experiments conducted for the MedHallu benchmark.

claimThe MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.

MedHallu: Benchmark for Medical LLM Hallucination Detection emergentmind.com Emergent Mind Feb 20, 2025 4 facts

procedureThe MedHallu generation pipeline employs a bidirectional entailment mechanism to assess semantic proximity, ensuring that hallucinations are semantically similar to genuine responses to create harder detection challenges.

claimThe MedHallu benchmark exposes current limitations in Large Language Model hallucination detection.

claimThe MedHallu study findings indicate that Large Language Model training methodologies should incorporate external knowledge systems and probabilistic modeling to better handle the nuanced semantic differences characteristic of medical knowledge.

procedureThe MedHallu generation pipeline produces hallucinated answers by prompting an LLM with a question, context, and ground truth answer, followed by qualitative checks and semantic analyses to ensure the hallucination is convincingly close to the verified answer.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 3 facts

measurementGPT-4 achieves an F1-score of approximately 0.625 in detecting subtle falsehoods on the hardest subset of the MedHallu benchmark.

referenceThe MedHallu benchmark, derived from PubMedQA, contains 10,000 question-answer pairs with deliberately planted plausible hallucinations to evaluate medical hallucination detection.

measurementA curriculum learning strategy that transitions training from easier to harder negatives demonstrates up to 24% relative F1 gains on the MedHallu and HaluEval datasets.

A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding · ACL Anthology 3 facts

claimMedHallu is a benchmark designed for detecting medical hallucinations in large language models, consisting of 10,000 high-quality question-answer pairs derived from PubMedQA.

measurementState-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.

procedureThe MedHallu benchmark generates hallucinated answers through a controlled pipeline to create a dataset for binary hallucination detection.

A Comprehensive Benchmark for Detecting Medical Hallucinations ... researchgate.net ResearchGate 2 facts

claimMedHallu is the first benchmark specifically designed for medical hallucination detection in large language models.

measurementMedHallu comprises 10,000 samples for evaluating medical hallucination detection.

[PDF] MedHallu: A Comprehensive Benchmark for Detecting Medical ... aclanthology.org ACL Anthology Nov 4, 2025 2 facts

claimMedHallu includes a hallucination generation framework designed to balance difficulty.

claimMedHallu integrates a fine-grained categorization system for medical hallucination types.

MedHallu: A Comprehensive Benchmark for Detecting Medical ... researchgate.net ResearchGate Dec 5, 2025 1 fact

referenceThe MedHallu research paper includes prompt templates used for hallucination detection experiments in sections 2.5 and 4.4.

Unknown source 1 fact

claimGeneral-purpose Large Language Models outperform fine-tuned medical Large Language Models in medical hallucination detection tasks, according to the evaluation conducted by the authors of the MedHallu benchmark.

[2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org arXiv Feb 20, 2025 1 fact

claimUsing bidirectional entailment clustering, the authors of the MedHallu paper demonstrated that harder-to-detect hallucinations are semantically closer to ground truth.