MedHallu ↔ Large Language Models

Relations (1)

related 12.00 — strongly supporting 12 facts

MedHallu is a benchmark specifically designed to evaluate and detect medical hallucinations within Large Language Models, as established in [1] and [2]. The benchmark uses these models as subjects for testing, comparing the performance of general-purpose and medically fine-tuned variants across various hallucination detection tasks [3], [4], and [5].

Facts (12)

Sources

[Literature Review] MedHallu: A Comprehensive Benchmark for ... themoonlight.io The Moonlight 5 facts

claimThe MedHallu benchmark provides a framework for evaluating hallucination prevalence and detection capabilities in medical applications of large language models, emphasizing the need for human oversight for patient safety.

claimThe MedHallu benchmark defines hallucination in large language models as instances where a model produces information that is plausible but factually incorrect.

claimThe MedHallu study observes that detection difficulty varies by hallucination type, with 'Incomplete Information' being identified as a particularly challenging category for large language models.

claimGeneral-purpose large language models often outperform specialized medical models in hallucination detection tasks according to experiments conducted for the MedHallu benchmark.

claimThe MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.

MedHallu - GitHub github.com GitHub 3 facts

measurementState-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.

claimGeneral-purpose Large Language Models outperform medical fine-tuned Large Language Models when provided with domain knowledge, according to findings from the MedHallu benchmark study.

measurementAdding a 'not sure' response option to Large Language Models improves hallucination detection precision by up to 38% in the MedHallu benchmark.

A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding · ACL Anthology 2 facts

claimMedHallu is a benchmark designed for detecting medical hallucinations in large language models, consisting of 10,000 high-quality question-answer pairs derived from PubMedQA.

measurementState-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.

A Comprehensive Benchmark for Detecting Medical Hallucinations ... researchgate.net ResearchGate 1 fact

claimMedHallu is the first benchmark specifically designed for medical hallucination detection in large language models.

Unknown source 1 fact

claimGeneral-purpose Large Language Models outperform fine-tuned medical Large Language Models in medical hallucination detection tasks, according to the evaluation conducted by the authors of the MedHallu benchmark.