Relations (1)
related 0.70 — strongly supporting 7 facts
MedHallu is a benchmark specifically designed to evaluate the phenomenon of hallucination in large language models, as defined by the production of plausible but factually incorrect information [1]. The benchmark categorizes and measures these hallucinations across varying levels of difficulty to assess detection capabilities in medical applications {fact:2, fact:4, fact:7}.
Facts (7)
Sources
[Literature Review] MedHallu: A Comprehensive Benchmark for ... themoonlight.io 4 facts
claimThe MedHallu benchmark provides a framework for evaluating hallucination prevalence and detection capabilities in medical applications of large language models, emphasizing the need for human oversight for patient safety.
claimThe MedHallu dataset is stratified into three levels of difficulty—easy, medium, and hard—based on the subtlety of the hallucinations present in the data.
claimThe MedHallu benchmark defines hallucination in large language models as instances where a model produces information that is plausible but factually incorrect.
claimThe MedHallu study observes that detection difficulty varies by hallucination type, with 'Incomplete Information' being identified as a particularly challenging category for large language models.
MedHallu - GitHub github.com 2 facts
[2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org 1 fact
claimUsing bidirectional entailment clustering, the authors of the MedHallu paper demonstrated that harder-to-detect hallucinations are semantically closer to ground truth.