Fact — measurement — Knowledge Tree

State-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.

Authors

Person: Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding Organization: ACL Anthology
A Comprehensive Benchmark for Detecting Medical Hallucinations ...
Person: Not available Organization: arXiv
[2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ...

Sources

A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding · ACL Anthology via serper
[2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org arXiv via serper

Referenced by nodes (4)

Large Language Models concept
GPT-4 concept
LLaMA concept
MedHallu concept