measurement
State-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.
Authors
Sources
- A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org via serper
- [2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org via serper
Referenced by nodes (4)
- Large Language Models concept
- GPT-4 concept
- LLaMA concept
- MedHallu concept