measurement
The best performing model on the MedHallu benchmark achieved an F1 score as low as 0.625 for detecting 'hard' category hallucinations.
Authors
Sources
- [2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org via serper
- [Literature Review] MedHallu: A Comprehensive Benchmark for ... www.themoonlight.io via serper
Referenced by nodes (2)
- hallucination concept
- F1 score concept