claim
The MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.
Authors
Sources
- [Literature Review] MedHallu: A Comprehensive Benchmark for ... www.themoonlight.io via serper
Referenced by nodes (3)
- Large Language Models concept
- GPT-4 concept
- MedHallu concept