measurement
State-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.
Authors
Sources
- MedHallu - GitHub github.com via serper
Referenced by nodes (4)
- Large Language Models concept
- GPT-4 concept
- LLaMA concept
- MedHallu concept