Fact — measurement — Knowledge Tree

State-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.

Authors

Person: Not available Organization: GitHub
MedHallu - GitHub

Sources

MedHallu - GitHub github.com GitHub via serper

Referenced by nodes (4)

Large Language Models concept
GPT-4 concept
LLaMA concept
MedHallu concept