claim
The MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.

Authors

Sources

Referenced by nodes (3)