Relations (1)

related 2.32 — strongly supporting 4 facts

GPT-4 is evaluated using the MedHallu benchmark to assess its ability to detect medical hallucinations, as evidenced by its performance metrics [1] and its inclusion in studies comparing general-purpose models on this specific benchmark [2], [3], and [4].

Facts (4)

Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact
measurementGPT-4 achieves an F1-score of approximately 0.625 in detecting subtle falsehoods on the hardest subset of the MedHallu benchmark.
MedHallu - GitHub github.com GitHub 1 fact
measurementState-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.
A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding · ACL Anthology 1 fact
measurementState-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.
[Literature Review] MedHallu: A Comprehensive Benchmark for ... themoonlight.io The Moonlight 1 fact
claimThe MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.