Relations (1)
Facts (4)
Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 1 fact
measurementGPT-4 achieves an F1-score of approximately 0.625 in detecting subtle falsehoods on the hardest subset of the MedHallu benchmark.
MedHallu - GitHub github.com 1 fact
measurementState-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.
A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org 1 fact
measurementState-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.
[Literature Review] MedHallu: A Comprehensive Benchmark for ... themoonlight.io 1 fact
claimThe MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.