concept

Med-HALT

Also known as: Med-HALT, MedHALT

Facts (16)

Sources

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 8 facts

claimMed-HALT is a framework designed to evaluate the multifaceted nature of medical hallucinations in Large Language Models by assessing both reasoning and memory-related inaccuracies.

procedureThe 'Base' method for evaluating Large Language Models involves querying the models directly with questions from the Med-HALT benchmark without additional context or instructions to assess inherent hallucination tendencies in a zero-shot setting.

claimThe Similarity Score in Med-HALT assesses the semantic similarity between a model’s generated response and the ground truth answer, as well as between the response and the original question, using UMLSBERT and cosine similarity.

referenceThe Med-HALT benchmark categorizes hallucination tests into Reasoning Hallucination Tests (RHTs), which evaluate a Large Language Model's ability to reason accurately with medical information and generate logically sound, factually correct outputs without fabrication.

claimReasoning Hallucination Tests (RHTs) within the Med-HALT framework are divided into three categories: the False Confidence Test (FCT), the None of the Above (NOTA) Test, and the Fake Questions Test (FQT).

procedureTo ground Large Language Model responses in validated medical information, the authors used MedRAG to retrieve relevant medical knowledge from a knowledge graph for each Med-HALT question and concatenated this knowledge with the original question as input to the Large Language Model.

measurementThe Pointwise Score used in Med-HALT (Pal et al., 2023) evaluates model performance by calculating the average score across samples, where each correct prediction is awarded a positive score (Pc = +1) and each incorrect prediction incurs a negative penalty (Pw = −0.25).

referenceThe Med-HALT benchmark (Pal et al., 2023) is used to evaluate the effectiveness of various hallucination mitigation techniques on Large Language Models.

Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 4 facts

formulaThe Hallucination Pointwise Score used in the Med-HALT benchmark is calculated as the average score across samples, where each correct prediction (Pc) is awarded a positive score of +1 and each incorrect prediction (Pw) incurs a negative penalty of -0.25.

claimQuantitative metrics in the Med-HALT benchmark are complemented by qualitative analysis from physician annotators, who specifically assess the clinical risk associated with each hallucination.

procedureThe authors evaluated the effectiveness of hallucination mitigation techniques on Large Language Models using the Med-HALT benchmark by sampling 50 examples from each of seven medical reasoning tasks, totaling 350 cases.

claimThe Pointwise and Similarity Scores in the Med-HALT benchmark do not directly capture clinical safety or potential for patient harm, as an output could be semantically similar but clinically inappropriate or omit critical warnings.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 2 facts

referenceThe Med-HALT dataset includes the following subsets: MEDMCQA, Headqa, Medqa USMILE, Medqa (Taiwan), and Pubmed.

claimDatasets utilized for hallucination detection research include HELM (50K Wikipedia articles), MedHALT, LegalBench, HaluEval, and XSum.

A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature May 13, 2025 2 facts

claimThe MedHALT benchmark is limited to assessing the reasoning capabilities of Large Language Models over the medical domain in a Question Answering (QA) format.

referenceMed-HALT is a medical domain hallucination test designed for large language models, introduced by Pal, Umapathi, and Sankarasubbu in 2023.