concept

MediHall Score

Facts (18)

Sources

Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv Jun 14, 2024 17 facts

claimThe MediHall Score assigns hallucination scores based on six categories: Catastrophic Hallucinations, Critical Hallucinations, Attribute Hallucinations, Prompt-induced Hallucinations, Minor Hallucinations, and Correct Statements.

procedureIn Med-VQA tasks, the MediHall Score assesses the entire answer provided by a Large Vision-Language Model (LVLM) to determine the hallucination category and calculate a score.

measurementUnder conventional and confidence-weakening questions, if an LVLM response aligns perfectly with the facts, the MediHall Score is 1.

measurementThe Med-HallMark benchmark evaluates AI models on hallucination detection using the MediHall Score and traditional metrics including BertScore, METEOR, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU.

claimThe authors of the paper 'Detecting and Evaluating Medical Hallucinations in Large Vision Language Models' presented the MediHall Score, a new hallucination evaluation metric, and demonstrated its effectiveness relative to traditional metrics through qualitative and quantitative analysis.

measurementMiniGPT4 exhibited the highest MediHall Score of 0.88.

formulaIn coarse-grained IRG scenarios, the MediHall Score for an individual report is calculated by averaging the hallucination scores of all sentences within that report, represented by the formula: Score = (1/N) * sum(S_i), where N is the number of sentences and S_i is the hallucination score for sentence i.

procedureIn IRG tasks, the MediHall Score evaluates hallucinations at the sentence level and aggregates these individual sentence scores to compute an overall score for the entire response.

claimThe authors developed the MediHall Score, an evaluation metric for the medical domain that calculates the hallucination score of Large Vision-Language Model outputs through hierarchical categorization to provide a numerical representation of the rationality of medical texts.

claimThe LLaVA-Med series, BLIP2, and RadFM models cannot produce a computable MediHall Score on the IRG task because their generation formats are not suitable for reporting generation scenarios with contextual reasoning properties.

claimExperimental evaluations indicate that the MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics.

measurementThe LLaVA-Med series achieved MediHall Score results of 0.61, 0.57, and 0.69 on the Med-VQA task, suggesting the production of more correct content and fewer hallucinatory states.

claimThe MediHall Score is a metric for evaluating medical text hallucinations that operates across two evaluation scenarios: Med-VQA tasks and IRG tasks.

referenceThe MediHall Score is a medical evaluative metric designed to assess Large Vision Language Models' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination to enable granular assessment of clinical impacts.

procedureIn multi-dimensional IRG scenarios, the MediHall Score evaluates sentence-level hallucinations and aggregates these scores to derive the final score for the response.

procedureThe MediHall Score calculates metrics based on hallucination detection models that classify hallucination levels according to image facts and textual annotations, with calculation methods varying by scenario.

measurementFor counterfactual questions, if an LVLM response is not comprehensive due to the inherent confusion of the question, it is categorized as a prompt-induced hallucination and yields a MediHall Score of 0.6.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

measurementThe Chain-of-Medical-Thought (CoMT) approach reduced catastrophic hallucinations by 38% compared to conventional report generation methods in chest X-ray and CT scan interpretation, as measured by the MediHall Score metric.