concept

Large Vision-Language Models

Also known as: LVLMs, Large multimodal vision-language models, Large Vision-Language Model

Facts (54)

Sources

Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv Jun 14, 2024 36 facts

claimMed-HallMark includes multifaceted hallucination data across three dimensions: ground truth (GT) standards, Large Vision-Language Model (LVLM) outputs for prompts, and fine-grained annotations of LVLM-generated content detailing both the type of hallucination and its correctness.

referenceMediHallDetector is a medical Large Vision Language Model engineered for precise hallucination detection that employs multitask training for hallucination detection.

referenceThe paper 'Visual instruction tuning' by Haotian Liu, Chunyuan Li, and colleagues, published as an arXiv preprint in 2023, introduces the concept of visual instruction tuning for large vision-language models.

claimThe medical domain currently lacks specific methods and benchmarks for detecting hallucinations in Large Vision-Language Models (LVLMs), which hinders the development of medical capabilities in these models.

measurementEven with current state-of-the-art Large Vision Language Models, at least 30% of hallucinatory text exists in the form of nonexistent objects, unfaithful descriptions, and inaccurate relationships.

procedureIn coarse-grained multi-dimension Image-Report Generation (IRG) scenarios, Large Vision-Language Model (LVLM) outputs are segmented into sentences and annotated at the sentence level.

claimHallucination in Large Vision Language Models (LVLMs) is defined as the generation of descriptions that are inconsistent with relevant images and user instructions, containing incorrect objects, attributes, and relationships related to the visual input.

claimLarge Vision Language Models (LVLMs) are increasingly used in healthcare applications, such as medical visual question answering and imaging report generation.

procedureIn Med-VQA tasks, the MediHall Score assesses the entire answer provided by a Large Vision-Language Model (LVLM) to determine the hallucination category and calculate a score.

procedureIn fine-grained single-dimension Visual Question Answering (VQA) scenarios, each Large Vision-Language Model (LVLM) response is labeled with a single hallucination category.

claimLarge Vision Language Models (LVLMs) inherit susceptibility to hallucinations from Large Language Models (LLMs), which poses significant risks in high-stakes medical contexts.

referenceThe paper 'Evaluating object hallucination in large vision-language models' by Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen, published as an arXiv preprint in 2023, focuses on the evaluation of object hallucination within large vision-language models.

measurementIn ID1 and ID2 scenarios where Large Vision Language Model (LVLM) answers are entirely correct, BertScore values are 66.73% and 46.11% respectively, indicating a significant and unwarranted disparity.

referenceYiyang Zhou et al. published 'Analyzing and mitigating object hallucination in large vision-language models' as an arXiv preprint in 2023.

claimThe traditional medical image-text task data used to train MediHallDetector is sourced from the SLAKE, VQA-RAD, MIMIC-Test, and OpenI datasets, which helps adapt general Large Vision Language Models (LVLMs) to the medical domain.

claimTraditional Natural Language Processing (NLP) metrics like METEOR and BLEU fail to reflect the factual correctness of Large Vision-Language Model outputs because they only measure shallow similarities to ground truth.

claimLarge Vision-Language Models (LVLMs) show insignificant differences in Attribute Hallucinations and have similar error boundaries, indicating difficulty in correctly judging or describing the size, shape, or number of organs and pathologies.

claimThe METEOR metric fails to directly reflect whether a Large Vision-Language Model's answer aligns with the ground truth, regardless of whether the answer is correct or hallucinatory.

referenceThe paper 'Evaluation and analysis of hallucination in large vision-language models' by Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, and colleagues, provides an evaluation and analysis of hallucination in large vision-language models.

claimWhen evaluating hallucination detection capabilities, GPT-4V and GPT-4O followed instructions well but incorrectly classified hallucination types in Large Vision-Language Model (LVLM) outputs, failing to recognize their errors even when prompted to explain their classifications.

claimEvaluation of medical capabilities in existing Large Vision-Language Models (LVLMs) is unreliable because it relies on outdated benchmarks that suffer from data leakage during pre-training.

claimThe authors of the paper 'Detecting and Evaluating Medical Hallucinations in Large Vision Language Models' introduced Med-HallMark, which is the first benchmark dedicated to hallucination detection in the medical domain, and provided baseline performance metrics for various Large Vision Language Models (LVLMs).

claimMost Large Vision-Language Models (LVLMs) fail to fully understand medical images from all dimensions, often ignoring information in questions that is irrelevant to the image facts.

claimAccuracy metrics for Large Vision-Language Models evaluate at a coarse semantic level and cannot distinguish between different degrees of hallucinations in the output.

claimGeneral Large Vision-Language Models (LVLMs) typically categorize hallucinations into three types: object hallucinations, attribute hallucinations, and relational hallucinations.

referenceXintong Wang et al. published 'Mitigating hallucinations in large vision-language models with instruction contrastive decoding' as an arXiv preprint in 2024.

claimThe authors developed the MediHall Score, an evaluation metric for the medical domain that calculates the hallucination score of Large Vision-Language Model outputs through hierarchical categorization to provide a numerical representation of the rationality of medical texts.

claimThe authors propose solutions for evaluating medical Large Vision-Language Models (LVLMs) across three dimensions: data, evaluation metrics, and detection methods.

claimIn Large Vision Language Models, the hallucination phenomenon is exacerbated by factors including a lack of visual feature extraction capability, misalignment of multimodal features, and the incorporation of additional information.

claimThe authors of 'Detecting and Evaluating Medical Hallucinations in Large Vision ...' intend to track open-source contributions and evaluate the latest Large Vision-Language Models (LVLMs) on the Med-HallMark dataset across various metrics.

claimTraditional medical datasets are difficult to use for evaluating Large Vision-Language Models (LVLMs) because they contain short answers or unstructured image reports, whereas LVLM outputs are typically well-ordered long texts.

referenceWenyi Xiao et al. published 'Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback' as an arXiv preprint in 2024.

claimThe authors of 'Detecting and Evaluating Medical Hallucinations in Large Vision Language Models' propose a novel benchmark, evaluation metrics, and a detection model specifically designed for the medical domain to address hallucination detection and evaluation challenges in Large Vision Language Models (LVLMs).

claimNearly all models show prompt-induced hallucinations close to or exceeding the number of catastrophic hallucinations when presented with counterfactual questions, indicating that Large Vision-Language Models (LVLMs) are highly vulnerable to such attacks.

referenceThe MediHall Score is a medical evaluative metric designed to assess Large Vision Language Models' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination to enable granular assessment of clinical impacts.

referenceHallucination detection methods for Large Vision Language Models are categorized into two groups: approaches based on off-the-shelf tools (using closed-source LLMs or visual tools) and training-based models (which detect hallucinations incrementally from feedback).

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 9 facts

procedureVisual evidence prompting is a method that injects outputs from object detection and scene graph models as structured prompts to reduce object and relation hallucinations in large vision-language models.

referenceHSA-DPO (Severity-Aware Direct Preference Optimization) is a method that uses fine-grained AI feedback to label hallucination severity and prioritize critical errors during the training of large vision-language models.

referenceThe HallusionBench benchmark evaluates Large Vision-Language Models (LVLMs) such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5 by emphasizing nuanced understanding and interpretation of visual data.

measurementThe Pelican framework reduces hallucination rates by approximately 8–32% across various Large Vision-Language Models (LVLMs) and performs 27% better than prior mitigation methods while maintaining or improving factual accuracy in following visual instructions.

claimResearch on VISTA reveals three phenomena during Large Vision-Language Model generation: gradual visual information loss, early excitation of semantically meaningful tokens, and hidden genuine information in vocabulary rankings.

procedureA training-free, head-level intervention framework for Large Vision-Language Models identifies critical hallucination heads across causal pathways and applies targeted corrections for yes/no, multiple-choice, and open-ended question-answer formats.

claimVISTA is a training-free inference-time framework that combats hallucination in Large Vision-Language Models (LVLMs) by steering visual information in the activation space.

claimLarge Vision-Language Model (LVLM) hallucinations originate from three interacting causal pathways: image-to-input-text, image-to-output-text, and text-to-text.

claimGLSim outperforms competitive baselines across multiple Large Vision-Language Models (LLaVA-1.5, MiniGPT-4, Shikra, InstructBLIP, Qwen2.5-VL) without requiring external supervision or judge models.

Awesome-Hallucination-Detection-and-Mitigation - GitHub github.com GitHub 6 facts

referenceThe paper 'ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Model' by Wan et al. (2025) proposes a one-layer intervention method for LVLMs.

referenceThe paper 'Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing' by Wang et al. (2025) proposes a latent editing method for discrete tokenizer-based LVLMs.

referenceThe paper 'Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection' by Yang et al. (2025) proposes a projection method to mitigate object hallucinations in LVLMs.

referenceThe paper "V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization" by Yang et al. (2024) presents a method to mitigate hallucinations in large vision-language models using vision-guided direct preference optimization.

referenceThe paper 'Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization' by Wu et al. (2025) proposes an entity-centric optimization method for LVLMs.

referenceThe paper "Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key" by Yang et al. (2025) argues that the use of on-policy data is critical for mitigating hallucinations in large vision-language models when using direct preference optimization.

Understanding LLM Understanding skywritingspress.ca Skywritings Press Jun 14, 2024 1 fact

claimLarge multimodal vision-language models, including GPT-4V, struggle with counting objects in images, identifying fine-grained differences between similar images, and lack sufficient visual grounding.

On Hallucinations in Artificial Intelligence–Generated Content ... jnm.snmjournals.org The Journal of Nuclear Medicine 1 fact

claimAutomatic hallucination detectors trained on benchmark datasets are being explored in large vision-language models to reduce the burden of human evaluation.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

claimFoundation models, including Large Language Models (LLM) and Large Vision Language Models (VLM), are used in healthcare for clinical decision support, medical research, and improving healthcare quality and safety.