concept

Multimodal Large Language Models

Also known as: MLLM, MLLMs, multi-modal large language models, multi-modal large language model

Facts (15)

Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 9 facts
procedureThe iTaD decoding strategy is a plug-and-play method for multi-modal large language models that uses attention to image tokens to select layers and apply inter-layer contrastive decoding.
claimMemVR significantly reduces hallucinations while preserving general capabilities across eight benchmarks and multiple MLLM architectures, including LLaVA-1.5, Qwen-VL, and GLM4V.
claimThe iTaD decoding strategy reduces hallucinations across multiple multi-modal large language models and benchmarks by amplifying image grounding when attention drops.
referenceThe UniHD framework is a unified system designed for the detection of hallucinations in content produced by Multimodal Large Language Models (MLLMs).
claimMemVR is a training-free decoding approach for Multimodal Large Language Models that reinjects visual tokens as key-value memory through the Feed Forward Network when the model exhibits uncertainty during generation.
referenceThe MHaluBench benchmark is a meta-evaluation dataset that encompasses various hallucination categories and multimodal tasks for Multimodal Large Language Models (MLLMs).
claimReinforcement learning provides the most robust defense against modality conflict in Multimodal Large Language Models by training the model to prioritize visual evidence over misleading textual cues, compared to prompt engineering and supervised fine-tuning.
claimModality conflict is defined as a primary driver of hallucinations where contradictions between visual and textual inputs trap Multimodal Large Language Models in a dilemma.
referenceDeCo is a model-agnostic decoding method that adaptively mixes earlier-layer representations to counteract language-prior suppression of visual evidence, reducing object hallucinations across Multimodal Large Language Models (MLLMs) with modest latency overhead.
Combining Knowledge Graphs and Large Language Models - arXiv arxiv.org arXiv Jul 9, 2024 3 facts
claimMultimodal Large Language Models are built on LLM backbones and may inherit the same limitations as standard LLMs, suggesting they could benefit from incorporating knowledge graphs.
claimMultimodal Large Language Models (LLMs) have experienced a surge in interest since the start of 2023, with new models being released monthly that can process audio, image, or video data alongside text.
claimMultimodal Large Language Models, such as Google's Gemini and GPT-4 with vision (GPT-4V), possess vision capabilities.
A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 1 fact
referenceThe paper 'Explainable and interpretable multimodal large language models: a comprehensive survey' is an arXiv preprint (arXiv:2412.02104).
Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv Jun 14, 2024 1 fact
referenceQinghao Ye et al. published 'mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration' as an arXiv preprint in 2023.
[PDF] arXiv:2412.18947v4 [cs.CL] 28 Mar 2025 arxiv.org arXiv Mar 28, 2025 1 fact
claimMedHallBench is a comprehensive benchmark framework designed for evaluating and mitigating hallucinations in Multimodal Large Language Models (MLLMs).