concept

Model Variability

Also known as: Model Variability (MV)

Facts (14)

Sources

Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 14 facts

claimMistral shows balanced behavior across the dimensions of Factuality, Coherence, Prompt Sensitivity, Model Variability, and Usability, indicating a mixed attribution of hallucination sources.

claimMixed-origin models, such as Mistral 7B and OpenChat-3.5, display moderate Prompt Sensitivity (PS) and Model Variability (MV) scores, indicating that both prompt and model factors contribute equally to hallucinations.

referenceResearch directions for hallucination evaluation include the development of integrated, multi-task, multilingual benchmarks with unified annotation schemas (Liu et al., 2023) and the use of attribution-aware metrics incorporating Prompt Sensitivity (PS) and Model Variability (MV).

claimThe radar plot in Figure 4 of the study 'Survey and analysis of hallucinations in large language models' visualizes the comparative performance of DeepSeek, Mistral, and LLaMA 2 across five behavioral dimensions: Factuality, Coherence, Prompt Sensitivity, Model Variability, and Usability.

claimThe authors of the 'Survey and analysis of hallucinations in large language models' define Prompt Sensitivity (PS) and Model Variability (MV) as metrics to quantify the contribution of prompts versus model-internal factors to hallucinations.

formulaModel variability (MV) is defined as the variation in hallucination frequency across different models when provided with the same prompt.

claimModel Variability (MV) is a metric that measures the difference in hallucination rates across different models for a fixed prompt, where high MV indicates that hallucinations are primarily model-intrinsic.

claimThe hallucination attribution framework provides interpretable quantitative scores, specifically Prompt Sensitivity (PS), Model Variability (MV), and Joint Attribution Score (JAS), which are used for benchmarking and tracking improvements in Large Language Models.

procedureThe evaluation framework presented in 'Survey and analysis of hallucinations in large language models' utilizes QAFactEval and hallucination rate metrics to compute Prompt Sensitivity (PS) and Model Variability (MV), allowing for the differentiation between prompt-induced and model-intrinsic hallucinations.

claimModel-dominant models, such as DeepSeek 67B, show low Prompt Sensitivity (PS) but high Model Variability (MV), meaning hallucinations persist regardless of prompt variation due to internal knowledge limitations or inference biases.

measurementLLaMA 2 exhibits high Prompt Sensitivity (PS), while DeepSeek shows high Model Variability (MV).

measurementThe 'Survey and analysis of hallucinations in large language models' reports Prompt Sensitivity (PS) and Model Variability (MV) scores for LLMs as follows: LLaMA 2 (13B) (PS: 0.091, MV: 0.045), Mistral 7B (PS: 0.078, MV: 0.053), DeepSeek 67B (PS: 0.060, MV: 0.080), OpenChat-3.5 (PS: 0.083, MV: 0.062), and Gwen (PS: 0.079, MV: 0.057).

procedureTo establish objective thresholds for 'low' versus 'high' Prompt Sensitivity and Model Variability, the authors collect the values for all evaluated models, plot the distributions, and use the median value of each distribution as the cutoff.

claimModel Variability, as a metric for language model evaluation, captures the variation in hallucination behavior across different models for the same prompt type, representing intrinsic model bias or instability.