concept

Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior

Also known as: Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior, Survey and analysis of hallucinations in large language models

Facts (19)

Sources

Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 17 facts

claimThe 'Survey and analysis of hallucinations in large language models' reports that vague prompts result in the highest hallucination rates at 38.3%, whereas Chain-of-Thought (CoT) prompts reduce hallucination rates to 18.1%, identifying CoT as the most effective prompting strategy among those evaluated.

claimThe 'Survey and analysis of hallucinations in large language models' provides qualitative examples of model hallucinations, including LLaMA 2 fabricating that 'Marie Curie invented penicillin' under zero-shot prompting, DeepSeek claiming 'Pluto is the largest planet in the solar system' under instruction prompting, and Mistral stating 'The Eiffel Tower is located in Berlin' under vague prompting.

claimThe authors of the article 'Survey and analysis of hallucinations in large language models' declare that no generative AI was used in the creation of the manuscript.

claimThe authors of 'Survey and analysis of hallucinations in large language models' introduce an attribution framework that distinguishes prompt-induced from model-intrinsic hallucinations using controlled prompt manipulation and model comparison.

claimThe paper 'Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior' was published in Frontiers in Artificial Intelligence on September 30, 2025, by authors Anh-Hoang D, Tran V, and Nguyen L-M.

procedureThe authors of the survey "Survey and analysis of hallucinations in large language models" conducted controlled experiments on multiple Large Language Models (GPT-4, LLaMA 2, DeepSeek, Gwen) using standardized hallucination evaluation benchmarks, specifically TruthfulQA, HallucinationEval, and RealToxicityPrompts.

claimThe radar plot in Figure 4 of the study 'Survey and analysis of hallucinations in large language models' visualizes the comparative performance of DeepSeek, Mistral, and LLaMA 2 across five behavioral dimensions: Factuality, Coherence, Prompt Sensitivity, Model Variability, and Usability.

claimThe authors of the 'Survey and analysis of hallucinations in large language models' define Prompt Sensitivity (PS) and Model Variability (MV) as metrics to quantify the contribution of prompts versus model-internal factors to hallucinations.

procedureThe evaluation framework presented in 'Survey and analysis of hallucinations in large language models' utilizes QAFactEval and hallucination rate metrics to compute Prompt Sensitivity (PS) and Model Variability (MV), allowing for the differentiation between prompt-induced and model-intrinsic hallucinations.

measurementThe research article titled 'Survey and analysis of hallucinations in large language models' was supported by JSPS KAKENHI under grant number JP23K16954.

claimThe study 'Survey and analysis of hallucinations in large language models' utilized three primary datasets to analyze hallucination patterns: TruthfulQA, HallucinationEval, and QAFactEval.

claimThe authors of the paper 'Survey and analysis of hallucinations in large language models' propose a probabilistic attribution framework for Large Language Model (LLM) hallucinations that introduces three new metrics: PS, MV, and JAS to quantify the contributions of prompts versus model behavior.

measurementAccording to the 'Survey and analysis of hallucinations in large language models,' the overall hallucination rates (HR) for evaluated LLMs are: LLaMA 2 (13B) at 31.3%, Mistral 7B at 25.8%, DeepSeek 67B at 23.2%, OpenChat-3.5 at 28.4%, and Gwen at 26.7%.

measurementThe 'Survey and analysis of hallucinations in large language models' reports Prompt Sensitivity (PS) and Model Variability (MV) scores for LLMs as follows: LLaMA 2 (13B) (PS: 0.091, MV: 0.045), Mistral 7B (PS: 0.078, MV: 0.053), DeepSeek 67B (PS: 0.060, MV: 0.080), OpenChat-3.5 (PS: 0.083, MV: 0.062), and Gwen (PS: 0.079, MV: 0.057).

claimThe authors of the paper 'Survey and analysis of hallucinations in large language models' formalize hallucination attribution using a Bayesian hierarchical model, which provides interpretable parameters for prompt-induced and intrinsic error rates.

claimThe study 'Survey and analysis of hallucinations in large language models' conducted qualitative and quantitative analyses on LLaMA 2, DeepSeek, and GPT-4 to illustrate differences in hallucination patterns.

procedureThe authors of the paper 'Survey and analysis of hallucinations in large language models' conducted controlled experiments using open-source models and standardized prompts to classify hallucination origins as prompt-dominant, model-dominant, or mixed.

Unknown source 2 facts

claimThe authors of the paper 'Survey and analysis of hallucinations in large language models' introduce a novel framework designed to determine whether large language models are hallucinating.

claimThe authors of the paper 'Survey and analysis of hallucinations in large language models' present a comprehensive survey and empirical analysis of hallucination attribution in large language models.