concept

hallucination rate

Also known as: hallucination rates, hallucination frequency

synthesized from dimensions

The hallucination rate (HR) is a critical performance metric used to quantify the frequency with which Large Language Models (LLMs) generate information that is not supported by ground truth or source data defined by Frontiers. By measuring the prevalence of factual or logical errors, this metric serves as an essential tool for organizations aiming to mitigate the risks of AI-generated misinformation according to TTMS.

Methodologically, the calculation of the hallucination rate is not standardized, leading to variations in how it is defined across different evaluation frameworks. Some approaches utilize formulas based on precision and uncertainty indicators per the KG-RAG evaluation framework, while others decompose the rate into "breadth" (entity-level) and "depth" (fact-level) components as proposed in arXiv research. Advanced frameworks further employ complex statistical methods, such as Bayesian hierarchical modeling, to distinguish between prompt-induced and model-intrinsic hallucinations as described by Frontiers.

Empirical research demonstrates that hallucination rates are heavily influenced by model architecture and size. Larger, more proficient proprietary models generally exhibit lower hallucination rates compared to smaller 8-32B models as reported in arXiv. As of March 2026, performance benchmarks show a wide spectrum of results, with rates ranging from 5.8% for models like xai-org/grok-3 to 17.8% for others documented by Vectara, and specific measurements ranging from 6.1% to 15.1% across other comparative benchmarks provided by Vectara.

The rate is also highly sensitive to operational and environmental variables. Retrieval-Augmented Generation (RAG) architectures significantly impact factual adherence, with Knowledge Graph RAG (KG-RAG) demonstrating lower hallucination rates than Embedding-RAG per research in arXiv. Furthermore, prompting techniques—including few-shot prompting and Chain-of-Thought reasoning—can provide complementary gains in reducing error frequency as noted by medRxiv. In highly specialized domains, such as specific Diagnosis Prediction tasks, some models have even achieved a 0% hallucination rate per medRxiv.

Ultimately, the hallucination rate is a dynamic indicator used to guide system development and deployment. In practical applications like A/B testing for RAG systems, a reduction of 30% or more in the hallucination rate is often treated as a significant threshold for adopting new system variants according to LinkedIn. While some research suggests that the minimum possible rate is constrained by the proportion of "singleton" facts within training data according to AI Innovations and Insights, developers continue to utilize production monitoring tools to track and manage these rates in real-time provided by Arize.

Model Perspectives (2)

openrouter/google/gemini-3.1-flash-lite-preview definitive 100% confidence

The "hallucination rate" (HR) is a critical metric used to quantify the frequency of factual or logical errors in large language model (LLM) generations defined by Frontiers. Monitoring this rate is essential for organizations seeking to mitigate the risks of AI-generated misinformation according to TTMS. Research indicates that model size and architecture significantly influence hallucination frequency; for instance, larger, more proficient proprietary models exhibit lower hallucination rates (11.91%) compared to smaller 8-32B models (54.75%) as reported in arXiv. Performance varies widely across models and evaluation datasets: empirical measurements from March 2026 show rates ranging from 5.8% for xai-org/grok-3 to 17.8% for xai-org/grok-4-1-fast-non-reasoning documented by Vectara. Furthermore, specific retrieval architectures, such as Knowledge Graph RAG (KG-RAG), have demonstrated a lower hallucination rate (15%) compared to Embedding-RAG (30%), indicating a higher capacity for maintaining factual adherence per research in arXiv. Advanced evaluation frameworks now incorporate complex statistical methods, such as Bayesian hierarchical modeling, to analyze how specific prompts and models interact to produce hallucinations as described by Frontiers. These frameworks, which utilize metrics like QAFactEval, allow researchers to differentiate between prompt-induced and model-intrinsic hallucinations as presented in Frontiers. In practical applications, such as A/B testing for RAG systems, a reduction in hallucination rate of 30% or more is often used as a threshold for adopting new system variants according to LinkedIn.

openrouter/google/gemini-3.1-flash-lite-preview definitive 100% confidence

The "hallucination rate" is a performance metric used to quantify the frequency with which Large Language Models (LLMs) generate information that is not supported by ground truth or source data. As of March 2026, various organizations, such as Vectara, provide comparative benchmarks for this rate across different models, with results ranging from 6.1% for the amazon/nova-lite-v1:0 model to 15.1% for the openai/gpt-5-high-2025-08-07 model. Methodologically, the definition of a hallucination rate varies by framework. Some approaches, such as the KG-RAG evaluation framework, use a formula based on perfect precision and uncertainty indicators, while others, like the metric proposed in arXiv research, split the rate into "breadth" (entity-level) and "depth" (fact-level) components. Research indicates that the minimum possible hallucination rate for a model is constrained by the proportion of "singleton" facts—information that appears only once—within its training data, according to AI Innovations and Insights. Hallucination rates are highly sensitive to operational variables. Evidence from Frontiers and Giskard shows that prompting techniques (like few-shot prompting) and system instructions significantly influence these rates. Specifically, medRxiv notes that systemic prompting and Chain-of-Thought reasoning can provide complementary gains in reducing errors. Furthermore, the task domain impacts performance; for example, medRxiv found that models like Claude-3.5 and o1 achieved a 0% hallucination rate in specific Diagnosis Prediction tasks. Monitoring tools, such as those provided by Arize, allow developers to track these metrics in production environments.

Facts (111)

Sources

vectara/hallucination-leaderboard - GitHub github.com Vectara 65 facts

measurementThe openai/gpt-5-nano-2025-08-07 model achieved a hallucination rate of 10.5%, a factual consistency rate of 89.5%, an answer rate of 100.0%, and an average summary length of 105.7 words as of March 20, 2026.

measurementThe moonshotai/Kimi-K2.5 model achieved a hallucination rate of 14.2%, a factual consistency rate of 85.8%, an answer rate of 92.2%, and an average summary length of 112.0 words as of March 20, 2026.

measurementThe mistralai/ministral-8b-2410 model achieved a hallucination rate of 7.4%, a factual consistency rate of 92.6%, an answer rate of 99.9%, and an average summary length of 196.0 words as of March 20, 2026.

measurementThe google/gemini-3.1-flash-lite-preview model achieved a hallucination rate of 8.2%, a factual consistency rate of 91.8%, an answer rate of 99.6%, and an average summary length of 62.6 words as of March 20, 2026.

measurementThe qwen/qwen3.5-27b model achieved a hallucination rate of 12.1%, a factual consistency rate of 87.9%, an answer rate of 99.8%, and an average summary length of 94.4 words as of March 20, 2026.

measurementThe google/gemini-3-pro-preview model achieved a hallucination rate of 13.6%, a factual consistency rate of 86.4%, an answer rate of 99.4%, and an average summary length of 101.9 words as of March 20, 2026.

measurementThe anthropic/claude-sonnet-4-5-20250929 model achieved a hallucination rate of 12.0%, a factual consistency rate of 88.0%, an answer rate of 95.6%, and an average summary length of 127.8 words as of March 20, 2026.

measurementThe xai-org/grok-3 model achieved a hallucination rate of 5.8%, a factual consistency rate of 94.2%, an answer rate of 93.0%, and an average summary length of 95.9 words as of March 20, 2026.

measurementThe openai/gpt-5-mini-2025-08-07 model achieved a hallucination rate of 12.9%, a factual consistency rate of 87.1%, an answer rate of 99.9%, and an average summary length of 169.7 words as of March 20, 2026.

measurementThe zai-org/glm-4p7 model achieved a hallucination rate of 11.7%, a factual consistency rate of 88.3%, an answer rate of 99.8%, and an average summary length of 70.6 words as of March 20, 2026.

measurementThe google/gemini-2.5-pro model achieved a hallucination rate of 7.0%, a factual consistency rate of 93.0%, an answer rate of 99.1%, and an average summary length of 106.4 words as of March 20, 2026.

measurementThe zai-org/GLM-4.5-AIR-FP8 model achieved a hallucination rate of 9.3%, a factual consistency rate of 90.7%, an answer rate of 98.1%, and an average summary length of 70.6 words as of March 20, 2026.

measurementThe anthropic/claude-opus-4-1-20250805 model achieved a hallucination rate of 11.8%, a factual consistency rate of 88.2%, an answer rate of 92.4%, and an average summary length of 129.1 words as of March 20, 2026.

measurementThe xai-org/grok-4-1-fast-non-reasoning model achieved a hallucination rate of 17.8%, a factual consistency rate of 82.2%, an answer rate of 98.5%, and an average summary length of 87.5 words as of March 20, 2026.

measurementThe meta-llama/Llama-4-Scout-17B-16E-Instruct model achieved a hallucination rate of 7.7%, a factual consistency rate of 92.3%, an answer rate of 99.0%, and an average summary length of 137.3 words as of March 20, 2026.

measurementThe anthropic/claude-sonnet-4-6 model achieved a hallucination rate of 10.6%, a factual consistency rate of 89.4%, an answer rate of 99.9%, and an average summary length of 114.7 words as of March 20, 2026.

measurementThe openai/gpt-5.1-high-2025-11-13 model achieved a hallucination rate of 12.1%, a factual consistency rate of 87.9%, an answer rate of 100.0%, and an average summary length of 254.4 words as of March 20, 2026.

measurementThe qwen/qwen3.5-plus-2026-02-15 model achieved a hallucination rate of 10.7%, a factual consistency rate of 89.3%, an answer rate of 99.8%, and an average summary length of 92.1 words as of March 20, 2026.

measurementThe google/gemini-3.1-pro-preview model achieved a hallucination rate of 10.4%, a factual consistency rate of 89.6%, an answer rate of 99.4%, and an average summary length of 107.7 words as of March 20, 2026.

measurementThe qwen/qwen3-32b model achieved a hallucination rate of 5.9%, a factual consistency rate of 94.1%, an answer rate of 99.9%, and an average summary length of 115.8 words as of March 20, 2026.

measurementThe arcee-ai/trinity-large-preview model achieved a hallucination rate of 6.9%, a factual consistency rate of 93.1%, an answer rate of 99.0%, and an average summary length of 117.3 words as of March 20, 2026.

measurementThe anthropic/claude-sonnet-4-20250514 model achieved a hallucination rate of 10.3%, a factual consistency rate of 89.7%, an answer rate of 98.6%, and an average summary length of 145.8 words as of March 20, 2026.

measurementThe deepseek-ai/DeepSeek-V3 model achieved a hallucination rate of 6.1%, a factual consistency rate of 93.9%, an answer rate of 97.5%, and an average summary length of 81.7 words as of March 20, 2026.

measurementThe openai/gpt-4o-2024-08-06 model achieved a hallucination rate of 9.6%, a factual consistency rate of 90.4%, an answer rate of 93.8%, and an average summary length of 86.6 words as of March 20, 2026.

measurementThe google/gemma-3-4b-it model achieved a hallucination rate of 6.4%, a factual consistency rate of 93.6%, an answer rate of 67.3%, and an average summary length of 77.4 words as of March 20, 2026.

measurementThe deepseek-ai/DeepSeek-V3.2 model achieved a hallucination rate of 6.3%, a factual consistency rate of 93.7%, an answer rate of 92.6%, and an average summary length of 62.0 words as of March 20, 2026.

measurementThe mistralai/mistral-3-large-2512 model achieved a hallucination rate of 14.5%, a factual consistency rate of 85.5%, an answer rate of 98.8%, and an average summary length of 112.7 words as of March 20, 2026.

measurementThe ai21labs/jamba-large-1.7-2025-07 model achieved a hallucination rate of 9.7%, a factual consistency rate of 90.3%, an answer rate of 98.9%, and an average summary length of 124.8 words as of March 20, 2026.

measurementThe qwen/qwen3.5-flash-2026-02-23 model achieved a hallucination rate of 10.5%, a factual consistency rate of 89.5%, an answer rate of 99.8%, and an average summary length of 95.0 words as of March 20, 2026.

measurementThe google/gemma-3-27b-it model achieved a hallucination rate of 7.4%, a factual consistency rate of 92.6%, an answer rate of 98.8%, and an average summary length of 96.4 words as of March 20, 2026.

measurementThe anthropic/claude-opus-4-6 model achieved a hallucination rate of 12.2%, a factual consistency rate of 87.8%, an answer rate of 99.8%, and an average summary length of 137.6 words as of March 20, 2026.

measurementThe CohereLabs/c4ai-aya-expanse-8b model achieved a hallucination rate of 9.5%, a factual consistency rate of 90.5%, an answer rate of 77.5%, and an average summary length of 88.2 words as of March 20, 2026.

measurementThe qwen/qwen3.5-122b-a10b model achieved a hallucination rate of 11.2%, a factual consistency rate of 88.8%, an answer rate of 99.8%, and an average summary length of 86.4 words as of March 20, 2026.

measurementThe CohereLabs/c4ai-aya-expanse-32b model achieved a hallucination rate of 10.9%, a factual consistency rate of 89.1%, an answer rate of 99.8%, and an average summary length of 112.7 words as of March 20, 2026.

measurementThe deepseek-ai/DeepSeek-R1 model achieved a hallucination rate of 11.3%, a factual consistency rate of 88.7%, an answer rate of 97.0%, and an average summary length of 93.5 words as of March 20, 2026.

measurementThe MiniMaxAI/minimax-m2p5 model achieved a hallucination rate of 9.1%, a factual consistency rate of 90.9%, an answer rate of 98.2%, and an average summary length of 137.2 words as of March 20, 2026.

measurementThe anthropic/claude-opus-4-5-20251101 model achieved a hallucination rate of 10.9%, a factual consistency rate of 89.1%, an answer rate of 98.7%, and an average summary length of 114.5 words as of March 20, 2026.

measurementThe openai/gpt-oss-120b model achieved a hallucination rate of 14.2%, a factual consistency rate of 85.8%, an answer rate of 99.9%, and an average summary length of 135.2 words as of March 20, 2026.

measurementThe MiniMaxAI/minimax-m2p1 model achieved a hallucination rate of 11.8%, a factual consistency rate of 88.2%, an answer rate of 98.5%, and an average summary length of 106.9 words as of March 20, 2026.

measurementThe openai/gpt-5.4-pro-2026-03-05 model achieved a hallucination rate of 8.3%, a factual consistency rate of 91.7%, an answer rate of 100.0%, and an average summary length of 148.5 words as of March 20, 2026.

measurementThe amazon/nova-lite-v1:0 model achieved a hallucination rate of 6.1%, a factual consistency rate of 93.9%, an answer rate of 99.9%, and an average summary length of 91.8 words as of March 20, 2026.

measurementThe openai/gpt-5.2-low-2025-12-11 model achieved a hallucination rate of 8.4%, a factual consistency rate of 91.6%, an answer rate of 100.0%, and an average summary length of 126.5 words as of March 20, 2026.

measurementThe anthropic/claude-haiku-4-5-20251001 model achieved a hallucination rate of 9.8%, a factual consistency rate of 90.2%, an answer rate of 99.5%, and an average summary length of 115.1 words as of March 20, 2026.

measurementThe qwen/qwen3-next-80b-a3b-thinking model achieved a hallucination rate of 9.3%, a factual consistency rate of 90.7%, an answer rate of 94.4%, and an average summary length of 70.9 words as of March 20, 2026.

measurementThe nvidia/Nemotron-3-Nano-30B-A3B model achieved a hallucination rate of 9.6%, a factual consistency rate of 90.4%, an answer rate of 99.6%, and an average summary length of 104.2 words as of March 20, 2026.

measurementThe inceptionlabs/mercury-2 model achieved a hallucination rate of 12.3%, a factual consistency rate of 87.7%, an answer rate of 100.0%, and an average summary length of 149.1 words as of March 20, 2026.

measurementThe openai/gpt-5-minimal-2025-08-07 model achieved a hallucination rate of 14.7%, a factual consistency rate of 85.3%, an answer rate of 99.9%, and an average summary length of 109.7 words as of March 20, 2026.

measurementThe mistralai/ministral-3b-2410 model achieved a hallucination rate of 7.3%, a factual consistency rate of 92.7%, an answer rate of 99.9%, and an average summary length of 167.9 words as of March 20, 2026.

measurementThe anthropic/claude-opus-4-20250514 model achieved a hallucination rate of 12.0%, a factual consistency rate of 88.0%, an answer rate of 91.0%, and an average summary length of 123.2 words as of March 20, 2026.

measurementThe openai/gpt-5-high-2025-08-07 model achieved a hallucination rate of 15.1%, a factual consistency rate of 84.9%, an answer rate of 99.9%, and an average summary length of 162.7 words as of March 20, 2026.

measurementThe CohereLabs/command-a-03-2025 model achieved a hallucination rate of 9.3%, a factual consistency rate of 90.7%, an answer rate of 97.6%, and an average summary length of 101.7 words as of March 20, 2026.

measurementThe openai/gpt-5.4-2026-03-05 model achieved a hallucination rate of 7.0%, a factual consistency rate of 93.0%, an answer rate of 99.9%, and an average summary length of 81.7 words as of March 20, 2026.

measurementThe google/gemini-2.5-flash model achieved a hallucination rate of 7.8%, a factual consistency rate of 92.2%, an answer rate of 99.0%, and an average summary length of 101.5 words as of March 20, 2026.

measurementThe zai-org/GLM-4.6 model achieved a hallucination rate of 9.5%, a factual consistency rate of 90.5%, an answer rate of 94.5%, and an average summary length of 77.2 words as of March 20, 2026.

measurementThe ibm-granite/granite-3.3-8b-instruct model achieved a hallucination rate of 10.6%, a factual consistency rate of 89.4%, an answer rate of 100.0%, and an average summary length of 131.4 words as of March 20, 2026.

measurementThe zai-org/glm-5 model achieved a hallucination rate of 10.1%, a factual consistency rate of 89.9%, an answer rate of 99.7%, and an average summary length of 74.4 words as of March 20, 2026.

measurementThe openai/gpt-5.2-high-2025-12-11 model achieved a hallucination rate of 10.8%, a factual consistency rate of 89.2%, an answer rate of 100.0%, and an average summary length of 186.3 words as of March 20, 2026.

measurementThe google/gemini-3-flash-preview model achieved a hallucination rate of 13.5%, a factual consistency rate of 86.5%, an answer rate of 99.8%, and an average summary length of 90.2 words as of March 20, 2026.

measurementThe CohereLabs/command-r-plus-08-2024 model achieved a hallucination rate of 6.9%, a factual consistency rate of 93.1%, an answer rate of 95.0%, and an average summary length of 91.5 words as of March 20, 2026.

measurementThe zai-org/GLM-4.7-flash model achieved a hallucination rate of 9.3%, a factual consistency rate of 90.7%, an answer rate of 91.6%, and an average summary length of 71.8 words as of March 20, 2026.

measurementThe meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 model achieved a hallucination rate of 8.2%, a factual consistency rate of 91.8%, an answer rate of 100.0%, and an average summary length of 106.0 words as of March 20, 2026.

measurementThe qwen/qwen3-235b-a22b model achieved a hallucination rate of 9.3%, a factual consistency rate of 90.7%, an answer rate of 94.9%, and an average summary length of 105.6 words as of March 20, 2026.

measurementThe qwen/qwen3.5-35b-a3b model achieved a hallucination rate of 10.5%, a factual consistency rate of 89.5%, an answer rate of 99.8%, and an average summary length of 94.9 words as of March 20, 2026.

measurementThe ai21labs/jamba-mini-1.7-2025-07 model achieved a hallucination rate of 14.7%, a factual consistency rate of 85.3%, an answer rate of 99.1%, and an average summary length of 136.4 words as of March 20, 2026.

measurementThe openai/gpt-5.1-low-2025-11-13 model achieved a hallucination rate of 10.9%, a factual consistency rate of 89.1%, an answer rate of 100.0%, and an average summary length of 165.5 words as of March 20, 2026.

Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 12 facts

formulaThe hallucination rate (HR) is defined as the percentage of model generations that contain factual or logical errors.

claimThe DeepSeek model demonstrates the lowest overall hallucination rate among the models studied but retains internal factual inconsistencies.

measurementThe aggregated hallucination rates (%) for GPT-4 are 14.3 on TruthfulQA, 9.8 on HallucinationEval, and 4.7 on QAFactEval.

formulaThe Joint Attribution Score (JAS) is defined as JAS = σP * σM, where σP and σM denote the standard deviations of hallucination rates across prompts and models, respectively.

procedureThe evaluation framework presented in 'Survey and analysis of hallucinations in large language models' utilizes QAFactEval and hallucination rate metrics to compute Prompt Sensitivity (PS) and Model Variability (MV), allowing for the differentiation between prompt-induced and model-intrinsic hallucinations.

formulaConditional Prompt Sensitivity (CPS) is defined as CPS = (1/N) * Σ |h(Pi, Mj) - h_avg(Mj)|, where h(Pi, Mj) is the hallucination rate for prompt variant Pi under model Mj, and h_avg(Mj) is the average hallucination rate for model Mj.

formulaBayesian hierarchical modeling (BHM) represents hallucination rates hierarchically with model-specific and prompt-specific parameters drawn from higher-level distributions, defined as: Hij = αi + βj + γij, where Hij is the hallucination rate for model i under prompt j, αi and βj represent model-specific and prompt-specific effects, and γij represents interaction effects (Gelman et al., 2013).

measurementThe aggregated hallucination rates (%) for LLaMA 2 are 31.2 on TruthfulQA, 27.6 on HallucinationEval, and 24.8 on QAFactEval.

claimFew-shot prompting reduces hallucination rates but is dependent on the quality of the demonstrations provided.

measurementIn QAFactEval experiments, GPT-4 achieved a hallucination rate below 5%, while LLaMA 2 and DeepSeek exhibited hallucination rates between 20% and 25%.

formulaConditional Model Variability (CMV) is defined as CMV = (1/M) * Σ |h(Mj, Pi) - h_avg(Pi)|, where h(Mj, Pi) is the hallucination rate for model Mj given prompt Pi, and h_avg(Pi) is the mean hallucination across models for prompt Pi.

claimVague or misleading prompts induce high hallucination rates across all models, confirming the risk of prompt underspecification.

KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org arXiv May 20, 2024 8 facts

measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.

claimThe hallucination rate of KG-RAG (15%) is significantly lower than that of Embedding-RAG (30%), suggesting that KG-RAG is more adept at adhering to factual content and reducing the generation of unsupported content.

measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.

measurementOn the CWQ dataset, the KG-RAG pipeline achieved an Exact Match (EM) score of 19%, an F1 Score of 25%, an accuracy of 32%, and a hallucination rate of 15%.

formulaHallucination in the KG-RAG evaluation framework is defined as responses containing information not present in the ground truth, and it is calculated using the formula: Hallucination Rate = (1/N) * Σ(1 if predicted answer is not perfect precision AND contains heuristic indicators of uncertainty, else 0), where N is the number of samples.

measurementOn the CWQ dataset, the Embedding-RAG model achieved an Exact Match (EM) score of 28%, an F1 Score of 37%, an accuracy of 46%, and a hallucination rate of 30%.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 5 facts

referencePal et al. (2023) introduce an empirical benchmark for quantifying hallucination frequency in real-world scenarios to underscore their prevalence.

claimData quality and curation practices influence hallucination rates in AI systems, particularly when generating patient summaries, according to a 2021 study.

measurementChronological Ordering tasks showed hallucination rates between 0.25% and 24.6%, while Lab Data Understanding tasks showed rates between 0.25% and 18.7%.

measurementThe study evaluated hallucination rates and clinical risk severity for five Large Language Models: o1, gemini-2.0-flash-exp, gpt-4o, gemini-1.5-flash, and claude-3.5 sonnet.

measurementDiagnosis Prediction tasks exhibited the lowest hallucination rates across all evaluated models, ranging from 0% to 22%.

A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature May 13, 2025 3 facts

claimD.P., E.A., M.D., N.M., S.K., and J.B. contributed to the concept, design, and execution of the study regarding clinical safety and hallucination rates of LLMs.

referenceThe article titled 'A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation' was published in the journal npj Digital Medicine (volume 8, article 274) in 2025, authored by E. Asgari, N. Montaña-Brown, M. Dubois, and others.

claimThe authors propose a framework for assessing clinical safety and hallucination rates in large language models (LLMs) that includes an error taxonomy for classifying outputs, an experimental structure for iterative comparisons in document generation pipelines, a clinical safety framework to evaluate error harms, and a graphical user interface named CREOLA.

Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 3 facts

measurementThe model o1 demonstrated a hallucination rate of 0.25% in both Chronological Ordering and Lab Data Understanding tasks.

measurementClaude-3.5 and o1 exhibited the lowest hallucination rates across all tasks and risk categories, including achieving a 0% hallucination rate in the Diagnosis Prediction task.

measurementSystem Prompting provides complementary gains to Chain-of-Thought (CoT) reasoning in reducing hallucination rates, as seen in o3-mini (baseline 80.4% to System Prompt 81.4% to CoT 90.7%) and deepseek-r1 (baseline 86.6% to System Prompt 84.5% to CoT 90.7%).

LLM Observability: How to Monitor AI When It Thinks in Tokens | TTMS ttms.com TTMS Feb 10, 2026 3 facts

claimAn effective LLM monitoring setup tracks a combination of performance metrics, including latency, throughput, request rates, token usage, and error rates, alongside quality metrics such as hallucination rate, factual accuracy, relevance, toxicity, and user feedback.

procedureOrganizations can mitigate the risk of unchecked AI misinformation by monitoring correctness through hallucination rates or user feedback loops.

claimArize includes out-of-the-box dashboard widgets for monitoring metrics such as hallucination rate, prompt failure rate, and latency distribution.

A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 2 facts

measurementThe hallucination rate metric decreases from 54.75% in smaller 8-32B models to 11.91% in larger, more proficient proprietary models.

claimThe Hallucination Rate metric is split into two components: breadth of knowledge (percentage of responses classified as hallucinations by an entity-level filter) and depth of knowledge (percentage of incorrect facts judged by a fact-level check).

A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 2 facts

claimThe Guidance Injection Loop feedback mechanism in the Patient Agent framework achieves the best performance by attenuating the hallucination rate and boosting relevance.

claimThe Basic setup of the Patient Agent framework, which relies solely on prompt engineering without constraints, exhibits a high hallucination rate and suboptimal behavioral consistency.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 2 facts

measurementThe hallucination rate of machine translation systems under perturbation is measured using the Language Pair fraction and rate, evaluated on the Flores-101, WMT, and TICO datasets.

measurementThe hallucination rate (H%) is a metric calculated based on 1000 generated titles.

Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com Mar 15, 2026 1 fact

claimEmpirical research on large language model hallucinations has made progress on individual dimensions, including studies on entity frequency, hallucination rates, knowledge cutoff effects, and ablations of decoding strategies.

RAG Hallucinations: Retrieval Success ≠ Generation Accuracy linkedin.com Sumit Umbardand · LinkedIn Feb 6, 2026 1 fact

procedureA 2-week A/B test plan for RAG systems involves comparing a baseline (Standard RAG) against a variant (CRAG or Adaptive RAG) using 10–20% stratified traffic, tracking hallucination rate, latency (p95), cost per query, and user trust score, with a decision rule to adopt the variant if hallucination rate decreases by ≥30%, latency increase is ≤200ms, and cost increase is ≤30%.

Unknown source 1 fact

measurementThe AI system evaluated by E. Asgari et al. in a 2025 study exhibited a 1.47% hallucination rate and a 3.45% omission rate.

LLM Hallucinations: Causes, Consequences, Prevention - LLMs llmmodels.org llmmodels.org May 10, 2024 1 fact

measurementIn a recent study, ChatGPT exhibited a hallucination rate of up to 31% when generating scientific abstracts.

Phare LLM Benchmark: an analysis of hallucination in ... giskard.ai Giskard Apr 30, 2025 1 fact

claimGiskard's data indicates that modifying system instructions significantly impacts the hallucination rates of Large Language Models.

What Really Causes Hallucinations in LLMs? - AI Exploration Journey aiexpjourney.substack.com AI Innovations and Insights Sep 12, 2025 1 fact

measurementThe minimum hallucination rate of a large language model is at least as high as the proportion of singletons (facts appearing only once) present in the training data.