concept

hallucination resistance

Facts (12)

Sources

Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 10 facts

measurementThe advanced reasoning model gpt-5 achieves a 71.2% baseline resistance to hallucinations and a semantic similarity score greater than 0.8.

perspectiveHallucination resistance in specialized medical contexts emerges from sophisticated reasoning capabilities, internal consistency mechanisms, and broad world knowledge developed during large-scale pretraining, rather than from domain-specific fine-tuning.

measurementThe AI model GPT-5 achieves a hallucination resistance performance of 87.6% when using search-augmented generation, representing a 16.5% improvement over its baseline performance.

claimHallucination resistance in AI models correlates more strongly with the depth of conceptual understanding, as measured by semantic similarity to ground truth, than with exposure to domain-specific training data.

measurementThe AI model o1 achieves a hallucination resistance baseline of 64.0%, while earlier-generation models gpt-4o and gpt-4o-mini achieve baselines of 54.4% and 48.3% respectively.

claimThe authors of the study interpret the statistical difference in hallucination resistance between general-purpose and medical-specialized models as a substantial and clinically meaningful difference that transcends statistical borderline status, based on the convergence of large effect size, adequate power, and narrow confidence intervals.

measurementThe medical-specialized model PMC-Llama exhibits a 46.1% relative improvement in hallucination resistance when using search-augmented generation, increasing from a 40.8% baseline to 59.9% (p = 1.9 × 10−12, q = 8.8 × 10−11 after FDR correction).

measurementThe AI model gemini-2.5-pro achieves a hallucination resistance rate of 97.9% when using Chain-of-Thought (CoT) prompting and 87.6% at baseline.

measurementGeneral-purpose models demonstrate a median baseline hallucination resistance of 76.6%, compared to 51.3% for medical-purpose models, representing an average difference of 25.2% (95% CI: [18.7%, 31.3%], U = 27.0, p = 0.012, two-tailed Mann–Whitney test, rank-biserial r = −0.64, 95% CI: [−0.86, −0.28]).

measurementThe medical-specialized AI models PMC-Llama, MedAlpaca, AlpacaRE, and MedGemma demonstrate baseline hallucination resistance scores of 40.8%, 32.0%, 28.6%, and 52.6% respectively, which is less than half the resistance of general-purpose models.

Phare LLM Benchmark: an analysis of hallucination in ... giskard.ai Giskard Apr 30, 2025 1 fact

measurementIn the most extreme cases observed by Giskard, instructions emphasizing conciseness resulted in a 20% decrease in hallucination resistance for Large Language Models.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

claimThere is a robust correlation between semantic similarity and hallucination resistance in LLMs, suggesting that a deeper understanding of medical concepts is a critical factor in minimizing factual errors in generated medical content.