concept

training data

Facts (51)

Sources
Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 11 facts
claimRestrepo et al. (2024) highlight that the underrepresentation of minority groups in training data can lead to systematic errors in AI predictions.
claimSurvey respondents identified limitations in training data and model architectures as key factors contributing to medical hallucinations in AI/LLM tools.
claimInadequate training data coverage creates knowledge gaps that cause large language models to hallucinate when addressing unfamiliar medical topics, according to Lee et al. (2024).
claimMedical large language models struggle to generalize beyond their training data when faced with rare diseases, novel treatments, or atypical clinical presentations, as noted by Svenstrup et al. (2015) and Hegselmann et al. (2024b).
claimMedical Large Language Model (LLM) hallucinations are the product of learned statistical correlations in training data, coupled with architectural constraints such as limited causal reasoning, as identified by Jiang et al. (2023) and Glicksberg (2024).
claimMedical Large Language Models (LLMs) exhibit availability bias, manifesting as a tendency to propose diagnoses or treatments that are disproportionately represented in the model's training data.
measurementRespondents identified insufficient training data (31 mentions) and biased training data (31 mentions) as the most frequently cited causes of AI hallucinations, followed by limitations in model architecture (30), lack of real-world context (26), overconfidence in AI-generated responses (24), and inadequate transparency of AI decision-making (14).
claimEnhancing data quality and curation is critical for reducing hallucinations in AI models because inaccuracies or inconsistencies in training data can propagate errors in model outputs.
claimPerceived causes of AI hallucinations include insufficient training data, biased training data, limitations in AI model architecture, lack of real-world context, overconfidence in AI-generated responses, and inadequate transparency of AI decision-making.
claimRobust finetuning procedures and retrieval-augmented generation can improve the balance of training data, which helps mitigate availability bias in large language models.
claimHigh uncertainty in a Large Language Model's outputs, indicated by low sequence probabilities or high semantic entropy, suggests the model is generating content without strong grounding in its training data, as noted by Asgari et al. (2024) and Vishwanath et al. (2024).
Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com Mar 15, 2026 9 facts
claimScaling up large language model size and training data simultaneously tends to reduce hallucinations regarding well-documented facts because larger models have greater capacity to memorize and recall high-frequency information.
measurementIn large language models, entities appearing fewer than approximately 100 times in training data are hallucinated at substantially higher rates than high-frequency entities.
claimHallucination in large language models is a structural issue originating from how training data is collected, how the optimization objective is constructed, the limitations of what knowledge the model can represent, and how the generation process converts probability distributions into words.
claimFor the long tail of entities and facts, increasing the volume of training data does not reduce hallucinations if the additional data contains noise levels similar to the existing training corpus.
claimAutomated filtering of training data for large language models can remove low-quality content like boilerplate, spam, and AI-generated text, but it cannot reliably identify factual errors at scale.
claimValuable scientific and specialized knowledge is often excluded from large language model training data because it is behind paywalls, in subscription journals, or contained in private databases like electronic health records, legal databases, and proprietary financial data.
claimLarge language models do not have an internal 'confidence score' grounded in the amount of training data that covered a specific topic.
claimThe four major categories of root causes for large language model hallucinations are training data issues, exposure bias during learning, structural knowledge gaps, and generation pressure at inference time.
measurementThe hallucination rate of large language models decreases as entity frequency in training data increases, dropping from 95% at one occurrence to approximately 60% at 50 occurrences.
LLM Hallucinations: Causes, Consequences, Prevention - LLMs llmmodels.org llmmodels.org May 10, 2024 8 facts
referenceThe causes of LLM hallucinations include flawed training data (biases, inaccuracies, or inconsistencies), knowledge gaps (lack of domain-specific knowledge or context understanding), and technical limitations (over-reliance on statistical patterns and vulnerability to manipulation).
procedureHigh-quality training data as a mitigation strategy for large language model hallucinations involves using diverse and well-curated training data.
claimLarge language models (LLMs) experience hallucinations due to flawed or biased training data, which may contain inaccuracies or inconsistencies.
claimStrategies to mitigate hallucinations in large language models include using high-quality training data, employing contrastive learning, implementing human oversight, and utilizing uncertainty estimation.
claimA significant challenge in developing accurate and reliable large language models is the need for high-quality, diverse, and representative training data.
claimLarge language models can hallucinate because they rely too heavily on statistical patterns in training data rather than understanding the underlying meaning or context of the text.
claimFactual errors or outdated information in training data lead large language models to generate inaccurate or misleading text.
claimBiased language in training data causes large language models to reproduce stereotypical or biased language in their generated text.
A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 4 facts
claimAccording to Setlur et al. (2025), the performance gap between Verifier-Based and Verifier-Free methods widens as test-time compute and training data increase, with Verifier-Based methods achieving superior asymptotic performance.
claimThe design and selection of deep learning model architectures are influenced by both the latent characteristics of the training data and the training paradigm adopted, such as next-token prediction (NTP) or masked language modeling (MLM).
formulaHoffmann et al. (2022b) established that for compute-optimal training, model size (N) and the amount of training data (D) should be scaled proportionally with the compute budget (C), specifically N ∝ C^0.5 and D ∝ C^0.5.
claimChu et al. (2025) provided empirical evidence that Supervised Fine-Tuning (SFT) tends to memorize training data, leading to poor performance on out-of-distribution (OOD) tasks, whereas Reinforcement Learning (RL) demonstrates superior generalization capabilities.
Unknown source 2 facts
claimLimitations in training data are a root cause of model-intrinsic hallucinations in large language models.
claimLarge language models (LLMs) can experience model-intrinsic hallucinations due to limitations in training data and architectural biases, even when well-organized prompts are used.
Construction of Knowledge Graphs: State and Challenges - arXiv arxiv.org arXiv 2 facts
claimMost knowledge graph construction approaches integrate supplementary data, specifically mapping rules, training data, or quality constraints such as SHACL shapes.
claimHigh-quality data sources can provide a clean type hierarchy and serve as training data to mitigate data-quality issues that are difficult to address when using low-quality sources in isolation.
On Hallucinations in Artificial Intelligence–Generated Content ... jnm.snmjournals.org The Journal of Nuclear Medicine 1 fact
perspectiveAI models are inherently probabilistic and rely on pattern recognition and statistical inference from training data without true understanding, making hallucinations an inevitable limitation of data-driven learning systems.
LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Zylos Jan 27, 2026 1 fact
claimFuture research in hallucination mitigation is focusing on mechanistic interpretability to understand internal processes, adaptive verification strategies based on query complexity and risk, extending detection to cross-modal systems, and causal tracing to link training data and architecture to hallucination propensity.
The construction and refined extraction techniques of knowledge ... nature.com Nature Feb 10, 2026 1 fact
claimTraining data for each task is designed based on the specific functional requirements of that task.
Enterprise AI Requires the Fusion of LLM and Knowledge Graph stardog.com Stardog Dec 4, 2024 1 fact
claimSchellaert's team found that 'ultracrepidarianism'—the tendency to give opinions on matters the AI knows nothing about—appeared in LLMs as a consequence of increasing scale and grew linearly with the amount of training data.
Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 1 fact
claimSurvey respondents in the study 'Medical Hallucination in Foundation Models and Their Impact on ...' identified limitations in training data and model architectures as key factors contributing to medical hallucinations.
How Open-Source AI Drives Responsible Innovation - The Atlantic theatlantic.com The Atlantic 1 fact
claimOpen-source AI systems help manage emerging risks such as intentional misuse by bad actors (cyberattacks, disinformation) and unintentional harms (exposure of private user data, entrenched biases in training data).
Policymakers Overlook How Open Source AI Is Reshaping ... techpolicy.press Lucie-Aimée Kaffee, Shayne Longpre · Tech Policy Press Dec 9, 2025 1 fact
measurementThe proportion of downloaded AI models that disclosed meaningful information about their training data fell from a majority in 2022 to below 40 percent by 2025.
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 1 fact
claimModel-intrinsic hallucinations occur due to limitations in training data, architectural biases, or inference-time sampling strategies, even when well-organized prompts are used, as noted by Bang and Madotto (2023), OpenAI (2023a), and Chen et al. (2023).
The Role of Hallucinations in Large Language Models - CloudThat cloudthat.com CloudThat Sep 1, 2025 1 fact
claimLarge language model hallucinations occur due to gaps in training data, a lack of grounding, or limitations in how models understand real-world facts.
Reducing hallucinations in large language models with custom ... aws.amazon.com Amazon Web Services Nov 26, 2024 1 fact
claimLLM hallucinations occur when training data lacks necessary information or when the model attempts to generate coherent responses by making logical inferences beyond its actual knowledge.
Understanding LLM Understanding skywritingspress.ca Skywritings Press Jun 14, 2024 1 fact
claimUnderstanding the behavior of large language models is challenging because their internal structures are complex, their training data is often opaque, and access to their underlying mechanisms is limited.
What Really Causes Hallucinations in LLMs? - AI Exploration Journey aiexpjourney.substack.com AI Innovations and Insights Sep 12, 2025 1 fact
measurementThe minimum hallucination rate of a large language model is at least as high as the proportion of singletons (facts appearing only once) present in the training data.
A Knowledge-Graph Based LLM Hallucination Evaluation Framework themoonlight.io The Moonlight 1 fact
claimThe authors of the GraphEval framework focus on detecting hallucinations within a defined context rather than identifying discrepancies between LLM responses and broader training data.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact
claimWell-known entities are more likely to be referenced in Large Language Model training data, which increases the likelihood that the model will accurately recall information about them.
vectara/hallucination-leaderboard - GitHub github.com Vectara 1 fact
perspectiveThe author of the Vectara hallucination-leaderboard argues that testing models by providing a list of well-known facts is a poor method for detecting hallucinations because the model's training data is unknown, the definition of 'well known' is unclear, and most hallucinations arise from rare or conflicting information rather than common knowledge.