training data ↔ Large Language Models

Relations (1)

related 4.17 — strongly supporting 17 facts

Large Language Models rely on training data as a fundamental component of their development, where the quality, frequency, and content of this data directly influence model performance, bias, and hallucination rates as described in [1], [2], [3], and [4].

Facts (17)

Sources

LLM Hallucinations: Causes, Consequences, Prevention - LLMs llmmodels.org llmmodels.org 6 facts

claimLarge language models (LLMs) experience hallucinations due to flawed or biased training data, which may contain inaccuracies or inconsistencies.

claimStrategies to mitigate hallucinations in large language models include using high-quality training data, employing contrastive learning, implementing human oversight, and utilizing uncertainty estimation.

claimA significant challenge in developing accurate and reliable large language models is the need for high-quality, diverse, and representative training data.

claimLarge language models can hallucinate because they rely too heavily on statistical patterns in training data rather than understanding the underlying meaning or context of the text.

claimFactual errors or outdated information in training data lead large language models to generate inaccurate or misleading text.

claimBiased language in training data causes large language models to reproduce stereotypical or biased language in their generated text.

Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com 5 facts

measurementIn large language models, entities appearing fewer than approximately 100 times in training data are hallucinated at substantially higher rates than high-frequency entities.

claimHallucination in large language models is a structural issue originating from how training data is collected, how the optimization objective is constructed, the limitations of what knowledge the model can represent, and how the generation process converts probability distributions into words.

claimAutomated filtering of training data for large language models can remove low-quality content like boilerplate, spam, and AI-generated text, but it cannot reliably identify factual errors at scale.

claimLarge language models do not have an internal 'confidence score' grounded in the amount of training data that covered a specific topic.

measurementThe hallucination rate of large language models decreases as entity frequency in training data increases, dropping from 95% at one occurrence to approximately 60% at 50 occurrences.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv 2 facts

claimInadequate training data coverage creates knowledge gaps that cause large language models to hallucinate when addressing unfamiliar medical topics, according to Lee et al. (2024).

claimRobust finetuning procedures and retrieval-augmented generation can improve the balance of training data, which helps mitigate availability bias in large language models.

Unknown source 2 facts

claimLimitations in training data are a root cause of model-intrinsic hallucinations in large language models.

claimLarge language models (LLMs) can experience model-intrinsic hallucinations due to limitations in training data and architectural biases, even when well-organized prompts are used.

Enterprise AI Requires the Fusion of LLM and Knowledge Graph stardog.com Stardog 1 fact

claimSchellaert's team found that 'ultracrepidarianism'—the tendency to give opinions on matters the AI knows nothing about—appeared in LLMs as a consequence of increasing scale and grew linearly with the amount of training data.

Understanding LLM Understanding skywritingspress.ca Skywritings Press 1 fact

claimUnderstanding the behavior of large language models is challenging because their internal structures are complex, their training data is often opaque, and access to their underlying mechanisms is limited.