concept

Synthetic data

Also known as: synthetic dataset, Synthetic data generation

Facts (15)

Sources
A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 12 facts
referenceThe paper 'Synthetic data generation with large language models for text classification: potential and limitations' was published in the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10443–10461.
claimReinforcement learning on incorrect responses helps models identify and unlearn 'spurious correlations'—incorrect intermediate steps that lead to correct final answers—scaling synthetic dataset efficiency by eight-fold compared to standard positive-only fine-tuning.
claimDohmatob et al. (2024) provide a theoretical framework explaining that the inclusion of synthetic, AI-generated data in the training corpus can alter or break traditional scaling laws, potentially leading to performance degradation and model collapse.
referenceThe paper 'Datadreamer: a tool for synthetic data generation and reproducible llm workflows' (arXiv:2402.10379) introduces a tool for synthetic data generation and reproducible workflows for large language models.
measurementSeddik et al. (2024) concluded that to maintain model stability, the amount of synthetic data used in training must be considerably smaller than the amount of real data in the training mix.
claimThe use of synthetic data for recursive self-improvement, where a model generates data to train its next generation, is a debated frontier in AI research according to Villalobos et al. (2024) and Long et al. (2024).
referenceThe paper 'Collapse or thrive? perils and promises of synthetic data in a self-generating world' is an arXiv preprint (arXiv:2410.16713) cited in section 2.3.1 of 'A Survey on the Theory and Mechanism of Large Language Models'.
referenceThe paper 'Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data' was published in the First Conference on Language Modeling.
referenceThe paper 'Towards a theoretical understanding of synthetic data in llm post-training: a reverse-bottleneck perspective' provides a theoretical framework for understanding the role of synthetic data in post-training large language models.
referenceThe research paper 'Best practices and lessons learned on synthetic data' was published in The Thirteenth International Conference on Learning Representations and cited in section 2.2.1 of the survey.
measurementLi et al. (2023d) found that the performance gap between real and synthetic data is smallest for low-subjectivity tasks like news classification, but significantly larger for high-subjectivity tasks like humor or sarcasm detection.
claimThe authors of 'A Survey on the Theory and Mechanism of Large Language Models' identify critical frontier challenges in the field, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent intelligence.
Cybersecurity Trends and Predictions 2025 From Industry Insiders itprotoday.com ITPro Today 2 facts
claimBusinesses are increasingly turning to synthetic data—training data generated by AI models—to maintain safety best practices and avoid the risks associated with using customer data for AI training.
claimSynthetic data creates feedback loops that exacerbate existing biases within data sets.
How NATO can integrate AI to prevail in future algorithmic warfare atlanticcouncil.org Atlantic Council 4 days ago 1 fact
claimAI models are vulnerable to exploitation of rare battlefield features because they are primarily trained on synthetic data or datasets from previous conflicts that may not fit current war zone circumstances.