measurement
In web-scale training data, high-accuracy sources like Wikipedia (95% accuracy) and academic papers (88% accuracy) contribute approximately 7% of the total corpus, while lower-accuracy sources like SEO content (35% accuracy) and general blogs (65% accuracy) constitute nearly 40% of the total token volume.
Authors
Sources
- Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com via serper