Fact — measurement — Knowledge Tree

In web-scale training data, high-accuracy sources like Wikipedia (95% accuracy) and academic papers (88% accuracy) contribute approximately 7% of the total corpus, while lower-accuracy sources like SEO content (35% accuracy) and general blogs (65% accuracy) constitute nearly 40% of the total token volume.

Authors

Person: M. Brenndoerfer Organization: mbrenndoerfer.com
Hallucination Causes: Why Language Models Fabricate Facts

Sources

Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com via serper

Referenced by nodes (2)

Wikipedia entity
blog concept