claim
Teacher forcing is computationally efficient for training large language models because all positions in a sequence can be computed in a single forward pass using attention masking, allowing for fast and parallelizable training.
Authors
Sources
- Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com via serper
Referenced by nodes (1)
- Large Language Models concept