claim
Ren et al. (2024b) analyzed the training dynamics of a single-layer Transformer on a synthetic dataset, showing that the optimization process consists of a sample-intensive stage followed by a sample-efficient stage.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (1)
- Transformer concept