claim
Ren et al. (2024b) analyzed the training dynamics of a single-layer Transformer on a synthetic dataset, showing that the optimization process consists of a sample-intensive stage followed by a sample-efficient stage.

Authors

Sources

Referenced by nodes (1)