claim
Shen et al. (2025) theoretically investigated the training dynamics of a single-layer Transformer model for in-context classification tasks on Gaussian mixtures, showing that the model can converge to the global optimum at a linear rate using gradient descent.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (1)
- gradient descent concept