claim
Shen et al. (2025) theoretically investigated the training dynamics of a single-layer Transformer model for in-context classification tasks on Gaussian mixtures, showing that the model can converge to the global optimum at a linear rate using gradient descent.

Authors

Sources

Referenced by nodes (1)