Fact — claim — Knowledge Tree

Shen et al. (2025) theoretically investigated the training dynamics of a single-layer Transformer model for in-context classification tasks on Gaussian mixtures, showing that the model can converge to the global optimum at a linear rate using gradient descent.

Authors

Person: Not available Organization: arXiv
A Survey on the Theory and Mechanism of Large Language Models

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv via serper

Referenced by nodes (1)

gradient descent concept