claim
Transformers operating under a causal mask setting execute algorithms that function as online gradient descent with non-decaying step sizes, which fails to guarantee convergence to optimal solutions, according to a 2023 analysis.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (1)
- Transformers concept