Fact — claim — Knowledge Tree

Transformers operating under a causal mask setting execute algorithms that function as online gradient descent with non-decaying step sizes, which fails to guarantee convergence to optimal solutions, according to a 2023 analysis.

Authors

Person: Not available Organization: arXiv
A Survey on the Theory and Mechanism of Large Language Models

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv via serper

Referenced by nodes (1)

Transformers concept