claim
Transformers operating under a causal mask setting execute algorithms that function as online gradient descent with non-decaying step sizes, which fails to guarantee convergence to optimal solutions, according to a 2023 analysis.

Authors

Sources

Referenced by nodes (1)