claim
The strength of the Muon optimization method is attributed to its ability to leverage the low-rank and approximately block-diagonal structure of the Hessian commonly observed in Large Language Models.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (1)
- Large Language Models concept