claim
The strength of the Muon optimization method is attributed to its ability to leverage the low-rank and approximately block-diagonal structure of the Hessian commonly observed in Large Language Models.

Authors

Sources

Referenced by nodes (1)