claim
Zhang et al. (2024c) observed that different parameter blocks in Transformers possess heterogeneous Hessian structures, causing SGD to perform poorly because it applies a uniform learning rate, whereas Adam's adaptive learning rate handles this heterogeneity more effectively.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (1)
- Transformers concept