concept

gradient descent

Also known as: gradient descent optimization

Facts (23)

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 14 facts

referenceThe paper 'Heavy-tailed class imbalance and why Adam outperforms gradient descent on language models' analyzes why the Adam optimizer performs better than standard gradient descent in the context of heavy-tailed class imbalance in language models.

perspectiveJiang et al. (2024b) argue that the formation of linear representations in high-dimensional settings for Large Language Models is naturally compelled by the interplay between the next-token prediction objective and the implicit bias of gradient descent.

referenceThe paper 'Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers' is an arXiv preprint (arXiv:2212.10559).

claimResearchers are investigating whether input prompts act as latent variables that locate a specific task within a pre-trained distribution, or if the model architecture implicitly executes meta-optimization algorithms like gradient descent to adapt to provided examples.

referenceThe paper 'Can looped transformers learn to implement multi-step gradient descent for in-context learning?' is an arXiv preprint, identified as arXiv:2410.08292.

claimResearchers in 2024 proved that Low-Rank Adaptation (LoRA) can eliminate spurious local minima, which allows gradient descent to find a high-performing low-rank solution.

claimMahankali et al. (2023) proved that when covariates are sampled from a Gaussian distribution, the pretraining loss with a single-layer linear attention is minimized via one-step gradient descent.

claimLi et al. (2025a) provided a convergence analysis demonstrating how gradient descent optimization enables non-linear Transformers to learn Chain-of-Thought (CoT) reasoning, while quantifying the sample complexity required to maintain robustness against noisy context examples.

claimAkyürek et al. (2022) demonstrated that under certain constructions, Transformers can implement basic operations such as move, multiply, divide, and affine transformations, which can be combined to perform gradient descent.

claimFast weight programmers and online learners are a family of linear models obtained by applying different gradient-descent algorithms in online or streaming settings (Schmidhuber, 1992; Yang et al., 2024b; Liu et al., 2024a; Yang et al., 2024c).

referenceThe paper 'Transformers learn in-context by gradient descent' was published in the International Conference on Machine Learning, pp. 35151–35174.

claimOymak et al. (2023) characterize how gradient descent naturally guides prompts to focus on sparse, task-relevant tokens.

claimZheng et al. (2024) demonstrated that autoregressively trained Transformers can implement in-context learning by learning a meta-optimizer, specifically learning to perform one-step gradient descent to solve ordinary least squares (OLS) problems under specific initial data distribution conditions.

claimShen et al. (2025) theoretically investigated the training dynamics of a single-layer Transformer model for in-context classification tasks on Gaussian mixtures, showing that the model can converge to the global optimum at a linear rate using gradient descent.

Track: Poster Session 3 - aistats 2026 virtual.aistats.org Samuel Tesfazgi, Leonhard Sprandl, Sandra Hirche · AISTATS 6 facts

claimStableMDS is a gradient descent-based method for Weighted Multidimensional Scaling that reduces computational complexity to O(n^2 p) per iteration.

claimThere has been significant recent interest in understanding the implicit bias of gradient descent optimization and its connection to the generalization properties of overparametrized neural networks.

claimTraining linear diagonal networks on square loss for regression tasks causes gradient descent to converge to special solutions, such as non-negative ones.

claimEstimating variable importance for algorithms using gradient descent and gradient boosting (such as neural networks and gradient-boosted decision trees) is computationally challenging when the number of variables is large because it requires re-training.

claimMulti-task representation learning outperforms single-task representation learning in scenarios involving over-parameterized two-layer convolutional neural networks trained by gradient descent.

claimStableMDS achieves computational efficiency by applying gradient descent independently to each point, eliminating the need for costly matrix operations inherent in Stress Majorization.

Neuro-Symbolic AI: Explainability, Challenges, and Future Trends arxiv.org arXiv Nov 7, 2024 2 facts

procedureThe implicit adjustment process in Deep Symbolic Regression (DSR) provides feedback to the RNN to guide expression generation, relying on gradient descent or optimization algorithms to adjust RNN weights.

procedureThe LNN-based inductive logic programming method proposed by Sen et al. (2022) operates through the following procedure: (1) Input a knowledge base containing facts, relations, and rules describing the target structure. (2) Build an LNN network based on the template to simulate logical connectives, where each node represents an expression or logical rule. (3) Use facts in the knowledge base as training data to adjust logical operations via optimization algorithms like back propagation and gradient descent. (4) Convert the trained LNN into a set of logical rules that reflect the relationships in the input data.

The Evidence for AI Consciousness, Today - AI Frontiers ai-frontiers.org AI Frontiers Dec 8, 2025 1 fact

claimCurrent AI training methods involve using gradient descent on petabytes of text to reshape networks with hundreds of billions of parameters, followed by fine-tuning that may suppress accurate self-reports about internal states.