Transformers
Facts (72)
Sources
A Survey on the Theory and Mechanism of Large Language Models arxiv.org Mar 12, 2026 56 facts
referenceThe paper 'Rnns are not transformers (yet): the key bottleneck on in-context retrieval' is an arXiv preprint (arXiv:2402.18510).
claimSeveral research works relate the optimization objective of Transformers to energy-based principles, including Ramsauer et al. (2020), Hoover et al. (2023), Hu et al. (2023a), Wu et al. (2023), Ren et al. (2025), and Hu et al. (2025).
claimJelassi et al. (2024) demonstrated that Transformers can copy sequences of exponential length, whereas fixed-state models are fundamentally limited by their finite memory.
claimKim and Suzuki (2024) theoretically showed that for Transformers with both MLP and attention layers, assuming rapid convergence of attention layers, the infinite-dimensional loss landscape for MLP parameters exhibits a benign non-convex structure.
referenceGiannou et al. (2023b) proposed treating Transformers as programmable computational units, where a fixed layer is repeatedly applied to execute instructions encoded in the input sequence.
claimDifferent attention patterns can be learned to generate bounded outputs, and interpretability via local ("myopic") analysis can be provably misleading on Transformers, according to Wen et al. (2023).
referenceThe paper 'Chain of thought empowers transformers to solve inherently serial problems' is available as arXiv preprint arXiv:2402.12875.
claimTransformers operating under a causal mask setting execute algorithms that function as online gradient descent with non-decaying step sizes, which fails to guarantee convergence to optimal solutions, according to a 2023 analysis.
claimCheng et al. (2023) and Collins et al. (2024) explored the ability of Transformers to learn a wider range of nonlinear functions, extending considerations beyond linear attention settings.
measurementWhen a problem becomes sparse, the prediction error of in-context learning (ICL) in Transformers is comparable to the solution of the Lasso problem, according to Garg et al. (2022).
referenceThe paper 'Transformers as intrinsic optimizers: forward inference through the energy principle' is an arXiv preprint (arXiv:2511.00907) cited in section 4.2.1 of 'A Survey on the Theory and Mechanism of Large Language Models'.
referenceThe paper 'Grokking of implicit reasoning in transformers: a mechanistic journey to the edge of generalization' explores implicit reasoning in transformer models.
referenceThe paper 'Are transformers with one layer self-attention using low-rank weight matrices universal approximators?' is an arXiv preprint (arXiv:2307.14023) cited in section 3.2.1 of 'A Survey on the Theory and Mechanism of Large Language Models'.
referenceThe paper 'Transformers learn shortcuts to automata' is available as arXiv preprint arXiv:2210.10749.
referenceThe paper 'Statistically meaningful approximation: a case study on approximating turing machines with transformers' examines the ability of transformers to approximate Turing machines.
measurementSanford et al. (2023) introduced the 'sparse averaging' task and demonstrated that Transformers achieve only logarithmic communication complexity, whereas RNNs and feed-forward networks require polynomial communication complexity.
referenceThe paper 'What can transformers learn in-context? a case study of simple function classes' was published in Advances in Neural Information Processing Systems 35, pp. 30583–30598.
claimMerrill et al. (2024) showed that the expressive power of linear RNNs with diagonal transition matrices is comparable to that of Transformers, but allowing data-dependent non-diagonal transitions enables linear RNNs to surpass that class of expressive power.
claimTest-time training (regression) is a model architecture design framework utilized to address the quadratic complexity of Transformers with respect to sequence length, as discussed by Sun et al. (2024), Yang et al. (2023c), von Oswald et al. (2025), Wang et al. (2025a), and Behrouz et al. (2024; 2025).
claimRelative positional encodings introduce a distance attenuation effect that competes and balances with the deviation caused by the causal mask in multi-layer Transformers, according to research by Wu et al. (2025c).
claimTransformers with O(1)-depth and log-precision can only solve problems within the complexity class AC^0, as shown by Merrill and Sabharwal (2023b).
claimTransformers have a quadratic computational cost, which acts as an obstacle to their broad deployment in real-world settings, according to Vaswani et al. (2017a).
referenceThe paper 'Disentangling feature structure: A mathematically provable two-stage training dynamics in transformers' is an arXiv preprint, identified as arXiv:2502.20681.
referenceThe paper 'Transformers as algorithms: generalization and stability in in-context learning' is available as arXiv preprint arXiv:2301.07067.
perspectiveThere is growing research interest in understanding the theoretical limits of Transformers under realistic constraints, specifically finite precision, width, and depth, rather than just idealized theoretical analyses.
referenceLi et al. (2025b) theoretically analyzed 'Task Arithmetic' and proved that, under suitable assumptions, linear operations such as addition and negation can successfully edit knowledge in nonlinear Transformers and generalize to out-of-domain tasks.
referenceGarg et al. (2022) found that Transformers trained on well-defined linear tasks can achieve predictive performance comparable to the least squares algorithm.
claimDai et al. (2022) assert that Transformers implicitly fine-tune during in-context learning inference, building upon the dual form of the attention mechanism originally proposed by Aiserman et al. (1964) and Irie et al. (2022).
referenceThe paper 'Transformers are ssms: generalized models and efficient algorithms through structured state space duality' is an arXiv preprint (arXiv:2405.21060).
referenceWu et al. (2025c) propose a graph theory framework to analyze position bias in multi-layer Transformers, revealing that causal masking inherently biases attention towards earlier positions because tokens in deep layers continuously aggregate context information from earlier tokens.
referenceThe paper 'In-context learning with transformers: softmax attention adapts to function lipschitzness' is an arXiv preprint (arXiv:2402.11639) regarding in-context learning.
referenceThe paper 'Revisiting transformers through the lens of low entropy and dynamic sparsity' is an arXiv preprint (arXiv:2504.18929) cited in section 3.2.3 of 'A Survey on the Theory and Mechanism of Large Language Models'.
referenceThe paper 'Uncovering mesa-optimization algorithms in transformers' is an arXiv preprint, arXiv:2309.05858.
claimRen and Liu (2025) reveal that Transformers have an inherent bias toward learning distributions with lower entropy than the true target, a bias primarily driven by the feed-forward (FFN) modules.
claimFeng et al. (2023a) used circuit complexity theory to prove that finite-depth Transformers can execute tasks by extending their effective depth linearly with the number of generated reasoning steps.
claimAkyürek et al. (2022) demonstrated that under certain constructions, Transformers can implement basic operations such as move, multiply, divide, and affine transformations, which can be combined to perform gradient descent.
referenceThe paper 'Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars' was published in Advances in Neural Information Processing Systems 36, pages 38723–38766.
claimTransformers with constant precision are limited to solving problems in the complexity class NC^0, as shown by Li et al. (2024b).
claimZhang et al. (2024c) observed that different parameter blocks in Transformers possess heterogeneous Hessian structures, causing SGD to perform poorly because it applies a uniform learning rate, whereas Adam's adaptive learning rate handles this heterogeneity more effectively.
referenceThe paper 'Transformers are rnns: fast autoregressive transformers with linear attention' was published in the International Conference on Machine Learning, pages 5156–5165, and is cited in section 3.2.3 of 'A Survey on the Theory and Mechanism of Large Language Models'.
referenceTenney et al. (2019) utilized edge-probing tasks to measure the distribution of various linguistic phenomena across layers in contextual encoders and transformers.
referenceGarg et al. (2022) demonstrated that Transformers could effectively learn and generalize on complex function classes, including two-layer neural networks and four-layer decision trees.
referenceThe paper 'Transformers learn nonlinear features in context: nonconvex mean-field dynamics on the attention landscape' was published in the Forty-first International Conference on Machine Learning and is cited in section 3.2.2 of 'A Survey on the Theory and Mechanism of Large Language Models'.
referenceThe paper 'In-context convergence of transformers' was published in the Forty-first International Conference on Machine Learning.
referenceThe paper 'Selective induction heads: how transformers select causal structures in context' was presented at The Thirteenth International Conference on Learning Representations.
referenceThe paper 'Transformers learn in-context by gradient descent' was published in the International Conference on Machine Learning, pp. 35151–35174.
referenceThe paper 'How do transformers learn topic structure: towards a mechanistic understanding' was published in the International Conference on Machine Learning, pp. 19689–19729.
claimTransformers and LSTMs both possess the ability to learn in-context, and this capability improves with the length and quantity of demonstrations.
claimTransformers can statistically approximate Turing machines running in time T with sample complexity polynomial in the alphabet size, state-space size, and T, as demonstrated by Wei et al. (2022a).
claimIncorporating the Delta Rule into Transformers has been explored as a method to strengthen the expressive power of these models.
referenceThe paper 'Transformers implement functional gradient descent to learn non-linear functions in context' is an arXiv preprint, identified as arXiv:2312.06528.
claimZheng et al. (2024) demonstrated that autoregressively trained Transformers can implement in-context learning by learning a meta-optimizer, specifically learning to perform one-step gradient descent to solve ordinary least squares (OLS) problems under specific initial data distribution conditions.
referenceThe paper 'Repeat after me: transformers are better than state space models at copying' is an arXiv preprint, identified as arXiv:2402.01032.
referenceThe paper 'Unveiling induction heads: provable training dynamics and feature learning in transformers' was published in Advances in Neural Information Processing Systems.
claimHybrid architectures, such as those combining Mamba with Transformers, can achieve high efficiency while maintaining performance comparable to standard models.
claimPérez et al. (2021) proved that Transformers are Turing complete under the assumption of infinite precision.
Track: Poster Session 3 - aistats 2026 virtual.aistats.org 2 facts
claimYingqian Cui, Jie Ren, Pengfei He, Hui Liu, Jiliang Tang, and Yue Xing present a theoretical analysis comparing the exact convergence of single-head and multi-head attention in transformers for in-context learning with linear regression tasks.
formulaWhen the number of in-context examples D increases, the prediction loss for both single-head and multi-head attention in transformers is in O(1/D), but the prediction loss for multi-head attention has a smaller multiplicative constant.
Neuro-Symbolic AI: Explainability, Challenges & Future Trends linkedin.com Dec 15, 2025 2 facts
claimGenerative adversarial networks (GANs), transformers, and graph neural networks (GNNs) demonstrate strong capabilities in modeling complex spatial-temporal dependencies and achieving accurate motion reconstruction within the AI domain.
claimKnowledge of Generative AI architectures, such as Large Language Models (LLMs), Generative Adversarial Networks (GANs), and Transformers, is critical for driving innovation, enhancing productivity, and personalizing experiences in industries like marketing, software development, and design.
Understanding LLM Understanding skywritingspress.ca Jun 14, 2024 1 fact
claimModern AI models, such as transformers, implement high-order Markov chains at their core.
The Synergy of Symbolic and Connectionist AI in LLM ... arxiv.org 1 fact
claimLarge Language Models are trained on large-scale transformers comprising billions of learnable parameters to support abilities including perception, reasoning, planning, and action.
MedHallu - GitHub github.com 1 fact
codeThe MedHallu software stack requires Python 3.8+, PyTorch, Transformers, vLLM, and Sentence-Transformers.
What is Open Source Software? - HotWax Systems hotwaxsystems.com Aug 11, 2025 1 fact
claimOpen source AI toolkits, including LangChain, Haystack, Transformers, llama.cpp, ONNX, and OpenVINO, enable the development of AI systems that are transparent, customizable, and performant across different environments.
Combining Knowledge Graphs and Large Language Models - arXiv arxiv.org Jul 9, 2024 1 fact
referenceMicaela E. Consens, Cameron Dufault, Michael Wainberg, Duncan Forster, Mehran Karimzadeh, Hani Goodarzi, Fabian J. Theis, Alan Moses, and Bo Wang authored the 2023 paper 'To transformers and beyond: Large language models for the genome' (arXiv:2311.07621).
Neuro-Symbolic AI: Explainability, Challenges, and Future Trends arxiv.org Nov 7, 2024 1 fact
referenceVishal Pallagani, Bharath Muppasani, Keerthiram Murugesan, Francesca Rossi, Lior Horesh, Biplav Srivastava, Francesco Fabiano, and Andrea Loreggia developed Plansformer, a method for generating symbolic plans using transformers, in 2022.
Practices, opportunities and challenges in the fusion of knowledge ... frontiersin.org 1 fact
referenceKnowledge-neurons (Dai et al., 2021) identify and activate neurons corresponding to specific facts, exploring the storage of factual knowledge in pre-trained Transformers and the editing and updating of internal knowledge.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Nov 4, 2024 1 fact
claimThe architecture of large language models, utilizing attention and transformers, allows them to identify important words in sentences, enabling them to handle a wide range of NLP tasks.
The Synergy of Symbolic and Connectionist AI in LLM-Empowered ... arxiv.org Jul 11, 2024 1 fact
claimLarge Language Models (LLMs) are trained on large-scale transformers comprising billions of learnable parameters to support agent abilities such as perception, reasoning, planning, and action.
Bridging the Gap Between LLMs and Evolving Medical Knowledge arxiv.org Jun 29, 2025 1 fact
referenceJacob Devlin published 'Bert: Pre-training of deep bidirectional transformers for language understanding' in 2018.
Applying Large Language Models in Knowledge Graph-based ... arxiv.org Jan 7, 2025 1 fact
referenceXu, P., Zhu, X., and Clifton, D.A. published the paper 'Multimodal learning with transformers: A survey' in the IEEE Transactions on Pattern Analysis and Machine Intelligence in 2023.
Building Trustworthy NeuroSymbolic AI Systems - arXiv arxiv.org 1 fact
claimLarge Language Models (LLMs) are successors to foundational language models like BERT (Bidirectional Encoder Representations from Transformers) and represent a combination of feedforward neural networks and transformers.