Transformers ↔ In-Context Learning

Relations (1)

related 11.00 — strongly supporting 11 facts

Justification not yet generated — showing supporting facts

Yingqian Cui, Jie Ren, Pengfei He, Hui Liu, Jiliang Tang, and Yue Xing present a theoretical analysis comparing the exact convergence of single-head and multi-head attention in transformers for in-context learning with linear regression tasks.
When a problem becomes sparse, the prediction error of in-context learning (ICL) in Transformers is comparable to the solution of the Lasso problem, according to Garg et al. (2022).
The paper 'What can transformers learn in-context? a case study of simple function classes' was published in Advances in Neural Information Processing Systems 35, pp. 30583–30598.
The paper 'Transformers as algorithms: generalization and stability in in-context learning' is available as arXiv preprint arXiv:2301.07067.
When the number of in-context examples D increases, the prediction loss for both single-head and multi-head attention in transformers is in O(1/D), but the prediction loss for multi-head attention has a smaller multiplicative constant.
The paper 'In-context learning with transformers: softmax attention adapts to function lipschitzness' is an arXiv preprint (arXiv:2402.11639) regarding in-context learning.
The paper 'Selective induction heads: how transformers select causal structures in context' was presented at The Thirteenth International Conference on Learning Representations.
The paper 'Transformers learn in-context by gradient descent' was published in the International Conference on Machine Learning, pp. 35151–35174.
Transformers and LSTMs both possess the ability to learn in-context, and this capability improves with the length and quantity of demonstrations.
The paper 'Transformers implement functional gradient descent to learn non-linear functions in context' is an arXiv preprint, identified as arXiv:2312.06528.
Zheng et al. (2024) demonstrated that autoregressively trained Transformers can implement in-context learning by learning a meta-optimizer, specifically learning to perform one-step gradient descent to solve ordinary least squares (OLS) problems under specific initial data distribution conditions.

Facts (11)

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv 9 facts

measurementWhen a problem becomes sparse, the prediction error of in-context learning (ICL) in Transformers is comparable to the solution of the Lasso problem, according to Garg et al. (2022).

referenceThe paper 'What can transformers learn in-context? a case study of simple function classes' was published in Advances in Neural Information Processing Systems 35, pp. 30583–30598.

referenceThe paper 'Transformers as algorithms: generalization and stability in in-context learning' is available as arXiv preprint arXiv:2301.07067.

referenceThe paper 'In-context learning with transformers: softmax attention adapts to function lipschitzness' is an arXiv preprint (arXiv:2402.11639) regarding in-context learning.

referenceThe paper 'Selective induction heads: how transformers select causal structures in context' was presented at The Thirteenth International Conference on Learning Representations.

referenceThe paper 'Transformers learn in-context by gradient descent' was published in the International Conference on Machine Learning, pp. 35151–35174.

claimTransformers and LSTMs both possess the ability to learn in-context, and this capability improves with the length and quantity of demonstrations.

referenceThe paper 'Transformers implement functional gradient descent to learn non-linear functions in context' is an arXiv preprint, identified as arXiv:2312.06528.

claimZheng et al. (2024) demonstrated that autoregressively trained Transformers can implement in-context learning by learning a meta-optimizer, specifically learning to perform one-step gradient descent to solve ordinary least squares (OLS) problems under specific initial data distribution conditions.

Track: Poster Session 3 - aistats 2026 virtual.aistats.org Samuel Tesfazgi, Leonhard Sprandl, Sandra Hirche · AISTATS 2 facts

claimYingqian Cui, Jie Ren, Pengfei He, Hui Liu, Jiliang Tang, and Yue Xing present a theoretical analysis comparing the exact convergence of single-head and multi-head attention in transformers for in-context learning with linear regression tasks.

formulaWhen the number of in-context examples D increases, the prediction loss for both single-head and multi-head attention in transformers is in O(1/D), but the prediction loss for multi-head attention has a smaller multiplicative constant.