formula
When the number of in-context examples D increases, the prediction loss for both single-head and multi-head attention in transformers is in O(1/D), but the prediction loss for multi-head attention has a smaller multiplicative constant.
Authors
Sources
- Track: Poster Session 3 - aistats 2026 virtual.aistats.org via serper
Referenced by nodes (2)
- Transformers concept
- In-Context Learning concept