claim
Zhang et al. (2024b) studied the training dynamics of a Transformer with a single linear attention layer during in-context learning for linear regression tasks and showed that the model can find the global minimum of the objective function.

Authors

Sources

Referenced by nodes (2)