claim
Chen et al. (2024e) used gradient flow to analyze how a simplified Transformer architecture with two attention layers performs in-context learning, revealing the collaborative mechanism of its components.

Authors

Sources

Referenced by nodes (2)