reference
Wu et al. (2025c) propose a graph theory framework to analyze position bias in multi-layer Transformers, revealing that causal masking inherently biases attention towards earlier positions because tokens in deep layers continuously aggregate context information from earlier tokens.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (2)
- Transformers concept
- attention concept