reference
Wu et al. (2025c) propose a graph theory framework to analyze position bias in multi-layer Transformers, revealing that causal masking inherently biases attention towards earlier positions because tokens in deep layers continuously aggregate context information from earlier tokens.

Authors

Sources

Referenced by nodes (2)