Fact — reference — Knowledge Tree

Wu et al. (2025c) propose a graph theory framework to analyze position bias in multi-layer Transformers, revealing that causal masking inherently biases attention towards earlier positions because tokens in deep layers continuously aggregate context information from earlier tokens.

Authors

Person: Not available Organization: arXiv
A Survey on the Theory and Mechanism of Large Language Models

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv via serper

Referenced by nodes (2)

Transformers concept
attention concept