claim
Ren and Liu (2025) reveal that Transformers have an inherent bias toward learning distributions with lower entropy than the true target, a bias primarily driven by the feed-forward (FFN) modules.

Authors

Sources

Referenced by nodes (2)