Why are Sensitive Functions Hard for Transformers?

Michael Hahn, Mark Rofin


Abstract
Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers’ inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.
Anthology ID:
2024.acl-long.800
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14973–15008
Language:
URL:
https://aclanthology.org/2024.acl-long.800
DOI:
Bibkey:
Cite (ACL):
Michael Hahn and Mark Rofin. 2024. Why are Sensitive Functions Hard for Transformers?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14973–15008, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Why are Sensitive Functions Hard for Transformers? (Hahn & Rofin, ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.800.pdf