Manifold-Preserving Transformers are Effective for Short-Long Range Encoding

Ayan Sengupta, Md Akhtar, Tanmoy Chakraborty


Abstract
Multi-head self-attention-based Transformers have shown promise in different learning tasks. Albeit these models exhibit significant improvement in understanding short-term and long-term contexts from sequences, encoders of Transformers and their variants fail to preserve layer-wise contextual information. Transformers usually project tokens onto sparse manifolds and fail to preserve mathematical equivalence among the token representations. In this work, we propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens. We propose a simple alternative to dot-product attention to ensure Lipschitz continuity. This allows TransJect to learn injective mappings to transform token representations to different manifolds with similar topology and preserve Euclidean distance between every pair of tokens in subsequent layers. Evaluations across multiple benchmark short- and long-sequence classification tasks show maximum improvements of 6.8% and 5.9%, respectively, over the variants of Transformers. Additionally, TransJect displays 79% better performance than Transformer on the language modeling task. We further highlight the shortcomings of multi-head self-attention from the statistical physics viewpoint. Although multi-head self-attention was incepted to learn different abstraction levels within the networks, our empirical analyses suggest that different attention heads learn randomly and unorderly. In contrast, TransJect adapts a mixture of experts for regularization; these experts are more orderly and balanced and learn different sparse representations from the input sequences. TransJect exhibits very low entropy and can be efficiently scaled to larger depths.
Anthology ID:
2023.findings-emnlp.228
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3533–3549
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.228
DOI:
10.18653/v1/2023.findings-emnlp.228
Bibkey:
Cite (ACL):
Ayan Sengupta, Md Akhtar, and Tanmoy Chakraborty. 2023. Manifold-Preserving Transformers are Effective for Short-Long Range Encoding. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3533–3549, Singapore. Association for Computational Linguistics.
Cite (Informal):
Manifold-Preserving Transformers are Effective for Short-Long Range Encoding (Sengupta et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.228.pdf