Length Generalization of Causal Transformers without Position Encoding

Jie Wang; Tao Ji; Yuanbin Wu; Hang Yan (航 颜); Tao Gui; Qi Zhang; Xuan-Jing Huang (黄萱菁); Xiaoling Wang

doi:10.18653/v1/2024.findings-acl.834

Length Generalization of Causal Transformers without Position Encoding

Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang

Abstract

Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE’s generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads’ best temperature hyper-parameters, which substantially expands NoPE’s context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

Anthology ID:: 2024.findings-acl.834
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14024–14040
Language:
URL:: https://aclanthology.org/2024.findings-acl.834/
DOI:: 10.18653/v1/2024.findings-acl.834
Bibkey:
Cite (ACL):: Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, and Xiaoling Wang. 2024. Length Generalization of Causal Transformers without Position Encoding. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14024–14040, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Length Generalization of Causal Transformers without Position Encoding (Wang et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.834.pdf

PDF Cite Search Fix data