CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending

Shiyi Zhu, Jing Ye, Wei Jiang, Siqiao Xue, Qi Zhang, Yifan Wu, Jianguo Li


Abstract
Self-attention and position embedding are two crucial modules in transformer-based Large Language Models (LLMs). However, the potential relationship between them is far from well studied, especially for long context window extending. In fact, anomalous behaviors that hinder long context extrapolation exist between Rotary Position Embedding (RoPE) and vanilla self-attention.Incorrect initial angles between Q and K can cause misestimation in modeling rotary position embedding of the closest tokens.To address this issue, we propose Collinear Constrained Attention mechanism, namely CoCA. Specifically, we enforce a collinear constraint between Q and K to seamlessly integrate RoPE and self-attention.While only adding minimal computational and spatial complexity, this integration significantly enhances long context window extrapolation ability. We provide an optimized implementation, making it a drop-in replacement for any existing transformer-based models.Extensive experiments demonstrate that CoCA excels in extending context windows. A CoCA-based GPT model, trained with a context length of 512, can extend the context window up to 32K (60×) without any fine-tuning.Additionally, incorporating CoCA into LLaMA-7B achieves extrapolation up to 32K within a training length of only 2K.Our code is publicly available at: https://github.com/codefuse-ai/Collinear-Constrained-Attention
Anthology ID:
2024.acl-long.233
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4247–4262
Language:
URL:
https://aclanthology.org/2024.acl-long.233
DOI:
Bibkey:
Cite (ACL):
Shiyi Zhu, Jing Ye, Wei Jiang, Siqiao Xue, Qi Zhang, Yifan Wu, and Jianguo Li. 2024. CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4247–4262, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
CoCA: Fusing Position Embedding with Collinear Constrained Attention in Transformers for Long Context Window Extending (Zhu et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.233.pdf