Long-range Sequence Modeling with Predictable Sparse Attention

Yimeng Zhuang, Jing Zhang, Mei Tu


Abstract
Self-attention mechanism has been shown to be an effective approach for capturing global context dependencies in sequence modeling, but it suffers from quadratic complexity in time and memory usage. Due to the sparsity of the attention matrix, much computation is redundant. Therefore, in this paper, we design an efficient Transformer architecture, named Fourier Sparse Attention for Transformer (FSAT), for fast long-range sequence modeling. We provide a brand-new perspective for constructing sparse attention matrix, i.e. making the sparse attention matrix predictable. Two core sub-modules are: (1) A fast Fourier transform based hidden state cross module, which captures and pools L2 semantic combinations in đť’Ş(Llog L) time complexity. (2) A sparse attention matrix estimation module, which predicts dominant elements of an attention matrix based on the output of the previous hidden state cross module. By reparameterization and gradient truncation, FSAT successfully learned the index of dominant elements. The overall complexity about the sequence length is reduced from đť’Ş(L2) to đť’Ş(Llog L). Extensive experiments (natural language, vision, and math) show that FSAT remarkably outperforms the standard multi-head attention and its variants in various long-sequence tasks with low computational costs, and achieves new state-of-the-art results on the Long Range Arena benchmark.
Anthology ID:
2022.acl-long.19
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
234–243
Language:
URL:
https://aclanthology.org/2022.acl-long.19
DOI:
10.18653/v1/2022.acl-long.19
Bibkey:
Cite (ACL):
Yimeng Zhuang, Jing Zhang, and Mei Tu. 2022. Long-range Sequence Modeling with Predictable Sparse Attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 234–243, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Long-range Sequence Modeling with Predictable Sparse Attention (Zhuang et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.19.pdf
Data
LRA