Sparse Attention with Linear Units

Biao Zhang, Ivan Titov, Rico Sennrich


Abstract
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. ‘switch off’) for some queries, which is not possible with sparsified softmax alternatives.
Anthology ID:
2021.emnlp-main.523
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6507–6520
Language:
URL:
https://aclanthology.org/2021.emnlp-main.523
DOI:
10.18653/v1/2021.emnlp-main.523
Bibkey:
Cite (ACL):
Biao Zhang, Ivan Titov, and Rico Sennrich. 2021. Sparse Attention with Linear Units. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6507–6520, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Sparse Attention with Linear Units (Zhang et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.523.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.523.mp4
Code
 bzhangGo/zero +  additional community code
Data
WMT 2014WMT 2016