Cascaded Head-colliding Attention

Lin Zheng; Zhiyong Wu; Lingpeng Kong

doi:10.18653/v1/2021.acl-long.45

Cascaded Head-colliding Attention

Abstract

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by 0.6 perplexity on Wikitext-103 in language modeling, and by 0.6 BLEU on WMT14 EN-DE in machine translation, due to its improvements on the parameter efficiency.

Anthology ID:: 2021.acl-long.45
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: August
Year:: 2021
Address:: Online
Editors:: Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 536–549
Language:
URL:: https://aclanthology.org/2021.acl-long.45/
DOI:: 10.18653/v1/2021.acl-long.45
Bibkey:
Cite (ACL):: Lin Zheng, Zhiyong Wu, and Lingpeng Kong. 2021. Cascaded Head-colliding Attention. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 536–549, Online. Association for Computational Linguistics.
Cite (Informal):: Cascaded Head-colliding Attention (Zheng et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.acl-long.45.pdf
Video:: https://aclanthology.org/2021.acl-long.45.mp4
Code: LZhengisme/CODA
Data: WMT 2014, WikiText-103, WikiText-2

PDF Cite Search Code Video Fix data