A Mixture of h - 1 Heads is Better than h Heads

Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith


Abstract
Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead “reallocate” them—the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over “transformer-base” by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.
Anthology ID:
2020.acl-main.587
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6566–6577
Language:
URL:
https://aclanthology.org/2020.acl-main.587
DOI:
10.18653/v1/2020.acl-main.587
Bibkey:
Cite (ACL):
Hao Peng, Roy Schwartz, Dianqi Li, and Noah A. Smith. 2020. A Mixture of h - 1 Heads is Better than h Heads. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6566–6577, Online. Association for Computational Linguistics.
Cite (Informal):
A Mixture of h - 1 Heads is Better than h Heads (Peng et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.587.pdf
Video:
 http://slideslive.com/38929434