Pruning Neural Machine Translation for Speed Using Group Lasso

Maximiliana Behnke, Kenneth Heafield


Abstract
Unlike most work on pruning neural networks, we make inference faster. Group lasso regularisation enables pruning entire rows, columns or blocks of parameters that result in a smaller dense network. Because the network is still dense, efficient matrix multiply routines are still used and only minimal software changes are required to support variable layer sizes. Moreover, pruning is applied during training so there is no separate pruning step. Experiments on top of English->German models, which already have state-of-the-art speed and size, show that two-thirds of feedforward connections can be removed with 0.2 BLEU loss. With 6 decoder layers, the pruned model is 34% faster; with 2 tied decoder layers, the pruned model is 14% faster. Pruning entire heads and feedforward connections in a 12–1 encoder-decoder architecture gains an additional 51% speed-up. These push the Pareto frontier with respect to the trade-off between time and quality compared to strong baselines. In the WMT 2021 Efficiency Task, our pruned and quantised models are 1.9–2.7x faster at the cost 0.9–1.7 BLEU in comparison to the unoptimised baselines. Across language pairs, we see similar sparsity patterns: an ascending or U-shaped distribution in encoder feedforward and attention layers and an ascending distribution in the decoder.
Anthology ID:
2021.wmt-1.116
Volume:
Proceedings of the Sixth Conference on Machine Translation
Month:
November
Year:
2021
Address:
Online
Editors:
Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre Martins, Makoto Morishita, Christof Monz
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1074–1086
Language:
URL:
https://aclanthology.org/2021.wmt-1.116
DOI:
Bibkey:
Cite (ACL):
Maximiliana Behnke and Kenneth Heafield. 2021. Pruning Neural Machine Translation for Speed Using Group Lasso. In Proceedings of the Sixth Conference on Machine Translation, pages 1074–1086, Online. Association for Computational Linguistics.
Cite (Informal):
Pruning Neural Machine Translation for Speed Using Group Lasso (Behnke & Heafield, WMT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wmt-1.116.pdf
Video:
 https://aclanthology.org/2021.wmt-1.116.mp4