Differentiable Subset Pruning of Transformer Heads

Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan


Abstract
Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer’s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. ntuitively, our method learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. he importance variables are learned via stochastic gradient descent. e conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.1
Anthology ID:
2021.tacl-1.86
Volume:
Transactions of the Association for Computational Linguistics, Volume 9
Month:
Year:
2021
Address:
Cambridge, MA
Editors:
Brian Roark, Ani Nenkova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
1442–1459
Language:
URL:
https://aclanthology.org/2021.tacl-1.86
DOI:
10.1162/tacl_a_00436
Bibkey:
Cite (ACL):
Jiaoda Li, Ryan Cotterell, and Mrinmaya Sachan. 2021. Differentiable Subset Pruning of Transformer Heads. Transactions of the Association for Computational Linguistics, 9:1442–1459.
Cite (Informal):
Differentiable Subset Pruning of Transformer Heads (Li et al., TACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.tacl-1.86.pdf
Video:
 https://aclanthology.org/2021.tacl-1.86.mp4