Efficient Transformer Knowledge Distillation: A Performance Review

Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence


Abstract
As pretrained transformer language models continue to achieve state-of-the-art performance, the Natural Language Processing community has pushed for advances in model compression and efficient attention mechanisms to address high computational requirements and limited input sequence length. Despite these separate efforts, no investigation has been done into the intersection of these two fields. In this work, we provide an evaluation of model compression via knowledge distillation on efficient attention transformers. We provide cost-performance trade-offs for the compression of state-of-the-art efficient attention architectures and the gains made in performance in comparison to their full attention counterparts. Furthermore, we introduce a new long-context Named Entity Recognition dataset, GONERD, to train and test the performance of NER models on long sequences. We find that distilled efficient attention transformers can preserve a significant amount of original model performance, preserving up to 98.6% across short-context tasks (GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on long-context Named Entity Recognition (GONERD), while decreasing inference times by up to 57.8%. We find that, for most models on most tasks, performing knowledge distillation is an effective method to yield high-performing efficient attention models with low costs.
Anthology ID:
2023.emnlp-industry.6
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2023
Address:
Singapore
Editors:
Mingxuan Wang, Imed Zitouni
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
54–65
Language:
URL:
https://aclanthology.org/2023.emnlp-industry.6
DOI:
10.18653/v1/2023.emnlp-industry.6
Bibkey:
Cite (ACL):
Nathan Brown, Ashton Williamson, Tahj Anderson, and Logan Lawrence. 2023. Efficient Transformer Knowledge Distillation: A Performance Review. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 54–65, Singapore. Association for Computational Linguistics.
Cite (Informal):
Efficient Transformer Knowledge Distillation: A Performance Review (Brown et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-industry.6.pdf
Video:
 https://aclanthology.org/2023.emnlp-industry.6.mp4