A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

Takuma Udagawa; Aashka Trivedi; Michele Merler; Bishwaranjan Bhattacharjee

doi:10.18653/v1/2023.emnlp-industry.3

A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

Takuma Udagawa, Aashka Trivedi, Michele Merler, Bishwaranjan Bhattacharjee

Abstract

Large language models have become a vital component in modern NLP, achieving state of the art performance in a variety of tasks. However, they are often inefficient for real-world deployment due to their expensive inference costs. Knowledge distillation is a promising technique to improve their efficiency while retaining most of their effectiveness. In this paper, we reproduce, compare and analyze several representative methods for task-agnostic (general-purpose) distillation of Transformer language models. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2. Through our extensive experiments, we study the effectiveness of each method for various student architectures in both monolingual (English) and multilingual settings. Overall, we show that MHA transfer based on MiniLMv2 is generally the best option for distillation and explain the potential reasons behind its success. Moreover, we show that HS transfer remains as a competitive baseline, especially under a sophisticated layer mapping strategy, while OD transfer consistently lags behind other approaches. Findings from this study helped us deploy efficient yet effective student models for latency-critical applications.

Anthology ID:: 2023.emnlp-industry.3
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Mingxuan Wang, Imed Zitouni
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20–31
Language:
URL:: https://aclanthology.org/2023.emnlp-industry.3/
DOI:: 10.18653/v1/2023.emnlp-industry.3
Bibkey:
Cite (ACL):: Takuma Udagawa, Aashka Trivedi, Michele Merler, and Bishwaranjan Bhattacharjee. 2023. A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 20–31, Singapore. Association for Computational Linguistics.
Cite (Informal):: A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models (Udagawa et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-industry.3.pdf
Video:: https://aclanthology.org/2023.emnlp-industry.3.mp4

PDF Cite Search Video Fix data