Minimal Distillation Schedule for Extreme Language Model Compression

Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, Dawei Song


Abstract
Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant.In this paper, we propose a novel approach called Minimal Distillation Schedule (MiniDisc), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new 𝜆-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MiniDisc can select the optimal teacher assistant with the best 𝜆-tradeoff.We extensively evaluate MiniDisc through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MiniDisc by applying it to a language model with billions of parameters.
Anthology ID:
2024.findings-eacl.93
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1378–1394
Language:
URL:
https://aclanthology.org/2024.findings-eacl.93
DOI:
Bibkey:
Cite (ACL):
Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, and Dawei Song. 2024. Minimal Distillation Schedule for Extreme Language Model Compression. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1378–1394, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Minimal Distillation Schedule for Extreme Language Model Compression (Zhang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.93.pdf