Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Qinyuan Ye, Madian Khabsa, Mike Lewis, Sinong Wang, Xiang Ren, Aaron Jaech


Abstract
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. The student models are typically compact transformers with fewer parameters, while expensive operations such as self-attention persist. Therefore, the improved inference speed may still be unsatisfactory for real-time or high-volume use cases. In this paper, we aim to further push the limit of inference speed by distilling teacher models into bigger, sparser student models – bigger in that they scale up to billions of parameters; sparser in that most of the model parameters are n-gram embeddings. Our experiments on six single-sentence text classification tasks show that these student models retain 97% of the RoBERTa-Large teacher performance on average, and meanwhile achieve up to 600x speed-up on both GPUs and CPUs at inference time. Further investigation reveals that our pipeline is also helpful for sentence-pair classification tasks, and in domain generalization settings.
Anthology ID:
2022.naacl-main.169
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2361–2375
Language:
URL:
https://aclanthology.org/2022.naacl-main.169
DOI:
10.18653/v1/2022.naacl-main.169
Bibkey:
Cite (ACL):
Qinyuan Ye, Madian Khabsa, Mike Lewis, Sinong Wang, Xiang Ren, and Aaron Jaech. 2022. Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2361–2375, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models (Ye et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.169.pdf
Video:
 https://aclanthology.org/2022.naacl-main.169.mp4
Code
 ink-usc/sparse-distillation
Data
AG NewsCivil CommentsIMDb Movie ReviewsPAQSSTSST-2