Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation

Yimeng Wu; Mehdi Rezagholizadeh; Abbas Ghaddar; Md. Akmal Haidar; Ali Ghodsi

doi:10.18653/v1/2021.emnlp-main.603

Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation

Yimeng Wu, Mehdi Rezagholizadeh, Abbas Ghaddar, Md Akmal Haidar, Ali Ghodsi

Abstract

Intermediate layer matching is shown as an effective approach for improving knowledge distillation (KD). However, this technique applies matching in the hidden spaces of two different networks (i.e. student and teacher), which lacks clear interpretability. Moreover, intermediate layer KD cannot easily deal with other problems such as layer mapping search and architecture mismatch (i.e. it requires the teacher and student to be of the same model type). To tackle the aforementioned problems all together, we propose Universal-KD to match intermediate layers of the teacher and the student in the output space (by adding pseudo classifiers on intermediate layers) via the attention-based layer projection. By doing this, our unified approach has three merits: (i) it can be flexibly combined with current intermediate layer distillation techniques to improve their results (ii) the pseudo classifiers of the teacher can be deployed instead of extra expensive teacher assistant networks to address the capacity gap problem in KD which is a common issue when the gap between the size of the teacher and student networks becomes too large; (iii) it can be used in cross-architecture intermediate layer KD. We did comprehensive experiments in distilling BERT-base into BERT-4, RoBERTa-large into DistilRoBERTa and BERT-base into CNN and LSTM-based models. Results on the GLUE tasks show that our approach is able to outperform other KD techniques.

Anthology ID:: 2021.emnlp-main.603
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7649–7661
Language:
URL:: https://aclanthology.org/2021.emnlp-main.603/
DOI:: 10.18653/v1/2021.emnlp-main.603
Bibkey:
Cite (ACL):: Yimeng Wu, Mehdi Rezagholizadeh, Abbas Ghaddar, Md Akmal Haidar, and Ali Ghodsi. 2021. Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7649–7661, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation (Wu et al., EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-main.603.pdf
Video:: https://aclanthology.org/2021.emnlp-main.603.mp4

PDF Cite Search Video Fix data