RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation

Peng Lu, Abbas Ghaddar, Ahmad Rashid, Mehdi Rezagholizadeh, Ali Ghodsi, Philippe Langlais


Abstract
Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel sample-wise loss weighting method, RW-KD. A meta-learner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.
Anthology ID:
2021.findings-emnlp.270
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3145–3152
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.270
DOI:
10.18653/v1/2021.findings-emnlp.270
Bibkey:
Cite (ACL):
Peng Lu, Abbas Ghaddar, Ahmad Rashid, Mehdi Rezagholizadeh, Ali Ghodsi, and Philippe Langlais. 2021. RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3145–3152, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation (Lu et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.270.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.270.mp4
Data
GLUEQNLI