%0 Conference Proceedings
%T RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation
%A Lu, Peng
%A Ghaddar, Abbas
%A Rashid, Ahmad
%A Rezagholizadeh, Mehdi
%A Ghodsi, Ali
%A Langlais, Philippe
%Y Moens, Marie-Francine
%Y Huang, Xuanjing
%Y Specia, Lucia
%Y Yih, Scott Wen-tau
%S Findings of the Association for Computational Linguistics: EMNLP 2021
%D 2021
%8 November
%I Association for Computational Linguistics
%C Punta Cana, Dominican Republic
%F lu-etal-2021-rw-kd
%X Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific fine-tuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel sample-wise loss weighting method, RW-KD. A meta-learner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.
%R 10.18653/v1/2021.findings-emnlp.270
%U https://aclanthology.org/2021.findings-emnlp.270
%U https://doi.org/10.18653/v1/2021.findings-emnlp.270
%P 3145-3152