%0 Conference Proceedings
%T GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning
%A Zhu, Wei
%A Wang, Xiaoling
%A Ni, Yuan
%A Xie, Guotong
%Y Moens, Marie-Francine
%Y Huang, Xuanjing
%Y Specia, Lucia
%Y Yih, Scott Wen-tau
%S Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
%D 2021
%8 November
%I Association for Computational Linguistics
%C Online and Punta Cana, Dominican Republic
%F zhu-etal-2021-gaml
%X In this work, we propose a novel framework, Gradient Aligned Mutual Learning BERT (GAML-BERT), for improving the early exiting of BERT. GAML-BERT’s contributions are two-fold. We conduct a set of pilot experiments, which shows that mutual knowledge distillation between a shallow exit and a deep exit leads to better performances for both. From this observation, we use mutual learning to improve BERT’s early exiting performances, that is, we ask each exit of a multi-exit BERT to distill knowledge from each other. Second, we propose GA, a novel training method that aligns the gradients from knowledge distillation to cross-entropy losses. Extensive experiments are conducted on the GLUE benchmark, which shows that our GAML-BERT can significantly outperform the state-of-the-art (SOTA) BERT early exiting methods.
%R 10.18653/v1/2021.emnlp-main.242
%U https://aclanthology.org/2021.emnlp-main.242
%U https://doi.org/10.18653/v1/2021.emnlp-main.242
%P 3033-3044