GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning

Wei Zhu, Xiaoling Wang, Yuan Ni, Guotong Xie


Abstract
In this work, we propose a novel framework, Gradient Aligned Mutual Learning BERT (GAML-BERT), for improving the early exiting of BERT. GAML-BERT’s contributions are two-fold. We conduct a set of pilot experiments, which shows that mutual knowledge distillation between a shallow exit and a deep exit leads to better performances for both. From this observation, we use mutual learning to improve BERT’s early exiting performances, that is, we ask each exit of a multi-exit BERT to distill knowledge from each other. Second, we propose GA, a novel training method that aligns the gradients from knowledge distillation to cross-entropy losses. Extensive experiments are conducted on the GLUE benchmark, which shows that our GAML-BERT can significantly outperform the state-of-the-art (SOTA) BERT early exiting methods.
Anthology ID:
2021.emnlp-main.242
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3033–3044
Language:
URL:
https://aclanthology.org/2021.emnlp-main.242
DOI:
10.18653/v1/2021.emnlp-main.242
Bibkey:
Cite (ACL):
Wei Zhu, Xiaoling Wang, Yuan Ni, and Guotong Xie. 2021. GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3033–3044, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning (Zhu et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.242.pdf
Data
GLUE