f-Divergence Minimization for Sequence-Level Knowledge Distillation

Yuqiao Wen; Zichao Li; Wenyu Du; Lili Mou

doi:10.18653/v1/2023.acl-long.605

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou

Abstract

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an FDISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our FDISTILL methods. We further derive step-wise decomposition for our FDISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.

Anthology ID:: 2023.acl-long.605
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10817–10834
Language:
URL:: https://aclanthology.org/2023.acl-long.605
DOI:: 10.18653/v1/2023.acl-long.605
Bibkey:
Cite (ACL):: Yuqiao Wen, Zichao Li, Wenyu Du, and Lili Mou. 2023. f-Divergence Minimization for Sequence-Level Knowledge Distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817–10834, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: f-Divergence Minimization for Sequence-Level Knowledge Distillation (Wen et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.605.pdf
Video:: https://aclanthology.org/2023.acl-long.605.mp4

PDF Cite Search Video