Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation

Guoqiang Gong; Jiaxing Wang; Jin Xu; Deping Xiang; Zicheng Zhang; Leqi Shen; Yifeng Zhang; JunhuaShu JunhuaShu; ZhaolongXing ZhaolongXing; Zhen Chen; Pengzhang Liu; Ke Zhang

doi:10.18653/v1/2025.acl-long.1125

Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation

Guoqiang Gong, Jiaxing Wang, Jin Xu, Deping Xiang, Zicheng Zhang, Leqi Shen, Yifeng Zhang, JunhuaShu JunhuaShu, ZhaolongXing ZhaolongXing, Zhen Chen, Pengzhang Liu, Ke Zhang

Abstract

Knowledge distillation (KD) compresses large language models (LLMs), known as teacher models, into lightweight versions called student models, enabling efficient inference and downstream applications. However, prevailing approaches accomplish this by predominantly focusing on matching the final output distributions of student/teacher models. Drawing on the perspective that transformers can be viewed as discretizing ordinary differential equation (ODEs) on integer time steps (corresponding to layer indices), where intermediate features evolve across layers, we argue that effective KD requires aligning the entire feature dynamics between teacher and student models, which we call feature dynamics distillation (FDD). This alignment involves matching both the feature trajectory and its first-order derivative, rather than just the final states. Our approach extends the original KD objective with two additional loss terms: layer-wise feature KD, which matches discretized feature trajectory, and layer feature delta KD, which matches first-order changes in features across adjacent layers. Extensive experiments on various tasks validate the effectiveness of our distillation method.

Anthology ID:: 2025.acl-long.1125
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23067–23077
Language:
URL:: https://aclanthology.org/2025.acl-long.1125/
DOI:: 10.18653/v1/2025.acl-long.1125
Bibkey:
Cite (ACL):: Guoqiang Gong, Jiaxing Wang, Jin Xu, Deping Xiang, Zicheng Zhang, Leqi Shen, Yifeng Zhang, JunhuaShu JunhuaShu, ZhaolongXing ZhaolongXing, Zhen Chen, Pengzhang Liu, and Ke Zhang. 2025. Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23067–23077, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation (Gong et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1125.pdf

PDF Cite Search Fix data