Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Tianyu Peng; Jiajun Zhang

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Abstract

Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs causes difficulties for student models to learn. In this paper, we first demonstrate the importance of multi-modal distribution alignment with experiments and then highlight the inefficiency of existing KD approaches in learning multi-modal distributions. To address this problem, we propose Ranking Loss based Knowledge Distillation (RLKD), which encourages the consistency of the ranking of peak predictions between the teacher and student models. By incorporating word-level ranking loss, we ensure excellent compatibility with existing distillation objectives while fully leveraging the fine-grained information between different categories in peaks of two predicted distribution. Experimental results demonstrate that our method enables the student model to better learn the multi-modal distributions of the teacher model, leading to a significant performance improvement in various downstream tasks.

Anthology ID:: 2025.coling-main.169
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2478–2496
Language:
URL:: https://aclanthology.org/2025.coling-main.169/
DOI:
Bibkey:
Cite (ACL):: Tianyu Peng and Jiajun Zhang. 2025. Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2478–2496, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment (Peng & Zhang, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.169.pdf

PDF Cite Search Fix data