BayesKD: Bayesian Knowledge Distillation for Compact LLMs in Constrained Fine-tuning Scenarios

Wei Li; Lujun Li; Mark Lee; Shengjie Sun; Lei Zhang; Wei Xue; Yike Guo

doi:10.18653/v1/2025.findings-acl.7

BayesKD: Bayesian Knowledge Distillation for Compact LLMs in Constrained Fine-tuning Scenarios

Wei Li, Lujun Li, Mark G. Lee, Shengjie Sun, Lei Zhang, Wei Xue, Yike Guo

Abstract

Large language models (LLMs) have revolutionized various domains with their remarkable capabilities, but their massive parameter sizes pose significant challenges for fine-tuning and inference, especially in resource-constrained environments. Conventional compression methods often result in substantial performance degradation within LLMs and struggle to restore model quality during fine-tuning. To address this challenge, we present Bayesian Knowledge Distillation (BayesKD), a novel distillation framework meticulously designed for compact LLMs in resource-constrained fine-tuning scenarios. Departing from conventional LLM distillation methods that introduce time-consuming paradigms and fail to generalize in compressed LLM fine-tuning scenarios, our BayesKD develops the Logits Dual-Scaling, Knowledge Alignment Module, and Bayesian Distillation Optimization. In particular, our Logits Dual-Scaling strategy adaptively aligns the strength of the teacher’s knowledge transfer, while the Knowledge Alignment Module bridges the gap between the teacher and student models by projecting their knowledge representations into a shared interval. Additionally, we employ Logits-Aware Bayesian Optimization to swiftly identify optimal settings based on these strategies, thereby enhancing model performance. Extensive experiments across diverse tasks demonstrate that BayesKD consistently outperforms baseline methods on various state-of-the-art LLMs, including LLaMA, Qwen2, Bloom, and Vicuna. Notably, our BayesKD achieves average accuracy gains of 2.99% and 4.05% over standard KD for the 8B parameter LLaMA and Qwen2 model. Codes are available in the supplementary materials.

Anthology ID:: 2025.findings-acl.7
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 138–152
Language:
URL:: https://aclanthology.org/2025.findings-acl.7/
DOI:: 10.18653/v1/2025.findings-acl.7
Bibkey:
Cite (ACL):: Wei Li, Lujun Li, Mark G. Lee, Shengjie Sun, Lei Zhang, Wei Xue, and Yike Guo. 2025. BayesKD: Bayesian Knowledge Distillation for Compact LLMs in Constrained Fine-tuning Scenarios. In Findings of the Association for Computational Linguistics: ACL 2025, pages 138–152, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: BayesKD: Bayesian Knowledge Distillation for Compact LLMs in Constrained Fine-tuning Scenarios (Li et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.7.pdf

PDF Cite Search Fix data