Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation

Malthe Have Musaeus; Rob Van Der Goot

Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation

Abstract

Traditional language model compression techniques, like knowledge distillation, require a fixed architecture, limiting flexibility, while structured pruning methods often fail to preserve performance. This paper introduces Iterative Structured Knowledge Distillation (ISKD), which integrates knowledge distillation and structured pruning by progressively replacing transformer blocks with smaller, efficient versions during training. This study validates ISKD on two transformer-based language models: GPT-2 and Phi-1. ISKD outperforms L1 pruning and achieves similar performance to knowledge distillation while offering greater flexibility. ISKD reduces model parameters - 30.68% for GPT-2 and 30.16% for Phi-1 - while maintaining at least four-fifths of performance on both language modeling and commonsense reasoning tasks. These findings suggest that this method offers a promising balance between model efficiency and accuracy.

Anthology ID:: 2025.coling-main.440
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6601–6606
Language:
URL:: https://aclanthology.org/2025.coling-main.440/
DOI:
Bibkey:
Cite (ACL):: Malthe Have Musaeus and Rob van der Goot. 2025. Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6601–6606, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Iterative Structured Knowledge Distillation: Optimizing Language Models Through Layer-by-Layer Distillation (Musaeus & van der Goot, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.440.pdf

PDF Cite Search Fix data