Niclas Doll

2025

pdf bib abs
Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks
Joyeeta Datta | Niclas Doll | Qusai Ramadan | Zeyd Boukhers
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Large Language Models (LLMs) have shown outstanding performance across a range of NLP tasks, but their computational demands hinder deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using knowledge distillation (KD) while maintaining strong performance on question answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models’ performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for real-world applications.

2024

The recent success of large language models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot.Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model’s downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model’s downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.

Co-authors

Venues

findings1
sigdial1

Fix author