Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Avyav Kumar Singh; Yen-Chen Wu; Alexandru Cioba; Alberto Bernacchia; Davide Buffelli

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, Davide Buffelli

Abstract

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

Anthology ID:: 2026.customnlp4u-1.9
Volume:: Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Sheshera Mysore, Sachin Kumar, Vidhisha Balachandran, Shirley Anugrah Hayati, Faeze Brahman, Hanane Nour Moussa, Alireza Salemi
Venues:: CustomNLP4U | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 84–96
Language:
URL:: https://aclanthology.org/2026.customnlp4u-1.9/
DOI:
Bibkey:
Cite (ACL):: Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, and Davide Buffelli. 2026. Cross-Tokenizer LLM Distillation through a Byte-Level Interface. In Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U), pages 84–96, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Cross-Tokenizer LLM Distillation through a Byte-Level Interface (Singh et al., CustomNLP4U 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.customnlp4u-1.9.pdf

PDF Cite Search Fix data