TReX: Tokenizer Regression for Optimal Data Mixture

Inho Won; Hangyeol Yoo; Minkyung Cho; Jungyeul Park; Hoyun Song; KyungTae Lim

TReX: Tokenizer Regression for Optimal Data Mixture

Inho Won, Hangyeol Yoo, Minkyung Cho, Jungyeul Park, Hoyun Song, KyungTae Lim

Abstract

Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer’s compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TReX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TReX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX’s predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both in- and out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.

Anthology ID:: 2026.eacl-long.298
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6353–6370
Language:
URL:: https://aclanthology.org/2026.eacl-long.298/
DOI:
Bibkey:
Cite (ACL):: Inho Won, Hangyeol Yoo, Minkyung Cho, Jungyeul Park, Hoyun Song, and KyungTae Lim. 2026. TReX: Tokenizer Regression for Optimal Data Mixture. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6353–6370, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: TReX: Tokenizer Regression for Optimal Data Mixture (Won et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.298.pdf
Checklist:: 2026.eacl-long.298.checklist.pdf

PDF Cite Search Checklist Fix data