TokLens: A Multilingual Lens on Tokenizer Quality for LLMs

Guan-Ming Chiu

TokLens: A Multilingual Lens on Tokenizer Quality for LLMs

Abstract

We introduce TokLens, an open-source toolkit for evaluating tokenizer quality across languages using six intrinsic metrics: fertility, characters per token, compression ratio, normalized sequence length, single-token retention rate, and cross-lingual parity. We evaluate 24 tokenizers from major LLM families across 15 typologically diverse languages and correlate these metrics with downstream performance. Our analysis reveals stark disparities: GPT-2 produces 56x more tokens per word in Japanese than in English, while newer tokenizers like Qwen2.5 and Gemma-2 reduce this gap to under 4x. No intrinsic metric predicts English benchmark performance after controlling for model size. However, on multilingual benchmarks (MMLU-ProX), linear mixed-effects models show that tokenizer metrics significantly predict per-language performance (STRR: 𝛽 = +5.7, z = 18.5, p < 0.001). A controlled experiment on the Qwen2.5 family further shows that languages with higher single-token retention rate exhibit steeper scaling slopes (𝜌 = 0.91, p < 0.001). These results indicate that tokenizer quality is significantly associated with multilingual LLM performance, though the evidence remains correlational and partially confounded with pretraining data composition.

Anthology ID:: 2026.acl-srw.18
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 188–205
Language:
URL:: https://aclanthology.org/2026.acl-srw.18/
DOI:
Bibkey:
Cite (ACL):: Guan-Ming Chiu. 2026. TokLens: A Multilingual Lens on Tokenizer Quality for LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 188–205, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: TokLens: A Multilingual Lens on Tokenizer Quality for LLMs (Chiu, ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-srw.18.pdf

PDF Cite Search Fix data