Evaluating Multilingual Tokenization under Worst-N Parity-Aware BPE

Vani Kanjirangat; David Kletz; Tanja Samardzic; Ljiljana Dolamic; Fabio Rinaldi

Evaluating Multilingual Tokenization under Worst-N Parity-Aware BPE

Vani Kanjirangat, David Kletz, Tanja Samardzic, Ljiljana Dolamic, Fabio Rinaldi

Abstract

Improving the fairness of a language model is a goal that applies at every level of the model. In this paper, we evaluate a method targeting a foundational level: tokenization.We present a multilingual evaluation of parity-aware tokenization under worst-N optimization, extending PA-BPE to jointly optimize over the N worst-compressed languages.We evaluate this formulation for N > 1 across vocabulary sizes of 16K and 32K on the languages from the flores+ benchmark, using metrics that capture both efficiency and structural alignment.Our results reveal that the effects of increasing N are inconsistent across metrics and do not lead to major gains. Efficiency-oriented and boundary-level metrics show a modest tendency to improve at higher values of N, while structural alignment metrics (such as AST alignment and boundary crossing) exhibit no clear pattern, suggesting that compression fairness and linguistic structure are mainly orthogonal objectives. Script-level analysis further reveals uneven effects across writing systems, with several non-Latin scripts showing greater sensitivity to increasing N.

Anthology ID:: 2026.mellm-1.21
Volume:: Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Month:: July
Year:: 2026
Address:: San Diego, United States
Editors:: Kaiyu Huang, Fengran Mo, Pinzhen Chen, Meng Jiang
Venues:: MeLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 221–228
Language:
URL:: https://aclanthology.org/2026.mellm-1.21/
DOI:
Bibkey:
Cite (ACL):: Vani Kanjirangat, David Kletz, Tanja Samardzic, Ljiljana Dolamic, and Fabio Rinaldi. 2026. Evaluating Multilingual Tokenization under Worst-N Parity-Aware BPE. In Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), pages 221–228, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):: Evaluating Multilingual Tokenization under Worst-N Parity-Aware BPE (Kanjirangat et al., MeLLM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.mellm-1.21.pdf

PDF Cite Search Fix data