The Token Tax: Systematic Bias in Multilingual Tokenization

Jessica M. Lundin; Ada Zhang; Nihal Karim; Hamza Louzan; Guohao Wei; David Ifeoluwa Adelani; Cody Carroll

The Token Tax: Systematic Bias in Multilingual Tokenization

Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, Cody Carroll

Abstract

Tokenization inefficiency is associated with structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and reducing accuracy. We evaluate 10 Large Language Models (LLMs) on AfriMMLU (5 subjects; 16 African languages) and show that token fertility reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (e.g., DeepSeek, o1) consistently outperform non-reasoning peers across high- and low-resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. In terms of economics, a doubling in tokens results in quadrupled training cost and time, underscoring the “token tax” faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).

Anthology ID:: 2026.africanlp-main.10
Volume:: Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Everlyn Asiko Chimoto, Constantine Lignos, Shamsuddeen Muhammad, Idris Abdulmumin, Clemencia Siro, David Ifeoluwa Adelani
Venues:: AfricaNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 103–112
Language:
URL:: https://aclanthology.org/2026.africanlp-main.10/
DOI:
Bibkey:
Cite (ACL):: Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll. 2026. The Token Tax: Systematic Bias in Multilingual Tokenization. In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), pages 103–112, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: The Token Tax: Systematic Bias in Multilingual Tokenization (Lundin et al., AfricaNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.africanlp-main.10.pdf

PDF Cite Search Fix data