Tokenization Cost, Retention, and Orthography Robustness for Ladin and Italian Varieties

Alessio Staffini

Tokenization Cost, Retention, and Orthography Robustness for Ladin and Italian Varieties

Abstract

Tokenizer mismatch is a practical bottleneck for low-resource language varieties: when text is fragmented into disproportionately many subwords or bytes, it wastes context, increases truncation, and can be brittle to orthographic variation.We present a lightweight and reproducible audit centered on Ladin and evaluated on the Identification of Languages and Dialects ofItaly benchmark of eleven Italian varieties.Our diagnostic suite combines tokenization cost measures (tokens per word, truncation pressure, bytes per token) with retention indicators (word split rate, continued-token rate, and type-level retention) and fragmentation proxies that reveal splitting patterns beyond fertility.We pair these diagnostics with a conservative orthography robustness protocol (diacritics, casing, punctuation and dash normalization) and assess how diagnostic changes relate to performance drops in lightweight baselines for sentence-level variety identification.We release code and derived statistics to support reproducible tokenizer audits in other low-resource settings.

Anthology ID:: 2026.loreslm-1.49
Volume:: Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venue:: LoResLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 570–583
Language:
URL:: https://aclanthology.org/2026.loreslm-1.49/
DOI:
Bibkey:
Cite (ACL):: Alessio Staffini. 2026. Tokenization Cost, Retention, and Orthography Robustness for Ladin and Italian Varieties. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 570–583, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Tokenization Cost, Retention, and Orthography Robustness for Ladin and Italian Varieties (Staffini, LoResLM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.loreslm-1.49.pdf

PDF Cite Search Fix data