Tanmay Tiwari
2026
Do Tokenizers Fail on Informal Hindi Expressions? Evidence from Static, Downstream, and Robustness Analyses
Manikandan Ravikiran | Tanmay Tiwari | Vibhu Gupta | Rakesh Prakash | Rohit Saluja | Shayan Mohanty
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Manikandan Ravikiran | Tanmay Tiwari | Vibhu Gupta | Rakesh Prakash | Rohit Saluja | Shayan Mohanty
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
We present, to our knowledge, the first systematic evaluation of tokenization quality for informal Hindi expressions, combining static, downstream, and robustness analyses. Our investigation centers on three questions: (RQ1) how well tokenizers preserve informal expression units using static boundary and integrity metrics, (RQ2) how tokenization choices affect downstream identification of informal expressions, and (RQ3) how robust tokenizers remain under orthographic variation, romanization, and noisy spelling. Across multilingual, Indic-focused, and byte-level tokenizers, we find that Indic-oriented models (e.g., MuRIL, IndicBERT) preserve expression boundaries better and achieve higher downstream F1 on clean text than generic multilingual models (e.g., mBERT, XLM-R). However, all tokenizers exhibit severe degradation under romanization, with phrase integrity rates approaching zero. These findings demonstrate that tokenization constitutes a hidden but critical bottleneck for informal Hindi NLP, particularly in cross-script settings, and motivate the need for tokenization strategies that explicitly account for phrase-level semantics and orthographic variation.