Studying the Effect of Hindi Tokenizer Performance on Downstream Tasks

Rashi Goel, Fatiha Sadat


Abstract
This paper deals with a study on the effect of training data size and tokenizer performance for Hindi language on the eventual downstream model performance and comprehension. Multiple monolingual Hindi tokenizers are trained for large language models such as BERT and intrinsic and extrinsic evaluations are performed on multiple Hindi datasets. The objective of this study is to understand the precise effects of tokenizer performance on downstream task performance to gain insight on how to develop better models for low-resource languages.
Anthology ID:
2025.indonlp-1.5
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Month:
January
Year:
2025
Address:
Abu Dhabi
Editors:
Ruvan Weerasinghe, Isuri Anuradha, Deshan Sumanathilaka
Venues:
IndoNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–49
Language:
URL:
https://aclanthology.org/2025.indonlp-1.5/
DOI:
Bibkey:
Cite (ACL):
Rashi Goel and Fatiha Sadat. 2025. Studying the Effect of Hindi Tokenizer Performance on Downstream Tasks. In Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, pages 44–49, Abu Dhabi. Association for Computational Linguistics.
Cite (Informal):
Studying the Effect of Hindi Tokenizer Performance on Downstream Tasks (Goel & Sadat, IndoNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.indonlp-1.5.pdf