AniSan@NLU of Devanagari Script Languages 2025: Optimizing Language Identification with Ensemble Learning

Anik Mahmud Shanto; Mst. Sanjida Jamal Priya; Mohammad Shamsul Arefin

AniSan@NLU of Devanagari Script Languages 2025: Optimizing Language Identification with Ensemble Learning

Anik Mahmud Shanto, Mst. Sanjida Jamal Priya, Mohammad Shamsul Arefin

Abstract

Identifying languages written in Devanagari script, including Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit, is essential in multilingual contexts but challenging due to the high overlap between these languages. To address this, a shared task on “Devanagari Script Language Identification” has been organized, with a dataset available for subtask A to test language identification models. This paper introduces an ensemble-based approach that combines mBERT, XLM-R, and IndicBERT models through majority voting to improve language identification accuracy across these languages. Our ensemble model has achieved an impressive accuracy of 99.68%, outperforming individual models by capturing a broader range of language features and reducing model biases that often arise from closely related linguistic patterns. Additionally, we have fine-tuned other transformer models as part of a comparative analysis, providing further validation of the ensemble’s effectiveness. The results highlight the ensemble model’s ability in distinguishing similar languages within the Devanagari script, offering a promising approach for accurate language identification in complex multilingual contexts.

Anthology ID:: 2025.chipsal-1.24
Volume:: Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Kengatharaiyer Sarveswaran, Ashwini Vaidya, Bal Krishna Bal, Sana Shams, Surendrabikram Thapa
Venues:: CHiPSAL | WS
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 236–241
Language:
URL:: https://aclanthology.org/2025.chipsal-1.24/
DOI:
Bibkey:
Cite (ACL):: Anik Mahmud Shanto, Mst. Sanjida Jamal Priya, and Mohammad Shamsul Arefin. 2025. AniSan@NLU of Devanagari Script Languages 2025: Optimizing Language Identification with Ensemble Learning. In Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025), pages 236–241, Abu Dhabi, UAE. International Committee on Computational Linguistics.
Cite (Informal):: AniSan@NLU of Devanagari Script Languages 2025: Optimizing Language Identification with Ensemble Learning (Shanto et al., CHiPSAL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.chipsal-1.24.pdf

PDF Cite Search Fix data