Mst. Sanjida Jamal Priya
2025
AniSan@NLU of Devanagari Script Languages 2025: Optimizing Language Identification with Ensemble Learning
Anik Mahmud Shanto
|
Mst. Sanjida Jamal Priya
|
Mohammad Shamsul Arefin
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Identifying languages written in Devanagari script, including Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit, is essential in multilingual contexts but challenging due to the high overlap between these languages. To address this, a shared task on “Devanagari Script Language Identification” has been organized, with a dataset available for subtask A to test language identification models. This paper introduces an ensemble-based approach that combines mBERT, XLM-R, and IndicBERT models through majority voting to improve language identification accuracy across these languages. Our ensemble model has achieved an impressive accuracy of 99.68%, outperforming individual models by capturing a broader range of language features and reducing model biases that often arise from closely related linguistic patterns. Additionally, we have fine-tuned other transformer models as part of a comparative analysis, providing further validation of the ensemble’s effectiveness. The results highlight the ensemble model’s ability in distinguishing similar languages within the Devanagari script, offering a promising approach for accurate language identification in complex multilingual contexts.