Towards Resource-Rich Mizo and Khasi in NLP: Resource Development, Synthetic Data Generation and Model Building

Soumyadip Ghosh; Henry Lalsiam; Dorothy Marbaniang; Gracious Mary Temsen; Rahul Mishra; Parameswari Krishnamurthy

doi:10.18653/v1/2025.law-1.18

Towards Resource-Rich Mizo and Khasi in NLP: Resource Development, Synthetic Data Generation and Model Building

Soumyadip Ghosh, Henry Lalsiam, Dorothy Marbaniang, Gracious Mary Temsen, Rahul Mishra, Parameswari Krishnamurthy

Abstract

In the rapidly evolving field of Natural Language Processing (NLP), Indian regional languages remain significantly underrepresented due to their limited digital presence and lack of annotated resources. This work presents the first comprehensive effort toward developing high quality linguistic datasets for two extremely low resource languages Mizo and Khasi. We introduce human annotated, gold standard datasets for three core NLP tasks: Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and Keyword Identification. To overcome annotation bottlenecks in NER, we further explore a synthetic data generation pipeline involving translation from Hindi and cross lingual word alignment. For POS tagging, we adopt and subsequently modify the Universal Dependencies (UD) framework to better suit the linguistic characteristics of Mizo and Khasi, while custom annotation guidelines are developed for NER and Keyword Identification. The constructed datasets are evaluated using multilingual language models, demonstrating that structured resource development, coupled with gradual fine-tuning, yields significant improvements in performance. This work represents a critical step toward advancing linguistic resources and computational tools for Mizo and Khasi.

Anthology ID:: 2025.law-1.18
Volume:: Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Siyao Peng, Ines Rehbein
Venues:: LAW | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 228–239
Language:
URL:: https://aclanthology.org/2025.law-1.18/
DOI:: 10.18653/v1/2025.law-1.18
Bibkey:
Cite (ACL):: Soumyadip Ghosh, Henry Lalsiam, Dorothy Marbaniang, Gracious Mary Temsen, Rahul Mishra, and Parameswari Krishnamurthy. 2025. Towards Resource-Rich Mizo and Khasi in NLP: Resource Development, Synthetic Data Generation and Model Building. In Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025), pages 228–239, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Towards Resource-Rich Mizo and Khasi in NLP: Resource Development, Synthetic Data Generation and Model Building (Ghosh et al., LAW 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.law-1.18.pdf

PDF Cite Search Fix data