2024
pdf
bib
abs
Findings of WMT 2024 Shared Task on Low-Resource Indic Languages Translation
Partha Pakray
|
Santanu Pal
|
Advaitha Vetagiri
|
Reddi Krishna
|
Arnab Kumar Maji
|
Sandeep Dash
|
Lenin Laitonjam
|
Lyngdoh Sarah
|
Riyanka Manna
Proceedings of the Ninth Conference on Machine Translation
This paper presents the results of the low-resource Indic language translation task, organized in conjunction with the Ninth Conference on Machine Translation (WMT) 2024. In this edition, participants were challenged to develop machine translation models for four distinct language pairs: English–Assamese, English-Mizo, English-Khasi, and English-Manipuri. The task utilized the enriched IndicNE-Corp1.0 dataset, which includes an extensive collection of parallel and monolingual corpora for northeastern Indic languages. The evaluation was conducted through a comprehensive suite of automatic metrics—BLEU, TER, RIBES, METEOR, and ChrF—supplemented by meticulous human assessment to measure the translation systems’ performance and accuracy. This initiative aims to drive advancements in low-resource machine translation and make a substantial contribution to the growing body of knowledge in this dynamic field.
2023
pdf
bib
abs
Findings of the WMT 2023 Shared Task on Low-Resource Indic Language Translation
Santanu Pal
|
Partha Pakray
|
Sahinur Rahman Laskar
|
Lenin Laitonjam
|
Vanlalmuansangi Khenglawt
|
Sunita Warjri
|
Pankaj Kundan Dadure
|
Sandeep Kumar Dash
Proceedings of the Eighth Conference on Machine Translation
This paper presents the results of the low-resource Indic language translation task organized alongside the Eighth Conference on Machine Translation (WMT) 2023. In this task, participants were asked to build machine translation systems for any of four language pairs, namely, English-Assamese, English-Mizo, English-Khasi, and English-Manipuri. For this task, the IndicNE-Corp1.0 dataset is released, which consists of parallel and monolingual corpora for northeastern Indic languages such as Assamese, Mizo, Khasi, and Manipuri. The evaluation will be carried out using automatic evaluation metrics (BLEU, TER, RIBES, COMET, ChrF) and human evaluation.
2021
pdf
bib
abs
Manipuri-English Machine Translation using Comparable Corpus
Lenin Laitonjam
|
Sanasam Ranbir Singh
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
Unsupervised Machine Translation (MT) model, which has the ability to perform MT without parallel sentences using comparable corpora, is becoming a promising approach for developing MT in low-resource languages. However, majority of the studies in unsupervised MT have considered resource-rich language pairs with similar linguistic characteristics. In this paper, we investigate the effectiveness of unsupervised MT models over a Manipuri-English comparable corpus. Manipuri is a low-resource language having different linguistic characteristics from that of English. This paper focuses on identifying challenges in building unsupervised MT models over the comparable corpus. From various experimental observations, it is evident that the development of MT over comparable corpus using unsupervised methods is feasible. Further, the paper also identifies future directions of developing effective MT for Manipuri-English language pair under unsupervised scenarios.
2016
pdf
bib
abs
Automatic Syllabification for Manipuri language
Loitongbam Gyanendro Singh
|
Lenin Laitonjam
|
Sanasam Ranbir Singh
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Development of hand crafted rule for syllabifying words of a language is an expensive task. This paper proposes several data-driven methods for automatic syllabification of words written in Manipuri language. Manipuri is one of the scheduled Indian languages. First, we propose a language-independent rule-based approach formulated using entropy based phonotactic segmentation. Second, we project the syllabification problem as a sequence labeling problem and investigate its effect using various sequence labeling approaches. Third, we combine the effect of sequence labeling and rule-based method and investigate the performance of the hybrid approach. From various experimental observations, it is evident that the proposed methods outperform the baseline rule-based method. The entropy based phonotactic segmentation provides a word accuracy of 96%, CRF (sequence labeling approach) provides 97% and hybrid approach provides 98% word accuracy.