2023
pdf
bib
abs
Iterative Back Translation Revisited: An Experimental Investigation for Low-resource English Assamese Neural Machine Translation
Mazida Akhtara Ahmed
|
Kishore Kashyap
|
Kuwali Talukdar
|
Parvez Aziz Boruah
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Back Translation has been an effective strategy to leverage monolingual data both on the source and target sides. Research have opened up several ways to improvise the procedure, one among them is iterative back translation where the monolingual data is repeatedly translated and used for re-training for the model enhancement. Despite its success, iterative back translation remains relatively unexplored in low-resource scenarios, particularly for rich Indic languages. This paper presents a comprehensive investigation into the application of iterative back translation to the low-resource English-Assamese language pair. A simplified version of iterative back translation is presented. This study explores various critical aspects associated with back translation, including the balance between original and synthetic data and the refinement of the target (backward) model through cleaner data retraining. The experimental results demonstrate significant improvements in translation quality. Specifically, the simplistic approach to iterative back translation yields a noteworthy +6.38 BLEU score improvement for the EnglishAssamese translation direction and a +4.38 BLEU score improvement for the AssameseEnglish translation direction. Further enhancements are further noticed when incorporating higher-quality, cleaner data for model retraining highlighting the potential of iterative back translation as a valuable tool for enhancing low-resource neural machine translation (NMT).
pdf
bib
abs
Neural Machine Translation for a Low Resource Language Pair: English-Bodo
Parvez Aziz Boruah
|
Kuwali Talukdar
|
Mazida Akhtara Ahmed
|
Kishore Kashyap
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
This paper represent a work done on Neural Machine Translation for English and Bodo language pair. English is a language spoken around the world whereas, Bodo is a language mostly spoken in North Eastern area of India. This work of machine translation is done on a relatively small size of parallel data as there is less parallel corpus available for english bodo pair. Corpus is generally taken from available source National Platform of Language Technology(NPLT), Data Management Unit(DMU), Mission Bhashini, Ministry of Electronics and Information Technology and also generated internally in-house. Tokenization of raw text is done using IndicNLP library and mosesdecoder for Bodo and English respectively. Subword tokenization is performed by using BPE(Byte Pair Encoder) , Sentencepiece and Wordpiece subword. Experiments have been done on two different vocab size of 8000 and 16000 on a total of around 92410 parallel sentences. Two standard transformer encoder and decoder models with varying number of layers and hidden size are build for training the data using OpenNMT-py framework. The result are evaluated based on the BLEU score on an additional testset for evaluating the performance. The highest BLEU score of 11.01 and 14.62 are achieved on the testset for English to Bodo and Bodo to English translation respectively.
pdf
bib
abs
PoS to UPoS Conversion and Creation of UPoS Tagged Resources for Assamese Language
Kuwali Talukdar
|
Shikhar Kumar Sarma
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
This paper addresses the vital task of transitioning from traditional Part-of-Speech (PoS) tagging to Universal Part-of-Speech (UPoS) tagging within the linguistic context of the Assamese language. The paper outlines a comprehensive methodology for PoS to UPoS conversion and the creation of UPoS tagged resources, bridging the gap between localized linguistic analysis and universal standards. The significance of this work lies in its potential to enhance natural language processing and understanding for the Assamese language, contributing to broader multilingual applications. The paper details the data preparation and creation processes, annotation methods, and evaluation techniques, shedding light on the challenges and opportunities presented in the pursuit of linguistic universality. The contents of this research have implications for improving language technology in the Assamese language and can serve as a model for similar work in other regional languages. Mapping of standard PoS tagset applicable for Indian languages to that of the primary categories of the UPoS tagset is done with respect to the Assamese language lexical behaviour. Conversion of PoS tagged text corpus to UPoS taged corpus using this mapping, and then utilizing a Deep Learning based model trained on such a dataset to create a sizable UPoS tagged corpus, are presented in a structured flow. This paper is a step towards a more standardized, universal understanding of linguistic elements in a diverse and multilingual world.
pdf
bib
abs
A Baseline System for Khasi and Assamese Bidirectional NMT with Zero available Parallel Data: Dataset Creation and System Development
Kishore Kashyap
|
Kuwali Talukdar
|
Mazida Akhtara Ahmed
|
Parvez Aziz Boruah
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
In this work we have tried to build a baseline Neural Machine Translation system for Khasi and Assamese in both directions. Both the languages are considered as low-resourced Indic languages. As per the language family in concerned, Assamese is a language from IndoAryan family and Khasi belongs to the MonKhmer branch of the Austroasiatic language family. No prior work is done which investigate the performance of Neural Machine Translation for these two diverse low-resourced languages. It is also worth mentioning that no parallel corpus and test data is available for these two languages. The main contribution of this work is the creation of Khasi-Assamese parallel corpus and test set. Apart from this, we also created baseline systems in both directions for the said language pair. We got best bilingual evaluation understudy (BLEU) score of 2.78 for Khasi to Assamese translation direction and 5.51 for Assamese to Khasi translation direction. We then applied phrase table injection (phrase augmentation) technique and got new higher BLEU score of 5.01 and 7.28 for Khasi to Assamese and Assamese to Khasi translation direction respectively.
pdf
bib
abs
Parts of Speech (PoS) and Universal Parts of Speech (UPoS) Tagging: A Critical Review with Special Reference to Low Resource Languages
Kuwali Talukdar
|
Shikhar Kumar Sarma
|
Manash Pratim Bhuyan
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Universal Parts of Speech (UPoS) tags are parts of speech annotations used in Universal Dependencies. Universal Dependency (UD) helps in developing cross-linguistically consistent treebank annotations for multiple languages with a common framework and standard. For various Natural Language Processing (NLP) tasks and research such as semantic parsing, syntactic parsing as well as linguistic parsing, UD treebanks are becoming increasingly important resources. A lot of interest has been seen in adopting UD and UPoS standards and resources for integrating with various NLP techniques, including Machine Translations, Question Answering, Sentiment Analysis etc. Consequently, a wide variety of Artificial Intelligence (AI) and NLP tools are being created with UD and UPoS standards on board. Part of Speech (PoS) tagging is one of the fundamental NLP tasks, which labels a specific sentence or set of words in a paragraph with lexical and grammatical annotations, based on the context of the sentence. Contemporary Machine Learning (ML) and Deep Learning (DL) techniques require god quality tagged resources for training potential tagger models. Low resource languages face serious challenges here. This paper discusses about the UPoS in UD and presents a concise yet inclusive piece of literature regarding UPoS, PoS, and various taggers for multiple languages with special reference to various low resource languages. Already adopted approaches and models developed for different low resource languages are included in this review, considering representations from a wide variety of languages. Also, the study offers a comprehensive classification based on the well-known ML and DL techniques used in the development of part-of-speech taggers. This will serve as a ready-reference for understanding nuances of PoS and UPoS tagging.
pdf
bib
abs
Neural Machine Translation for Assamese-Bodo, a Low Resourced Indian Language Pair
Kuwali Talukdar
|
Shikhar Kumar Sarma
|
Farha Naznin
|
Kishore Kashyap
|
Mazida Akhtara Ahmed
|
Parvez Aziz Boruah
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Impressive results have been reported in various works related to low resource languages, using Neural Machine Translation (NMT), where size of parallel dataset is relatively low. This work presents the experiment of Machine Translation in the low resource Indian language pair AssameseBodo, with a relatively low amount of parallel data. Tokenization of raw data is done with IndicNLP tool. NMT model is trained with preprocessed dataset, and model performances have been observed with varying hyper parameters. Experiments have been completed with Vocab Size 8000 and 16000. Significant increase in BLEU score has been observed in doubling the Vocab size. Also data size increase has contributed to enhanced overall performances. BLEU scores have been recorded with training on a data set of 70000 parallel sentences, and the results are compared with another round of training with a data set enhanced with 11500 Wordnet parallel data. A gold standard test data set of 500 sentence size has been used for recording BLEU. First round reported an overall BLEU of 4.0, with vocab size of 8000. With same vocab size, and Wordnet enhanced dataset, BLEU score of 4.33 was recorded. Significant increase of BLEU score (6.94) has been observed with vocab size of 16000. Next round of experiment was done with additional 7000 new data, and filtering the entire dataset. New BLEU recorded was 9.68, with 16000 vocab size. Cross validation has also been designed and performed with an experiment with 8-fold data chunks prepared on 80K total dataset. Impressive BLEU scores of (Fold-1 through fold-8) 18.12, 16.28, 18.90, 19.25, 19.60, 18.43, 16.28, and 7.70 have been recorded. The 8th fold BLEU deviated from the trend, might be because of nonhomogeneous last fold data.
pdf
bib
abs
GUIT-NLP’s Submission to Shared Task: Low Resource Indic Language Translation
Mazida Ahmed
|
Kuwali Talukdar
|
Parvez Boruah
|
Prof. Shikhar Kumar Sarma
|
Kishore Kashyap
Proceedings of the Eighth Conference on Machine Translation
This paper describes the submission of the GUIT-NLP team in the “Shared Task: Low Resource Indic Language Translation” focusing on three low-resource language pairs: English-Mizo, English-Khasi, and English-Assamese. The initial phase involves an in-depth exploration of Neural Machine Translation (NMT) techniques tailored to the available data. Within this investigation, various Subword Tokenization approaches, model configurations (exploring differnt hyper-parameters etc.) of the general NMT pipeline are tested to identify the most effective method. Subsequently, we address the challenge of low-resource languages by leveraging monolingual data through an innovative and systematic application of the Back Translation technique for English-Mizo. During model training, the monolingual data is progressively integrated into the original bilingual dataset, with each iteration yielding higher-quality back translations. This iterative approach significantly enhances the model’s performance, resulting in a notable increase of +3.65 in BLEU scores. Further improvements of +5.59 are achieved through fine-tuning using authentic parallel data.