Thoudam Doren Singh

Also published as: Thoudam Doren Singh


2021

pdf bib
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)
Thoudam Doren Singh | Cristina España i Bonet | Sivaji Bandyopadhyay | Josef van Genabith
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

pdf bib
Multiple Captions Embellished Multilingual Multi-Modal Neural Machine Translation
Salam Michael Singh | Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

Neural machine translation based on bilingual text with limited training data suffers from lexical diversity, which lowers the rare word translation accuracy and reduces the generalizability of the translation system. In this work, we utilise the multiple captions from the Multi-30K dataset to increase the lexical diversity aided with the cross-lingual transfer of information among the languages in a multilingual setup. In this multilingual and multimodal setting, the inclusion of the visual features boosts the translation quality by a significant margin. Empirical study affirms that our proposed multimodal approach achieves substantial gain in terms of the automatic score and shows robustness in handling the rare word translation in the pretext of English to/from Hindi and Telugu translation tasks.

pdf bib
Low Resource Multimodal Neural Machine Translation of English-Hindi in News Domain
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

Incorporating multiple input modalities in a machine translation (MT) system is gaining popularity among MT researchers. Unlike the publicly available dataset for Multimodal Machine Translation (MMT) tasks, where the captions are short image descriptions, the news captions provide a more detailed description of the contents of the images. As a result, numerous named entities relating to specific persons, locations, etc., are found. In this paper, we acquire two monolingual news datasets reported in English and Hindi paired with the images to generate a synthetic English-Hindi parallel corpus. The parallel corpus is used to train the English-Hindi Neural Machine Translation (NMT) and an English-Hindi MMT system by incorporating the image feature paired with the corresponding parallel corpus. We also conduct a systematic analysis to evaluate the English-Hindi MT systems with 1) more synthetic data and 2) by adding back-translated data. Our finding shows improvement in terms of BLEU scores for both the NMT (+8.05) and MMT (+11.03) systems.

2020

pdf bib
NITS-Hinglish-SentiMix at SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text Using an Ensemble Model
Subhra Jyoti Baroi | Nivedita Singh | Ringki Das | Thoudam Doren Singh
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Sentiment Analysis refers to the process of interpreting what a sentence emotes and classifying them as positive, negative, or neutral. The widespread popularity of social media has led to the generation of a lot of text data and specifically, in the Indian social media scenario, the code-mixed Hinglish text i.e, the words of Hindi language, written in the Roman script along with other English words is a common sight. The ability to effectively understand the sentiments in these texts is much needed. This paper proposes a system titled NITS-Hinglish to effectively carry out the sentiment analysis of such code-mixed Hinglish text. The system has fared well with a final F-Score of 0.617 on the test data.

pdf bib
The NITS-CNLP System for the Unsupervised MT Task at WMT 2020
Salam Michael Singh | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the Fifth Conference on Machine Translation

We describe NITS-CNLP’s submission to WMT 2020 unsupervised machine translation shared task for German language (de) to Upper Sorbian (hsb) in a constrained setting i.e, using only the data provided by the organizers. We train our unsupervised model using monolingual data from both the languages by jointly pre-training the encoder and decoder and fine-tune using backtranslation loss. The final model uses the source side (de) monolingual data and the target side (hsb) synthetic data as a pseudo-parallel data to train a pseudo-supervised system which is tuned using the provided development set(dev set).

pdf bib
Unsupervised Neural Machine Translation for English and Manipuri
Salam Michael Singh | Thoudam Doren Singh
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Availability of bitext dataset has been a key challenge in the conventional machine translation system which requires surplus amount of parallel data. In this work, we devise an unsupervised neural machine translation (UNMT) system consisting of a transformer based shared encoder and language specific decoders using denoising autoencoder and backtranslation with an additional Manipuri side multiple test reference. We report our work on low resource setting for English (en) - Manipuri (mni) language pair and attain a BLEU score of 3.1 for en-mni and 2.7 for mni-en respectively. Subjective evaluation on translated output gives encouraging findings.

pdf bib
English to Manipuri and Mizo Post-Editing Effort and its Impact on Low Resource Machine Translation
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay | Mihaela Vela | Josef van Genabith
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

We present the first study on the post-editing (PE) effort required to build a parallel dataset for English-Manipuri and English-Mizo, in the context of a project on creating data for machine translation (MT). English source text from a local daily newspaper are machine translated into Manipuri and Mizo using PBSMT systems built in-house. A Computer Assisted Translation (CAT) tool is used to record the time, keystroke and other indicators to measure PE effort in terms of temporal and technical effort. A positive correlation between the technical effort and the number of function words is seen for English-Manipuri and English-Mizo but a negative correlation between the technical effort and the number of noun words for English-Mizo. However, average time spent per token in PE English-Mizo text is negatively correlated with the temporal effort. The main reason for these results are due to (i) English and Mizo using the same script, while Manipuri uses a different script and (ii) the agglutinative nature of Manipuri. Further, we check the impact of training a MT system in an incremental approach, by including the post-edited dataset as additional training data. The result shows an increase in HBLEU of up to 4.6 for English-Manipuri.

2019

pdf bib
WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 6th Workshop on Asian Translation

A multimodal translation is a task of translating a source language to a target language with the help of a parallel text corpus paired with images that represent the contextual details of the text. In this paper, we carried out an extensive comparison to evaluate the benefits of using a multimodal approach on translating text in English to a low resource language, Hindi as a part of WAT2019 shared task. We carried out the translation of English to Hindi in three separate tasks with both the evaluation and challenge dataset. First, by using only the parallel text corpora, then through an image caption generation approach and, finally with the multimodal approach. Our experiment shows a significant improvement in the result with the multimodal approach than the other approach.

2015

pdf bib
An Empirical Study of Diversity of Word Alignment and its Symmetrization Techniques for System Combination
Thoudam Doren Singh
Proceedings of the 12th International Conference on Natural Language Processing

2013

pdf bib
Taste of Two Different Flavours: Which Manipuri Script works better for English-Manipuri Language pair SMT Systems?
Thoudam Doren Singh
Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation

2012

pdf bib
Addressing some Issues of Data Sparsity towards Improving English- Manipuri SMT using Morphological Information
Thoudam Doren Singh
Workshop on Monolingual Machine Translation

The performance of an SMT system heavily depends on the availability of large parallel corpora. Unavailability of these resources in the required amount for many language pair is a challenging issue. The required size of the resource involving morphologically rich and highly agglutinative language is essentially much more for the SMT systems. This paper investigates on some of the issues on enriching the resource for this kind of languages. Handling of inflectional and derivational morphemes of the morphologically rich target language plays important role in the enrichment process. Mapping from the source to the target side is carried out for the English-Manipuri SMT task using factored model. The SMT system developed shows improvement in the performance both in terms of the automatic scoring and subjective evaluation over the baseline system.

pdf bib
Bidirectional Bengali Script and Meetei Mayek Transliteration of Web Based Manipuri News Corpus
Thoudam Doren Singh
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

2011

pdf bib
Integration of Reduplicated Multiword Expressions and Named Entities in a Phrase Based Statistical Machine Translation System
Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
Statistical Machine Translation of English-Manipuri using Morpho-syntactic and Semantic Information
Thoudam Doren Singh | Savaji Bandyopadhyay
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

English-Manipuri language pair is one of the rarely investigated with restricted bilingual resources. The development of a factored Statistical Machine Translation (SMT) system between English as source and Manipuri, a morphologically rich language as target is reported. The role of the suffixes and dependency relations on the source side and case markers on the target side are identified as important translation factors. The morphology and dependency relations play important roles to improve the translation quality. A parallel corpus of 10350 sentences from news domain is used for training and the system is tested with 500 sentences. Using the proposed translation factors, the output of the translation quality is improved as indicated by the BLEU score and subjective evaluation.

pdf bib
Web Based Manipuri Corpus for Multiword NER and Reduplicated MWEs Identification using SVM
Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing

pdf bib
Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphology and Dependency Relations
Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

2009

pdf bib
Named Entity Recognition for Manipuri Using Support Vector Machine
Thoudam Doren Singh | Kishorjit Nongmeikapam | Asif Ekbal | Sivaji Bandyopadhyay
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

2008

pdf bib
Morphology Driven Manipuri POS Tagger
Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages