Rejwanul Haque


2021

pdf bib
Investigating Active Learning in Interactive Neural Machine Translation
Kamal Gupta | Dhanvanth Boppana | Rejwanul Haque | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of Machine Translation Summit XVIII: Research Track

Interactive-predictive translation is a collaborative iterative process and where human translators produce translations with the help of machine translation (MT) systems interactively. Various sampling techniques in active learning (AL) exist to update the neural MT (NMT) model in the interactive-predictive scenario. In this paper and we explore term based (named entity count (NEC)) and quality based (quality estimation (QE) and sentence similarity (Sim)) sampling techniques – which are used to find the ideal candidates from the incoming data – for human supervision and MT model’s weight updation. We carried out experiments with three language pairs and viz. German-English and Spanish-English and Hindi-English. Our proposed sampling technique yields 1.82 and 0.77 and 0.81 BLEU points improvements for German-English and Spanish-English and Hindi-English and respectively and over random sampling based baseline. It also improves the present state-of-the-art by 0.35 and 0.12 BLEU points for German-English and Spanish-English and respectively. Human editing effort in terms of number-of-words-changed also improves by 5 and 4 points for German-English and Spanish-English and respectively and compared to the state-of-the-art.

2020

pdf bib
Arabisc: Context-Sensitive Neural Spelling Checker
Yasmin Moslem | Rejwanul Haque | Andy Way
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

Traditional statistical approaches to spelling correction usually consist of two consecutive processes — error detection and correction — and they are generally computationally intensive. Current state-of-the-art neural spelling correction models usually attempt to correct spelling errors directly over an entire sentence, which, as a consequence, lacks control of the process, e.g. they are prone to overcorrection. In recent years, recurrent neural networks (RNNs), in particular long short-term memory (LSTM) hidden units, have proven increasingly popular and powerful models for many natural language processing (NLP) problems. Accordingly, we made use of a bidirectional LSTM language model (LM) for our context-sensitive spelling detection and correction model which is shown to have much control over the correction process. While the use of LMs for spelling checking and correction is not new to this line of NLP research, our proposed approach makes better use of the rich neighbouring context, not only from before the word to be corrected, but also after it, via a dual-input deep LSTM network. Although in theory our proposed approach can be applied to any language, we carried out our experiments on Arabic, which we believe adds additional value given the fact that there are limited linguistic resources readily available in Arabic in comparison to many languages. Our experimental results demonstrate that the proposed methods are effective in both improving the quality of correction suggestions and minimising overcorrection.

pdf bib
Investigating Low-resource Machine Translation for English-to-Tamil
Akshai Ramesh | Venkatesh Balavadhani parthasa | Rejwanul Haque | Andy Way
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Statistical machine translation (SMT) which was the dominant paradigm in machine translation (MT) research for nearly three decades has recently been superseded by the end-to-end deep learning approaches to MT. Although deep neural models produce state-of-the-art results in many translation tasks, they are found to under-perform on resource-poor scenarios. Despite some success, none of the present-day benchmarks that have tried to overcome this problem can be regarded as a universal solution to the problem of translation of many low-resource languages. In this work, we investigate the performance of phrase-based SMT (PB-SMT) and neural MT (NMT) on a rarely-tested low-resource language-pair, English-to-Tamil, taking a specialised data domain (software localisation) into consideration. In particular, we produce rankings of our MT systems via a social media platform-based human evaluation scheme, and demonstrate our findings in the low-resource domain-specific text translation task.

pdf bib
Identifying Complaints from Product Reviews: A Case Study on Hindi
Raghvendra Pratap Singh | Rejwanul Haque | Mohammed Hasanuzzaman | Andy Way
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Automatic recognition of customer complaints on products or services that they purchase can be crucial for the organisations, multinationals and online retailers since they can exploit this information to fulfil their customers’ expectations including managing and resolving the complaints. Recently, researchers have applied supervised learning strategies to automatically identify users’ complaints expressed in English on Twitter. The downside of these approaches is that they require labeled training data for learning, which is expensive to create. This poses a barrier for them being applied to low-resource languages and domains for which task-specific data is not available. Machine translation (MT) can be used as an alternative to the tools that require such task-specific data. In this work, we use state-of-the-art neural MT (NMT) models for translating Hindi reviews into English and investigate performance of the downstream classification task (complaints identification) on their English translations.

pdf bib
Terminology-Aware Sentence Mining for NMT Domain Adaptation: ADAPT’s Submission to the Adap-MT 2020 English-to-Hindi AI Translation Shared Task
Rejwanul Haque | Yasmin Moslem | Andy Way
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

This paper describes the ADAPT Centre’s submission to the Adap-MT 2020 AI Translation Shared Task for English-to-Hindi. The neural machine translation (NMT) systems that we built to translate AI domain texts are state-of-the-art Transformer models. In order to improve the translation quality of our NMT systems, we made use of both in-domain and out-of-domain data for training and employed different fine-tuning techniques for adapting our NMT systems to this task, e.g. mixed fine-tuning and on-the-fly self-training. For this, we mined parallel sentence pairs and monolingual sentences from large out-of-domain data, and the mining process was facilitated through automatic extraction of terminology from the in-domain data. This paper outlines the experiments we carried out for this task and reports the performance of our NMT systems on the evaluation test set.

pdf bib
Modelling Source- and Target- Language Syntactic Information as Conditional Context in Interactive Neural Machine Translation
Kamal Kumar Gupta | Rejwanul Haque | Asif Ekbal | Pushpak Bhattacharyya | Andy Way
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In interactive machine translation (MT), human translators correct errors in automatic translations in collaboration with the MT systems, which is seen as an effective way to improve the productivity gain in translation. In this study, we model source-language syntactic constituency parse and target-language syntactic descriptions in the form of supertags as conditional context for interactive prediction in neural MT (NMT). We found that the supertags significantly improve productivity gain in translation in interactive-predictive NMT (INMT), while syntactic parsing somewhat found to be effective in reducing human effort in translation. Furthermore, when we model this source- and target-language syntactic information together as the conditional context, both types complement each other and our fully syntax-informed INMT model statistically significantly reduces human efforts in a French–to–English translation task, achieving 4.30 points absolute (corresponding to 9.18% relative) improvement in terms of word prediction accuracy (WPA) and 4.84 points absolute (corresponding to 9.01% relative) reduction in terms of word stroke ratio (WSR) over the baseline.

pdf bib
The ADAPT System Description for the WMT20 News Translation Task
Venkatesh Parthasarathy | Akshai Ramesh | Rejwanul Haque | Andy Way
Proceedings of the Fifth Conference on Machine Translation

This paper describes the ADAPT Centre’s submissions to the WMT20 News translation shared task for English-to-Tamil and Tamil-to-English. We present our machine translation (MT) systems that were built using the state-of-the-art neural MT (NMT) model, Transformer. We applied various strategies in order to improve our baseline MT systems, e.g. onolin- gual sentence selection for creating synthetic training data, mining monolingual sentences for adapting our MT systems to the task, hyperparameters search for Transformer in lowresource scenarios. Our experiments show that adding the aforementioned techniques to the baseline yields an excellent performance in the English-to-Tamil and Tamil-to-English translation tasks.

pdf bib
The ADAPT’s Submissions to the WMT20 Biomedical Translation Task
Prashant Nayak | Rejwanul Haque | Andy Way
Proceedings of the Fifth Conference on Machine Translation

This paper describes the ADAPT Centre’s submissions to the WMT20 Biomedical Translation Shared Task for English-to-Basque. We present the machine translation (MT) systems that were built to translate scientific abstracts and terms from biomedical terminologies, and using the state-of-the-art neural MT (NMT) model: Transformer. In order to improve our baseline NMT system, we employ a number of methods, e.g. “pseudo” parallel data selection, monolingual data selection for synthetic corpus creation, mining monolingual sentences for adapting our NMT systems to this task, hyperparameters search for Transformer in lowresource scenarios. Our experiments show that systematic addition of the aforementioned techniques to the baseline yields an excellent performance in the English-to-Basque translation task.

pdf bib
The ADAPT Centre’s Participation in WAT 2020 English-to-Odia Translation Task
Prashanth Nayak | Rejwanul Haque | Andy Way
Proceedings of the 7th Workshop on Asian Translation

This paper describes the ADAPT Centre sub-missions to WAT 2020 for the English-to-Odia translation task. We present the approaches that we followed to try to build competitive machine translation (MT) systems for English-to-Odia. Our approaches include monolingual data selection for creating synthetic data and identifying optimal sets of hyperparameters for the Transformer in a low-resource scenario. Our best MT system produces 4.96BLEU points on the evaluation test set in the English-to-Odia translation task.

pdf bib
The ADAPT Centre’s Neural MT Systems for the WAT 2020 Document-Level Translation Task
Wandri Jooste | Rejwanul Haque | Andy Way
Proceedings of the 7th Workshop on Asian Translation

In this paper we describe the ADAPT Centre’s submissions to the WAT 2020 document-level Business Scene Dialogue (BSD) Translation task. We only consider translating from Japanese to English for this task and we use the MarianNMT toolkit to train Transformer models. In order to improve the translation quality, we made use of both in-domain and out-of-domain data for training our Machine Translation (MT) systems, as well as various data augmentation techniques for fine-tuning the model parameters. This paper outlines the experiments we ran to train our systems and report the accuracy achieved through these various experiments.

pdf bib
An Error-based Investigation of Statistical and Neural Machine Translation Performance on Hindi-to-Tamil and English-to-Tamil
Akshai Ramesh | Venkatesh Balavadhani Parthasa | Rejwanul Haque | Andy Way
Proceedings of the 7th Workshop on Asian Translation

Statistical machine translation (SMT) was the state-of-the-art in machine translation (MT) research for more than two decades, but has since been superseded by neural MT (NMT). Despite producing state-of-the-art results in many translation tasks, neural models underperform in resource-poor scenarios. Despite some success, none of the present-day benchmarks that have tried to overcome this problem can be regarded as a universal solution to the problem of translation of many low-resource languages. In this work, we investigate the performance of phrase-based SMT (PB-SMT) and NMT on two rarely-tested low-resource language-pairs, English-to-Tamil and Hindi-to-Tamil, taking a specialised data domain (software localisation) into consideration. This paper demonstrates our findings including the identification of several issues of the current neural approaches to low-resource domain-specific text translation.

pdf bib
The ADAPT System Description for the STAPLE 2020 English-to-Portuguese Translation Task
Rejwanul Haque | Yasmin Moslem | Andy Way
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes the ADAPT Centre’s submission to STAPLE (Simultaneous Translation and Paraphrase for Language Education) 2020, a shared task of the 4th Workshop on Neural Generation and Translation (WNGT), for the English-to-Portuguese translation task. In this shared task, the participants were asked to produce high-coverage sets of plausible translations given English prompts (input source sentences). We present our English-to-Portuguese machine translation (MT) models that were built applying various strategies, e.g. data and sentence selection, monolingual MT for generating alternative translations, and combining multiple n-best translations. Our experiments show that adding the aforementioned techniques to the baseline yields an excellent performance in the English-to-Portuguese translation task.

2019

pdf bib
Investigating Terminology Translation in Statistical and Neural Machine Translation: A Case Study on English-to-Hindi and Hindi-to-English
Rejwanul Haque | Md Hasanuzzaman | Andy Way
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Terminology translation plays a critical role in domain-specific machine translation (MT). In this paper, we conduct a comparative qualitative evaluation on terminology translation in phrase-based statistical MT (PB-SMT) and neural MT (NMT) in two translation directions: English-to-Hindi and Hindi-to-English. For this, we select a test set from a legal domain corpus and create a gold standard for evaluating terminology translation in MT. We also propose an error typology taking the terminology translation errors into consideration. We evaluate the MT systems’ performance on terminology translation, and demonstrate our findings, unraveling strengths, weaknesses, and similarities of PB-SMT and NMT in the area of term translation.

2014

pdf bib
DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation task
Xiaofeng Wu | Rejwanul Haque | Tsuyoshi Okita | Piyush Arora | Andy Way | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation
Rejwanul Haque | Sergio Penkale | Andy Way
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)

2012

pdf bib
Translating User-Generated Content in the Social Networking Space
Jie Jiang | Andy Way | Rejwanul Haque
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program

This paper presents a case-study of work done by Applied Language Solutions (ALS) for a large social networking provider who claim to have built the world’s first multi-language social network, where Internet users from all over the world can communicate in languages that are available in the system. In an initial phase, the social networking provider contracted ALS to build Machine Translation (MT) engines for twelve language-pairs: Russian⇔English, Russian⇔Turkish, Russian⇔Arabic, Turkish⇔English, Turkish⇔Arabic and Arabic⇔English. All of the input data is user-generated content, so we faced a number of problems in building large-scale, robust, high-quality engines. Primarily, much of the source-language data is of ‘poor’ or at least ‘non-standard’ quality. This comes in many forms: (i) content produced by non-native speakers, (ii) content produced by native speakers containing non-deliberate typos, or (iii) content produced by native speakers which deliberately departs from spelling norms to bring about some linguistic effect. Accordingly, in addition to the ‘regular’ pre-processing techniques used in the building of our statistical MT systems, we needed to develop routines to deal with all these scenarios. In this paper, we describe how we handle shortforms, acronyms, typos, punctuation errors, non-dictionary slang, wordplay, censor avoidance and emoticons. We demonstrate automatic evaluation scores on the social network data, together with insights from the the social networking provider regarding some of the typical errors made by the MT engines, and how we managed to correct these in the engines.

pdf bib
Monolingual Data Optimisation for Bootstrapping SMT Engines
Jie Jiang | Andy Way | Nelson Ng | Rejwanul Haque | Mike Dillinger | Jun Lu
Workshop on Monolingual Machine Translation

Content localisation via machine translation (MT) is a sine qua non, especially for international online business. While most applications utilise rule-based solutions due to the lack of suitable in-domain parallel corpora for statistical MT (SMT) training, in this paper we investigate the possibility of applying SMT where huge amounts of monolingual content only are available. We describe a case study where an analysis of a very large amount of monolingual online trading data from eBay is conducted by ALS with a view to reducing this corpus to the most representative sample in order to ensure the widest possible coverage of the total data set. Furthermore, minimal yet optimal sets of sentences/words/terms are selected for generation of initial translation units for future SMT system-building.

2010

pdf bib
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Supertags as Source Language Context in Hierarchical Phrase-Based SMT
Rejwanul Haque | Sudip Naskar | Antal van den Bosch | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Statistical machine translation (SMT) models have recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context to improve phrase selection in SMT. In the present work, we introduce lexico-syntactic descriptions in the form of supertags as source-side context features in the state-of-the-art hierarchical phrase-based SMT (HPB) model. These features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. In our experiments two kinds of supertags are employed: those from lexicalized tree-adjoining grammar (LTAG) and combinatory categorial grammar (CCG). We use a memory-based classification framework that enables the efficient estimation of these features. Despite the differences between the two supertagging approaches, they give similar improvements. We evaluate the performance of our approach on an English-to-Dutch translation task, and report statistically significant improvements of 4.48% and 6.3% BLEU scores in translation quality when adding CCG and LTAG supertags, respectively, as context-informed features.

2009

pdf bib
Using Supertags as Source Language Context in SMT
Rejwanul Haque | Sudip Kumar Naskar | Yanjun Ma | Andy Way
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
Dependency Relations as Source Context in Phrase-Based SMT
Rejwanul Haque | Sudip Kumar Naskar | Antal van den Bosch | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf bib
Experiments on Domain Adaptation for English–Hindi SMT
Rejwanul Haque | Sudip Kumar Naskar | Josef van Genabith | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf bib
English-Hindi Transliteration Using Context-Informed PB-SMT: the DCU System for NEWS 2009
Rejwanul Haque | Sandipan Dandapat | Ankit Kumar Srivastava | Sudip Kumar Naskar | Andy Way
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

2008

pdf bib
Named Entity Recognition in Bengali: A Conditional Random Field Approach
Asif Ekbal | Rejwanul Haque | Sivaji Bandyopadhyay
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
Language Independent Named Entity Recognition in Indian Languages
Asif Ekbal | Rejwanul Haque | Amitava Das | Venkateswarlu Poka | Sivaji Bandyopadhyay
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages

pdf bib
Bengali, Hindi and Telugu to English Ad-hoc Bilingual Task
Sivaji Bandyopadhyay | Tapabrata Mondal | Sudip Kumar Naskar | Asif Ekbal | Rejwanul Haque | Srinivasa Rao Godavarthy
Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies