Improved English to Hindi Multimodal Neural Machine Translation

Machine translation performs automatic translation from one natural language to another. Neural machine translation attains a state-of-the-art approach in machine translation, but it requires adequate training data, which is a severe problem for low-resource language pairs translation. The concept of multimodal is introduced in neural machine translation (NMT) by merging textual features with visual features to improve low-resource pair translation. WAT2021 (Workshop on Asian Translation 2021) organizes a shared task of multimodal translation for English to Hindi. We have participated the same with team name CNLP-NITS-PP in two submissions: multimodal and text-only NMT. This work investigates phrase pairs injection via data augmentation approach and attains improvement over our previous work at WAT2020 on the same task in both text-only and multimodal NMT. We have achieved second rank on the challenge test set for English to Hindi multimodal translation where Bilingual Evaluation Understudy (BLEU) score of 39.28, Rank-based Intuitive Bilingual Evaluation Score (RIBES) 0.792097, and Adequacy-Fluency Metrics (AMFM) score 0.830230 respectively.

Machine translation performs automatic translation from one natural language to another. Neural machine translation attains a state-ofthe-art approach in machine translation, but it requires adequate training data, which is a severe problem for low-resource language pairs translation. The concept of multimodal is introduced in neural machine translation (NMT) by merging textual features with visual features to improve low-resource pair translation. WAT2021 (Workshop on Asian Translation 2021) organizes a shared task of multimodal translation for English to Hindi. We have participated the same with team name CNLP-NITS-PP in two submissions: multimodal and text-only translation. This work investigates phrase pairs injection via data augmentation approach and attains improvement over our previous work at WAT2020 on the same task in both text-only and multimodal translation. We have achieved second rank on the challenge test set for English to Hindi multimodal translation where Bilingual Evaluation Understudy (BLEU) score of 39.28, Rank-based Intuitive Bilingual Evaluation Score (RIBES) 0.792097, and Adequacy-Fluency Metrics (AMFM) score 0.830230 respectively.

Introduction
Multimodal NMT (MNMT) intends to draw insights from the input data through different modalities like text, image, and audio. Combining information from more than one modality attempts to amend the quality of low resource language translation. (Shah et al., 2016) show, combining the visual features of images with corresponding textual features of the input bitext to translate sentences outperform text-only translation. Encoder-decoder architecture is a widely used technique in the MT community for text-only-based NMT as it handles various issues like variable-length phrases using sequence to sequence learning, the problem of longterm dependency using Long Short Term Memory (LSTM) (Sutskever et al., 2014). Nevertheless, the basic encoder-decoder architecture cannot encode all the information when it comes to very long sentences. The attention mechanism is proposed to handle such issues, which pays attention to all source words locally and globally (Bahdanau et al., 2015;Luong et al., 2015). The attention-based NMT yields substantial performance for Indian language translation Laskar et al., 2019aLaskar et al., ,b, 2020aLaskar et al., , 2021b. Moreover, NMT performance can be enhanced by utilizing monolingual data (Sennrich et al., 2016;Zhang and Zong, 2016;Laskar et al., 2020b) and phrase pair injection (Sen et al., 2020), effective in low resource language pair translation. This paper aims English to Hindi translation using the multimodal concept by taking advantage of monolingual data and phrase pair injections to improve the translation quality at the WAT2021 translation task.

Related Works
For the English-Hindi language pair, the literature survey revealed minor existing works on translation using multimodal NMT (Dutta Chowdhury et al., 2018;Sanayai Meetei et al., 2019;Laskar et al., 2019c). (Dutta Chowdhury et al., 2018) uses synthetic data, following multimodal NMT settings (Calixto and Liu, 2017), and attains a BLEU score of 24.2 for Hindi to English translation. However, in the WAT 2019 multimodal translation task of English to Hindi, we achieved the highest BLEU score of 20.37 for the challenge test set (Laskar et al., 2019c). This score was improved later in the task of WAT2020 (Laskar et al., 2020c) to obtain the BLEU score of 33.57 on the challenge   (Calixto and Liu, 2017;Calixto et al., 2017) and utilizes pre-train word embeddings of the monolingual corpus and additional parallel data of IITB. This work attempts to utilize phrase pairs (Sen et al., 2020) to enhance the translational performance of the WAT2021: English to Hindi multimodal translation task.

Dataset Description
We have used the Hindi Visual Genome 1.1 dataset provided by WAT2021 organizers (Nakazawa et al., 2021;Parida et al., 2019

System Description
To build multimodal and text-only NMT models, OpenNMT-py (Klein et al., 2017) tool is used. There are four operations which include data augmentation, preprocessing, training and testing. Our multi-model NMT gets advantages from both image and textual features with phrase pairs and word embeddings.

Data Augmentation
In (Sen et al., 2020), authors used SMT-based phrase pairs to augment with the original parallel data to improve low-resource language pairs translation. In SMT 3 , Giza++ word alignment tool is used to extract phrase pair. Inspired by the work (Sen et al., 2020), we have extracted phrase pairs using Giza++ 4 . Then after removing duplicates and blank lines, the obtained phrase pairs are augmented to the original parallel data. The data statistics of extracted phrase pairs is given in Table 3. Additionally, IITB parallel data is directly augmented with the original parallel to expand the train data. The diagram of data augmentation is presented in Figure 1.

Data Preprocessing
To extract visual features from image data, we have used publicly available 5 pre-trained CNN with VGG19. The visual features are extracted independently for train, validation, and test data. To get the advantage of monolingual data on both multimodal and text-only, GloVe (Pennington et al., 2014) is used to generate vectors of word embeddings. For tokenization of text data, the OpenNMT-py tool is utilized and obtained a vocabulary size of 50004 for source-target sentences. We have not used any word-segmentation technique.

Training
The multimodal and text-only based NMT are trained independently. During the multimodal training process, extracted visual features, pre-trained word vectors are fine-tuned with the augmented parallel data. The bidirectional RNN (BRNN) at encoder type and doubly-attentive RNN at decoder type following default settings of (Calixto and Liu, 2017;Calixto et al., 2017) are used for multimodal NMT. Two different RNNs are used in BRNN, one for backward and another for forwards directions, and two distinct attention mechanisms are utilized over source words and image features at a single decoder. The multimodal NMT is trained up to 40 epochs with 0.3 drop-outs and batch size 32 on a single GPU. During the training process of text-only NMT, we have used only textual data i.e., pre-trained word vectors are fine-tuned with the augmented parallel data, and the model is trained up to 100000 steps using BRNN encoder and RNN decoder following default settings of OpenNMTpy. The primary difference between our previous work (Laskar et al., 2020c) and this work is that the present work uses phrase pairs in augmented parallel data.

Testing
The obtained trained NMT models of both multimodal and text-only are tested on both test data: evaluation and challenge set independently. During testing, the only difference between text-only and multimodal NMT is that multimodal NMT uses

Number of Phrase Pairs
Tokens in millions En Hi 158,131 0.392966 0.410696  visual features of image test data.

Result and Analysis
The WAT2021 shared task organizer published the evaluation result 6 of multimodal translation task for English to Hindi and our team stood second position in multimodal submission for challenge test set. Our team name is CNLP-NITS-PP, and we have participated in the multimodal and text-only submission tracks of the same task. In both multimodal and text-only translation submission tracks, a total of three teams participated in both evaluation and challenges test data. The results are evaluated using automatic metrics: BLEU (Papineni et al., 2002), RIBES (Isozaki et al., 2010) and AMFM (Banchs et al., 2015). The results of our system is reported in Table 4, and it is noticed that the multimodal NMT obtains higher than text-only NMT. It is because the combination of textual and visual features outperforms text-only NMT. Furthermore, our system's results are improved as compared to our previous work on the same task (Laskar et al., 2020c).  Figure 2 and 3. In Figure 2 and 3, highlighted the region in the image for the given caption by a red colour rectangular box.

Conclusion and Future Work
In this work, we have participated in a shared task at WAT2021 multimodal translation task of English to Hindi, where translation submitted at tracks: multimodal and text-only. This work investigates phrase pairs through data augmentation approach in both multimodal and text-only NMT, which shows better performance than our previous work on the same task (Laskar et al., 2020c). In future work, we will investigate a multilingual approach to improve the performance of multimodal NMT.