IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task

Neural Machine Translation (NMT) is a predominant machine translation technology nowadays because of its end-to-end trainable flexibility. However, NMT still struggles to translate properly in low-resource settings specifically on distant language pairs. One way to overcome this is to use the information from other modalities if available. The idea is that despite differences in languages, both the source and target language speakers see the same thing and the visual representation of both the source and target is the same, which can positively assist the system. Multimodal information can help the NMT system to improve the translation by removing ambiguity on some phrases or words. We participate in the 8th Workshop on Asian Translation (WAT - 2021) for English-Hindi multimodal translation task and achieve 42.47 and 37.50 BLEU points for Evaluation and Challenge subset, respectively.


Introduction
Recent progress in neural machine translation (NMT) focuses on translating a source language into a particular target language. Various methods have been proposed for this task and most of them deal with the textual data. There are certain drawbacks while performing machine translation using only textual datasets.
Human performs translation which is based upon language grounding: our sense of meaning emerges from interacting with the world. NMT methods do not have any mechanism to perform language grounding; thus they are devoid of capturing the true meaning of sentences or phrases while translating them into the other languages. For example, it * Equal contribution needs to translate the word "cricket", it can get confused if it is the game cricket or the insect cricket. But the visual information can clear the ambiguity. Multi-modal translation aims to alleviate this issue by training an NMT model on textual data along with associated images to perform language grounding. This shared task deals with developing multi-modal NMT models for English-Hindi translation. The choice of languages depends on the following issues: i). Hindi is the most spoken language in India and the fourth most spoken language in the world with 600 million speakers 1 . Despite the huge amount of speakers, suitable resources in Hindi is limited due to the various factors. ii) Automatic translation of texts from one language to the another is a difficult task. Specifically, when one or both of them are resource-poor and distant from each other. In Multimodal NMT (MNMT), information from the other modalities like audio, image, video, etc. are used along with text to generate the translation. In low-resource languages, this is particularly used to improve the low-quality translations as even though vocabularies, grammar of two languages are different but their visual representation is the same. There are several proposed multi-modal methods for translations that exploit the features of the associated image for better translation. Stateof-the-art methods might achieve better accuracy than the models we used. Our main motivation for using simplistic models is to demonstrate a proof-of-concept to be used for multi-modal translation among the resource-poor language pairs. We achieved good results on both Challenge and Evaluation set in different evaluation metrics including BLEU, RIBES, AMFM. In subsequent modifications, we aim to develop our models incorporating several state-of-the-art features. The following sections describe our processes in greater details.

Related Works
There have been many attempts to use information other than the source for better translation. Uni-modal systems include document-level NMT (Wang et al., 2017), sentence-level NMT with contextual information (Gain et al., 2021), etc. Among multimodal systems, (Huang et al., 2016) used an object detection system and extracted local and global image features. Thereafter, they used those image features as additional inputs to encoder and decoder. (Delbrouck and Dupont, 2017) used attention mechanism on visual inputs for the source hidden states. (Lin et al., 2020) Su et al. (2018) demonstrated an unsupervised method based on the language translation cycle consistency loss conditional on the image. This is done to learn the bidirectional multi-modal translation simultaneously. Moreover, Su et al. (2021) showed that jointly learning text-image interaction instead of modeling them separately using attentional networks is more useful. This result is in line with several state-of-the-art visual transformer related models, such as VisualBERT , UNITER (Chen et al., 2019) etc.
Multimodal dataset consists of an image along with a description of certain rectangular portion of the image. We are given the coordinates of the portion. We aim to translate the description with help of the image. An example of multimodal dataset is given in Figure 1.

Pre-processing
For text data, we lowercase all the utterances. Then, we jointly learn byte-pair-encoding (Sennrich et al., 2016) combining both source and target with a vocabulary of 10,000. We treat the images by cropping a specified rectangular portions. This operation is used to discard the portions that do not contribute much to the translation performance. After we get those cropped-out images, we use the pre-trained VGG19-bn (Simonyan and Zisserman, 2015) to obtain the image representations. We use OpenNMT-py (Klein et al., 2017) framework to perform this step.

Training
We use OpenNMT-py (Klein et al., 2017) for our NMT systems. We use Bidirectional RNN encoder and doubly attentive RNN decoder  for our experiments. We train our system in two ways viz. With pre-training, and Without pre-training.: 1. With pre-training We pre-train one of our models on HindEnCorp dataset. This step does not use any visual features as the dataset used for pre-training is devoid of any visual features. After pre-training, we fine-tune the pre-trained model with VisualGenome dataset containing textual and visual features.
2. Without pre-training We do not pre-train the model. We directly fine-tune the models on VisualGenome dataset which contains both text and associated image. Consequently, both textual and visual features are used.
Following step is taken into account while doing inference step: We take the best hypothesis from both the models and filter out any hypothesis containing <unk>token. Then, we pick the hypothesis with best log-likelihood during generation.

Hyper-parameters
We set the word embedding size and size of RNN hidden states to 500. We set the batch size to 40 and train for a maximum 25 epochs. We restrict maximum source and target sequence length to 50. We use the Adam optimizer (Kingma and Ba, 2017) for optimization with β 1 = 0.9 and β 2 = 0.999. During training, we use 0.3 as dropout rate to avoid over-fitting. During generation of translation, we use 5 as the beam width.

Experimental Results
We obtain impressive results on our submissions. There are two sets designed for evaluating our model, i) Evaluation set, ii) Challenge set. We evaluate our model on both of these test set and tabulate our results in Table 2. We use different evaluation metrics (BLEU, RIBES, AMFM) to test our model. The results shown in the table are sorted according to the obtained BLEU scores. As it can be seen from Table 2, we obtain 42.47 BLEU points and achieve second position in terms of BLEU on Evaluation set on multimodal task. Please refer to Figure 2 for example of translation by our system. We obtain 37.50 BLEU points on Challenge set. One reason for not so good results on Challenge set could be: • The challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Hence, it is difficult to translate compared to the Evaluation set, which was randomly selected.
• Difference between utterance length during training and testing, i.e. while average length

Conclusion
We participate in WAT-2021 Multimodal Translation Task for English to Hindi. We achieve good results on both the Challenge and Evaluation sets achieving 42.47 and 37.50 BLEU points, respectively. We rank second place on Evaluation set and third place on Challenge set on WAT-2021 Multimodal Translation Task for English to Hindi. In future, we would like to extend our work by training with additional monolingual data and better ways to incorporate multimodal features.