TMEKU System for the WAT2021 Multimodal Translation Task

We introduce our TMEKU system submitted to the English-Japanese Multimodal Translation Task for WAT 2021. We participated in the Flickr30kEnt-JP task and Ambiguous MSCOCO Multimodal task under the constrained condition using only the officially provided datasets. Our proposed system employs soft alignment of word-region for multimodal neural machine translation (MNMT). The experimental results evaluated on the BLEU metric provided by the WAT 2021 evaluation site show that the TMEKU system has achieved the best performance among all the participated systems. Further analysis of the case study demonstrates that leveraging word-region alignment between the textual and visual modalities is the key to performance enhancement in our TMEKU system, which leads to better visual information use.


Introduction
Neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015) has achieved state-of-the-art translation performance. However, there remain numerous situations where textual context alone is insufficient for correct translation, such as in the presence of ambiguous words and grammatical gender. Therefore, researchers in this field have established multimodal neural machine translation (MNMT) tasks (Specia et al., 2016;Elliott et al., 2017;Barrault et al., 2018), which translates sentences paired with images into a target language.
Due to the lack of multimodal datasets, multimodal tasks on the English→Japanese (En→Ja) language pair have not been paid attention to. Since the year 2020, as the multimodal dataset on the 1 TMEKU is the abbreviation of the combination of the Tokyo Metropolitan University, the Ehime University and the Kyoto University.
En→Ja language pair has been made publicly available, the multimodal machine translation (MMT) tasks on the En→Ja were held at the WAT 2020 (Nakazawa et al., 2020) for the first time. Some studies  have started to focus on incorporating multimodal contents, particularly images, to improve the translation performance on the En→Ja task.
In this study, we apply our system (Zhao et al., 2021) for the MMT task on the En→Ja language pair, which is called TMEKU system. This system is designed to translate a source word into a target word, focusing on a relevant image region. To guide the model to translate certain words based on certain image regions, explicit alignment over source words and image regions is needed. We propose to generate soft alignment of word-region based on cosine similarity between source words and visual concepts. While encoding, textual and visual modalities are represented interactively by leveraging the word-region alignment, which is associating image regions with respective source words.
The contributions of this study are as follows: 1. Our TMEKU system outperforms baselines and achieves the first place evaluated by BLEU metric among all the submitted systems in the multimodal translation task of WAT 2021 2 (Nakazawa et al., 2021) on the En→Ja.
2. Further analysis demonstrates that our TMEKU system utilizes visual information effectively by relating the textual to visual information.  Figure 1: The soft alignment of word-region.

Word-Region Alignment
As shown in Figure 1, we propose to create an alignment between semantically relevant source words and image regions. For the regions, we follow  in detecting object-level image regions from each image, which are denoted by bounding boxes on the figure. In particular, each bounding box is detected along with a visual concept consisting of an attribute class followed by an object class instead of only the object class. We take these visual concepts to represent the image regions. We set each image labeled with 36 visual concepts of image regions, which are space-separated phrases. For the words, we lowercase and tokenize the source English sentences via the Moses toolkit. 3 The soft alignment is a similarity matrix filled with the cosine similarity between source words and visual concepts. To avoid unknown words, we convert the words and concepts into subword units using the byte pair encoding (BPE) model (Sennrich et al., 2016). Subsequently, we utilize fastText (Bojanowski et al., 2017) to learn subword embeddings. We use a pre-trained model 4 containing two million word vectors trained with subword information on Common Crawl (600B tokens). The source subword embeddings can be generated directly, whereas the generation of visual concept embeddings should take an average of the embeddings of all constituent subwords because they are phrases. As shown in Figure 1, source subwords are represented by W = {w 1 , w 2 , w 3 , · · · , w n }, and the visual concepts are represented by C = {c 1 , c 2 , c 3 , · · · , c 36 }. These embeddings provide a mapping function from a subword to a 300-dim vector, where semantically similar subwords are embedded close to each other. Finally, we calculate a cosine similarity matrix of the word-region as a soft alignment A soft .

Representing Textual Input
In Figure 2, the textual encoder is a bi-directional RNN. Given a source sentence of n source words, the encoder generates the forward annotation vec-

Representing Visual Input
We follow  in extracting the region-of-interest (RoI) features of detected image regions in each image. There are 36 object-level image region features, each of which is represented as a 2,048-dim vector r, and all features in an image are denoted as R = (r 1 , r 2 , r 3 , · · · , r 36 ).

Representations with Word-Region Alignment
As shown in Figure 2, we represent textual annotation of n source words as A txt = (a txt 1 , a txt 2 , a txt 3 , · · · , a txt n ), and visual annotation of 36 regions as A img = (a img 1 , a img 2 , a img 3 , · · · , a img 36 ). We represented the visual annotation A img by concatenating R with the aligned textual features H align and the textual annotation A txt using textual input representation H directly.
The calculation of the A img is computed as follows: where the |R| and |H| represent the length of source words and the numbers of image regions: n and 36; the CONCAT is a concatenation operator.

Decoder
To generate target word y t at time step t, a hidden state proposal s (1) t is computed in the first cell of deepGRU (Delbrouck and Dupont, 2018) (GRU (1)) by function f gru 1 (y t−1 , s t−1 ). The function considers the previously emitted target word y t−1 and generated hidden state s t−1 as follows.
where W ξ , U ξ , W γ , U γ , W , and U are training parameters, and E Y is the target word embedding.

Text-Attention
At time step t, the text-attention focuses on every textual annotation a txt i in A txt and assigns an attention weight. The textual context vector z t is generated as follows.
where V text , U text , and W text are the training parameters; e text t,i is the attention energy; and α text t,i is the attention weight matrix.

Image-Attention
Similarly, the visual context vector c t is generated as follows.
where V img , U img , and W img are the training parameters; α img t,j is a weight matrix of each a img j ; and e img t,j is the attention energy.

DeepGRU
As shown in Figure 2, deepGRU consists of three layers of GRU cells, which are variants of the conditional gated recurrent unit (cGRU). 5 The hidden state s t is computed in GRU (3) as follows. Because the calculation of f gru 2 and f gru 3 are similar to function f gru 1 , they are not included in the paper.
We use a gated hyperbolic tangent activation  instead of tanh. This nonlinear layer implements function f ght : x ∈ R m → y ∈ R n with parameters defined as follows.
where K, K ∈ R n×m and b, b ∈ R n are the training parameters.
To ensure that both representations have their own projections to compute the candidate probabilities, a textual GRU block and visual GRU block (Delbrouck and Dupont, 2018) obtained as below. b

Dataset
Firstly, we conducted experiments for the En→Ja task using the official Flickr30kEnt-JP dataset (Nakayama et al., 2020), which was extended from the Flickr30k (Young et al., 2014) and Flickr30k Entities (Plummer et al., 2017) datasets, where manual Japanese translations were newly added.
For training and validation, we used the Flickr30kEnt-JP dataset 6 for Japanese sentences, the Flickr30k Entities dataset 7 for English sentences, and the Flickr30k dataset 8 for images. They were sharing the same splits of training and validation data made in Flickr30k Entities. For test data, we used the officially provided data of the Flickr30kEnt-JP task, and their corresponding images were in the Flickr30k dataset.
Note that the Japanese training data size is originally 148,915 sentences, but five sentences are missing. Thus, we used 148,910 sentences for training. In summary, we used 148,910 pairs for training, 5k pairs for validation, and 1k monolingual English sentences for translating test results.

Preprocessing
For English sentences, we applied lowercase, punctuation normalization, and the tokenizer in the Moses Toolkit. Then we converted space-separated tokens into subword units using the BPE model with 10k merge operations. For Japanese sentences, we used MeCab 11 for word segmentation with the IPA dictionary. The resulting vocabulary sizes of En→Ja were 9,578→22,274 tokens.
For image regions, we used Faster- RCNN (Ren et al., 2015) in  to detect up to 36 salient visual objects per image and extracted their corresponding 2,048-dim image region features and attribute-object combined concepts.

Settings
(i) NMT: the baseline NMT system (Bahdanau et al., 2015) is the architecture comprised a 2-layer bidirectional GRU encoder and a 2-layer cGRU decoder with attention mechanism, which only encodes the source sentence as the input. (ii) MNMT: the baseline MNMT system without word-region alignment (Zhao et al., 2020). This architecture comprised a 2-layer bidirectional GRU encoder and a 2-layer cGRU decoder with double attentions to integrate visual and textual features. (iii) TMEKU system: our proposed MNMT system with word-region alignment.

Parameters
We ensured that the parameters were consistent in all the settings. We set the encoder and decoder hidden state to 400-dim; word embedding to 200dim; batch size to 32; beam size to 12; text dropout to 0.3; image region dropout to 0.5; dropout of source RNN hidden states to 0.5; and blocks b t  and one validation evaluation was performed after every epoch.

Ensembling Models
For the Flickr30kEnt-JP task on the En→Ja, each experiment is repeated with 12 different seeds to mitigate the variance of BLEU. At last, we choose the top 10 trained models that evaluated by BLEU scores on the validation set for ensembling. For the Ambiguous MSCOCO task on the En→Ja, each experiment is repeated with 8 different seeds to mitigate the variance of BLEU and benefit from ensembling these 8 trained models for the final testing.

Evaluation
We evaluated the quality of the translation results using the official evaluation system provided by WAT 2021. We submitted the final translation results in Japanese, which was translated from the official test data in English. On the WAT 2021 evaluation site, an automatic evaluation server was prepared and the BLEU was the main metric to evaluate our submitted translation results.

Results
In Table 1, we presented the results of the baselines and our TMEKU system on the Flickr30kEnt-JP task. We compared all the results based on BLEU scores evaluated by WAT 2021 evaluation site. For instance, the TMEKU system outperformed the NMT baseline by BLEU scores of 0.86 and outperformed the MNMT baseline by BLEU scores of 0.69 on the official test set. Our TMEKU system achieved significant improvement over both the NMT and MNMT baselines. Moreover, the result of ensembling the top 10 models has achieved the first place in the ranking of this task.
We also participated in the Ambiguous MSCOCO task on the En→Ja translation using our TMEKU system. Our reported BLEU scores are shown in Table 2, and the result of ensembling 8 models has ranked the first among all the submissions in this task.

Human Evaluation
To further validate the translation performance, a human evaluation was done by the organizers.
There are two native speakers of Japanese to rate the translation results with a score of 1 to 5 (1 is the worst and 5 is the best), who are informed to focus more on semantic meaning than grammatical correctness. There are 200 randomly selected examples for evaluation on the En→Ja language pair of Flickr30kEnt-JP task and Ambiguous MSCOCO task, respectively.
The human evaluation scores provided by the organizers are added in Table 1 and Table 2, which have achieved the best scores among the participated systems in their respective tasks.

Case Study
We show two cases in Figure 3, and improvement is highlighted in green.
We perform two types of visualization for each case: (1) We visualize the source-target word alignment of the text-attention.
(2) We visualize the region-target alignment of the image-attention at a time step that generates a certain target word along with attending to the most heavily weighted image region feature.
In the case shown on the left, our TMEKU system translates "entering" to "entrant," but the baselines under-translate. By visualization, the textattention and image-attention assign the highest weights to the word and region that are semantically relevant at that time step of generating "entrant." This example shows that translation quality improvement is due to the simultaneous attentions of semantically related image regions and words.
In the case shown on the right, our TMEKU system correctly translates "backyard" to a com-English: a man in a red shirt entering an establishment. Reference: un homme en t-shirt rouge entrant dans un établissement.
English: a man is grilling out in his backyard. Reference: un homme fait un barbecue dans son arrière-cour.
NMT Baseline: un homme fait griller quelque chose dans sa cour (yard). MNMT Baseline: un homme fait griller quelque chose dans sa cour (yard). TMEKU System: un homme fait griller quelque chose dans sa arrière-cour (backyard). pound noun of "arrière-cour." But the baselines mistranslates it to "cour," which means "yard" in English. Through visualization, we find that the text-attention and image-attention focus on the features that are semantically relevant at that time step. This example shows that the image region feature associated with its semantically relevant textual feature can overcome the deficiency, where the object attribute cannot be specifically represented by only the image region feature.

Conclusion
We presented our TMEKU system to the English→Japanese MMT tasks for WAT 2021, which is designed to simultaneously consider relevant textual and visual features during translation. By integrating the explicit word-region alignment, the object-level regional features can be further specified with respective source textual features. This leads the two attention mechanisms to understand the semantic relationships between textual objects and visual concepts.
Experimental results show that our TMEKU system exceeded baselines by a large margin and achieved the best performance among all the participated systems. We also performed analysis of case study to demonstrate the specific improvements resulting from related modalities.
In the future, we plan to propose a more efficient integration method to make modalities interactive with each other.