Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation

A neural multimodal machine translation (MMT) system is one that aims to perform better translation by extending conventional text-only translation models with multimodal information. Many recent studies report improvements when equipping their models with the multimodal module, despite the controversy of whether such improvements indeed come from the multimodal part. We revisit the contribution of multimodal information in MMT by devising two interpretable MMT models. To our surprise, although our models replicate similar gains as recently developed multimodal-integrated systems achieved, our models learn to ignore the multimodal information. Upon further investigation, we discover that the improvements achieved by the multimodal models over text-only counterparts are in fact results of the regularization effect. We report empirical findings that highlight the importance of MMT models’ interpretability, and discuss how our findings will benefit future research.


Introduction
Multimodal Machine Translation (MMT) aims at designing better translation systems by extending conventional text-only translation systems to take into account multimodal information, especially from visual modality (Specia et al., 2016;Wang et al., 2019). Despite many previous success in MMT that report improvements when models are equipped with visual information (Calixto et al., 2017;Ive et al., 2019;Lin et al., 2020;, there have been continuing debates on the need for visual context in MMT. In particular, Specia et al. (2016); Elliott et al. (2017); Barrault et al. (2018) argue that visual context does not seem to help translation reliably, at * The majority of this work was done while the first author was interning at Tencent AI Lab. least as measured by automatic metrics. Elliott (2018); Grönroos et al. (2018a) provide further evidence by showing that MMT models are, in fact, insensitive to visual input and can translate without significant performance losses even in the presence of features derived from unrelated images. A more recent study (Caglayan et al., 2019), however, shows that under limited textual context (e.g., noun words are masked), models can leverage visual input to generate better translations. But it remains unclear where the gains of MMT methods come from, when the textual context is complete.
The main tool utilized in prior discussion is adversarial model comparison -explaining the behavior of complex and black-box MMT models by comparing performance changes when given adversarial input (e.g., random images). Although such an opaque tool is an acceptable beginning to investigate the need for visual context in MMT, they provide rather indirect evidence (Hessel and Lee, 2020). This is because performance differences can often be attributed to factors unrelated to visual input, such as regularization (Kukačka et al., 2017), data bias (Jabri et al., 2016), and some others (Dodge et al., 2019).
From these perspectives, we revisit the need for visual context in MMT by designing two interpretable models. Instead of directly infusing visual features into the model, we design learnable components, which allow the model to voluntarily decide the usefulness of the visual features and reinforce their effects when they are helpful. To our surprise, while our models are shown to be effective on Multi30k (Elliott et al., 2016) and VaTex (Wang et al., 2019) datasets, they learn to ignore the multimodal information. Our further analysis suggests that under sufficient textual context, the improvements come from a regularization effect that is similar to random noise injection (Bishop, 1995) and weight decay (Hanson and Pratt, 1989). The addi-tional visual information is treated as noise signals that can be used to enhance model training and lead to a more robust network with lower generalization error (Salamon and Bello, 2017). Repeating the evaluation under limited textual context further substantiates our findings and complements previous analysis (Caglayan et al., 2019).
Our contributions are twofold. First, we revisit the need for visual context in the popular task of multimodal machine translation and find that: (1) under sufficient textual context, the MMT models' improvements over text-only counterparts result from the regularization effect (Section 5.2). (2) under limited textual context, MMT models can leverage visual context to help translation (Section 5.3). Our findings highlight the importance of MMT models' interpretability and the need for a new benchmark to advance the community.
Second, for the MMT task, we provide a strong text-only baseline implementation and two models with interpretable components that replicate similar gains as reported in previous works. Different from adversarial model comparison methods, our models are interpretable due to the specifically designed model structure and can serve as standard baselines for future interpretable MMT studies. Our code is available at https://github. com/LividWo/Revisit-MMT.

Background
One can broadly categorize MMT systems into two types: (1) Conventional MMT, where there is gold alignment between the source (target) sentence pair and a relevant image and (2) Retrieval-based MMT, where systems retrieve relevant images from an image corpus as additional clues to assist translation.
Conventional MMT Most MMT systems require datasets consist of images with bilingual annotations for both training and inference. Many early attempts use a pre-trained model (e.g., ResNet (He et al., 2016)) to encode images into feature vectors. This visual representation can be used to initialize the encoder/decoder's hidden vectors (Elliott et al., 2015;Libovický and Helcl, 2017;Calixto et al., 2016). It can also be appended/prepended to word embeddings as additional input tokens (Huang et al., 2016;Calixto and Liu, 2017). Recent works (Libovický et al., 2018;Zhou et al., 2018;Ive et al., 2019;Lin et al., 2020) employ attention mechanism to generate a visual-aware representation for the decoder. For instance, Doubly- ATT (Calixto et al., 2017;Arslan et al., 2018) insert an extra visual attention sub-layer between the decoder's source-target attention sub-layer and feed-forward sub-layer. While there are more works on engineering decoders, encoder-based approaches are relatively less explored. To this end, Yao and Wan (2020) and  replace the vanilla Transformer encoder with a multi-modal encoder.
Besides the exploration on network structure, researchers also propose to leverage the benefits of multi-tasking to improve MMT (Elliott and Kádár, 2017;Zhou et al., 2018). The Imagination architecture (Elliott and Kádár, 2017; decomposes multimodal translation into two subtasks: translation task and an auxiliary visual reconstruction task, which encourages the model to learn a visually grounded source sentence representation. Retrieval-based MMT The effectiveness of conventional MMT heavily relies on the availability of images with bilingual annotations. This could restrict its wide applicability. To address this issue, Zhang et al. (2020) propose UVR-NMT that integrates a retrieval component into MMT. They use TF-IDF to build a token-to-image lookup table, based on which images sharing similar topics with a source sentence are retrieved as relevant images. This creates image-bilingual-annotation instances for training. Retrieval-based models have been shown to improve performance across a variety of NLP tasks besides MMT, such as question answering (Guu et al., 2020), dialogue (Weston et al., 2018), language modeling (Khandelwal et al., 2019), question generation (Lewis et al., 2020), and translation (Gu et al., 2018).

Method
In this section we introduce two interpretable MMT models: (1) Gated Fusion for conventional MMT and (2) Dense-Retrieval-augmented MMT (RMMT) for retrieval-based MMT. Our design philosophy is that models should learn, in an interpretable manner, to which degree multimodal information is used. Following this principle, we focus on the component that integrates multimodal information. In particular, we use a gating matrix Λ Zhang et al., 2020) to control the amount of visual information to be blended into the textual representation. Such a matrix facilitates interpreting the fusion process: a larger gating value Λ ij ∈ [0, 1] indicates that the model exploits more visual context in translation, and vice versa.

Gated Fusion MMT
Given a source sentence x of length T and an associated image z, we compute the probability of generating target sentence y of length N by: where p θ (y i | x, z, y <i ) is implemented with a Transformer-based (Vaswani et al., 2017) network. Specifically, we first feed x into a vanilla Transformer encoder to obtain a textual representation H text ∈ R T ×d , which is then fused with visual representation Embed image (z) before fed into the Transformer decoder. For each image z, we use a pre-trained ResNet-50 CNN (He et al., 2016) to extract a 2048-dimensional average-pooled visual representation, which is then projected to the same dimension as H text : We next generate a gating matrix Λ ∈ [0, 1] T ×d to control the fusion of H text and Embed image (z): where W Λ and U Λ are model parameters. Note that this gating mechanism has been a building block for many recent MMT systems (Zhang et al., 2020;Lin et al., 2020;. We are, however, the first to focus on its interpretability. Finally, we generate the output vector H by: H is then fed into the decoder directly for translation as in vanilla Transformer.

Retrieval-Augmented MMT (RMMT)
RMMT consists of two sequential components: (1) an image retriever p(z|x) that takes x as input and returns Top-K most relevant images from an image database; (2) a multi-modal translator p(y|x, Z) = N i p θ (y i | x, Z, y <i ) that generates each y i conditioned on the input sentence x, the image set Z returned by the retriever, and the previously generated tokens y <i .
Image Retriever Based on the TF-IDF model, searching in existing retrieval-based MMT (Zhang et al., 2020) ignores the context information of a given query, which could lead to poor performance. To improve the recall of our image retriever, we compute the similarity between a sentence x and an image z with inner product: where Embed text (x) and Embed image (z) are ddimensional representations of x and z, respectively. We then retrieve top-K images that are closest to x. For Embed image (z), we compute it by Eq. 2. For Embed text (x), we implement it using BERT (Devlin et al., 2019): Following standard practices, we use a pre-trained BERT model 1 to obtain the "pooled" representation of the sequence (denoted as BERT CLS (x)). Here, W text is a projection matrix.
Multimodal Translator Different from Gated Fusion, p(y|x, Z) now is conditioning on a set of images rather than one single image. For each z in Z, we represent it using Embed image (z) ∈ R d as in Equation 2. The image set Z then forms a feature matrix Embed image (Z) ∈ R K×d , where K = |Z| and each row corresponds to the feature vector of an image. We use a transformation layer f θ ( * ) to extract salient features from Embed image (Z) and obtain a compressed representation R d of Z. After the transformation, ideally, we can implement p(y|x, Z) using any existing MMT models. For interpretability, we follow the Gated Fusion model to fuse the textual and visual representations with a learnable gating matrix Λ: Here, f θ ( * ) denotes a max-pooling layer with window size K × 1.

Experiment
In this section, we evaluate our models on the Multi30k and VaTex benchmark.

Dataset
We perform experiments on the widely-used MMT datasets: Multi30k. We follow a standard split of 29,000 instances for training, 1,014 for validation and 1,000 for testing (Test2016). We also report results on the 2017 test set (Test2017) with extra 1,000 instances and the MSCOCO test set that includes 461 more challenging out-of-domain instances with ambiguous verbs. We merge the source and target sentences in the officially preprocessed version of Multi30k 2 to build a joint vocabulary. We then apply the byte pair encoding (BPE) algorithm (Sennrich et al., 2016) with 10,000 merging operations to segment words into subwords, which generates a vocabulary of 9,712 (9,544) tokens for En-De (En-Fr). Retriever pre-training. We pre-train the retriever on a subset of the Flickr30k dataset (Plummer et al., 2015) that has overlapping instances with Multi30k removed. We use Multi30k's validation set to evaluate the retriever. We measure the performance by recall-at-K (R@K), which is defined as the fraction of queries whose closest K images retrieved contain the correct images. The pre-trained retriever achieves R@1 of 22.8% and R@5 of 39.6%.

Setup
We experiment with different model sizes (Base, Small, and Tiny, see Appendix A for details). Base is a widely-used model configuration for Transformer in both text-only translation (Vaswani et al., 2017) and MMT (Grönroos et al., 2018b;Ive et al., 2019). However, for small datasets like Multi30k, training such a large model (about 50 million parameters) could cause overfitting. In our preliminary study, we found that even a Small configuration, which is commonly used for low-resourced translation (Zhu et al., 2019), can still overfit on Multi30k. We therefore perform grid search on the En→De validation set in Multi30k and obtain a Tiny configuration that works surprisingly well.
We use Adam with β 1 = 0.9, β 2 = 0.98 for model optimization. We start training with a warmup phase (2,000 steps) where we linearly increase the learning rate from 10 −7 to 0.005. Thereafter we decay the learning rate proportional to the number of updates. Each training batch contains at most 4,096 source/target tokens. We set label smoothing weight to 0.1, dropout to 0.3. We follow (Zhang et al., 2020) to early-stop the training if validation loss does not improve for ten epochs. We average the last ten checkpoints for inference as in (Vaswani et al., 2017) and (Wu et al., 2018). We perform 2 https://github.com/multi30k/dataset beam search with beam size set to 5. We report 4-gram BLEU and METEOR scores for all test sets. All models are trained and evaluated on one single machine with two Titan P100 GPUs.

Baselines
Our baselines can be categorized into three types: • The text-only Transformer; • The conventional MMT models: Doubly-ATT and Imagination; • The retrieval-based MMT models: UVR-NMT.
Details of these methods can be found in Section 2. For fairness, all the baselines are implemented by ourselves based on FairSeq (Ott et al., 2019). We use top-5 retrieved images for both UVR-NMT and our RMMT. We also consider two more recent state-of-the-art conventional methods for reference: GMNMT  and DCCN (Lin et al., 2020), whose results are reported as in their papers.
Note that most MMT methods are difficult (or even impossible) to interpret. While there exist some interpretable methods (e.g., UVR-NMT) that contain gated fusion layers similar to ours, they perform sophisticated transformations on visual representation before fusion, which lowers the interpretability of the gating matrix. For example, in the gated fusion layer of UVR-NMT, we observe that the visual vector is order-of-magnitude smaller than the textual vector. As a result, interpreting gating weight is meaningless because visual vector has negligible influence on the fused vector. Table 1 shows the BLEU scores of these methods on the Multi30k dataset. From the table, we see that although we can replicate similar BLEU scores of Transformer-Base as reported in (Grönroos et al., 2018b;Ive et al., 2019), these scores (Row 1) are significantly outperformed by Transformer-Small and Transformer-Tiny, which have fewer parameters. This shows that Transformer-Base could overfit the Multi30k dataset. Transformer-Tiny, whose number of parameters is about 20 times smaller than that of Transformer-Base, is more robust and efficient in our test cases. We therefore use it as the base model for all our MMT systems in the following discussion.

Results
Based on the Transformer-tiny model, both our proposed models (Gated Fusion and RMMT) and baseline MMT models (Doubly-ATT, Imagination and UVR-NMT) significantly outperform the state-of-the-arts (GMNMT and DCCN) on En→De translation. However, the improvement of all these methods (Rows 4-10) over the base Transformer-Tiny model (Row 3) is very marginal. This shows that visual context might not be as important as we expected for translation, at least on datasets we explored.
We further evaluate all the methods on the ME-TEOR scores (see Appendix C). We also run experiments on the VaTex dataset (see Appendix B). Similar results are observed as Table 1. Although various MMT systems have been proposed recently, a well-tuned model that uses text only remain competitive. This motivates us to revisit the importance of visual context for translation in MMT models.

Model Analysis
Taking a closer look at the results given in the previous section, we are surprised by the observation that our models learn to ignore visual context when translating (Sec 5.1). This motivates us to revisit the contribution of visual context in MMT systems (Sec 5.2). Our adversarial evaluation shows that adding model regularization achieves comparable results as incorporating visual context. Finally, we discuss when visual context is needed (Sec 5.3) and how these findings could benefit future research.

Probe the need for visual context in MMT
To explore the need for visual context in our models, we focus on the interpretable component: the gated fusion layer (see Equation 3 and 5). Intuitively, a larger gating weight Λ ij indicates the model learns to depend more on vi-  sual context to perform better translation. We quantify the degree to which visual context is used by the micro-averaged gating weight Λ = M m=1 sum(Λ m )/(d × V ). Here M , V are the total number of sentences and words in the corpus, respectively. sum(·) add up all elements in a given matrix, and Λ is a scalar value ranges from 0 to 1. A larger Λ implies more usage of the visual context.
We first study models' behavior after convergence. From Table 2, we observe that Λ is negligibly small, suggesting that both models learn to discard visual context. In other words, visual context may not be as important for translation as previously thought. Since Λ is insensitive to outliers (e.g., large gating weight at few dimensions), we further compute p(Λ ij > 1e-10): percentage of gating weight entries in Λ that are larger than 1e-10. With no surprise, we find that on all test splits p(Λ ij > 1e-10) are always zero, which again shows that visual input is not used by the model in inference.
The  some light on how the model accommodates the visual information during training. Figure 1 (a) and (b) shows how Λ changes during training, from the first epoch. We find that, Gated Fusion starts with a relatively high Λ (>0.5), but quickly decreases to ≈ 0.48 after the first epoch. As the training continues, Λ gradually decreases to roughly zero. In the early stages, the model relies heavily on images, possibly because they could provide meaningful features extracted from a pre-trained ResNet-50 CNN, while the textual encoder is randomly initialized. Compared with text-only NMT, utilizing visual features lowers MMT models' trust in the hidden representations generated from the textual encoders. As the training continues, the textual encoder learns to represent source text better and the importance of visual context gradually decreases. In the end, the textual encoder carries sufficient context for translation and supersedes the contributions from the visual features. Nevertheless, this doesn't explain the superior performance of the multimodal systems (Table 1). We speculate that visual context is acting as regularization that helps model training in the early stages. We further explore this hypothesis in the next section.

Revisit need for visual context in MMT
In the previous section, we hypothesize that the gains of MMT systems come from some regularization effects. To verify our hypothesis, we conduct experiments based on two widely used regularization techniques: random noise injection (Bishop, 1995) and weight decay (Hanson and Pratt, 1989). The former simulates the effects of assumably uninformative visual representations and the later is a more principled way of regularization that does not get enough attention in the current hyperparameter tuning stage. Inspecting the results, we find that applying these regularization techniques achieves similar gains over the text-only baseline as incorporating multimodal information does.
For random noise injection, we keep all hyperparameters unchanged but replace visual features extracted using ResNet with randomly initialized vectors, which are noise drawn from a standard Gaussian distribution. A MMT model equipped with ResNet features is denoted as a ResNet-based model, while the same model with random initialization is denoted as a noise-based model. We run each experiment three times and report the averaged results. Note that values in parentheses indicate the performance gap between the ResNetbased model and its noise-based adversary. Table 3 shows BLEU scores on the Multi30k dataset. Each column in the table corresponds to a test set "contest". From the table, we observe that, among 18 (3 methods × 3 test sets × 2 tasks) contests with the Transformer model (row 1), noise-based models (rows 2-4) achieve better performance 13 times, while ResNet-based models win 14 cases. This shows that noise-based models perform comparably with ResNet-based models. A further comparison between noise-based models and ResNet-based models shows that they are compatible after 18 contests, in which the former wins 8 times and the latter wins 10 times.
We observe similar results when repeating above evaluation using METEOR (Tabel 9 ) and on VaTex (Table 7 ). These observations deduce that random noise could function as visual context. In MMT systems, adding random noise or visual context can help reduce overfitting (Bishop et al., 1995) when translating sentences in Multi30k, which are short and repetitive (Caglayan et al., 2019). Moreover, we find that the 2 norm of model weights in ResNet-based Gated Fusion and noise-based Gated Fusion are only 97.7% and 95.2% of that in Transformer on En→De, respectively. This further verifies our speculation that, as random noise injection (An, 1996), visual context can help weight smoothing and improve model generalization.
Further, we regularize the models with weight decay. We consider three models: the text-only Trans-   former, the representative existing MMT method Doubly-ATT, and our Gated Fusion method. Figure 2 and 3 (in Appendix C) show the BLEU and METEOR scores of these methods on En→De translation as weight decay rate changes, respectively. We see that the best results of the text-only Transformer model with fine-tuned weight decay are comparable or even better than that of the MMT models Doubly-ATT and Gated Fusion that utilize visual context. This again shows that visual context is not as useful as we expected and it essentially plays the role of regularization.

When is visual context needed in MMT
Despite the less importance of visual information we showed in previous sections, there also exist works that support its usefulness. For example, Caglayan et al. (2019) experimentally show that, with limited textual context (e.g., masking some input tokens), MMT models will utilize the visual input for translation. This further motivates us to investigate when visual context is needed in MMT models. We conduct experiment with a new masking strategy that does not need any entity linking annotations as in Caglayan et al. (2019). Specifically, we follow Tan and Bansal (2020) to collect a list of visually grounded tokens. A visually grounded token is the one that has more than 30 occurrences in the Multi30k dataset with stop words removed.
Masking all visually grounded tokens will affect around 45% of tokens in Multi30k. Table 4 shows the adversarial study with visually grounded tokens masked. In particular, we select Transformer, Gated Fusion and RMMT as representative methods. From the table, we see that random noise injection (row 5,6) and weight decay (row 2) can only bring marginal improvement over the text-only Transformer model. However, ResNet-based models that utilize visual context significantly improve the translation results. For example, RMMT achieves almost 50% gain over the Transformer on the BLEU score. Moreover, both Gated Fusion and RMMT using ResNet features lead to a larger Λ value than that when textual context is sufficient as shown in Table 2. Those results further suggest that visual context is needed when textual context is insufficient. In addition to token masking, sentences with incorrect, ambiguous and gender-neutral words (Frank et al., 2018) might also need visual context to help translation. Therefore, to fully exert the power of MMT systems, we emphasize the need for a new MMT benchmark, in which visual context is deemed necessary to generate correct translation.
Interestingly, even with ResNet features, we observe a significant drop in both BLEU and ME-TEOR scores compared with those in Table 1 and 8, similar to that reported in (Chowdhury and Elliott, 2019). The reason could be two-fold. On the one hand, there are many words that can not be visualized. For example, in Table 5 (a), although Gated Fusion can successfully identify the main objects in the image ("little boys pose with a puppy"), it fails to generate the more abstract concept "family picture". On the other hand, when translating different words, it is difficult to capture correct regions in images. For example, in Table 5 (b), we see that Gated Fusion incorrectly generates the word frauen (women) because it captures the woman at the top-right corner of the image.  two men sitting in a restaurant zwei kinder spielen in einem springbrunnen (two children are playing in a fountain) zwei frauen sitzen in einem restaurant (two women are sitting in a restaurant) zwei männer sitzen in einem restaurant (two men are sitting in a restaurant) Table 5: Case studies under limited textual input. We use underline to denote masked tokens, and strikethrough (bold) font to denote incorrect (correct) lexical choices. We use Gated Fusion for analysis.

Discussion
Finally, we discuss how our findings might benefit future MMT research. First, a benchmark that requires more visual information than Multi30k to solve is desired. As shown in Section 5.2, sentences in Multi30k are rather simple and easy-tounderstand. Thus textual context could provide sufficient information for correct translation, making visual modules relatively redundant in these systems. While the MSCOCO test set in Multi30k contains ambiguous verbs and encourages models to use image sources for disambiguation, we still lack a corresponding training set. Second, our methods can serve as a verification tool to investigate whether visual grounding is needed in translation for a new benchmark.
Third, we find that visual feature selection is also critical for MMT's performance. While most methods employ the attention mechanism to learn to attend relevant regions in an image, the shortage of annotated data could impair the attention module (see Table 5 (b)). Some recent efforts Lin et al., 2020;Caglayan et al., 2020) address the issue by feeding models with preextracted visual objects instead of the whole image. However, these methods are easily affected by the quality of the extracted objects. Therefore, a more effective end-to-end visual feature selection technique is needed, which can be further integrated into MMT systems to improve performance.

Conclusion
In this paper we devise two interpretable models that exhibit state-of-the-art performance on the widely adopted MMT datasets -Multi30k and the new video-based dataset -VaTex. Our analysis on the proposed models, as well as on other existing MMT systems, suggests that visual context helps MMT in the similar vein as regularization methods (e.g., weight decay), under sufficient textual context. Those empirical findings, however, should not be understood as us downplaying the importance existing datasets and models; we believe that sophisticated MMT models are necessary for effective grounding of visual context into translation. Our goal, rather, is to (1) provide additional clarity on the remaining shortcomings of current dataset and stress the need for new datasets to move the field forward; (2) emphasise the importance of interpretability in MMT research.

Acknowledgement
Zhiyong Wu is partially supported by a research grant from the HKU-TCL Joint Research Centre for Artificial Intelligence.
Hasan Sait Arslan, Mark Fishel, and Gholamreza Anbarjafari. 2018. Doubly attentive transformer machine translation. arXiv preprint arXiv:1807.11605. A Training Settings   The results are shown in Table 7. We observe that although most MMT systems show improvement over the Transformer baseline, the gains are quite marginal. Indicating that although imagebased MMT models can be directly applied to video-based MMT, there is still room for improvement due to the challenge of video understanding. We also note that (a) regularize the text-only Transformer with weight decay demonstrates similar gains as injecting video information into the models; (b) replacing video features with random noise replicate comparable performance, which further supports our findings in Section 5.2.

C Results on METEOR
We also report our results based on ME-TEOR (Banerjee and Lavie, 2005), which consistently demonstrates higher correlation with human judgments than BLEU does in independent evaluations such as in EMNLP WMT 2011 3 . From Table 8, we can see that on En-Fr translation, MMT systems demonstrate similar improvements over text-only baselines in both METEOR and BLEU(see Table 1). On En-De translation, however, MMT systems are mostly on-par with Transformer-tiny on METEOR and do not show consistent gains as BLEU. We hypothesis the reason being that En-De sets are created in a imageblind fashion, in which the crowd-sourcing workers produce translations without seeing the images (Frank et al., 2018). Such that source sentence can already provide sufficient context for translation. When creating the En-Fr corpus, the image-blind issue is fixed (Elliott et al., 2017), thus images are perceived as "needed" in the translation for whatever reason. Although BLEU is unable to elicit this difference, evaluation based on ME-TEOR captured it and confirmed previous research. We also compute METEOR scores for our experiments that regularize models with random noise (see Table 9) and weight decay (see Figure 3). The results are consistent with those evaluated using BLEU and further complement our early findings.

D Results on IWSLT'14
We also evaluate the retrieval-based model RMMT on text-only corpus -IWSLT'14. The IWSLT'14 dataset contains 160k bilingual sentence pairs for En-De translation task. Following the common practice, we lowercase all words, split 7k sentence pairs from the training dataset for validation and concatenate dev2010, dev2012, tst2010, tst2011, tst2012 as the test set. The number of BPE operations is set to 20,000. We use the Small configuration in all our experiments. The dropout and label    smoothing rate are set to 0.3 and 0.1, respectively. Since there is no images associated with IWSLT, we follow (Zhang et al., 2020) and retrieve top-5 images from Multi30K corpus. From Table 10, we see that Transformer without weight decay is marginally outperformed by RMMT, but achieves slightly higher BLEU scores when trained with a 0.0001 weight decay. Our discussion in Section 5.2 sheds light on why visual context is helpful on non-grounded lowresourced datasets like IWSLT'14 -for lowresourced dataset like IWSLT'14, injecting visual context help regularize model training and avoid overfitting.