Visual Cues and Error Correction for Translation Robustness

Neural Machine Translation models are sensitive to noise in the input texts, such as misspelled words and ungrammatical constructions. Existing robustness techniques generally fail when faced with unseen types of noise and their performance degrades on clean texts. In this paper, we focus on three types of realistic noise that are commonly generated by humans and introduce the idea of visual context to improve translation robustness for noisy texts. In addition, we describe a novel error correction training regime that can be used as an auxiliary task to further improve translation robustness. Experiments on English-French and English-German translation show that both multimodal and error correction components improve model robustness to noisy texts, while still retaining translation quality on clean texts.


Introduction
Neural Machine Translation (NMT) has been shown to be very sensitive to noise (Belinkov and Bisk, 2018;Michel and Neubig, 2018;Ebrahimi et al., 2018), with even small perturbations in the inputs often leading to mistranslations. To improve the robustness of NMT models, current research mostly focuses on adapting the model to noisy texts via methods such as fine-tuning (Michel and Neubig, 2018;Alam and Anastasopoulos, 2020), noiseinjection (Belinkov and Bisk, 2018;Cheng et al., 2018;Karpukhin et al., 2019), and data augmentation through back-translation (Berard et al., 2019;Vaibhav et al., 2019;Li and Specia, 2019), etc. In these approaches, the translation model is trained or fine-tuned on the noisy data so that it can learn from the noise. However, methods using extra context to help translate noisy texts have not been investigated.
Studies in Multimodal Machine Translation (MMT) have shown that visual information improves translation quality when the textual context  2019), multimodality can help translate unknown words, but fail when there is noise in the input. The misspelled word "song" is correctly translated as "enfant" (child) when it is replaced with an unknown token, but translated literally as "chanson" (song) otherwise.
is incomplete (Caglayan et al., 2019;Imankulova et al., 2020;Caglayan et al., 2020). However, as exemplified by Caglayan et al. (2019) (Figure 1), an MMT model trained on clean data was not able to handle noise. When the word "son" was misspelled as "song", the model disregarded the visual information and used the literal translation "chanson". The MMT model attended to the relevant region in the image and generated the intended translation "enfant" only when the noise was masked by a placeholder in the input, imitating an out-ofvocabulary (OOV) example.
Given that the visual modality has been shown to help predict unknown words, we investigate whether adding multimodal information to adaption-based methods would further improve translation robustness. To answer this question, we build MMT models in conjunction with noise injection techniques and investigate their behaviour during training and inference on both noisy and clean data. To further improve robustness, we extend the current adversarial training method (i.e., training NMT models on noisy texts) and propose an error correction training method. In addition to training the model with noise-injected source sentences and their clean translation counterparts, we introduce error correction as an auxiliary task and add a separate decoder to the model, which is used to denoise the source sentence. 1 Our main contributions can be summarized as: • To the best of our knowledge, this is the first work combining adversarial training with multimodal NMT to improve translation robustness. We evaluate robustness on three types of noise that mimic errors commonly introduced by humans. Systematic experiments reveal that multimodality can improve model performance on both known and unseen noise.
• We propose an error correction training method for translation by introducing denoising as an auxiliary task. We show that the robustness of both NMT and MMT models is improved with this method.
• We demonstrate that the model using visual features also learns to correct grammatical errors more accurately, indicating the potential for multimodal monolingual error correction.
The paper is organised as follows: In Section 2, we present the background and related work. In Section 3, we introduce the types of noise injected and the error correction training method. In Section 4, we describe our experiment settings, with experiment results in Section 5, and further analysis in Section 6.

Background and Related Work
Robust NMT Although NMT models can achieve high performance on clean data, they are very brittle to non-standard inputs, such as noisy texts (Belinkov and Bisk, 2018). Different types of noisy data have been proposed to test translation robustness, e.g. synthetic word perturbations (Belinkov and Bisk, 2018), grammatical errors (Anastasopoulos et al., 2019), and user-generated texts from social platform (Michel and Neubig, 2018;. The most common approach to improve translation robustness is to train the model on noisy data, which is referred to as adversarial training. Since parallel data with noisy source sentences and clean translations is difficult to obtain, the clean training data is often injected with different types of artificial noise, e.g. random word perturbations like character insertion/deletion/substitution (Belinkov and Bisk, 2018;Karpukhin et al., 2019;Passban et al., 2020;Xu et al., 2021), noise generated via back-translation (Berard et al., 2019;Vaibhav et al., 2019;Li and Specia, 2019), and adversarial examples generated by white-box generator model (Cheng et al., 2018(Cheng et al., , 2019(Cheng et al., , 2020. Even though this method has been shown to improve NMT performance on noisy data, the types of noise used thus far are not common in real data. For example, it would be highly unlikely for human authors to misspell the word "robust" as "zobust", but such random transformations are used when synthesizing noisy training data for MT. In addition, back-translation paraphrases the texts to introduce noise, however such noise is less realistic as human-generated errors, which include mispellings and grammatical errors. In adversarial approaches for other NLP tasks, Ribeiro et al. (2020) and Ma (2019) introduce various methods to inject both artificial and realistic noise. Inspired by these work, we focus on three types of noise that are commonly generated by humans in real texts and experiment with these for the translation task.
MMT Multimodal machine translation extends the framework of NMT by incorporating extra modalities, e.g. image (Specia et al., 2016a) or audio (Sulubacak et al., 2020). In our case, the extra modality is given as visual features from an image network to complement the textual context. In standard MMT, these features can be fused with the textual representation by simple operations such as concatenation (Caglayan et al., 2016), hidden states initialization (Calixto and Liu, 2017), or via attention mechanisms (Libovický and Helcl, 2017;Calixto et al., 2016Calixto et al., , 2017Yao and Wan, 2020) and latent variables (Calixto et al., 2019).
Recent research has shown that the extra modality helps translation, especially when the input is incomplete (Caglayan et al., 2019(Caglayan et al., , 2020Imankulova et al., 2020) or ambiguous (Ive et al., 2019;Wu et al., 2019b). Wu et al. (2019a) hinted at the possibility of multimodality helping NMT in dealing with natural noise stemming from the speech recognition system used as a first step in their pipeline approach to speech translations from videos. Their results, however, were inconclusive. Salesky et al. (2021) investigate the robustness of open-vocabulary translation by representing texts clean a pink flower is starting to bloom . edit-distance a pink flower is staring to loom . homophone a pink flour is starting to bloom . keyboard a pink flower is starring to bloom . as images followed by optical character recognition to cover some cases of noise such as misspellings. This is an interesting but orthogonal area of research since no external visual information is used. Therefore, it remains an open question whether MMT can perform better than NMT on noisy texts, and whether multimodality can be complementary rather than redundant to previous text-based robustness techniques. The work by Caglayan et al. (2019) is the closest to our approach, however they focused mainly on identifying when the visual information is helpful. As such, they only performed experiments comparing NMT and MMT in the presence of unknown words consisting of placeholders used to mask out words in the source sentence. In contrast, we focus on multimodal models for realistic noise that includes in-and out-of-vocabulary words, such as misspellings or correctly-spelled words used in an incorrect context.

Methods
In this section, we introduce our methods to improve and evaluate the robustness of NMT and MMT models. In Section 3.1, we describe three techniques to inject realistic noise into training and test data. In Section 3.2, we introduce our error correction training method.

Noise Injection
In previous work on noise injection, the perturbations are often arbitrary, which would result in unrealistic noise. To simulate the natural noise in real situations, we add constraints to the random perturbations. We select three constrained noise injection methods that can be applied to both training and test data, with each method simulating one type of human-generated errors: Edit distance A word is randomly replaced with another word in the vocabulary where the edit distance between the two words is less than two characters. The edit-distance noise simulates the occurrence of confusable spellings (e.g. sat vs seat) and also some grammatical errors (e.g. horse vs horses).
Homophones A word is randomly replaced with another word that shares the same pronunciation. We use the CMU Pronouncing Dictionary 2 to transform words into phonemes and find noisy substitutes with the same pronunciation. This simulates errors made by applications such as automatic speech recognition, or by non-native speakers.
Keyboard (Belinkov and Bisk, 2018) A character in a word is randomly replaced with an adjacent key on the standard QWERTY keyboard. The keyboard noise simulates the real-life typos when users accidentally press wrong keys while typing. Table 1 shows examples of the three types of noise we experimented with. The edit distance and homophone noise types are applied on the word level, while the keyboard noise is on the character level. Word-level noise is more likely to break the sentence context even though the noisy substitutes are correctly spelled words. On the contrary, character-level noise is likely to introduce misspelled words and increase the out-of-vocabulary (OOV) rate.
When constructing the noisy training or test sets, we sample from the three types of noise following a uniform distribution, where to each sentence we apply only one type of noise. To avoid substituting words not carrying much contextual information (e.g. articles and punctuations) , we only perturb words with more than two characters. The noise level is controlled by the hyperparameter n, which defines the maximum number of words replaced with noisy counterparts per sentence. The noise injection procedure can be characterized as: given a source sentence x = [x 1 , x 2 , ..., x M ] and a target translation y = [y 1 , y 2 , ..., y N ], noise will be injected to the clean source sentence x to obtain its noisy variant x = [x 1 , ..., x a i , ..., x M ], where a i is the position of the noisy substitutes (i = {1, 2, ..., n}).

Error Correction Training
We introduce error correction (Ng et al., 2014;Yuan and Briscoe, 2016) as an auxiliary task to help improve the robustness against noisy inputs. For that, we add a second decoder to the MT architecture, which is only used for the error correction task. During training, the noisy sentence x is encoded by the encoder, which is shared between the translation and correction tasks, into hidden states h . The hidden state representation is then fed to both decoders. The translation decoder aims to generate a correct translation y while the correction decoder aims to recover the original source sentence x. This method is also compatible with the MMT model, where the error correction decoder will use both visual and textual hidden states to recover the clean source sentences. Figure 2 gives an illustration of the model architecture.
Compared to the standard MT model, the version with error correction training (which we refer to as NMT-cor and MMT-cor hereinafter) maximizes both the probability of generating correct translations P (y|x ; θ mt ) and the probability of recovering the clean source sentences P (x|x ; θ cor ).
The θ mt represents parameters for the translation component and the θ cor represents parameters for the error correction component, with θ mt = {θ enc , θ mt_dec }, θ cor = {θ enc , θ cor_dec }. Our hypothesis is that the auxiliary task of error correction may help the encoder with a noise-invariant representation, which would indirectly improve the translation of noisy sentences. During training, we jointly optimize the sum of the translation loss and the error correction loss, as is shown in Equation 2: (2) where λ ≥ 0 is the factor that controls the weight of the error correction loss, and D represents the noise-injected data consisting of triples in the form of (x, x , y).

Datasets
We experiment with the Multi30K dataset (Elliott et al., 2016), using both the En-Fr and En-De language pairs. This is the standard dataset for MMT and has been used in all open challenges on the topic (Specia et al., 2016b;Elliott et al., 2017a;Barrault et al., 2018). Following Caglayan et al.
(2019), we use both the train and valid splits as our training set. The test2016-flickr set is used as our development set for checkpoint selection. For evaluation, we test the models on both test2017-flickr and test2017-mscoco sets (Elliott et al., 2017b). We use a word-level vocabulary and build vocabularies for the original source and target languages, as well as the vocabulary on noisy source texts. 3 We use the pre-processed data in Multi30K, which is lowercased, normalized, and tokenized with Moses (Koehn et al., 2007). We also performed experiments using a subword-level vocabulary (BPE), which led to further improvements, but the trend in the results is the same (see Appendix A).
Following Caglayan et al. (2020), we use the "bottom-up-top-down" (BUTD) features (Anderson et al., 2018) extracted from a pre-trained Faster R-CNN ResNet-101 object detector. Each image is represented as 36 pooled feature vectors V ∈ R 36×2048 , with each vector representing a local object region.

Models
NMT and MMT Models Our baseline NMT model is the standard Transformer model (Vaswani et al., 2017), with 6 layers for both the encoder and the decoder. The hidden state size is 512 while the feed-forward dimension is 1024. The number of attention heads is set to 4. Dropout (0.3) is applied to both self/cross-attention and the position-wise feedforward layer, and Pre-norm (Nguyen and Salazar, 2019) is applied to boost convergence. Our baseline MMT model follows the same architecture and hyperparameters as the baseline NMT model, except for the multimodal components. We use the serial multimodal cross-attention (Libovický et al., 2018), where an extra cross-attention sublayer is appended in the decoder layer to perform attention over the visual features. We also experiment with GRU models (Cho et al., 2014), following the hyperparameter settings of Caglayan et al. (2019). Due to space restrictions, we include the detailed results with GRU models in Appendix C. The GRU results display the same trend as the experimental results using Transformer models.
Error Correction Models The error correction NMT/MMT models adopt the same encoder and decoder as the baseline NMT/MMT models, except for a second decoder added for error correction training. During training, we compute the cross-entropy loss for translation, as well as for error correction in the correction-based models. In these models, the two losses are summed and optimized jointly on the same batch. We found the best λ value (λ ∈ {0.2, 0.2, 0.4, 0.4, 0.8}) for different levels of noise (number of noisy words n ∈ {1, 2, 4, 6, 10}) during hyperparameter tuning. See Appendix B for more details.
Training and Evaluation We use ADAM (Kingma and Ba, 2015) as the optimizer and adopt the noam learning rate scheduler (Vaswani et al., 2017) with a warm-up of 8000 steps. The training batch size is 64. Models are evaluated using the METEOR score (Denkowski and Lavie, 2014), which is the main metric for multimodal machine translation (Barrault et al., 2018). For the evaluation of error correction, we use ERRANT (Bryant et al., 2017) to compute the F 0.5 score. During evaluation, we select the checkpoint with the best performance on the development set and generate the translation and correction using beam search of size 12. All models are implemented using nmtpytorch 4 and pysimt 5 . Each model is run with three random seeds and the average results are reported. Each run takes approximately 2 hours to train on an RTX 2080 Ti GPU.

Testing for Robustness to Noise
We first evaluate the robustness of standard NMT and MMT models trained on clean data by testing on the noise-injected data. This setting represents regular models that are not specifically adapted to noise. Figure 3 presents the change in METEOR (∆METEOR) between standard MMT and NMT models tested on data with different noise levels. The ∆METEOR is consistently above 0 for both test sets in the two language pairs. As the noise level increases, the difference between NMT and MMT models is larger, showing that the visual information in the MMT model leads to predictions that are more robust to noise.

Training for Robustness to Noise
To test models for their ability to adapt to noisy data, we train models on data with added noise, sampling from the three types of noise in Section 3.1 and test them on noisy test data, with noise added in the same fashion. METEOR score results are shown in Table 2.
The training on noisy data is equivalent to the "adversarial training" experiments in previous studies (Belinkov and Bisk, 2018;Karpukhin et    2019). In this setting, a text-only NMT model still suffers from significant performance degradation as the number of noisy words grows, for example dropping from 70.6 METEOR on clean test data to 49.4 under the noisiest setting for en-fr on flickr2017. A drop is also observed for the MMT model, however it is smaller for both language pairs and test sets. As n becomes larger, the gain from the visual context is more obvious, showing that additional context in the form of image features is increasingly important for translation when the quality of the textual input is degraded.
With the addition of the error correction training, both NMT and MMT models further improve their performance, with NMT-cor even outperforming the base MMT model. The MMT-cor model performs better than both NMT-cor and base MMT models, demonstrating that the improvements from error correction and visual cues are complementary. Similar to the benefit from visual features, the difference between models with and without error correction training becomes larger when the noise level increases.
In addition to the performance on noisy texts, another important aspect when measuring robustness is to evaluate whether the performance of the models on clean data is harmed when the model is adapted to the noisy data. Following Karpukhin et al. (2019), we train models on a mixture of noisy and clean data (0.5/0.5) and test them on clean (original) data.  The trend is same for models on the other datasets/language pairs: the larger the proportion of noise in the training data, the higher the performance drop on the clean test set. However, the largest drop in METEOR is only 2.4, showing that mixing clean and noisy training data is a good strat-egy. 6 Both MMT and MMT-cor show a similar performance drop to the base NMT model, which indicates that the use of visual context and error correction training does not harm performance on clean texts.
The corresponding results for Table 2 and 3 with GRU models can be found in Appendix C, showing a similar benefit when using multimodal information and error correction training.

Analysis
Robustness on Unseen Noise Since in realistic applications the noise distribution at test time is unknown, we evaluate models using different noise proportions and types at training and test time. For the former, we test the same model (n=4) on various test sets created with different values of n. For the latter, we test the same model (n=4) on the test set where words are randomly replaced with unknown tokens (i.e. "[UNK]") to simulate unseen noise (noisy words from different corpora or domains, e.g. new emojis). Table 4 shows results for both cases.  Table 4: Performance of NMT and MMT models trained noisy data with n=4 but tested on data with different noise proportion and noise types. All models are tested on Flickr2017 En-Fr.
The overall trend is similar to the case when the train/test noise are the same: models with visual information and error correction training achieve better performance. The METEOR score of train/test noise proportion mismatch is close to the score in Table 2 under the same noise proportion, showing that the models are robust to unknown noise distributions. As for the evaluation on unknown noise types, the MMT model outperforms the NMT 6 In additional experiments, we found that models trained on entirely noisy data show much more severe performance drops as n becomes larger -see Appendix D. Visual Sensitivity To further probe the effect of the visual information on MMT and MMT-cor models, we apply the incongruent decoding evaluation approach (Elliott, 2018;Caglayan et al., 2019) by feeding the multimodal models with incorrect visual features at test time, i.e. features taken from a different test sample. The expectation is that the multimodal model will suffer due to the incorrect visual context, performing worse compared to using the correct visual features. Figure 4 shows the performance gap between congruent decoding and incongruent decoding.
The ∆METEOR is always positive for both MMT and MMT-cor models, and this difference is amplified with a larger noise ratio in the test data, reaching up to 7.2 METEOR scores when n=10. We note that the ∆METEOR for the MMT-cor model is similar to the MMT model, but slightly lower, indicating that the error correction training helps the model recover from incorrect image features to a small extent on noisier data.

Error Correction Quality
To understand whether visual information can also benefit error correction, we compute the span-based correction F 0.5 score as commonly used in the Grammatical Error Correction task (Dahlmeier and Ng, 2012). The <noisy, corrected> and <noisy, clean> pairs are first transformed into two lists of edits, where adding/replacing/deleting a word at any position counts as one edit. The evaluation is then performed by calculating the precision/recall/F0.5 between these edit sets.
We report the results in Table 6 for both NMT-SRC: women are playing lacrosse with an orange ball .
(women are playing lacrosse with a strange ball.) MMTcor: des femmes jouent à la lacrosse avec une balle orange . REF: des femmes jouent à la crosse avec une balle orange .
(women are playing lacrosse with an orange ball .) COR-NMT: women are playing lacrosse with an old ball . COR-MMT: women are playing lacrosse with an orange ball .
SRC: a man with his bicycle selling his products on a street NSY: a [kan] with his [bicycld] selling his products on a street NMT: un homme avec son casque vendant ses produits dans une rue (a man with his helmet selling his products on a street) NMTcor: un homme avec son vélo vendant ses produits dans une rue MMTcor: un homme avec son vélo vendant ses produits dans une rue REF: un homme avec son vélo vendant ses produits dans une rue (a man with his bicycle selling his products on a street) COR-NMT: a man with his bicycle selling his products on a street COR-MMT: a man with his bicycle selling his products on a street Table 5: Qualitative examples for both translation and error correction, where noise is indicated by the words in square brackets. Underlined and bold words highlight the bad and good lexical choices, respectively. NSY: noisy sentence. COR-* : corrected sentence (output from the error correction decoder).
cor and MMT-cor models trained on different values of n. The MMT-cor model outperforms the NMT-cor model, with an improvement of up to +1.7 and +2.6 F 0.5 on the two test sets. This improvement indicates that visual features can also be beneficial for error correction performance, showing a potential for the task of multimodal error correction, which has yet to be explored.   Table 5 (see Appendix F for more examples). In the first example, the source sentence is injected with the "edit-distance" noise, with "are" and "orange" replaced with "art" and "strange" respectively. Both NMT and NMT-cor models fail to include "orange" in the translation, as it is difficult to recover from this error without visual information, while the MMT-cor model is able to generate the correct output. The source sentence in the second example is injected the "keyboard" noise, with "man" replaced with "kan" and "bicycle" replaced with "bicycld". Although the training data is injected with the same types of noise, the NMT model fails to translate correctly. The reason might be that "bicycle" has multiple noisy variants, such as "bicycld", "bocycle", etc., so the NMT model can hardly learn a strong relationship between "bicycld" and "vélo" (translation of "bicycle"). However, the NMT-cor model could relate "bicycld" with "bicycle", which helps to predict the correct translation "vélo".
In Figure 5, we also present the attention map of the MMT-cor system when generating the translation. The input is injected with noise by substituting "sit" with "sheet", and "wine" with "wire". When generating "sont assises" (are sitting), although the attention on the input text still mainly focuses on the noisy word "sheet" (with a small proportion focusing on the preposition "at"), the visual attention is able to focus on the people in the image; therefore, the model obtains the correct information from the visual input and is able to generate the correct translation. Similarly, the model generates "vin" (wine) by attending to the glasses in the images and is not distracted by the noisy input word "wire". The attention map for the example when generating the error correction output can be found in Figure 7 in the Appendix.

Conclusions
In this paper we propose to explore visual cues in order to improve model robustness to noise in machine translation. We combine adversarial training on artificially generated noisy examples with visually-informed multimodal machine translation. By training multimodal models on noisy data, we show that the extra visual context can improve translation robustness on both known and unseen noise. We also propose a novel error correction training method, jointly optimizing the translation model with an auxiliary objective for correcting input errors, which we show can further improve the robustness of both text-only and multimodal translation models. Future work in this area could investigate the integration of further modalities, such as audio in the speech translation setting. In addition to translation, we found that the model using visual features can also help correct errors in the source language. This opens up a promising direction for multimodal monolingual error correction, a task not yet explored.  Table 7 we present the results for NMT and MMT models using word-level and subword-level vocabulary. Models using byte-pair-encoding (BPE) perform better than models with word-level vocabulary. Nevertheless, MMT models ourperform NMT models when using BPE. Likewise, the MMT-cor models are consistently better than the MMT model when subword-level vocabulary is applied. The results show that the benefit from both multimodality and error correction training still holds on models with subword-level vocabulary.  Table 7: Results for word-and subword-level models trained and tested on noisy data. The word-level (w2w) results are used for comparison and are same as Table 2.

B Effect of λ
The value of λ controls the weight of the error correction training for NMT-cor and MMT-cor models. This is thus an important hyper-parameter. We show the performance on translation and error correction tasks for different values of λ in Figure 6. In terms of translation, the performance for both NMT-cor and MMT-cor models follows the same trend: the METEOR score first increases and then drops as λ increases. This is reasonable since error correction is an auxiliary task, and a large weight for error correction task might harm models' ability to translate well. Nevertheless, the optimal λ value is different for different levels of noise. Higher values of λ help translating noisier texts. Regarding error correction, the increase of λ always leads to better performance.

C Results with GRU Models
In Table 8, we present the results for GRU models trained and tested on the noisy data. Similar to Transformer models, GRU models also benefits from multimodality and error correction training, and the improvement is larger on noisier data.
In Table 9, the performance drop for GRU models on clean data is presented. Both MMT and MMT-cor shows lower drop than the NMT baseline, confirming that the improved robustness on noisy data does not sacrifice for the ability to translate clean data.  These results with GRU models further confirm that both multimodality and error correction training improves translation robustness and can generalise to different models.

D Performance Drop on Clean Texts
(Trained on Fully Noisy Data) In Table 10, we present the performance drop on clean texts for models trained on fully noisy data. The drop on clean texts is not obvious for models trained with smaller n while as n becomes large, all three models suffers from a significant perform degradation. The results indicates that the proportion of noise in the training data is an important factor for robustness. However, to a lesser extent, the benefit from visual context and error correction training still holds on the clean test set, which indicates that the two methods do not simply trade off the performance on clean and noisy texts.

E Semantic Similarity
To study the effect of error correction training on the shared encoder, we conduct a semantic similarity evaluation for models w/o error correction training. For that, we extract the hidden states from the last encoder layer for each sentence and measure the average cosine similarity over all words between noisy sentences and their clean counterparts. The similarity is computed as: where x = [x 1 , x 2 , ..., x k ] represents the noisy sentence, x = [x 1 , x 2 , ..., x k ] represents the clean sentence, and h i and h i represent the hidden state vectors for the i-th word in the noisy/clean sentences respectively. Results are presented in Table 11. Models applied with the error correction training achieve higher similarity between the clean and noisy hidden representations, suggesting that the error correction task helps learn a noise-invariant encoder representation. It is also interesting that visual features can slightly improve the similarity. The reason might be that the model learns alignments for both (image, clean text) and (image, noisy words). Therefore, the image might act as a bridge connecting the clean and noisy texts.  Table 11: Cosine similarity between the hidden representations for noisy and clean sentences. All models are trained with n=4 and tested on Flickr2017 En-Fr.

F More Qualitative Examples
In the appendix we provide some qualitative examples of translation (Table 12)