Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images. Our findings suggest the research on effective MMT architectures is currently impaired by the lack of suitable datasets and careful consideration must be taken in creation of future MMT datasets, for which we also provide useful insights.


Introduction
Multimodal machine translation (MMT) aims to improve machine translation by resolving certain contextual ambiguities with the aid of other modalities such as vision, and have shown promising integration in conventional neural machine translation (NMT) models . On the other hand, recent studies reported some conflicting results regarding how the additional visual information is exploited by the models for generating higher-quality translations. A number of MMT models (Calixto et al., 2017;Helcl et al., 2018;Ive et al., 2019;Lin et al., 2020; have been proposed which showed improvements over text-only models, whereas ; Barrault et al. (2018); Raunak et al. (2019) observed that the multimodal integration did not make a big difference quantitatively or qualitatively. Following experimental work showed that replacing the images in image-caption pairs with incongruent images (Elliott, 2018) or even random noise (Wu et al., 2021) might still result in similar performance of multimodal models. In light of these results, Wu et al. (2021) suggested that gains in quality might merely be due to a regularization effect and the images may not actually be exploited by models during the translation task.
In this paper, we investigate the role of the evaluation benchmark in model performance and whether its tendency to ignore visual information in the input could be a consequence of the nature of the dataset. The most widely-used dataset for MMT is Multi30K (Elliott et al., , 2017Barrault et al., 2018), which extends the Flickr30K dataset (Young et al., 2014) to German, French, and Czech translations. Captions were translated without access to images, and it is posited that this heavily biases MMT models towards only relying on textual input (Elliott, 2018). MMT models may well be capable of using visual signals, but will only learn to do so if the visual context provides information beyond the text. For instance, the English word "wall" can be translated into German as either "Wand" (wall inside of a building) or "Mauer" (wall outside of a building), but we find that reference translations in Multi30k are not always congruent with images.
A number of efforts have been put into creating datasets where correct translations are only possible in the presence of images. Caglayan et al. (2019) degrade the Multi30K dataset to hide away crucial information in the source sentence, includ-ing color, head nouns, and suffixes. Similarly, Wu et al. (2021) mask high-frequency words in Multi30K. Multisense (Gella et al., 2019) collects sentences whose verbs have cross-lingual sense ambiguities. However, due to the high cost of data collection, datasets of such kind are often limited in size. MultiSubs (Wang et al., 2021) is another related dataset. which is primarily used for lexical translation because the images are retrieved to align with text fragments rather than whole sentences.
In this work, we propose two methods to necessitate the visual context -back-translation from a gender-neutral language (e.g. Turkish) and word dropout in the source sentence. They are simple and cheap to implement, allowing them to be applied on much larger datasets. We test the methods on two MMT architectures and find that they indeed make the model more reliant on the images.

Method
In this section, we elaborate two methods to conceal important information in the source textual inputs that can be recovered with the aid of visual inputs.
Back-Translation. Rather than trying to create reference translations that make use of visual signals for disambiguation, we treat original image captions as the target side and automatically produce ambiguous source sentences. While such back-translations are generally used for data augmentation (Sennrich et al., 2016), we rely fully on this data for training and testing. We focus on gender ambiguity, which can be easily created by translating from a language with natural gender (English) into a gender-neutral language (Turkish). In Turkish, there is no distinction between gender pronouns (e.g. "he" and "she" are both translated into "o"). We use a commercial translation system (Google Translate) to translate the image description in English to Turkish. The task is then to translate from Turkish back into English. An example is shown in Fig. 1.
Word Dropout. Inspired by Caglayan et al. (2019), we degrade the textual inputs to eliminate crucial information. We use a simplified approach that requires no manual annotation, randomly replacing tokens in the source sentence with a special UNK token, subject to a dropout probability p (Bowman et al., 2016).
Film character sitting at his chair and reading a letter with fireplace and Christmas tree in the background.
Film character sitting in his chair and reading a letter with fireplace and Christmas tree in the background.

MMT Model
Film character sitting in her chair and reading a letter with fireplace and Christmas tree in the background. Figure 1: An example for back-translation. The image caption is translated into Turkish using a text-only translation system. Then a MMT model is trained to translate it back into English. When an incongruent image is fed into the model, the gender pronoun "his" is mistranslated.

Data Collection
As our starting point, we use Conceptual Captions (Sharma et al., 2018), which contains 3.3M images with captions. The captions in the dataset have already been processed to replace named entities with hypernyms such as 'person' or profession names such as 'actor'. In order to create a gender-ambiguous dataset we further filter out any sentences containing nouns with information about the gender of the entity (e.g. woman/man, lady/gentleman, king/queen, etc.) and also remove sentences with professions which are only used in a single gender-specific context (e.g. 'football player', which is always used with the male pronoun in the dataset). We then automatically translate the captions of the resulting dataset into Turkish and use this pseudo-parallel data for training our Turkish-English MMT models. For validation and testing we randomly sample 1000 sentences and use the remaining for training. We refer to this processed dataset as Ambiguous Captions (AmbigCaps).
For comparison, we also create a Turkish→English version of Multi30k by backtranslating the English side. Tab. 1 summarizes the characteristics of the two corpora.

Models
In our experiments, we consider one NMT model and two MMT models. We follow Wu et al.
(2021)'s model and configuration to isolate the cause for the negative results they obtained. We decide not to use the retrieval-based system because it samples images that are not described by the text. We also implement another simple model to demonstrate the applicability of our approaches.
Transformer. For text-only baseline, we use a variant of the Transformer that has 4 encoder layers, 4 decoder layers, and 4 attention heads in each layer. The dimensions of input/output layers and inner feed-forward layers are also reduced to 128 and 256 respectively. This configuration has been shown to be effective on Multi30K dataset (Wu et al., 2021). The MMT models below follow the same configuration.
Gated Fusion. Gated fusion model (Wu et al., 2021) learns a gate vector λ, and combines textual representation and image representations as follows: where H text is the output of the Transformer encoder, H avg is the average pooled visual features after projection and broadcasting, and denotes the Hadamard product. H is then fed into the Transformer decoder as in NMT.

Concatenation.
We implement a different approach to combine textual and visual features. The flattened and projected 'res4_relu' features H res4_relu are directly concatenated with the Transformer encoder representations H text as follows: This preserves more fine-grained features in the original image and avoids confounding the two modalities.

Implementation Details
We follow (Wu et al., 2021)

Metrics
BLEU. We compute the cumulative 4-gram BLEU scores (Papineni et al., 2002) to evaluate the overall quality of translation.
Gender Accuracy. Since we are most concerned with the gender ambiguity in the texts, we introduce gender accuracy as an additional metric. We first extract gender pronouns from the sentence. If the sentence contains at least one of the male pronouns ['he', 'him', 'his', 'himself'], it is classified as 'male'; if it contains at least one of the female pronouns ['she', 'her', 'hers', 'herself'], it is classified as 'female'; if it contains both male and female pronouns or neither, it is classified as 'undetermined'. We only consider the first two categories, 3 and compute gender accuracy by comparing the results of references and hypotheses.
Image Awareness. To examine models' reliance on the visual modality, we calculate the performance degradation when randomly sampled images are fed. This is also termed as image awareness (Elliott, 2018).

Results
The results of our experiments are shown in Tab. 2.

Multi30K EN→DE
Test2016 We found our MMT models provide little to no improvement over the text-only Transformer. Moreover, the impact of feeding MMT systems with incongruent images is negligible. Our observations conform with previous work Barrault et al., 2018;Wu et al., 2021), namely that visual signals are not utilized.
Multisense We also evaluate models trained on Multi30K on the Multisense test set (Gella et al., 2019). Similarly, no substantial difference is observed whether congruent or incongruent images are used. This suggests that it is not just a matter of the Test2016 test set containing too little textual ambiguity, but that the model has not learned to incorporate the visual information necessary for Multisense. 4

Multi30K TR→EN
Our experiments on the TR→EN version of Multi30K that we created do not show any substantial improvements in image awareness, which we attribute to the relative sparsity of gender ambiguity at training and test time (see Tab. 1).

Ambiguous Captions
Training the same multimodal models on the Ambiguous Captions dataset results in substantial improvements in terms of both BLEU scores and gender accuracy compared to our text-only baseline. This suggests that the high level of textual ambiguity in this dataset encourages MMT models to exploit visual information. We further test this hypothesis by repeating the experiment when images are shuffled, and observe that their performance substantially deteriorates, especially their ability to infer the correct gender pronouns. For instance, the gated fusion model has an impressive gender accuracy of 80.9% compared to 73.9% of the textonly Transformer, while it drops to 64.4% when incongruent images are used. We find that both the gated fusion and concatenation model behave similarly, indicating that the choice of dataset has a bigger effect on the success of multimodal modeling than the specific architecture.

Effect of Word Dropout
We found word dropout tends to increase image awareness for the concatenation model. This is most evident for Multi30K (TR→EN), where image awareness increases by ≈ 300%. For the gated fusion model, although word dropout leads to more differences in translations between congruent and incongruent image-text alignments (e.g. on Multi30K (TR→EN), 20 differences without dropout, 192 with dropout), it is not well reflected by the image awareness metric. The reason remains to be further inspected.
Despite having the desired effect of increasing image awareness on the concatenation model, we observe some deterioration of BLEU and gender accuracy compared to the model trained without word dropout; still, we hope that our results serve as a proof-of-concept to motivate future research on regularization schemes that aim to (re)balance visual and textual signal. We note the success of work done in parallel to ours that applied word dropout to increase context usage in context-aware machine translation (Fernandes et al., 2021).

Conclusion
Our experiments explain recent failures in MMT, and show that the models we examine successfully learn to rely more on images when textual ambiguity is high (as in our back-translated Turkish-English dataset) or when textual information is dropped out. Our results suggest that simple MMT models have some capacity to integrate visual and textual information, but their effectiveness is hidden when training on datasets where the visual signal provides little information. In the long term, we hope to identify real-world applications where multimodal context naturally provides a strong disambiguation signal. For the near future, we release our dataset and encourage researchers to utilize it to validate future research on multimodal translation models. For example, we are interested under which conditions multimodal models learn to exploit visual signal: does the absolute frequency of examples with textual ambiguity matter more, or their proportion?

Broader Impact Statement
Our dataset inherits biases from the Conceptual Captions dataset. We cannot rule out gender bias in the dataset similar to the one described by Zhao   (2017), with males and females showing different distributions, and we only studied a subset of captions with unambiguously male or female pronouns. Despite potential issues with our dataset (which we consider unsuitable for use in production because of aggressive filtering), we believe our work on improving MMT has a positive effect on gender fairness, since multimodal systems with audiovisual clues have the potential to reduce gender bias compared to systems that only rely on textual co-occurrence frequencies.