Cross-lingual Visual Pre-training for Multimodal Machine Translation

Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.


Introduction
Pre-trained language models (Peters et al., 2018;Devlin et al., 2019) have been proven valuable tools for contextual representation extraction. Many studies have shown their effectiveness in discovering linguistic structures (Tenney et al., 2019), which is useful for a wide variety of NLP tasks (Talmor et al., 2019;Kondratyuk and Straka, 2019;Petroni et al., 2019). These positive results led to further exploration of (i) cross-lingual pretraining (Lample and Conneau, 2019;Conneau et al., 2020;Wang et al., 2020) through the use of multiple mono-lingual and parallel resources, and (ii) visual pre-training where large-scale image captioning corpora are used to induce grounded vision & language representations (Lu et al., 2019;Tan and Bansal, 2019;Li et al., 2020a;Li et al., 2020b). The latter is usually achieved by extending the masked language modelling (MLM) objective (Devlin et al., 2019) with auxiliary vision & language tasks such as masked region classification and image sentence matching.
In this paper, we present the first attempt to bring together cross-lingual and visual pre-training. Our visual translation language modelling (VTLM) objective combines the translation language modelling (TLM) (Lample and Conneau, 2019) with masked region classification (MRC) (Chen et al., 2020; to learn grounded crosslingual representations. Unlike most of the prior work that use classification or retrieval based downstream evaluation, we focus on the generative task of multimodal machine translation (MMT), where images accompany captions during translation (Sulubacak et al., 2020). Once pre-trained, we transfer the VTLM encoder to a Transformer-based (Vaswani et al., 2017) MMT and fine-tune it for the MMT task. To our knowledge, this is also the first attempt of pre-training & fine-tuning for MMT, where the current state of the art mostly relies on training multimodal sequence-to-sequence systems from scratch (Calixto et al., 2016;Caglayan et al., 2016;Libovický and Helcl, 2017;Elliott and Kádár, 2017;Caglayan et al., 2017;Yin et al., 2020).
Our findings highlight the effectiveness of crosslingual visual pre-training: when fine-tuned on the English→German direction of the Multi30k dataset , our MMT model surpasses our constrained MMT baseline by about 10 BLEU and 8 METEOR points. The rest of the paper is organised as follows: §2 describes our pre-training and fine-tuning protocol, §3 presents our quantitative and qualitative analyses, and §4 concludes the paper with pointers for future work.

Method
We propose Visual Translation Language Modelling (VTLM) objective to learn multimodal crosslingual representations. In what follows, we first describe the TLM objective (Lample and Conneau, 2019) and then introduce the modifications required to extend it to VTLM.

Translation language modelling
The TLM objective is based on Transformer networks and assumes the availability of parallel corpora during training. It defines the input x as the concatenation of m-length source language sentence s 1:m and n-length target language sentence s (2) 1:n : 1 , · · · , s (2) n For a given input, TLM follows (Devlin et al., 2019), and selects a random set of input tokens y = {s (l) 1 , . . . , s (l) k } for masking. Let us denote the masked input sequence withx, and the groundtruth targets for masked positions withŷ. TLM employs the masked language modelling (MLM) objective to maximise the log-probability of correct labelsŷ, conditioned on the masked inputx: where θ are the model parameters. We keep the standard hyper-parameters for masking, i.e. 15% of inputs are randomly selected for masking, from which 80% are replaced with the [MASK] token, 10% are replaced with random tokens from the vocabulary, and 10% are left intact.

Visual translation language modelling
VTLM extends the TLM by adding the visual modality alongside the translation pairs ( Figure 1). Therefore, we assume the availability of sentence pair & image triplets and redefine the input as: 1 , · · · , s (2) n , v 1 , · · · , v o where {v 1 , · · · , v o } are features extracted from a Faster R-CNN model (Ren et al., 2015) pre-trained on the Open Images dataset (Kuznetsova et al., 2018). 1 Specifically, we extract convolutional feature maps from o = 36 most confident regions, and average pool each of them to obtain a regionspecific feature vector v i ∈ R 1536 . Each region i is also associated with a detection labelv i provided by the extractor. Before encoding, the feature vectors and their bounding box coordinates are projected into the language embedding space. The final model processes translation pairs and projected region features in a single-stream fashion Li et al., 2020a), and combines the TLM loss with the masked region classification (MRC) loss as follows: Masking. 15% random masking ratio is applied separately to both language and visual streams, and thev above now denotes the correct region labels for the masked feature positions. Different from previous work that zeroes out masked regions (Tan and Bansal, 2019;, VTLM replaces their projected feature vectors with the [MASK] token embedding. 2 Similar to textual masking, 10% of the random masking amounts to using regional features randomly sampled from all images in the batch, and the remaining 10% of regions are left intact.

Pre-training
VTLM requires a three-way parallel multimodal corpus, which does not exist in large-scale. To ad-dress this, we extend 3 the Conceptual Captions (CC) (Sharma et al., 2018) dataset with German translations. CC is a large-scale collection of ∼3.3M images retrieved from the Internet, with noisy alt-text captions in English. The translation of English captions into German was automatically performed using an existing NMT model  provided 4 in the Fairseq  toolkit. Since some of the images are no longer accessible, the final corpus' size is reduced to ∼3.1M triplets. We used byte pair encoding (BPE) (Sennrich et al., 2016) to learn a joint 50k BPE model on the CC dataset. The pre-training was conducted for 1.5M steps, using a single RTX2080-Ti GPU, and best checkpoints were selected with respect to validation set accuracy.

Settings.
We The pre-training is done for 1.5M steps using a single RTX2080-Ti GPU, and best checkpoints are selected with respect to validation accuracy.

Baseline MT models and fine-tuning
Our experimental protocol consists of initialising the encoder and the decoder of Transformerbased NMT and MMT models with weights from TLM/VTLM, and fine-tuning them with a smaller learning rate. The architectural difference between the NMT and the MMT models is that the latter encodes 36 regional visual features as part of the source sequence, similar to the VTLM ( § 2.2). As a natural baseline, we train constrained (trained only on the MT dataset) models without transferring weights from the pre-trained TLM/VTLM models. We refer to these models as from-scratch. For the fine-tuning experiments, we train three runs with different seeds. For evaluation, we use the models with the lowest validation set perplexity to decode translations with beam size equal to 8. Settings. For fine-tuning, we use the same hyperparameters as the pre-training phase, apart from decreasing the learning rate to 1e−5. For MT models that are trained from scratch, we increase the dropout rate to 0.4 and linearly warm up the learning rate from 1e−7 to 1e−4 during the first 4,000 iterations. Inverse square-root annealing is applied after 4,000 iterations. Second, we see that the best performances are obtained when models are first pre-trained on the three-way parallel Conceptual Captions (CC) dataset. To validate this further, we train a baseline NMT on the concatenation of Multi30k and CC (NMT+CC) and an MMT that uses only Multi30k for both pre-training and fine-tuning. The results clearly show that these systems lag behind the ones pre-trained on CC.

Machine translation
We also experimented with an alternative pretraining strategy where we do not mask visual regions. Interestingly, this alternative MMT in Table 1 reveals that not masking visual regions during pre-training yields slightly better results overall. This is equivalent to letting the model predict the object labels from a multimodal input where words are stochastically masked but regional features are kept intact. Overall, MMT fine-tuning on VTLM sets a new state of the art across all Multi30k test  Table 1: Quantitative comparison of experiments: when the mean and the standard deviation is reported, the single numbers appearing above, denote the maximum across three different runs.
sets. 6 We leave the exploration of visual region masking for the MRC task as future work and proceed with the alternative variant in the following experiments.
Encoder attention parameters. When finetuning the TLM for MT, the default XLM implementation randomly initialises the decoder's missing encoder attention parameters. In our experiments, we noticed that copying those parameters from the TLM self-attention layers substantially improves the results up to 2.2 BLEU.

Explicit masking
Here, we will evaluate the extent to which the visual information is taken into account (i) when TLM/VTLM predicts masked tokens, and (ii) when the fine-tuned NMT and MMT models are forced to translate source sentences with missing visual entities. For the latter, we use Flickr30k entities (Plummer et al., 2015) to mask head nouns in 2016 test set sentences, similar to .
Last-word masking. In this experiment, we measure the target word prediction accuracy, when last tokens 7 of input caption pairs are systematically masked during evaluation.   that the visual information is much more helpful (i.e. up to 6% accuracy improvement) when last tokens are masked in both English and German captions. However, if one caption is available, it provides enough context for cross-lingual prediction. Finally, when we shuffle (+shuf) the test set features to introduce incongruence (Elliott, 2018), we see that the VTLM model deteriorates substantially. This confirms that the accuracy improvements are not due to side-effects of experimentation noise, such as regularisation or random seed related effects.
Entity masking in MT. We devise two ways of masking entities i.e. we either replace them with the [MASK] token or remove them entirely so that the masking phenomena is not known to the model. The results in Table 3 show that MMT models can recover the missing source context to some extent, only when they are pre-trained using the proposed VTLM objective. In other words, the grounding ability can only be acquired when visual modality is present for both pre-training and fine-tuning. The gap between MASK and REMOVE also seems to highlight the importance of reserving a source position even it is corrupted/masked.

Visual attention in MMT
Here we take the MMT decoder's cross-attention layers and measure the attention mass they attribute to regional features in the input embeddings. Although the encoder's self-attention layers produce increasingly mixed contextual embeddings as we move towards the top layers, Brunner et al. (2020) show that the final layer states still encode corresponding input embeddings to some extent. With this assumption at hand, Figure 2 shows the average attention mass attributed to the first 36 (visual) top-layer encoding states, by each cross-attention layer in the decoder. We find these results to be in agreement with the quantitative metrics (Table 1), with VTLM-MMT assigning substantially more attention to these positions, compared to TLM-MMT and MMT from scratch.

Conclusions
We proposed a novel cross-lingual visual pretraining approach and tested its efficacy for multimodal machine translation. Our pre-training approach extends the TLM framework (Lample and Conneau, 2019) with regional features and performs masked language modelling and masked region classification on a three-way parallel corpus. We show that this leads to substantial improvements compared to multimodal machine translation with cross-lingual pre-training only or without pre-training at all. As future work, we consider exploring more informed masking strategies for visual regions and investigating the impact of visual masking probability for the MRC pre-training task for downstream MMT performance.