Neural Machine Translation with Phrase-Level Universal Visual Representations

Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs. In this paper, we propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets so that MMT can break the limitation of paired sentence-image input. Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region, which can mitigate data sparsity. Furthermore, our method employs the conditional variational auto-encoder to learn visual representations which can filter redundant visual information and only retain visual information related to the phrase. Experiments show that the proposed method significantly outperforms strong baselines on multiple MMT datasets, especially when the textual context is limited.


Introduction
Multimodal machine translation (MMT) introduces visual information into neural machine translation (NMT), which assumes that additional visual modality could improve NMT by grounding the language into a visual space . However, most existing MMT methods require additional input of images to provide visual representations, which should match with the source sentence. Unfortunately, in practice it is difficult to get this kind of pairwise input of text and images which hinders the applications of MMT. What is worse, to train an MMT model, the training data still involves the target sentence besides the source sentence and the image, which is costly to collect. As a result, the MMT model is usually trained on a small Multi30K  data set, which limits the performance of MMT. Therefore, it is necessary to utilize the separate image data set to obtain visual representations to break the constraints of pairwise input.
Towards this end, some researchers  propose to integrate a retrieval module into NMT, which retrieve images related to the source sentence from existing sentenceimage pairs as complementary input, and then use a pre-trained convolutional neural network (CNN) to encode the images. However, such sentence-level retrieval usually suffers from sparsity as it is difficult to get the images that properly match with the source sentence. Besides, visual features outputted by the CNN contain richer information (e.g., color, size, shape, texture, and background) than the source text, thus encoding them in a bundle without any filtering will introduce noise into the model.
To solve these problems, we propose a novel retrieval-based method for MMT to learn phraselevel visual representations for the source sentence, which can mitigate the aforementioned problems of sparse retrieval and redundant visual representations. For the sparsity problem, our method retrieves the image at the phrase level and only refers to the grounded region in the image related with the phrase. For the redundancy problem, our method employs the conditional variational auto-encoder to force the learned representations to properly reconstruct the source phrase so that the learned representations only retain the information related to the source phrase . Experiments on Multi30K  show that the proposed method gains significant improvements over strong baselines. When the textual context is limited, it achieves up to 85% gain over the text-only baseline on the BLEU score. Further analysis demonstrates that the proposed method can obtain visual information that is more related to translation quality.

Phrase-Guided Visual Representation
We use phrase-level visual representation to improve NMT. In this section, we will introduce our proposed phrase-guided visual representation. We first build a phrase-level image set, and then introduce a latent-variable model to learn a phraseguided visual representation for each image region.

Phrase-Level Image Set
Our phrase-level image set is built from the training set of Multi30K, which contains about 29K bilingual sentence-image pairs. We only use the images e and source descriptions x from them, which is denoted as We extract <noun phrase, image region> pairs from <sentence, im-age> pairs in D to build our phrase-level image set, which is denoted as D p .
For each sentence x i , we use an open-source library spaCy 1 to identify the noun phrases, which is denoted as where t i is the number of noun phrases in x i . For each noun phrase p i j , we detect the corresponding region r i j from the paired image e i using the visual grounding toolkit . Then (p i j , r i j ) is added to our phrase-level image set D p . Figure 1 illustrates an example.
Finally, we obtain the phrase-level image set It contains about 102K pairs in total.

Latent-Variable Model
For an image region r, we can obtain the visual features v with a pre-trained ResNet-101 Faster R-CNN (He et al., 2016;Ren et al., 2015), which contains rich visual information (e.g., color, size, shape, texture, and background). However, we should not pay much attention to the visual information not mentioned in the corresponding phrase, which will introduce too much noise and even be harmful to NMT. Therefore, we further introduce a continuous latent variable to explicitly model the semantic information of image regions under the guidance of phrases. We adopt the framework of conditional variational auto-encoder (CVAE) (Kingma and Welling, 2014;Sohn et al., 2015) to maximize the conditional marginal log-likelihood 1 https://spacy.io a black dog jumping to catch a rope toy log p(p|v) = log z p(p|z, v)p(z|v)dz by maximizing the evidence lowerbound (ELBO): (1) where p ω (z|v) is the prior, q φ (z|p, v) is an approximate posterior and p θ (p|z, v) is the decoder. The prior p ω is modeled as a Gaussian distribution: where Linear(·) denotes linear transformation.
The approximate posterior q φ is also modeled as a Gaussian distribution: where RNN(·) denotes a single-layer unidirectional recurrent neural network (RNN). The final hidden state of RNN is used to compute the mean and variance vectors.
To be able to update the parameters using backpropagation, we use the reparameterization trick (Kingma and Welling, 2014) to sample z from q φ : The decoder p θ (p|z, v) is also implemented by a single-layer unidirectional RNN. The initial hidden state of decoder RNN is defined as:  and then the decoder will reconstruct the phrase p based on s. We refer to s as phrase-guided visual representation, since it pays more attention to the semantic information mentioned in the phrase and filters out irrelevant information. We will describe how to incorporate it into NMT in the next section.

NMT with Phrase-Level Universal Visual Representation
In this section, we will introduce our retrievalbased MMT method. Specifically, we obtain visual context through our proposed phrase-level visual retrieval, and then learn a universal visual representation for each phrase in the source sentence, which is used to improve NMT. Figure 2 shows the overview of our proposed method, which is composed of four modules: source encoder, phraselevel visual retrieval module, multimodal aggregation module, and target decoder. The source encoder and target decoder are the same as the encoder and decoder of conventional text-only Trans-former (Vaswani et al., 2017). Therefore, we will introduce the phrase-level visual retrieval module and multimodal aggregation module in detail in the rest of this section. We denote the input source sentence as x = (x 1 , x 2 , ..., x n ), the ground truth target sentence as y * = (y * 1 , y * 2 , ..., y * m ) and the generated translation as y = (y 1 , y 2 , ..., y m ). The input source sentence x will be encoded with the source encoder to obtain source sentence representation, which is denoted as H = (h 1 , h 2 , ..., h n ).

Phrase-Level Visual Retrieval Module
To obtain the visual context of the source sentence without input paired images, we design a phraselevel visual retrieval module. Specifically, for the input sentence x = (x 1 , x 2 , ..., x n ), we identify the noun phrasesP = (p 1 ,p 2 , ...,p t ) in x. Each phrasep i = (x l i , x l i +1 , ..., x l i +d i −1 ) is a continuous list of tokens, where l i is the index of the first token and d i is the length ofp i . For each noun phrasep i , we will retrieve several relevant <noun phrase, image region> pairs from our phrase-level image set D p according to the semantic similarity between phrases, and then use the image regions as visual context. We design a phrase encoder to compute the phrase embedding, which is used to measure the semantic similarity between phrases.
Phrase Encoder Our phrase encoder Enc p (·) is based on a pre-trained BERT (Devlin et al., 2019). For a phrase p = (p 1 , p 2 , ..., p l ), we first use BERT to encode it into contextual embeddings: then the phrase embedding is the average embedding of all tokens: Visual Retrieval For a given phrasep, we retrieve top-K relevant <noun phrase, image region> pairs from D p . For (p i , r i ) ∈ D p , the relevance score with given phrasep can be defined as the cosine distance between their phrase embeddings: then we retrieve top-K relevant pairs forp: Universal Visual Representation For every pair (p i k , r i k ), we can obtain the phrase-guided visual representation s i k through our latent-variable model as described in Section 2.2. Finally, the phrase-level universal visual representation ofp is defined as the weighted sum of all {s i k }: Our universal visual representation considers multiview visual information from several image regions, which avoids the bias caused by a single image region. Finally, for all phrasesP = (p 1 ,p 2 , ...,p t ) in x, we obtain the corresponding universal visual representation U = (u 1 , u 2 , ..., u t ).

Multimodal Aggregation Module
Inspired by the recent success of modality fusion in multimodal machine translation Fang et al., 2022), we design a simple multimodal aggregation module to fuse the source sentence representation H and phrase-level universal visual representation U. At first, we perform a phrase-level aggregation. For , we fuse the universal visual representation u i and the textual representation of corresponding tokens where denotes element-wise product. Now we obtain the multimodal phrase representation M = (m 1 , m 2 , ..., m t ). Afterwards, we apply a multi-head attention mechanism to append M to the source sentence representation: We then fuseS and H with a gate mechanism: Finally, S is fed into our target decoder for predicting the translation. The translation model is trained with a cross-entropy loss:

Experiments
We conduct experiments on the following datasets:  Table 1: BLEU scores on Multi30K dataset. * and ** mean the improvements over Transformer (Vaswani et al., 2017) baseline is statistically significant (p < 0.05 and p < 0.01, respectively).
WMT16 EN-RO WMT16 EN-RO dataset contains about 0.6M sentence pairs. We choose news-dev2016 for validation and newstest2016 for test. For all the above datasets, all sentences are tokenized and segmented into subwords units using byte-pair encoding (BPE) (Sennrich et al., 2016). The vocabulary is shared for source and target languages, with a size of 10K for Multi30K, and 40K for WMT16 EN-DE and WMT16 EN-RO.

System Settings
Model Implementation For the latent-variable model, the image region is encoded with a pretrained ResNet101 Faster-RCNN (He et al., 2016;Ren et al., 2015). Both the phrase encoder and decoder are implemented using a single-layer unidirectional RNN with 512 hidden states. The size of the latent variable is set to 64. The batch size is 1024, and the learning rate is 5e-5. We train the model up to 200 epochs with Adam optimizer (Kingma and Ba, 2015). We adopt KL cost annealing and word dropout tricks to alleviate the posterior collapse problem following Bowman et al. (2016). The annealing step is set to 20000 and the word dropout is set to 0.1. Note that the phrases are segmented using the same BPE vocabulary as that for each source language.
For the translation model, we use Transformer (Vaswani et al., 2017) as our baseline. Both encoder and decoder contain 6 layers. The number of attention heads is set to 4. The dropout is set to 0.3, and the value of label smoothing is set to 0.1. For the visual retrieval module, we retrieve top-5 image regions for each phrase. We use Adam optimizer (Kingma and Ba, 2015) to tune the parameters. The learning rate is varied under a warm-up strategy with 2,000 steps. We train the model up to 8,000, 20,000, and 250,000 steps for Multi30K, WMT16 EN-RO, and WMT16 EN-DE, respectively. We average the checkpoints of last 5 epochs for evaluation. We use beam search with a beam size of 4. Different from previous work, we use sacreBLEU 2 (Post, 2018) to compute the BLEU (Papineni et al., 2002) scores and the statistical significance of translation results with paired bootstrap resampling (Koehn, 2004) for future standard comparison across papers. Specifically, we measure case-insensitive detokenized BLEU for Multi30K (sacreBLEU signature: nrefs:1 | bs:1000 | seed:12345 | case:lc | eff:no | tok:13a | smooth:exp | version:2.0.0) 3 and case-sensitive detokenized BLEU for WMT datasets (sacreBLEU signature: nrefs:1 | bs:1000 | seed:12345 | case:mixed | eff:no | tok:13a | smooth:exp | version:2.0.0).
All models are trained and evaluated using 2 RTX3090 GPUs. We implement the translation model based on fairseq 4 (Ott et al., 2019). We train latent-variable model and translation model individually.

Baseline Systems
Our baseline is the text-only Transformer (Vaswani et al., 2017). Besides, we implement Imagination (Elliott and Kádár, 2017) and UVR-NMT  based on Transformer, and compare our method with them. The details of these methods can be found in Section 6. We use the same configuration for all baseline systems as our model. Table 1 shows the results on Multi30K. Our proposed method significantly outperforms the Transformer (Vaswani et al., 2017) baseline, demonstrating that our proposed phrase-level universal visual representation can be helpful to NMT. Our method also surpass Imagination (Elliott and Kádár, 2017) and UVR-NMT . We consider it is mainly due to the following reasons. First, our phrase-level visual retrieval can obtain strongly correlated image regions instead of weakly correlated whole images. Second, our phrase-level universal visual representation considers visual information from multiple image regions and pays more attention to the semantic information mentioned in the phrases. Last, our phrase-level aggregation module makes it easier for the translation model to exploit the visual information.

Effects of Latent-Variable Model
In Section 2.2, we introduce a latent-variable model to learn a phrase-guided visual representation for each image region. To understand how it improves the model performance compared with original visual features, we visualize the representations by reducing the dimension with Principal Component Analysis (PCA). Specifically, for all <noun phrase, image region> pairs in D p , we cluster the image regions by the head 5 of noun phrases. We select top-8 clusters according to their size, and randomly sample 1000 image regions for each cluster. As shown in Figure 3, the original visual features of different clusters are mixed together, indicating that they contains too much irrelevant information. In contrast, our proposed phrase-guided visual representations, which pay more attention to the semantic information, form several clusters according to their heads.
Combined with our visual retrieval module, we found that as the number of retrieved image regions K increases, the BLEU score keeps decreasing when we use original visual features, while increasing when we use our proposed phrase-guided visual representations, which is shown in Figure 4. We believe the decrease of BLEU score is due to the 5 https://en.wikipedia.org/wiki/Head_ (linguistics)  irrelevant information in original visual features, and thus directly sum them together will introduce too much noise. Our method filters out those irrelevant information, and multiple image regions could avoid the bias caused by a single one, which leads to the increase of BLEU score. However, we don't observe further improvements when using more image regions.

Source-Degradation Setting
We further conduct experiments under sourcedegradation setting, to verify the effectiveness of our method when the source textual context is limited. Following , we mask the visually grounded tokens in the source sentence, which affects around 43% of tokens in Multi30K. As shown in Table 2, our method achieves almost 85% improvements over the text-only Transformer baseline. It means our proposed phrase-level universal visual representation can fill in the missing information effectively.

Phrase-Level vs. Sentence-Level Retrieval
To prove the effectiveness of phrase-level retrieval, we implement a sentence-level variant of our method. In this variant, we switch the latentvariable model, retrieval module and aggregation module from phrase-level to sentence-level. In this way, we retrieve several images as visual con-   (Mask) indicates source-degradation setting. * and ** mean the improvements over Transformer (Vaswani et al., 2017) baseline is statistically significant (p < 0.05 and p < 0.01, respectively).
text to help the translation. As shown in Table  3, the sentence-level variant Ours-sentence performs worse than Ours, especially in the case of source-degradation setting. We believe it is because phrase-level retrieval can obtain more relevant image regions as visual context, which contain less noise and can be integrated into textual representations more precisely. In contrast, sentence-level retrieval leads to images with much irrelevant information, and makes it difficult for the model to capture the fine-grained semantic correspondences between images and descriptions. To understand this difference more intuitively, we give an example in Figure 5. As we can see, for the input sentence, phrase-level retrieval can obtain closely related image regions for noun phrases a person and a black car, while the results of sentence-level retrieval are actually weakly related with the input sentence.

Results on WMT News Datasets
Finally, we conduct experiments on WMT16 EN-DE and WMT16 EN-RO datasets. As shown in Table 4, we observe that both  and our method only achieve marginal improvements compared with text-only Transformer baseline. We consider that there are two main reasons. On the one hand, most of tokens in such news text are not naturally related to specific visual contents. We found that the percentage of visual grounded tokens in the training set of WMT16 EN-DE is only 7% (vs. 43% in Multi30K), so the contribution of visual information is indeed limited. On the other hand, the news text is far from the descriptive text in Multi30K. In this way, the retrieved image regions are actually weakly correlated with the source phrase. We did some analysis to verify our hypotheses. As described in Section 3.1, we retrieve top-K pairs for each phrase according to the relevance scores. We define the average relevance scores (ARS) as follows: which means the average relevance scores for all phrases in the validation set. As shown in Figure  6, ARS on WMT news datasets are much lower than that on Multi30K, which proves that the gap between news text and descriptive text does exists.

Related Work
Multimodal machine translation (MMT) aims to enhance NMT (Vaswani et al., 2017;Zhang et al., 2019; with additional visual context. Since the release of Multi30K  dataset, researchers have proposed many MMT methods. Early methods (Huang et al., 2016;Caglayan et al., 2016;Calixto et al., 2016;Caglayan et al., 2017;Libovický and Helcl, 2017;Delbrouck and Dupont, 2017b,a;Zhou et al., 2018;Helcl et al., 2018;Caglayan et al., 2018) are mainly based on the RNN-based encoder-decoder architecture with attention (Bahdanau et al., 2015). Recent methods based on Transformer (Vaswani et al., 2017) achieve better performance. Yao and Wan (2020) (Xia et al., 2017) or capsule networks (Sabour et al., 2017) to better utilize visual information during decoding.  propose a cross-lingual visual pre-training method and fine-tuned for MMT. It is worth noting that some of previous works (Ive et al., 2019;Lin et al., 2020;Wang and Xiong, 2021;Nishihara et al., 2020; adopt regional visual information like us, which shows effectiveness compared with global visual features. The major difference between our method and theirs is that our method is a retrieval-based method, which breaks the reliance on bilingual sentence-image pairs, Therefore, our method is still applicable when the input is text only (without paired images), which is unfortunately not available with those previous methods.
In addition to focusing on model design, ; Nishihara et al. (2020); Wang and Xiong (2021) propose auxiliary loss to allow the model to make better use of visual information. All of the above methods require a specific image as input to provide visual context, which heavily restricts their applicability. To break this bottleneck, Hitschler et al. (2016) propose target-side image retrieval to help the translation. Elliott and Kádár (2017) propose a multitask learning framework Imagination to decomposes the multimodal translation into learning translation and learning visually grounded representation. Calixto et al. (2019) introduce a latent variable and estimate a joint distribution over translations and images. Long et al. (2020) predict the translation with visual representation generated by a generative adversarial network (GAN) (Goodfellow et al., 2014). The most closely related work to our method is UVR-NMT , which breaks the reliance on bilingual sentence-image pairs. Like some retrieval-enhanced MT (Feng et al., 2017;Gu et al., 2017) methods, they build a topic-image lookup table from Multi30K, and then retrieve images related to the source sentence as visual context based on the topic words. The central differences between  and our method are as follows: • First, their method depends on the weak correlation between words and images, which leads to much noise in the retrieved images, while our approach relies on the strong correlation between noun phrases and image regions.
• Second, our phrase-level retrieval can obtain more related visual context than their sentence-level retrieval (Section 5.4).
• Last, their method directly uses visual features extracted by ResNet (He et al., 2016), which may introduce too much noise. We adopt a latent-variable model to filter out irrelevant information and obtain a better representation.

Conclusion
In this paper, we propose a retrieval-based MMT method, which learns a phrase-level universal visual representation to improve NMT. Our method not only outperforms the baseline systems and most existing MMT systems, but also breaks the restrictions on input that hinder the development of MMT in recent years. Experiments and analysis demonstrate the effectiveness of our proposed method.
In the future, we will explore how to apply our method to other tasks.