LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation

Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features, which has attracted considerable attention from both natural language processing and computer vision communities. Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases in the real world. In other words, the multilingual multimodal machine translation (Multilingual MMT) task has not been investigated, which aims to handle the aforementioned issues by providing a shared semantic space for multiple languages. Besides, the image modality has no language boundaries, which is superior to bridging the semantic gap between languages. To this end,we first propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.Then, an effective baseline LVP-M3 using visual prompts is proposed to support translations between different languages,which includes three stages (token encoding, language-aware visual prompt generation, and language translation). Extensive experimental results on our constructed benchmark datasets demonstrate the effectiveness of LVP-M3 method for Multilingual MMT.


Introduction
Multimodal Machine Translation (MMT) extends the conventional text-based machine translation by taking corresponding images as additional inputs (Lin et al., 2020;Li et al., 2022) to mitigate the problems of data sparsity and ambiguity (Ive et al., 2019;Yang et al., 2022) when compared with purely text-based machine translation.Similar to other multimodal tasks (e.g., visual question answering (Antol et al., 2015;Shih et al., 2016), image captioning (Vinyals et al., 2015;Jia et  (a) For MMT, we need to train different MMT models to support translations between different language pairs (e.g., "En-De" represents to translate the English to German).(b).For Multilingual MMT, we only need one single model to translate the source language to different target languages.2015) and video-text retrieval (Liu et al., 2022d)), MMT aims to exploit the effectiveness of vision information for the machine translation task.
Moreover, MMT has broad applications (Zhou et al., 2018), such as multimedia news and movie subtitles in different languages.
However, as shown in Fig. 1(a), previous MMT models (e.g., DCCN (Lin et al., 2020)) can handle a single language translation pair (e.g., English → German, English → French) well, but training a separate model for each language pair is unaffordable considering there are thousands of languages in the world.A straightforward solution to reduce computational cost is to use one model for handling the translations of multiple languages as shown in Fig. 1(b).Meanwhile, multilingual machine translation has been investigated for many years (Conneau et al., 2020), but these existing methods only consider the language as the input, where the vision context has been ignored.Therefore, in our work, we first propose the Multilingual Multimodal Machine Translation (Multilingual MMT) task to achieve the translations for multiple languages using one single model.
To eliminate the above limitations, we propose a simple and effective LVP-M 3 method, including Token Encoding, Language-aware Visual Prompt Generation (LVPG), and Language Translation.Specifically, in the token encoding stage, we use the pre-trained vision encoder to extract the visual tokens.Then, we follow (Johnson et al., 2017) to utilize the Transformer to encode the textual tokens.In LVPG, inspired by (Yang et al., 2019) and (Tian et al., 2020), a controller network in Fig. 3 is leveraged to dynamically generate the parameters of the mapping network conditioned on the target language.Further, the mapping network outputs the language-aware visual prompts.After that, during the language translation, following the works (e.g., ViLBERT (Lu et al., 2019)), we utilize co-Transformer to generate the vision-guided language tokens.Then the Transformer decoder is adopted to predict the translation results.
Extensive experiments are conducted on our proposed benchmark datasets for LVP-M 3 .Results show that our model achieves the state-of-the-art performance in all translation directions, especially outperforming the text-only multilingual model by 4.3 BLEU scores on average.
The contributions of this work are summarized as follows: • We first propose the Multilingual Multimodal Machine Translation (Multilingual MMT) to handle the translations for multiple language pairs, which investigates the effect of vision modality for multilingual translation and reduces the computation costs of existing MMT methods for multiple languages.
• For Multilingual MMT, we propose an effective language-aware visual prompt generation strategy to produce different visual prompts for different target languages based on the vision modality and type of the target language.
• We establish two Multilingual MMT benchmark datasets to nourish the further research on Multilingual MMT, and extensive experiments on these datasets demonstrate the effectiveness of our proposed LVP-M 3 method.

Related Works
Multimodal Machine Translation.The multimodal context plays a key role in Multimodal Machine Translation (MMT).Recent MMT methods can be divided into three categories: (1) Using global visual features directly (Calixto and Liu, 2017).For instance, Huang et al. (2016) proposes to concatenate global and regional visual features with source sequences.(2) Exploiting visual features via attention scheme (Libovickỳ and Helcl, 2017;Helcl et al., 2018).Calixto et al. (2017) introduces the visual features into the MMT model by using an independent attention module.
(3) Combining other vision tasks with the translation task by multitask learning (Calixto et al., 2019;Yin et al., 2020).Elliott and Kádár (2017) decomposes multimodal translation into two sub-tasks (i.e., translation and visual grounding).Recently, (Huang et al., 2020) focuses on unsupervised setting for MMT, which utilizes pseudo visual pivoting and visual content to improve the cross-lingual alignments in latent space.In contrast, LVP-M 3 considers fully-supervised multilingual setting by mapping vision embeddings into different feature spaces and achieving the purpose of using one MT model for handling translations of multiple languages.Besides, reducing computation cost is vital for many tasks (Liu et al., 2021(Liu et al., , 2022c,a) ,a) and we focus on the Multilingual MMT task by using one single model for efficiency.Multilingual Language Models.Pre-trained multilingual Transformer-based language models (e.g., mBERT (Kenton and Toutanova, 2019) and XLM-R (Conneau et al., 2020)) utilize the same pretraining strategies as their respective monolingual counterparts (e.g., BERT (Kenton and Toutanova, 2019) and RoBERTa (Liu et al., 2019)).They are pre-trained via the masked language modeling objective (MLM) Strategy.2020) also provides a comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability.Rust et al. (2021) shows that monolingually adapted tokenizers can robustly improve the mono-lingual performance of multilingual models.Overall, when compared with these methods, we focus on the multilingual setting for MMT, which has not been investigated before.
Vision-Language Models.The success of visionlanguage models can be credited to the following three important reasons: Transformers (Liu et al., 2022b;Vaswani et al., 2017), contrastive representation learning (Radford et al., 2021;Li et al., 2020), and large-scale training datasets (Sharma et al., 2018;Miech et al., 2019).Previous Transformer-based multimodal models ( (Tan and Bansal, 2019;Chen et al., 2020;Gan et al., 2020;Bugliarello et al., 2021)) jointly encode text tokens and image region features by preprocessing images using object detection models.The image region features are projected into the joint embedding space of the multimodal Transformer, and then the multi-head attention attends to all text and image inputs to learn a joint representation of both modalities.Besides, Kamath et al. (2021) avoids using object detectors as a black box for pre-extracting these region features and incorporates the object detector end-to-end with the multimodal Transformer to achieve flexibility and better representation capacity.Recently, a representative approach CLIP (Radford et al., 2021) is proposed, which trains two neural network-based encoders using a contrastive loss to match pairs of images and texts.After consuming 400 million data pairs, the CLIP model demonstrates a remarkable zeroshot image recognition capability, and has been applied to many downstream tasks.For example, Shen et al. (2022) proposes to leverage the CLIP model for different vision-language models across various tasks (e.g., image caption, visual question answering).In our work, we aim to investigate the effectiveness of the multimodal information for Multilingual MMT.

Datasets
We introduce two Multilingual MMT benchmark datasets (i.e., M 3 -Multi30K, M 3 -AmbigCaps) using Multi30K (Elliott et al., 2016) and Ambig-Caps (Li et al., 2021).Here, we descried the details of the M 3 -Multi30K and M 3 -AmbigCaps.Data Construction.The widely-used Multi30K dataset for the MMT task is based on the Flickr30K Entities dataset (Plummer et al., 2017).For each image of Multi30K, one of the English (En) descriptions is selected in Flickr30K Entities.Currently,  the English description of each image is translated into German (De), French (Fr), and Czech (Cs) (Elliott et al., 2017;Barrault et al., 2018).To support more languages from different language families and various language distributions for Multilingual MMT, we extend the existing Multi30K dataset with additional three languages as shown in Table 1, where one sample of the M 3 -Multi30K dataset is shown in Fig. 2. Specifically, in the annotation process, based on the recent state-of-the-art multilingual machine translation model XLM-R (Conneau et al., 2020), we first translate the English description into Hindi (Hi), Turkish (Tr), and Latvian (Lv) for each image in Multi30K.Then, we hire independent native speakers to verify and improve the quality of the translation results of different languages.In addi- tion, as the original AmbigCaps (Li et al., 2021) dataset only contains two types of languages, we use a similar way to extend AmbigCaps into additional five languages in M 3 -AmbigCaps.Data Splits.In M 3 -Multi30K, the number of image-translation pairs for training and testing data are 29000, 1000, respectively.In M 3 -AmbigCaps, the number of image-translation pairs for training and testing data are 89600, 1000, respectively.We will released these datasets.

Multilingual MMT
Supposing we have M languages {L m } M m=1 and N bilingual corpora {D n } N n=1 under the multilingual setting, the dataset D n consists of K parallel sentences {(x k L i , x k L j )} K k=1 between language L i and L j , where K is the number of training instances and each instance has the corresponding image z k .Given the corpora, we can train a Multilingual MMT model that enables the translation among different languages with the help of image modality.The training objective of the Multilingual MMT is learnt with a combination of different languages: where the Multilingual MMT model uses a complete shared model for all translation directions.In this work, we adopt Transformer as the backbone model for language encoding and pre-trained vision branch of the CLIP model (Radford et al., 2021) for vision modality.A target language token t L j is prefixed to each source sentence to indicate the translation direction (Johnson et al., 2017).

LVP-M 3
As shown in Fig. 3, our proposed LVP-M 3 method includes three stages: token encoding, languageaware visual prompt generation and language translation.Specifically, in training, give each image z k , the parallel sentences {(x k L i ; x k L j )} from source language L i and target language L j , and the target language token embedding t L j , in token encoding stage, we first use the vision encoder to extract the visual token features {v m } M m=1 based on z k , where M is the number of visual tokens.Then, we utilize the Transformer encoder to extract the source language tokens {s f } F f =1 , where F is the number of language tokens.In language-aware visual prompt generation (LVPG) stage, we map the {v m } M m=1 into the language-aware visual prompt {p m } M m=1 conditioned on t L j to generate different visual prompts for different target languages, where we propose to adopt the controller network to dynamically generate the parameters of the mapping network.In language translation stage, we first adopt the co-attention strategy to generate the vision-guided language tokens as the input of the Transformer decoder to predict the translation results and compute the loss in Eq. 1 using the predicted translation results and the ground-truth target language x k L j .

Token Encoding
For each image z k , we directly use the vision backbone (e.g., the pre-trained vision branch of the widely-used CLIP model (Radford et al., 2021)) as the vision encoder to extract the visual tokens for z k as follows: where H denotes the vision encoder and M is the number of visual tokens.
Similarly, given the source language x k L i , based on the Transformer encoder E, the source language tokens {s f } F f =1 are extracted as follows: where F is defined as the number of source language tokens.

Language-aware Visual Prompt Generation
In language-aware visual prompt generation stage of Fig. 3, motivated by recent works (e.g., dynamic filter networks (Jia et al., 2016) and Cond-Conv (Yang et al., 2019)) based on conditional convolutions, where the filters of conditional convolutions are conditioned on the input and are dynamically generated by another network to improve the capacity of the neural network, we extend this idea to generate the visual prompt conditioned on the target language type t L j (e.g., German) to map the extracted the visual tokens into different embedding spaces for different target language.Specifically, in Fig. 3, based on the embedding of the target language token t L j , we utilize a controller network C implemented by two fully-connected layers with ReLU (Nair and Hinton, 2010) activation function to generate the parameters θ of the mapping network M as follows: After that, we generate the language-aware visual prompt {p m } M m=1 as follows: θ is the generated parameters in Eq. 4, which is assigned to the mapping network M. In this way, when translating source language into different target languages, the θ will be generated according to type of target language tokens, and the visual tokens {v m } M m=1 can be mapped into different visual prompts according to the type of the target language.

Language Translation
In Fig. 3, based on the source language tokens {s f } F f =1 and language-aware visual prompt {p m } M m=1 , we first generate the vision-guided source language tokens based on co-attention strategy, which are widely used for fusing the information from another modality in vision-language models (Lu et al., 2019).Then, we predict the translation results using the Transformer decoder.
Specifically, we utilize the Transformer module implemented by self-attention to fuse the information from other tokens within each modality for {s f } F f =1 and {p m } M m=1 , respectively, and we represent the updated source language tokens and visual prompt as S and P, respectively.Then, we take S as the query, and the P as the key and value in the co-attention module to generate the vision-guided source language tokens {q f } F f =1 as follows: where ∥ H h=1 is the concatenation of the H attentive features along the channel dimension.SF represents the softmax operation.ϕ h Q (•), ϕ h K (•) and ϕ h V (•) are the corresponding linear projection operations of the h-th head for the query, the key and the value, respectively.C denotes the number of feature channels.After the operation of Eq. 6, other operations (e.g., FFN, layer normalization (Ba et al., 2016)) of standard attention scheme (Vaswani et al., 2017) are used.
Finally, at inference, based on {q f } F f =1 , we use the Transformer decoder to predict the target language sequence in our LVP-M 3 .

Experiments
We evaluate our proposed LVP-M 3 method on the multilingual dataset including 7 languages and 6 translation directions.In all experiments, English (En) is treated as the pivot language for Multilingual MMT setting.

Experimental Setting
Implementation Details.Our implementation is based on the Fairseq (Ott et al., 2019) toolbox.We utilize Sentencepiece tokenizer.The model in Fig. 3   β 1 = 0.9 and β 2 = 0.98.The learning rate warms up from 1e-7 to 1e-4 in 2000 steps and then decays based on the inverse square root of the update number.The maximum number of tokens in each mini-batch is 4096.Dropout and label-smoothing rate are set as 0.3 and 0.1, respectively.For vision encoder, by default, we adopt the vision branch of CLIP based on the ViT-L/14 model.The effect of different vision backbones is discussed in our ablation study.All models are trained for 30 epochs and evaluated on one single linux machine with 8 NVIDIA A100 GPUs (80G).
Evaluation.We compute the cumulative 4-gram BLEU scores to evaluate the quality of translation.
During inference, the beam search strategy is performed with a beam size of 5 for the target sentence generation.We set the length penalty as 1.0.Baseline Methods.As we are the first multilingual method in this area, we reproduce methods including Text Transformer (Fan et al., 2021), the Vision Matters (Gated fusion) (Li et al., 2021), and the Vision Matters (Concatenation) (Li et al., 2021) in the multilingual translation setting for a fair comparison.Besides, we also report the results of LVP-M 3 (w/o LVPG), where we directly adopt the co-attention strategy in Lu et al. (2019) to generate the vision-guided language tokens using the source tokens with the visual features.

Results on M 3 -Multi30K
To demonstrate the effectiveness of LVP-M 3 , we compare our method with baseline methods on M 3 -Multi30K under the multilingual MMT setting in Table 2.It should be mentioned that the Vision Matters (Gated fusion) and the Vision Matters (Concatenation) are originally proposed in the bilingual setting, and we reproduce these methods in the multilingual setting for a fair comparison.In lingual MMT.Second, when compared with baseline method LVP-M 3 (w/o LVPG), LVP-M 3 also achieves better performance on all settings, which verifies the effectiveness of our proposed languageaware prompt generation module for Multilingual MMT.Among all translation directions, the task of En→De achieves the most improvement.Because English and German are from the same language family, both languages can share the similar semantic knowledge by cross-lingual transfer.

Results on M 3 -AmbigCaps
Results of M 3 -AmbigCaps are presented in Table 3.When compared with other baseline methods, we observe that our proposed LVP-M 3 method also achieves significant performance improvements in all translation directions.In Table 3, we observe that our proposed method LVP-M 3 outperforms by +4.8 BLEU scores on average when the visual modality is used, which is larger than that in M 3 -Multi30K.

Ablation Study
In this section, we conduct comprehensive ablation study to demonstrate the effectiveness of different components in our proposed LVP-M 3 method on the test set of M 3 -Multi30K.

Effect of LVPG.
In Table 2 and Table 3, we observe that our language-aware visual prompt generation (LVPG) brings significant improvements for Multilingual MMT.To demonstrate the effectiveness of LVPG, we further propose two alternative methods (i.e., LVP-M 3 (Static) and LVP-M 3 (Co-CoOp)) to generate the visual prompts in Table 4. Specifically, in LVP-M 3 (Static), we directly generate visual prompts by mapping the visual tokens {v m } M m=1 using the mapping network, where the parameters of the mapping network are static after training and not conditioned on the target language token embedding t L j .In Table 4, we observe that our LVP-M 3 outperforms these alternative methods a lot, which guides the visual clues to bridge the semantic gap between multiple languages.
Effect of Different Vision Backbones.In Table 5, we compare the results of LVP-M 3 by using the visual tokens extracted by different vision backbones (He et al., 2016;Dosovitskiy et al., 2020) in CLIP.In Table 5, we observe that our LVP-M 3 achieves best results when using ViT-L/14 as the vision encoder.Thus, we use ViT-L/14 as the vision encoder by default.Moreover, we observe that the performance is better when the capacity of the vision backbone is better.It is also reasonable because the quality of the visual tokens is better when using more powerful vision backbones.

Further Analysis
Visualization of Different Masking Ratios.As shown in Fig. 4, we compare our LVP-M 3 method with the alternative method LVP-M 3 (w/o vision) to analyze the effectiveness of visual context when using different masking ratios on the source language.Specifically, in LVP-M 3 (w/o vision), we only use the Transformer encoder to process the source language with the target language embedding and then adopt the Transformer decoder to predict the target language for multilingual MT, where the vision encoder and LVPG are both not used in LVP-M 3 (w/o vision).In Fig. 4, we report the results of these methods by translating from English (En) to French (Fr) and Turkish (Tr).First, when the ratio of masking increases, BLEU scores drop whether the visual contents are added or not, and our LVP-M 3 still   outperforms LVP-M 3 (w/o vision) a lot.Second, the performance gap between LVP-M 3 and LVP-M 3 (w/o vision) is larger when the mask ratio is between 20% and 40%, which shows that the visual information improves the robustness of the translation model.Third, when the mask ratio is larger, the results of these methods on all settings degrade.When the mask ratio is set as 80%, the results of LVP-M 3 (w/o vision) are close to those of LVP-M 3 .It is also reasonable, because most tokens in each source language are masked and it is difficult to translate well for both methods under these extreme scenarios.
Qualitative Analysis.To further explore the necessity of visual modality for machine translation, we compare the predictions results (i.e., De and Fr) of a sample source language (i.e., En) with the ground truth of these target languages in Fig. 5. Specifically, the "man" (noun), "pink" (adjective), and "ball" (noun) are masked, and these masked tokens describe the saliency regions in the corresponding left image.We have the following observations.First, we observe that even though the "man" is masked, the prediction results of German and French on this token are still right, which means that visual modality is complementary rather than redundant if the text is insufficient.Second, our model translates some tokens to their synonyms in the target language.For example, although the translated word "rosa" in German is evaluated as a bad translation for the masked token "pink" in English, it represents the same meaning as the word "pinken" in German.Besides, "la balle" in French is also the synonym of "ball" in English, which further demonstrates the effectiveness of the vision modality.

Discussion on LVP-M 3
In our proposed LVP-M 3 method, first, both encoders (vision and text) and decoder are shared for all language pairs, while previous methods on MMT usually adopt different models for different language pairs.Second, to generate different visual prompts for different language pairs with minimal additional parameters, we just use controller network to generate the parameters of mapping network to map the vision embeddings.Third, different language translation directions are used in training, where the target language token is also prefixed to each source sentence for denoting the translation direction.Last, training separated models will result in huge training costs when compared with the multilingual models as discussed in many multilingual methods.

Conclusion
In our work, we first propose the Multilingual MMT task to support the multilingual multimodal machine translations between different language pairs using one single model.Then, we propose an effective LVP-M 3 baseline method for the Multilingual MMT task, where a language-aware prompt generation module is proposed to generate visual prompts for different target languages dynamically.Comprehensive experimental results on our established Multilingual MMT benchmark datasets demonstrate the effectiveness of our proposed LVP-M 3 method for Multilingual MMT.

Limitations
Although our proposed LVP-M 3 method has achieved substantial improvements for Multilingual MMT, we find that there still exists some hyper-parameters (e.g., the number of encoder and decoder layers,) to tune for better results, which may be time-consuming.Besides, in our established datasets, we support seven languages currently, and we will extend to support more languages and more translation directions for Multilingual MMT in the future work.

ImageFigure 1 :
Figure 1: Comparison of MMT and Multilingual MMT.(a) For MMT, we need to train different MMT models to support translations between different language pairs (e.g., "En-De" represents to translate the English to German).(b).For Multilingual MMT, we only need one single model to translate the source language to different target languages.

En:
Figure 2: Example of an image with its descriptions of seven different languages.

De:Figure 3 :
Figure3: The overall framework of our proposed LVP-M 3 method for Multilingual MMT task, which includes three stages (i.e., token encoding and language-ware visual prompt generation (LVPG) and language translation).Here, we take an example by translating English (En) to German (De)."TRM" and "Co-TRM" represent the Transformer and co-Transformer models, respectively.

Figure 4 :
Figure 4: Translation results of LVP-M 3 under different masking ratios on the source language.Results are evaluated on the M 3 -Multi30K test set by translating English (En) to other languages (i.e., Fr and Tr).
SRC (En) : A man in a pink shirt is sitting in the grass and a ball is in the air.SRC (En) with MASK: A [MASK] in a [MASK] shirt is sitting in the grass and a [MASK] is in the air.TGT (De): Ein Mann in einem pinken Hemd sitzt auf dem Gras und ein Ball ist in der Luft.PRE (De): Ein Mann in einem rosa Hemd sitzt im Gras und ein Ball liegt in der Luft.TGT (Fr): Un homme en polo rose est assis dans l'herbe et un ballon est en l'air.. PRE (Fr): Un homme en chemise rose est assis dans l'herbe et la balle est en l'air.

Figure 5 :
Figure5: A qualitative example by translating English (En) to German (De) and French (Fr) with the help of vision modality.Tokens in red denotes correct translation.Tokens in blue denotes good synonyms, which have the similar meaning with the ground-truth of target language.SRC denotes the source language.MASK means the masked contents in the source language.PRE and TGT represent the predicted translation results and the ground-truth of the target language, respectively.

Table 2 :
consists of 6 Transformer encoder/decoder layers.The number of attention heads in all Transformer layers is set as 12.For training, we take the Adam optimizer (Kingma and Ba, 2015) with The BLEU scores of different methods on M 3 -Multi30K test set.Five multilingual baselines are compared by us.The bottom part shows the results of the multilingual models trained with text and vision modalities.The best results are highlighted.

Table 3 :
The BLEU scores of different methods on M 3 -AmbigCaps test set.Five multilingual baselines are compared by us.The bottom part shows the results of the multilingual models trained with text and vision modalities.The best results are highlighted.

Table 4 :
Table 2, our LVP-M 3 achieves the best BLEU scores in all translation directions.Specifically, first, when compared with text Transformer with only text information, LVP-M 3 outperforms by +2.2 BLEU scores on average, which demonstrates the effectiveness of visual context for Multi-Comparison of different vision prompt generation methods with BLEU scores.

Table 5 :
Comparison different visual backbones with BLEU scores.