Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation

Multimodal machine translation (MMT) aims to improve translation quality by incorporating information from other modalities, such as vision. Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets. These studies face two challenges. First, they can only utilize triple data (bilingual texts with images), which is scarce; second, current benchmarks are relatively restricted and do not correspond to realistic scenarios. Therefore, this paper correspondingly establishes new methods and new datasets for MMT. First, we propose a framework 2/3-Triplet with two new approaches to enhance MMT by utilizing large-scale non-triple data: monolingual image-text data and parallel text-only data. Second, we construct an English-Chinese {e}-commercial {m}ulti{m}odal {t}ranslation dataset (including training and testing), named EMMT, where its test set is carefully selected as some words are ambiguous and shall be translated mistakenly without the help of images. Experiments show that our method is more suitable for real-world scenarios and can significantly improve translation performance by using more non-triple data. In addition, our model also rivals various SOTA models in conventional multimodal translation benchmarks.


Introduction
Multimodal Machine Translation (MMT) is a machine translation task that utilizes data from other modalities, such as images. Previous studies propose various methods to improve translation quality by incorporating visual information and showing promising results (Lin et al., 2020;Caglayan et al., 2021;Li et al., 2022a;Jia et al., 2021). However, manual image annotation is relatively expensive; at this stage, most MMT work is applied on a small and specific dataset, Multi30K (Elliott et al., 2016). The current performance of the MMT system still lags behind the large-scale text-only Neural Machine Translation (NMT) system, which hinders the real-world applicability of MMT.
We summarize the limitations of the current MMT in two aspects. The first limitation is the size of the training data. Usually, the performance of MMT heavily relies on the triple training data: 1 Codes and data are available at github.com/ Yaoming95/23Triplet and huggingface.co/ datasets/Yaoming95/EMMT Figure 1: Triple data, although widely utilized in multimodal machine translation, is quite scarce. We emphasize the importance of other two kinds of data: parallel text and image captions. The numbers represent the size of commonly used datasets for the corresponding data type.
parallel text data with corresponding images. The triplets are much rarer for collection and much more costly for annotation than monolingual imagetext and parallel text data, as in Figure 1. Considering that current MT systems are driven by a massive amount of data (Aharoni et al., 2019), the sparsity of multimodal data hinders the large-scale application of these systems. Some researchers have proposed retrieve-based approaches Fang and Feng, 2022), aiming to construct pseudo-multimodal data through text retrieval. However, their constructed pseudo-data face problems like visual-textual mismatches and sparse retrieval. Besides, the models still cannot take advantage of monolingual image-text pairs.
The second limitation is the shortage of proper benchmarks. Although several researchers have examined the benefit of visual context upon the translation when textural information is degradated (Caglayan et al., 2019;Wang and Xiong, 2021), the improvements remain questionable. Wu et al. (2021) and  argue that vision contributes minor in previous MMT systems, and the images in the previous benchmark dataset provide limited additional information. In many cases, the translation of sentences relies on textual other than image information. The texts contain com-plete contexts and are unambiguous, leaving the usage of images doubtful. Therefore, a benchmark in that the sentences can not be easily translated without visual information is much needed.
To address these limitations, we propose models to make the most of training data and build a challenge and real-world benchmark to push the realworld application of MMT research. At first, we propose a new framework, named 2 /3-Triplet, which can use both parallel text and image-text data. It provides two different ways of exploiting these data based on the continuous vision feature and discrete prompt tokens, respectively. The two approaches are not mutually exclusive and can be used jointly to improve performance within the same framework. It is also worth mentioning that the prompt approach is easy to deploy without modifying the model architecture.
In addition, we present a new real-world dataset named EMMT. We collect parallel text-image data from several publicly available e-commerce websites and label the translation by 20 language experts. To build a challenge test set, we carefully select ambiguous sentences that can not be easily translated without images. This high-quality dataset contains 22K triplets for training and 1000 test examples, along with extra image-text and parallel text data.
Comprehensive experiments show that 2 /3-Triplet rivals or surpasses text-only and other MMT competitors on EMMT, as well as previous benchmarks. Especially, 2 /3-Triplet consistently improves the strong text-only baseline by more than 3 BLEU scores in various settings, showing the importance of visual information.

Related Work
Researchers applied multimodal information to enhance machine translation systems since the statistical machine translation era (Hitschler et al., 2016;Afli et al., 2016). With the rise of neural networks in machine translation, researchers have focused on utilizing image information more effectively. Early work used image features as initialization for neural MT systems (Libovický and Helcl, 2017). More recent studies proposed multimodal attention mechanisms (Calixto et al., 2017;Yao and Wan, 2020), enhanced text-image representations using graph neural networks (Lin et al., 2020), latent variable models or capsule networks , and used object-level visual grounding information to align text and image (Wang and Xiong, 2021). Li et al. (2022a) found that a stronger vision model is more important than a complex architecture for multimodal translation.
As we discussed earlier, these methods are limited to bilingual captions with image data, which is scarce. Therefore, some researchers Fang and Feng, 2022) also design retrievalbased MMT methods that retrieve images with similar topics for image-free sentences. Alternatively, Elliott and Kádár (2017) proposed visual "imagination" by sharing visual and textual encoders.
Recently, Wu et al. (2021) and  have questioned whether the most common benchmark Multi30K (Elliott et al., 2016) is suited for multimodal translation since they found images contribute little to translation. Song et al. (2021) have contributed a new dataset of the e-commercial product domain. However, we find their datasets still have similar drawbacks.
Several relevant studies about translation and multimodality are noteworthy. Huang et al. (2020) used visual content as a pivot to improve unsupervised MT. Wang et al. (2022b) proposed a pretraining model by using modality embedding as prefix for weak supervision tasks. Li et al. (2022c) introduced the VALHALLA, which translates under guidance of hallucinated visual representation.

Approach
For the fully supervised condition in MMT, we have triplet {(x, y, i)}, where x is the source text, y is the target text, and i is the associated image. Since the triplet is rare, we attempt to utilize partially parallel data like {(y, i)} and {(x, y)}, which are referred as monolingual image-text data and parallel text data in this paper.
In this section, we propose a new training framework 2 /3-Triplet with two approaches to utilize triple and non-triple data at the same time. We name these two approaches as FUSION-BASED and PROMPT-BASED, as shown in Figure 2.
For each approach, the model can conduct a mix training with three kinds of data: ((x, i) → y), ((x) → y), and ((y * , i) → y), where y * indicates the masked target text.
FUSION-BASED approach resembles the conventional models where the encoded vision information is taken as model input and the model is trained in end2end manners, and our design makes it possible to utilize bilingual corpus and image-text pairs  other than multilingual triplets. PROMPT-BASED approach is inspired by the recent NLP research based on prompts (Gao et al., 2021;Li and Liang, 2021;Wang et al., 2022a;Sun et al., 2022), where we directly use the image caption as a prompt to enhance the translation model without any modification to the model.

FUSION-BASED
The common practice to utilize image information is to extract vision features and use them as inputs of the multimodal MT systems. Typically, it's common to cooperate vision and textual features to get a multimodal fused representation, where the textual features are the output state from the Transformer encoder and the vision feature is extracted via a pre-trained vision model.
We incorporate textual embedding and image features by simple concatenation: where H text is the encoded textual features of Transformer encoder, and h img is the visual representation of [CLS] token broadcated to the length of the text sequence.
Then, we employ a gate matrix Λ to regulate the blend of visual and textual information.
Finally, we add the gated fused information to the origin textual feature to get the final multimodal fused representation: It is worth noting that in Eq.2, we employ the hyperbolic tangent (tanh) gate instead of the traditional sigmoid gate (Wu et al., 2021;Li et al., 2022a) in the multimodal translation scenario. The new choice has two major advantages: (a) The output of the tanh can take on both positive and negative values, thereby enabling model to modulate the fused features H fused in accordance with the text H text ; (b) The tanh function is centered at zero, thus, when the fused feature is close to zero, the output of the gate is also minimal, which aligns with the scenario where the image is absent naturally (i.e. tanh(0) = 1(no img) = 0).
The next paragraphs illustrate how to utilize three types of data respectively.
Using Triple Data ((x, i) → y) Figure 2a 1 : Based on the basic architecture, we take in the source text for the text encoder and the image for the image encoder. By setting 1(img) = 1, we naturally leverage vision context for translation. The inference procedure also follows this flow.

Figure 2a 2 :
We utilize the same architecture as the triple data setting. By setting 1(img) = 0, we can adapt to the text-only condition. For the image-free bilingual data, the fused term is absent, and the final representation H out is reduced to textual only, consistent with the learning on unimodal corpus.  Siddhant et al. (2020)'s strategy on leveraging monolingual data for translation, we adapt the mask de-noising task for utilizing monolingual image-text pairs. In a nutshell, we randomly mask some tokens in the text, and force the model to predict the complete caption text based on the masked text and image as input.

PROMPT-BASED
As prompt-based methods have made great success in NLP tasks (Gao et al., 2021;Li et al., 2022b;Wang et al., 2022a;Sun et al., 2022), we also consider whether the image information can be converted to some prompt signals for guiding sentence generation.
The general idea is quite straight: our translation system accepts a sentence of source language along with some keywords of target language, and translates the source sentence into the target language under the instruction of the target keywords. The keywords can be any description of the image that can help disambiguate the translation.
Using Triple Data ((x, i) → y) Figure 2b 1 : First, we generate the prompt from the image with a pre-trained caption model (we will introduce the caption model later). The source sentence is concatenated with the The original source sentence and the prompt are concatenated together to compose the training sources, with a special token [SEP] as a separator between the two.
Using Parallel Text ((x) → y) Figure 2b 2 : Since PROMPT-BASED approach adopts a standard Transformer and involves no modification on architecture, it is natural to train on unimodal parallel corpus. We use the parallel data to strengthen the ability to take advantage of the prompt. Without any image, we randomly select several words from the target sentence as the pseudo vision prompt. For translation training, we append the keyword prompt to the end of the original sentence and use a special token as a separator (Li et al., 2022b).
After inference, we extract the translation result by splitting the separator token.
Using Monolingual Caption ((y * , i) → y) Figure 2b 3 : Like FUSION-BASED approach, we use the de-noising auto-encoder task. By randomly masking some tokens and combining the caption result from the image as the prompt, we make the model learn to predict the original target text.
Training Caption Model ((i) → keywords(y)) Meanwhile, we train an caption model to generate the guiding prompt from images for translation, We formulate image-text pairs from both triple data and target-side monolingual caption. The input and output of the model are the image and extracted keywords of the corresponding target sentence.

Comparison and Combination of FUSION-BASED and PROMPT-BASED
Under the same training framework 2 /3-Triplet, we propose two approaches, FUSION-BASED and PROMPT-BASED, for utilizing non-triple data. The FUSION-BASED approach preserves the complete visual context, providing more information via model fusion. In contrast, the PROMPT-BASED approach has the advantage of not requiring any modifications to the model architecture. Instead, all visual information is introduced by the prompt model, making deployment more straightforward. The two methods, FUSION-BASED and PROMPT-BASED, are not mutually exclusive, and we can jointly utilize them. Specifically, the model simultaneously utilizes the fused feature in Eq. 3 as an encoder representation and the promptedconcatenated source as text input. The combination enables the model to benefit from our framework in the most comprehensive way, and as a result, the performance gains significant improvements.

Dataset
As mentioned before, in previous test sets, many sentences can be easily translated without the image context, for all information is conveyed in the text and has no ambiguity. To deeply evaluate visual information usage, we propose a multimodalspecific dataset.
We collect the data based on real-world ecommercial data crawled from TikTok Shop and Shoppee. We crawled the product image and title on two websites, where the title may be in English or Chinese. We filter out redundant, duplicate sam-ples and those with serious syntax errors. Based on this, we conduct manual annotations. We hired a team of 20 professional translators. All translators are native Chinese, majoring in English. In addition, another translator independently samples the annotated corpus for quality control. We let the annotators select some samples specifically for the test set, which they found difficult to translate or had some confusion without images. The total number of triples annotated is 22, 500 of which are carefully selected samples as testset. We also randomly selected 500 samples as devsets among the full-set while the remaining as training set.
Besides the annotated triplets, we clean the rest of the crawled data and open sourced it as the monolingual caption part of the data. Since our approach features in utilizing bilingual data to enhance multimodal translation, we sample 750K CCAlign (El-Kishky et al., 2020) English-Chinese as a bilingual parallel text. The selection is motivated by the corpus's properties of its diversity in sources and domains, and it is more relevance to real-world compared to other corpus. The sampled data scale is decided based on both the model architecture and the principles of the neural scaling law (Kaplan et al., 2020;Gordon et al., 2021). We also encourage future researchers to explore the use of additional non-triple data to further enhance performance, as detailed in the appendix. We summarize the dataset statistics in Table 1. We discuss ethic and copyright issue of the data in the appendix.

Datasets
We conduct experiments on three benchmark datasets: Multi30K (Elliott et al., 2016), Fashion-MMT (Clean) (Song et al., 2021), and our EMMT. Multi30K is the most common benchmark on MMT tasks, annotated from Flickr, where we focus on English-German translation. To validate the effectiveness of parallel text, we add 1M English-German from CCAlign and COCO (Lin et al., 2014;Biswas et al., 2021). Fashion-MMT is built on fashion captions of FACAD .

Baselines
We compare our proposed 2 /3-Triplet with the following SOTA MT and MMT systems: Transformer (Vaswani et al., 2017) is the current de facto standard for text-based MT.

Train
Test Dev  Triplet  PT  MC  22K 750K 103K 1000 500  , 2022) is an improved version of retrieval-based MMT model that retrieve images in phrase-level.
In addition, we report the results of Google Translate, which helps to check whether the translation of the test set actually requires images. All baselines reported use the same number of layers, hidden units and vocabulary as 2 /3-Triplet for fair comparison.
We mainly refer to BLEU (Papineni et al., 2002) as the major metric since it is the most commonly used evaluation standard in various previous multimodal MT studies.

Setups
To compare with previous SOTAs, we use different model scales on Multi30K and the other two datasets. We follow Li et al. (2022a)'s and's setting on Multi30K, where the model has 4 encoder layers, 4 decoder layers, 4 attention heads, hidden size and filter size is 128 and 256, respectively. On the other two datasets, we set the model has 6 encoder layers, 6 decoder layers, 8 attention heads, hidden size and filter size is 256 and 512, respectively (i.e. Transformer-base setting). We apply BPE (Sennrich et al., 2016) on tokenized English and Chinese sentences jointly to get vocabularies with 11k merge operations. We use Zeng et al. (2022)'s method to get the caption model. The vocabularies, tokenized sentence and caption models will be released for reproduction. Codes are based on Fairseq (Ott et al., 2019).
When training models on various domains (+PT and +MC in Tab. 2), we upsample small-scale data (i.e. E-commercial Triplet) because of the massive  Song et al. (2021). We add all MC and PT data for its pre-training in +PT+MC column for fair comparison on data. The complete UPOC 2 also utilize product attributes besides images, which is removed from our replication. ♣ Multi30K's results copy from Fang and Feng (2022). Phrase Retrieval is not reported on EMMT since they haven't released the phrase extraction scripts. We conduct the retrieval for all parallel sentences with top 5 images as candidate in +PT column of UVR-NMT.  (Papineni et al., 2002).

Main Results
We list the main results in Tab. 2. We get three major findings throughout the results: 1. In traditional multimodal MT settings (i.e. Triplet only and Multi30K), whose training and inference are on triple data, 2 /3-Triplet rivals or even surpasses the previous SOTAs.
2. Parallel text and monolingual captions significantly boost the performance of multimodal translation models. With these additional data, even the plain Transformer model outperforms SOTA multimodal baselines. Given the scarcity of multimodal data, we argue that the use of extra data, especially the parallel text, is more crucial for multimodal translation than the use of multimodal information.
3. FUSION and PROMPT generally achieve the best performance when used together. This suggests two approaches are complementary.
We also list results on Multi30k for dataset comparison. Google Translate achieves the best results, while all other models are close in performance with no statistical significant improvement. It indicates that images in Multi30K are less essential and a strong text translation model is sufficient to handle the majority of cases. Moreover, we find that by incorporating non-imaged parallel text, the model's performance improves significantly, while narrows the gap between plain transformer models MMT ones. Hence, the parallel text rather than images may be more essential for improving performance on the Multi30k. In contrast, 2 /3-Triplet surpass Google's on EMMT with visual infomation, providing evidence that ours serves as a suitable benchmark.
We also report the results of 2 /3-Triplet and baselines on FashionMMT in Appendix along with BLEURT (Sellam et al., 2020) and word accuracy as supplementary metrics. The results show that 2 /3-Triplet also rivals the SOTA MMT systems on various benchmarks and metrics.

Performance on Triplet-unavailable Setting
In more scenarios, annotated triple data is rather scarce or even unavailable, i.e. only bilingual translation or monolingual image caption is available in the training data, while we wish the model can still translate sentences in multimodal manners. Since our proposed 2 /3-Triplet utilize not only triplets, we examine whether our model can conduct inference on multimodal triple testset while only trained on the non-triple data, as triplet might be unavailable in real scenarios. In this experiment, we discard all images of EMMT's triples during the training stage, while the trained model is still evaluated on the multimodal test set. We compare the triplet-unavailable results to triplet only and full data training set settings in Figure 3 We can see that 2 /3-Triplet still preserves a relatively high performance and even sharply beats the triplet-only setting. This fully illustrates that involving parallel text and monolingual caption is extremely important for MMT.  Table 2)

Discussion
As plenty of previous studies have discussed, the current multimodal MT benchmarks are biased, hence the quality gains of previous work might not actually derive from image information, but from a better training schema or regularization effect (Dodge et al., 2019; Hessel and Lee, 2020). This section gives a comprehensive analysis and sanity check on our proposed 2 /3-Triplet and EMMT: we carefully examine whether and how our model utilize images, and whether the testset of EMMT has sufficient reliability.

Visual Ablation Study: Images Matter
We first conduct ablation studies on images to determine how multimodal information contributes to the performance. Most studies used adversarial input (e.g. shuffled images) to inspect importance of visual information. However, effects of adversarial input might be opaque . Hence, we also introduce absent input to examine whether 2 /3-Triplet can handle source sentences without image by simply zeroing the image feature for FUSION or striping the prompt for PROMPT. We list the results of a vision ablation study of both adversarial and absent respectively in Figure 4, where we select FUSION-BASED and PROMPT-BASED approaches trained with full data(last columns in Table 2) for comparison. In the absent setting, both the FUSION and PROMPT degrade to the baseline, confirming the reliance of 2 /3-Triplet on image information. In the adversarial setting, the PROMPT performs worse than the baseline, which is in line with the expectation that incorrect visual contexts lead to poor results. However, while the FUSION also exhibits a decline in performance, it still surpasses the baseline. This aligns with the observations made by Elliott (2018); Wu et al. (2021) that the visual signal not only provides multimodal information, but also acts as a regularization term. We will further discuss this issue in Section 7.

How Visual Modality Works
We further investigate how the visual signal influence the model.

FUSION-BASED
We verify how much influence the visual signal imposes upon the model. Inspired by Wu et al. (2021), we quantify the modality contribution via the L2-norm ratio (ΛH fused for vision over H text for text, in Eq. 3). We visualize the whole training process along with BLEU as a reference in Figure 5. Wu et al. (2021)    final ratio converge to zero. Our method shows a different characteristic: as the BLEU becomes stable, the ratio of visual signal and textual signal still remains at around 0.15, showing the effectiveness of the visual modality.
PROMPT-BASED We also look into the influence caused by the prompts. We sample an ambiguous sentence: "chengwei kf94 fish mouth medical mask, 10 pieces of one box". The keyword "mask" can be translated into "口罩" ("face mask" in English) or "面膜" ("facial mask" in English) without any context. We visualize the attention distribution when our PROMPT-BASED model is translating "mask" in Figure 6. We can see that the a high attention is allocated to the caption prompt. Therefore, our method correctly translates the word. We also vi-sualize the detailed attention heatmaps for source, prompts and generated sentences in Appendix.

Qualitative Case Study
We also compare several cases from EMMT testsets to discuss how multimodal information and external bilingual data help the translation performance. Meanwhile, we regard the case study as a spot check for the multimodal translation testset itself. We here choose plain Transformer, our methods trained on triplet only and all data, as well as human reference for comparison. Table 3 presents the qualitative cases and major conclusions are as follows: 1) Visual information plays a vital role in disambiguating polysemous words or vague descriptions. 2) Non-triple data improves translation accuracy, particularly in translating jargons and enhancing fluency in the general lack of multimodal data. 3) Our test set is representative in real-world seniors as it includes product titles that are confusing and require image, in contrast to previous case studies on Multi30k where researchers artificially mask key words (Caglayan et al., 2019;Wu et al., 2021;Wang and Xiong, 2021;Li et al., 2022a).

Conclusion
This paper devises a new framework 2 /3-Triplet for multimodal machine translation and introduces two approaches to utilize image information. The new methods are effective and highly interpretable. Considering the fact that current multimodal benchmarks are limited and biased, we introduce a new dataset EMMT of the e-commercial domain. To better validate the multimodal translation systems, the testset is carefully selected as the image are crucial for translation accuracy. Experimental results and comprehensive analysis show that 2 /3-Triplet makes a strong baseline and EMMT can be a promising benchmark for further research.

Limitation
First, there are studies (Wu et al., 2021) claiming visual information only serves as regularization. In our ablation study, we find the adversarial setting of FUSION-BASED approach outperforms the plain Transformer. Combined with observations from previous studies, we suggest that fusion-based architectures may apply some images information as regularization terms, yet the further quantitative analysis is needed to confirm this phenomenon.
Second, though our testset is carefully selected to ensure the textual ambiguity without image data, we encounter difficulties in designing a suitable metric for quantifying the degree to which the models are able to resolve the ambiguity. Specifically, we find that conventional metrics, such as wordlevel entity translation accuracy, exhibit significant fluctuations and do not effectively quantify the extent to which the model effectively resolves ambiguity. We discuss this metric in more details in the Appendix, and offer a glossary of ambiguous words used in the test set. We acknowledge that the evaluation of multimodal ambiguity remains an open problem and an area for future research.
In addition, there are some details regarding the dataset that we need to clarify: the dataset is collected after COVID-19, so some commodities will be associated with the pandemic. We collect data by category in order to cover various products to reduce the impact of the epidemic on product types.

Data Copyright
In our study, we present a new dataset of public e-commercial products from Shoppee and TikTok Shop. To address copyright concerns, we provide a detailed description of how we collect the data and ensure that our usage complies with all relevant policies and guidelines.
For the Shoppee dataset, we obtain the data from their Open Platform API 2 . We carefully review their Data Protection Policy 3 and Privacy Policy guidelines 4 , which provide clear instructions for using data through the Shopee Open Platform. We strictly follow their requirements and limitations, ensuring that we did not access any personal data and that we only use open information provided by the API. We also adhere to their robot guidelines 5 , avoiding full-site scraping.
For the TikTok Shop dataset, we access the data using robots, as scraping is allowed according to their robots.txt file 6 . We also review TikTok Shop Privacy Policy and TikTok for Business Privacy Policy 7 to ensure that we only collect data from merchants under their policy.
It is important to note that all data we publish is publicly available on the Internet and only pertains to public e-commercial products. We do not access or publish any user information, and we take all necessary steps to respect the intellectual property and privacy rights of the original authors and corresponding websites. If any authors or publishers express a desire for their documents not to be included in our dataset, we will promptly remove that portion from the dataset. Additionally, we certify that our use of any part of the datasets is limited to non-infringing or fair use under copyright law. Fi-nally, we affirm that we will never violate anyone's rights of privacy, act in any way that might give rise to civil or criminal liability, collect or store personal data about any author, infringe any copyright, trademark, patent, or other proprietary rights of any person.

Results on Fashion-MMT
We list the testset performance on Fashion-MMT in Table 4. FashionMMT  Fashion-MMT is divided into two subset according to the source of the Chinese translation: "Large" subset for the machine-translated part and "Clean" subset for the manually annotated part. As its authors also found the Large subset is noisier and different from the human annotated data, our experiments focused on the Clean subset with Fashion-MMT(i.e. Fashion-MMT(c)).
We compare the model performance on training on Triplet Only and adding Parallel Text settings. As the original dataset does not provide a parallel corpus without pictures, we used Parallel Text from EMMT for our experiments.
Note that the UPOC 2 model relies on three submethods, namely MTLM, ISM, and ATTP. The ATTP requires the use of commodity attributes, whereas our model does not use such information. Hence, we also list results of UPOC 2 without ATTP in the table.
The results show that our model rivals UPOC 2 on triplet only settings. And by using parallel text, ours gain further improvement, even if the parallel text does not match the domain of the original data. The results demonstrate the potential of our training strategy over multiple domains.

Evaluation with various metrics
Recent studies have indicated that the sole reliance on BLEU as an evaluation metric may be biased (Marie et al., 2021). We hence evaluate models with machine learning-based metric BLEURT (Sellam et al., 2020)   in Table 5 8 . Previous multimodal works often set entity nouns in the original sentence into [mask] to quantify model's ability for translating masked items with images (Wang and Xiong, 2021;Li et al., 2022a;Fang and Feng, 2022). While the experiment can measure the effectiveness of multimodal information, text with [mask] is not natural and the setting makes less sense in the real world. Inspired by their settings, we have developed a set of commonly used English-Chinese translation ambiguities by mining frequently used product entity and manual annotating. We have defined an word-level accuracy metric based on those potential ambiguous words in Table 7: if a certain English word appears in the original sentence, we require that the model's translation result in the target language must be consistent with the human reference's corresponding entity translation in order to be considered a correct translation, and thus calculate the word-level accuracy.
The results of BLEURT generally align with BLEU, indicating the effectiveness of 2 /3-Triplet. However, an exception occurs in the Google Translate system, whose score are highest among all systems. We attribute this deviation to the use of back-translated pseudo corpus in the pre-training of the BLEURT model.
Multimodal models consistently perform better than plain transformer models in word-level accuarcy. Additionally, Google Translate obtains the lowest scores in word-level accuracy, indicating that BLEURT may not distinguish ambiguous words in multimodal scenarios. However, the dif-ference between multimodal ones is not significant. We attribute it to the difficulty in quantifying the semantic differences between synonyms, as we will demonstrate in our case study details. Furthermore, given the significant human effort required for mining and annotating ambiguous word list while it is highly domain-specifc to the test set, we suggest that the development of new metrics for evaluating multimodal translation ambiguity shall be a valuable topic of future research.

Translation Details of Case Study
Here we give some detailed explanations about the translation of case study translations: In the first case, the Plain Transformer fail to recognize whether the word "grains" means cereal crop (谷物) or the cheese of grain sizes(奶酪 粒). Triplet-Only 2 /3-Triplet translate "grains" into 颗粒, which is acceptable, but the word not commonly used to describe food in Chinese, yet the model does not translate "only" grammatically properly.
In the second case, Plain Transformer translates "mask" to 面具, which is more commonly used to refer opera mask in Chinese. Both Plain Transformer and Triplet-Only 2 /3-Triplet fail to understand "pcs"(件、个、片) and "ply"(层), and directly copy them to targets. The two methods also fail to translate "surgical"(手术、外科) correctly as it is a rare word in Triplet only settings.
In comparison, the translation of 2 /3-Triplet is more consistent with the images, and more appropriate in terms of grammar and wording.  Table 6: Results of 2 /3-Triplet and plain Transformer on EMMT with parallel text and excessive parallel text (5M).

Attention Visualization
We visualize one the attention heatmap case of PROMPT-BASED in Figure 8 and Figure 8. Figure 8 shows the attention alignment of original source (y-axis) and the prompted source (xaxis) in text encoder. Figure 8 shows the generated sentence (y-axis) and the prompted source (x-axis) in text decoder. From the heat map we know that the prompt attends to the most relevant ambiguous words and supports the model translation, both when encoding the source sentence and decoding the infernece. Specifically in our case, "口 罩"(face mask) in prompts has high attention with all "masks" occurrence on the source side, and has high attention with all "口罩" generation in decoder side. In contrast, the word "防护"(protective) less prominent in the attention heatmap as it is less ambiguous.

Details on Data Selection and Mixing
As discussed in Section 5.3, we resort to upsampling the e-commercial triplet data due to the significant disparity in the quantity of data across various domains. As previously proposed by Wang and Neubig (2019) and Arivazhagan et al. (2019), we utilize a temperature-based sampling method, where the i-th data split is assigned a sampling weight proportional to D 1 T i , where D i denotes the number of sentences in the i-th data split, and T is the temperature hyper-parameter. In our implementation, to guarantee the completeness and homogeneity of data across each training iteration, we directly upsample the triplet data or monolingual captions, and subsequently, shuffle them randomly with parallel text to construct the training dataset. The upsampling rate for the triplet data is rounded to 15 and the upsampling rate for the parallel text is rounded to 4, resulting in an actual sampling temperature of 5.11 .

Model Performance with Excessive Data
Based on data distribution and scaling laws, we sample 750k parallel text and 103k monolingual captions as non-triple data to validate our methods.
To further explore the potential of models with excessive non-triple data, we attempt to increase the data scale of the parallel text corpus to 5M, which are also sampled from CCAlign corpus. We list the results in Table 6. However, we find that excessive parallel text does not further promote model performance on current test sets. We suggest that the lack of improvement in performance may be due to the difference in text domain between the general domain and the e-commerce domain. As we will release the parallel text corpus we used in our experiments, in addition to conducting fair comparisons based on our data, we also encourage future researchers to use more unconstrained external data and techniques to continue to improve performance.