Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs

This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete. Instead, we attribute this phenomenon to insufficient cross-modal interaction, rather than image information redundancy. A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text, fostering more robust cross-modal interaction. Using Large Language Models (LLMs), we explicitly model the probing signal in MMT to convert it into VQA-style data to create the Multi30K-VQA dataset. An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process. Experimental results on two widely-used benchmarks demonstrate the effectiveness of this novel approach. Our code and data would be available at: \url{https://github.com/libeineu/MMT-VQA}.


Introduction
Broadening the scope of traditional machine translation, multimodal machine translation (MMT) enhances the quality of text translation by factoring in an auxiliary visual modality (Specia et al., 2016).The key challenge of MMT is finding an effective way to integrate images into the translation model, thereby maximizing the utilization of visual information (Caglayan et al., 2016;Libovický and Helcl, 2017;Calixto and Liu, 2017;Qian et al., 2018).Meanwhile, it operates on the premise that the relevant visual context can help clarify or fill in gaps when the source sentence is unclear or incomplete.
While, researchers soon discovered that MMT systems do not always perform as expected: the visual component often contributes little to the trans-lation process when the text is complete.Surprisingly, MMT systems can still behave well even the image input is unrelated to the text (Grönroos et al., 2018;Lala et al., 2018;Elliott, 2018).This raises questions about the actual role of visual input in the MMT framework.Aiming this, Caglayan et al. (2019) pointed out that vision features are helpful when the source sentence miss important patterns.Along this line, Li et al. (2022a) designed more specific probing tasks to evaluate how MMT behave in a limited textual context.But a severe problem still remains, that MMT systems exhibit less sensitivity to the information conveyed by the vision modality when the text is complete.
Our hypothesis posits that this may not result from the redundancy of visual data but rather from the lack of effective interaction between the text and visual modalities.It is a reasonable progression to consider the probing task proposed by Li et al. (2022a), which explicitly measures the crossmodality ability.This task involves masking critical entities and evaluating whether MMT can produce accurate translation by attending to the image.
However, this process falls short in providing any actionable feedback to MMT as it serves as a evaluation metric.Consequently, it is worth investigating whether we can incorporate the probing signals into the training process, instead of using it solely for evaluation.
In this study, we focus on augmenting MMT systems by incorporating probing signals during training to fortify the interplay between text and visual data.This can actively stimulating the textual representation to engage with visual data.To achieve this goal, we leverage Large Language Models (LLMs) to identify intricate contextual words and generate questions, thereby transforming original text into Visual Question-Answering (VQA) style pairs.This results in a new Multi30K-VQA dataset.On this basis, we introduce a MMT-VQA multi-task learning framework, prompting the MMT model to actively "probe" visual data guided by the textual context.
Our main contributions can be summarized as follows: • We first demonstrate that utilizing advanced pre-trained vision features, such as MAE and BLIP, yields consistent performance improvements across various scenarios.
• We release a Multi30K-VQA dataset via using LLMs API, which consists of questionanswering pairs originated from the source sentence.
• We propose a MMT-VQA multi-task learning framework to explicitly model the probing signal in MMT to strengthen the interactions between text and visual modalities.
• Experimental results on Multi30K En-De and En-Fr tasks demonstrate the effectiveness both in terms of BLEU&METEOR scores and specific testsets.

An Empirical Study on Advanced Vision Features
The use of stronger vision models as image encoders in MMT has gained widespread attention.Li et al. (2022a) demonstrated that the application of more robust pre-trained vision models, such as the Vision Transformer (ViT) (Dosovitskiy et al., 2021), considerably enhances MMT systems, as evidenced by the substantial improvements in the accuracy metrics of the proposed probing tasks.
Continuing along this research trajectory, we question whether more recent pre-trained vision models can offer additional advantages to MMT systems, a topic yet to be fully explored.In this context, we chose MAE (He et al., 2022) due to its notable impact in the field of computer vision, as well as two vision-language pre-trained models, namely CLIP (Radford et al., 2021) and BLIP (Li et al., 2022b), for their potential as vision encoders in a MMT system.Table 1 shows comparisons on the Test2016 test set of the Multi30K En-De task, based on the selective attention model (Li et al., 2022a).We see that the MAE-based MMT system delivers the best performance.This outcome was unexpected as we anticipated models like CLIP, endowed with strong cross-modal modeling capabilities, to excel.However, as shown in incomplete text, CLIP showcased a significant performance boost due to its innate cross-modal modeling knowledge.Therefore, we can find that the cross-modal knowledge of the pre-trained model does not seem to work when the text is complete.This leads us to a fundamental question: So now that we have better image representation, how can we further enhance the cross-modal interaction ability of the model within the MMT context?
3 MMT-VQA Multi-Task Learning Framework 3.1 Probing Signal Modeling: Probing to VQA Indeed, a compelling proposition is to compel the MMT system to consider image representation even in the absence of ambiguity.The Probing Task proposed by Li et al. (2022a) serves as an effective evaluation metric, determining the usefulness of image features in a limited textual context.Figure 1 shows how the probing task implicitly guides the model to fill the masked position: "probing" the image information.Li et al. (2022a) demonstrated that the probing task efficiently modeling a crossmodal interaction scenario.An intriguing idea is, is it possible to use this task for training?Probing task can spur and assess cross-modal interaction capabilities of a model simultaneously, making it an effective method for capability enhancement.However, in small datasets like Multi30K, the context information reduction and text coherence decrease during supervised training can significantly impact performance.Therefore, we should transform the Probing Task to other form.So how to extract the probing signal for training is necessary to explore.
We propose a model for the the explicit of probing signals.Figure 1 shows a case of VQA data.The way to model is to formulate questions based on the source, ensuring that all question content originates from the source, and the answer is the masked word in probing task (such as "yellow"

Source:
A child in a yellow hoodie playing in puddles.

Probing Task:
A child in a [MASK] hoodie playing in puddles.

VQA:
Query: What color hoodie do child wear to play in puddles?Answer: Yellow. in Figure 1).We thus constructed a VQA-style Question-Answering pair, transforming the original source (which required masking) into a parallel QA pair.Finally, we completed the explicit extraction and representation of the probing signal.

Multi30K-VQA: LLM-Driven Data Generation
To explicitly strengthen the correlations with the help of probing signals, we need solve the challenge of how to generate high-quality questionanswering pairs.The key point of the issue lies in the manual selection of appropriate tokens for questioning, a task that proved to be both timeconsuming and difficult to implement.Due to the limitations of supervised training with small-scale data, we turned our attention to large language models (LLMs).Recently, there has been a surge in the capabilities of LLMs, such as GPT3.5 (Ouyang et al., 2022).One can design specific prompting to guide the generation results by providing several task-specific instructions and demonstrations.Here, we summary three important aspects for constructing high-quality QA pairs: determine the answer words, comprehending the source, and posing questions based on the source.
The Figure 2 illustrates the prompt we employ, and we will now explain our design approach step by step.To guide the LLM to achieve this task step by step, we constructed a reading scenario to describe the requirements.With the source of each sample in the dataset acting as the reading comprehension passage, the LLM formulates questions based on the passage and generates answers to construct QA pairs.Task Description This section primarily constructs a reading comprehension scenario and provides a comprehensive description of the problem.We engaged the Language Learning Model (LLM) in the role of an experienced author of professional English test papers, specifically focusing on reading comprehension questions.We then designated four types of probing questions to guide the LLM in ascertaining the correct answer.In the final step, we charged the LLM with the task of generating a reading comprehension question and supplying an answer rooted in the content of the source sentence.

Generation Requirements During this phase, our principal focus was the enhancement of
Please be the senior author of the professional English test paper, mainly responsible for reading comprehension questions.I will give you a sentence as a passage of a reading comprehension topic, and then ask you to come up with a reading comprehension question according to the content of the passage to test the student's mastery of the content of the passage, and finally you will give the answer.I need you to make a 'what' type of question about the topic's numbers, nouns, colors, and characters.The characters type of questions needs to be answered by: 'man', 'woman ', 'people', 'men', 'girl' and 'boy'.Note: Don't ask questions that have no answer or the answer is unknown, make sure that the answer can be obtained from the passage!!! Give priority to asking questions for the parts that most need to view relevant pictures to assist in judgment!!! Give priority to asking questions about the parts that are most likely to be ambiguous and confusing in understanding the text, such as the part of polysemy.Questions can only be asked for one type, and the types can only be: Numbers, Nouns, Colors, and Characters.The content of questions and answers must be limited to the passage, and do not ask questions about external knowledge.The answer needs to use the expression in the original passage, does not contain extra words other than the question, and only contains the word phrase corresponding to the content in the original passage!!The answer should be no more than two words.Do not comment on the reason for the answer, and try not to use complete sentences!! Below I will provide you with two examples to learn.Please refer to the format in the examples when replying to me! Passage: two young, white males are outside near many bushes.Type: Color Question: What color is a woman?Answer: White.
Passage: several men in hard hats are operating a giant pulley system.Type: Characters Question: Who are operating the giant pulley system?Answer: Men.
After understanding what I said, please respond me according to this sentence: {}.prompts through human feedback.We implemented a set of generative guidelines designed to augment data quality.Notably, we hand-crafted several strategies (as illustrated in the blue segment of Figure 2) for identifying and addressing translation pitfalls, thereby enhancing the overall quality of the generated content.
In-Context Learning In the final phase, we selected two carefully designed examples as the demonstration to foster in-context learning for the Large Language Model (text-davinci-003).This process could be regarded as a 2-shot prompting strategy.Consequently, we obtained 29, 000 data entries in a Type-Question-Answer format, where several samples are shown in the Table 3.Note that we set the temperature to 0, and utlimately obtained the Multi30K-VQA dataset, which could be found in our supplementary.

MMT-VQA Multi-Task Learning Model
After obtaining the Multi30K-VQA dataset, our central design is to enable the MMT model to "ask questions to image" via a multi-task learning framework.This leads us to the main task: the con-struction of a model for MMT, while VQA serves as a secondary, supporting task.Upon alignment of data from both tasks, we work with three inputs: Source, Question and Image.While, the corresponding target labels for MMT and VQA are Target and Answer.
We adopted the selective attention of Li et al. (2022a)'s work as its strong performance.In this way, our model leverages a Transformer encoder in combination with an advanced vision model as the encoder for text and visual modalities respectively.By sharing the parameters of the Text Encoder, we enhance the ability of MMT to facilitate an interplay of information between the two encoders, catering to the needs of VQA for modeling the question.
Given the differences in the fusion methods of the two tasks' information and the variances in the output languages of the tasks, we acknowledged potential significant effects on the performance of model due to discrepancies between the two language representation spaces.As a consequence, we independently initialize the parameters in Selective Attention layer and Decoder.Model Architecture The proposed overall model architecture is depicted in Figure 3.The training process is initiated by feeding two distinct textual inputs, namely Source (X src ) and Question (X qsn ), into the Text Encoder.This operation results in two corresponding text representations, H src and H qsn .Simultaneously, we apply a pre-trained vision model to the associated image, thereby obtaining H img .
where W is a projection matrix to convert the shape of PTM(X img ) into that of X qsn and X src .PTM(•) denotes MAE or other advanced vision models.
After obtaining the three representations, the text is associated with the image patch using a singlehead attention network, where the queries derive from H src and H qsn , respectively, and the key and value are both H img .Formally, these two processes can be expressed as: where d k is the same as the dimension of X src or X qsn because a single head is used.
Then calculate two gate and the fused output as follow: where H text is H src or H qsn .U and V are trainable variables.λ controls the mixing ratio of visual information.Then, the fusion vectors H Out Training Objective In order to facilitate the learning process of MMT with the assistance of VQA, we regard MMT as the primary task and VQA as an auxiliary task.The training objective for MMT is the standard Maximum Likelihood Estimation (MLE) loss, while simultaneously optimizing the training objective for VQA in conjunction with the MMT loss: where λ is a hyperparameter controlling the balance between the MMT loss and the VQA loss.During training, the training objective is to minimize the total loss L Total .

Data
Multi30K We evaluate our methods on two standard benchmarks: Multi30K English-German (En-De) and English-French (En-Fr) (Elliott et al., 2016).Multi30K is a widely used MMT dataset, containing 31, 014 images with one English description and the manual translation in German and French.The training and validation sets consist of 29, 000 and 1, 014 instances, respectively.We reported the results on the Test2016, Test2017, Test2018 and MSCOCO test sets (Elliott et al., 2017), which includes 1, 000, 1, 000, 1071 and 461 instances, respectively.
Multi30K-VQA As detailed in Section 3.2, we introduce the Multi30K-VQA dataset, featuring 29, 000 rigorously curated QA pairs (the same size as the training set) for text-image tasks.To ensure quality and minimize hallucination, our LLMgenerated QA pairs underwent iterative reviews and refinements on 500 randomly selected samples.Despite these efforts, there are even 1, 012 pairs fell short of our expectations.we employed hand-crafted rules to refine 605 of these pairs and annotated the remaining 407 with sentence-level annotations via 3 annotators.Our final prompt, fine-tuned through 5 iterations, yielded the best output among various tests.Answer-type distributions are shown in Table 6.

Model Settings
Considering the small size of the Multi30K data set, we follow the prior work using Transformer-Tiny as the base configuration (Wu et al., 2021)  brary (Ott et al., 2019), supplemented by the integration of Vision Pre-trained Models from the Huggingface1 .The optimization process is facilitated by the Adam optimizer, with parameters β 1 and β 2 set at 0.9 and 0.98 respectively, and ϵ established at 10 −8 .Learning rate is set to 0.005 and a warmup phase consisting of 2000 update steps.
The model incorporates a dropout rate of 0.3 and label-smoothing of 0.1.The training process is conducted on 2 GeForce RTX 3090 GPUs, with each training batch comprising 4096 tokens per GPU.To ensure a fair comparison with the previous work, we adopt an early stopping strategy, triggered if there is no performance improvement on the validation set over 10 epochs.For inference, we average the last 10 checkpoints for robust performance.Note that our λ is set to 0.3, which achieves the best in our preliminary experiments.

Results
Table 4 presents the results of various MMT systems evaluated on the Multi30K dataset.All models were evaluated in terms of BLEU and METEOR scores on six separate test sets for En-De and En-Fr tasks.Upon comparison with the most recent and strong prior works, MMT-VQA can outperform most of competing MMT systems in terms of the BLEU and METEOR metrics, demonstrating the effectiveness of strengthening cross modality interactions.For a more convincing result, we further conducted experiments on Test2018 in Table 5, where similar phenomenon could be observed.Experimental results in Table 7 demonstrate the general applicability of MMT-VQA multitask learning framework.We can see that MMT-VQA is often beneficial when transitioning to other advanced vision features, encompassing both pretrained vision models and vision-language pretrained models, such as CLIP and BLIP.The MMT-VQA method exhibits improvements across various test sets, with the magnitude of these enhancements varying.Remarkably, the most substantial increment observed was as high as 1.21, thus reinforcing the general applicability and robustness of our proposed methodology.

Analysis
Here, we aim to take more in-depth analysis on the importance of MMT-VQA.Without specific notion, all results were conducted on Multi30K En-De task.

Illustrating Training Loss
First, we would like to validate whether the improvements come from the strengthened interaction between modalities or just act as an regularization signal.Figure 4  losses, L MMT (in blue) and L VQA (in red) during the training.Given the minimal changes observed in the latter stages of training, we have chosen to present only the trends from the initial 25 epochs.It is evident that both models were able to converge stably at a certain level, validating that the auxiliary loss is not a regularization.We also observed a rapid convergence during the training of the VQA tasks.This can be attributed to our strategic choice of using shorter vocabulary as answer words during data generation, which significantly reduced the learning difficulty for the model.

Incongruent Question-Answering Pairs
We further explored the effects of incongruent question-answering pairs training to substantiate our claims.Under the assumption that if VQA acts as a noise signal during training, randomizing the correct Question and Answer in the training data should not significantly impact the results.To scrutinize this, we conducted additional experiments, employing the BLIP model for visual feature extraction due to its notable performance improvement.The experimental results, depicted in Figure 5, reveal a significant impact on the performance of MMT-VQA-Wrong-Answer and MMT-VQA-Wrong-Question across all three test sets.Some metrics even regressed below the performance of the Selective Attention, which did not employ the MMT-VQA method.We attribute these performance decreases to the failure of model to correctly learn the correspondences and probing signals among the Question, Answer and Image triad.This further emphasizes the importance of correct alignment and interaction in improving the efficacy of MMT systems.Image Perspective We evaluated the visual context sensitivity using adversarial methods, following the inconsistent decoding setup in prior studies (Caglayan et al., 2019(Caglayan et al., , 2021)).By replacing random proportions of image samples with irrelevant images, we performed ablation experiments and compared BLEU performance changes between the MMT baseline (Selective Attention) and MMT-VQA methods.As seen in Figure 6, without our strengthened modality interactions, the baseline shows higher sensitivity and more overall fluctuation without a clear downward trend, suggesting inadequate use of image information.On the other hand, MMT-VQA, despite similar fluctuation levels, maintained a downward trend, indicating stronger dependence on visual information accuracy and improved visual modality interaction.

Test2016-con: LLM-Driven Model Evaluation
We used LLMs to identify confusing words within samples of Test2016 and subsequently formed a subset, Test2016-con, which consists of samples with these words.This subset prioritizes image disambiguation, necessitating the interplay of two modalities.Table 9 illustrates that our MMT-VQA method notably exceeds the performance of the baseline system with MAE and BLIP vision features, suggesting a stronger integration with image modality for disambiguation.It is noteworthy that we will also open-source this testset in our codebase.

Case Study
We also make a quantitative case study as shown in the Table 10.In the first case, we observed that the MMT system, despite employing a selective attention mechanism and Vision Transformer (ViT) features, faltered in the translation of the word "up" in the German context.Specifically, it omitted this word entirely.In contrast, our MMT-VQA system produced the accurate German translation "hochgelegten", which effectively captures the notion of "elevated" or "up".Another noteworthy observation is that conventional MMT systems struggle with accurate color translation when a sentence contains multiple color descriptors.This limitation was effectively mitigated in our MMT-VQA sys-  tem by the incorporation of probing signals.These signals, particularly those covering color questionanswering pairs, significantly enhanced the system's ability to resolve such translation ambiguities.Our analysis suggests that the failure of the MMT system with selective attention in translating seemingly simple sentences stems primarily from its inability to disambiguate certain words effectively.

Related Work
The task of MMT forms an intersection between NLP and CV, with the goal of enhancing translation outcomes through the integration of image-based context.The earlier approaches focused primarily on the effective amalgamation of these dual-modal representations (Calixto and Liu, 2017;Elliott and Kádár, 2017;Delbrouck and Dupont, 2017).Subsequent research, however, began to question the significance of visual modality, finding that irrelevant images did not substantially influence translation quality.Indeed, the absence of images did not lead to a substantial drop in BLEU scores (Elliott, 2018).Caglayan et al. (2019) provided further insight into this phenomenon by demonstrating that visual modality could offer utility in contexts with limited linguistic data, but its influence was less pronounced when full sentences were available.
In recent studies, Wu et al. (2021) attributed improvements to the effects of regularization during training.They further showed that even a tiny configuration could outperform earlier models in terms of BLEU score.Concurrently, additional efforts include the use of hallucination representations augmentation (Li et al., 2022c), Noise-Input to mask the additional noise thereby guiding the model to focus on meaningful interactions (Ye et al., 2022), and Dual-level interactive mixup for feature alignment (Ye and Guo, 2022) have emerged.However, existing MMT systems still face difficulties when dealing with complete textual contexts, a concern that also forms the crux of the present study.Distinctively, we focus on fortifying text-visual interactions through the deployment of probing signals during training.Our approach shares thematic similarity with Ji et al. (2022), who explored this issue from a mutual information standpoint.

Conclusions
In this work, we offer a novel perspective on multimodal machine translation (MMT), attributing reduced system sensitivity to visual information in complete text inputs to insufficient cross-modal interaction rather than image information redundancy.We propose a unique approach that generates Visual Question-Answering (VQA) style pairs from source text via LLMs to model the probing signal in MMT explicitly.As a bonus, we propose a Multi30K-VQA dataset and design a MMT-VQA multitask learning framework.Our methodology is validated by experimental results on two representative benchmarks, evidencing its effectiveness and driving forward the MMT research.

Limitations
The performance of our proposed model will depend on the quality of the generated dataset, and the quality of the data set mainly depends on the hints we prepare for the large model.Therefore, although we make good use of the ability of large model, how to design a better prompt and reduce the influence of human factors on prompt quality is still worth exploring in the future.In addition, we haven't explored a better model architecture in depth, and it is very likely that the model we proposed is not the most suitable for this task.

Figure 1 :
Figure 1: An illustration of the probing task and the VQA task given the source-image pair.

Figure 3 :
Figure 3: The overall architecture of MMT-VQA multi-task learning framework.SA indicates the Selective Attention proposed by Li et al. (2022a).
Eq. 6 are sent to the Target Decoder and the Answer Decoder, to predict the target sentence and answer, respectively.

Figure 4 :
Figure 4: The change curve of two loss in the training process of multi-task learning model.

Figure 5 :
Figure 5: BLEU scores of different data processes to Question and Answer.

Table 2 ,
in probing tasks with

Table 2 :
Li et al. (2022a)nd the accuracy of Multi30K En-De task of Probing Tasks proposed byLi et al. (2022a)with Selective Attention applied.

Table 3 :
Four types of cases in the Multi30K-VQA dataset.

Table 4 :
To address this issue, BLEU (B) and METEOR (M) scores of Multi30K En-De and En-Fr tasks.Some of the results are from Li et al.

Table 5 :
BLEU (B) and METEOR (M) scores of Multi30KEn-De and En-Fr tasks on the Test2018 test set.

Table 7 :
BLEU scores of Multi30K En-De task of MMT-VQA model compared with MMT (Selective Attention) model with Different Pre-trained Vision Models.

Table 8 :
BLEU scores of exploring whether the improvement comes from the introduction of Questions in the training phase and contains other information.

Table 9 :
BLEU scores of Multi30K En-De task on the Test2016-con test set.

Table 10 :
Qualitative examples to demonstrate enhanced cross-modal capability.Strikethrough and bold words present the incorrect and good lexical choices.