Distill The Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation

Past works on multimodal machine translation (MMT) elevate bilingual setup by incorporating additional aligned vision information.However, an image-must requirement of the multimodal dataset largely hinders MMT’s development — namely that it demands an aligned form of [image, source text, target text].This limitation is generally troublesome during the inference phase especially when the aligned image is not provided as in the normal NMT setup.Thus, in this work, we introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme.In particular, a multimodal feature generator is executed with a knowledge distillation module, which directly generates the multimodal feature from (only) source texts as the input.While there have been a few prior works entertaining the possibility to support image-free inference for machine translation, their performances have yet to rival the image-must translation.In our experiments, we identify our method as the first image-free approach to comprehensively rival or even surpass (almost) all image-must frameworks, and achieved the state-of-the-art result on the often-used Multi30k benchmark. Our code and data are availableat: https://github.com/pengr/IKD-mmt/tree/master..


Introduction
Multimodal machine translation (MMT) is an worthy task of elevating text-only translation by introducing additional image modality (Specia et al., 2016;Elliott et al., 2017;Barrault et al., 2018).Existing works mostly focus on the fusion and alignment of images and texts to improve MMT (Calixto et al., 2017;Ive et al., 2019;Yin et al., 2020), that they have managed to concept-prove the effectiveness of the aligned

(b) Image-free MMT (Ours)
A small boy playing with a wheel.
A small boy playing with a wheel.visual information.Nevertheless, the strict triplet data form of the dataset, in both the training and inference phases, has disabled the MMT model to generalize further.In particular, if we consider using an MMT model to conduct translation for the normal bilingual text translation as in the NMT setup, one must provide the aligned images during inference.And unfortunately, this is not often feasible.This general comparison between imagefree and image-must schemes is visually illustrated in Figure 1(a).In hindsight, the quantity and quality of attached images become a bottleneck towards the development of MMT, as acquiring such resources can be scarce and expensive (e.g.Multi30K (Elliott et al., 2016)).
Indeed, there have been a few attempts to resolve the image-must limitation.For instance, Elliott and Kádár (2017) present a multi-task learning model for MMT where they rely on an auxiliary visual grounding task to obtain the visual feature.Zhang et al. (2020) introduce an image retrieval paradigm to find topic-related images from a small-scale dataset.Further, Long et al. (2021) attempts to utilize a set of generative adversarial networks to obtain an imaginary vision feature.We may posit that a (nearly) common ground for such imagefree frameworks is to learn and further obtain a generated visual feature representation without the actual image data provided during inference.However, none of the aforementioned works has managed to consistently reach the performance of the image-must counterpart.In this work, we hypothesise that this can be caused by the inferior representation learned, insufficient visual distribution coverage, improper multimodal fusion stage (Caglayan et al., 2017;Arslan et al., 2018;Helcl et al., 2018;Calixto and Liu, 2017), and/or lacked training stability, etc.
In this work, we intend to take a thorough exploration towards this line.As Shown in Figure 1(b), unlike prior works solely targeting visual feature generation and/or relying on later stages of fusion, our approach directly generates a multimodal feature using only the source text input.We enable this by proposing an inverse knowledge distillation mechanism employing pretrained convolutional neural networks (CNN).
From our experiments, we find that this architectural choice has notably enhanced the training stability as well as the final representation quality.To this end, we introduce the IKD-MMT framework, an image-free framework that systematically rivals or outperforms the imagemust frameworks.To set up the inverse knowledge distillation flow, we incorporate dual CNNs with inverted data feeding flow.Of the two, the teacher network receives the pre-trained weights while the student CNN is trained from scratch aiming to provide a high-quality multimodal feature space by incorporating both inter-modal and intra-modal distillations.
Our contributions are summarized as follows: i. IKD-MMT framework is the first method that systematically rivals or even outperforms the existing image-must frameworks, which fully demonstrates the feasibility of the image-free concept; ii.We pioneer the exploration of knowledgedistillation combined with the pre-trained models in the regime of MMT, as well as the multimodal feature generation.We posit that these techniques have shed some light on the representation learning and training stability of MMT.

Multi-modal Machine Translation
As an intersection of multimedia and neural machine translation (NMT), MMT has drawn great attention in the research community.Technically, existing methods mainly focus on how to better integrate visual information into the framework of NMT. 1) Calixto et al. (2017) propose a doublyattentive decoder to incorporate two separate attention over the source words and visual features.2) Ive et al. (2019) propose a translate-and-refine approach to refine draft translations by visual features.3) Yao and Wan (2020) propose the multimodal Transformer to induce the image representations from the text under the guide of image-aware attention.4) Yin et al. (2020) employs a unified multimodal graph to capture various semantic interactions between multimodal semantic units.
However, the quantity and quality of the annotated images limit the development of this task, which is scarce and expensive.In this work, we aim to perform the MMT in an image-free manner, which has the ability to break data constraints.

Knowledge Distillation
Knowledge distillation (KD) (Buciluco et al., 2006;Hinton et al., 2015) aims to use a knowledgerich teacher network to guide the parameter learning of the student network.In fact, KD has been investigated in a wide range of fields.Romero et al. (2014) transfer knowledge through an intermediate hidden layer to extend the KD.Yim et al. (2017) define the distilled knowledge to be transferred in terms of flow between layers, which is calculated by the inner product between features from two layers.In the multimedia field, Gupta et al. (2016) first introduce the technique that transfers supervision between images from different modalities.Yuan and Peng (2018) propose the symmetric distillation networks for the text-to-image synthesis task.
Inspired by these pioneering efforts, our IKD-MMT framework is intents to take full advantage of KD to generate a multimodal feature to overcome triplet data constraints.

Source Sentence
The bird has a white crown and a long yellow bill.

Image-Free MMT Backbone
Given a source sentence X= (x 1 , . . ., x I ), each token x i is mapped into a word embedding vector E x i ∈ R dw through the textual embedding with position encoding (Gehring et al., 2017).d w and t = (E x 1 , . . ., E x I ) are the word embedding dimension and the textual feature, respectively.Then, we feed the text feature t together with the multimodal feature m (detail in Section 3.2.1)into the multimodal transformer encoder (Yao and Wan, 2020).In the multimodal encoder layer, we cascade the multimodal feature m and the text feature t to reorganize a new multimodal feature x as the query vector: where I is the length of source sentence, and P is the size of multimodal feature.Here, we can understand this modal fusion from the perspective of nodes and graphs.If we treat each source token as a node, each region of the multimodal feature can also be regarded as a pseudo-token and added to the source token graph for modal fusion.The key and value vectors are preserved as the text feature t, and the multimodal encoder layer is calculated as follows: In this paper, we directly adopt the Transformer decoder 2 (Vaswani et al., 2017) for translation.
2 For details, please refer to the original paper.
Given a target sentence Y = (y 1 , . . ., y J ), our framework outputs the predicted probability of the target word y j as follow: where H L j represents the top output of the decoder at j-th decoding time step, W h and b h are learnable multi-layer perceptrons, and exp() is a Softmax layer.

Preliminaries
In this part, we introduce the frame, symbol definitions and task goal of multimodal feature generation in advance.
The frame is composed of a multimodal feature generator F , a visual teacher model T and a multimodal student model S. The detailed architecture of each module is shown in Table 7 of the appendix.The model parameters of S are denoted as θ s .When the global text feature t is fed into S, the hidden representation produced by the l-th layer is denoted as ϕ S l t, θ s l .The F outputs a multimodal feature m, and the S produces an inverse feature I s after the S-conv1 layer.The real image and the inverse feature are {I s , I r } ∈ R m * n * 3 .Given a feature I as input, the hidden representation produced by the l-th layer of T is denoted as ϕ T l (I).Our goal is to generate multimodal features from the source text to break the image-must restriction in testing.The visual perception of this multimodal feature is extracted from the visual distillation of the teacher-student model, while the textual semantic of that is derived from the text translation of the input text.

Multimodal Feature Generator
First, we simply adopt an average pooling to transform all word embedding vectors into global textual features, which are proven to carry the overall word senses in (Zhang et al., 2010). (5) Then, the global text feature t is serially transported into the multimodal feature generator to compute a multimodal feature m: m = unpool(W t t). (6) Among them, the FC layer W t projects the global text feature t into the image space.The following average unpooling computes a highdimensional multimodal feature map from the lowdimensional latent vector.The dimension of m ∈ R P * 2048 is the same as that of the last convolutional activation of the teacher model.Notably, the textual semantics of multimodal features are modelled from the global textual context supervised by the text translation.

Inversion Knowledge Distillation
The inversion knowledge distillation transfers the visual perception from the teacher model T to the student model S, and in-depth interacts with textual semantics in the multimodal feature generator.To synthesize an information-rich multimodal feature, we formulate a novel dual distillation paradigm consisting of inter-modal (IrM-KD) and intramodal (IaM-KD) knowledge distillations.
IrM-KD: The IrM-KD direct the student model S to extract the vital visual information from the source text, thereby bridging the intermodal semantics of the text and the real image.Specifically, given the real image I r , the teacher model T generates a visual representation ϕ T l (I r ) in each layer l.Meanwhile, the S produces a inverse hidden representation ϕ S l+1 t, θ s l in next layer l + 1.The paired representations ϕ S l t, θ s l and ϕ T l (I r ) with identical dimension entail the same-level latent concepts.We present the IrM-KD loss by the discrepancy among these two representations and an auxiliary regularization term: where the L 2 norm 2 is used to measure the similarity of two vectors.The regularization term I r -I s 2 indicates the image space loss, which is the fundamental constraint for the S to learn the distribution of the real image.
IaM-KD: The IaM-KD constrains the student model S to learn the visual perception of images via the inverse feature, thus relieving the intramodal gap between the inverse feature with the real image.Specifically, we fed the inverse feature I s into the teacher model T to gain the teacher's cognition for it -a pseudo visual representation ϕ D l (I s ).Then, to encourage the student model profoundly learn the distribution of images, we narrow the divergence between the ϕ D l (I s ) and its coupled visual representation ϕ T l (I r ).So that, the IaM-KD loss is defined as the combination of the above divergence and the image space loss: Compare with T2I synthesis works (Reed et al., 2016;Zhang et al., 2017;Xu et al., 2018), we are dedicated to aidding text translation through intermodal and intra-modal bi-visual distillation.By doing so, our generated multimodal feature focuses more on the text-image alignment and fusion, but not only the authenticity of image.

Objective function
During the training phase, we optimize the proposed IKD-MMT model end-to-end by the text translation loss and the inversion distillation loss: Wherein, the translation loss over the training dataset D, not only bridges the relevance of the source and target texts, but also models the text semantics of multimodal features: In the testing phase, the trained multimodal feature generator is capable to generate rich features to embed into the MMT backbone, thus getting rid of the image-must constraints.

Setup
Datasets We conduct experiments on the Multi30K benchmark (Elliott et al., 2016).The  (Calixto et al., 2017)  training and validation sets contain 29,000 and 1,014, respectively.We report the results of the Test2016, Test2017 and ambiguous MSCOCO test sets.We directly use the preprocessed sentences 3 and apply the BPE (Sennrich et al., 2016) with 10K merge operations to segment words into sub-words, which build a shared vocabulary of 9,712 and 9,544 tokens for EN-DE and EN-FR translation tasks.
Settings We follow all model settings of (Wu et al., 2021), such as the Transformer-Tiny configuration for anti-overfitting in small datasets.4-gram case-insensitive BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014) are used as evaluation metrics.All models are run three times and report the average results.First, the IKD-MMT significantly surpasses all image-free MMT systems on five test sets.These 3 https://github.com/multi30k/datasetimprovements demonstrate that a) our model can effectively embed multimodal semantics during the training and guide the translation via multimodal features among the image-free testing phase, b) benefiting from the informative richness and stable generation of multimodal features, our method is a more robust way to break data constraints.

EN-DE Translation Task
Second, the image-must MMT systems generally exceed their image-free counterparts, showing the efficacy of additional images for translation.
Finally, encouragingly, our image-free MMT model not only overbeats almost all image-must MMT systems, but even rivals the SOTA imagemust MMT.We speculate that these noticeable gains stem from the IKD-MMT's strong ability to fuse the text semantics and visual perception, and generate text-related visual representation, under the dual supervision of text translation and visual distillation.
EN-FR Translation Task We also conduct experiments on the EN-FR task.Our IKD-MMT still outperforms the compared baselines in Table 1.This verifies the robustness and generality of our model in various language scenarios.

Ablation Studies
Table 2 illustrates ablation experiments on the EN-DE task to explore the impact of different collocations of distillation modules.
Similarity Function First, we explore the effect of using varied similarity functions to measure the divergence between hidden representations in our distillation module.As shown in row (A), the L 2 norm is the best option.Later, the performance order is KL Divergence (Kullback and Leibler, 1951) Distillation Granularity Second, in row (B), we analyze what distillation granularity would be the golden standard of our model for optimal translation performance.Specifically, the "Layer", "Block" and "Model" represents that we employ representations of each layer, each block, the last convolutional layer and the image in teacherstudent models to compute the distillation loss.Based on the evaluation results, we conclude that the "Model" is optimal, and the "Block" is consistent with the "Layer" in the Meteor score, but slightly inferior in the BLEU score.The such phenomenon reflect that the initial and terminal representations in our knowledge distillation are sufficient to teach the student model to generate information-rich features.This case breaks the stereotype that KD must transmit all knowledge.
CNN Backbone Third, in row (C), we devise three variants with diverse CNN backbones to investigate their impact on the translation.The ResNet50 wins this round since the deep residual network can derive the strongest visual representation.The VGG19 performs worse with the absence of residual connection and plenty of training samples for model convergence.Undoubtedly, the lightweight AlexNet incurs the worst translation degradation.It implies that the feature extraction capability of a small model may be difficult to undertake the heavy task of multisupervised learning.
Distillation Loss Finally, we discuss the translation performance of different distillation loss strategies in row (D).Unsurprisingly, w/o (IrM-KD+IaM-KD) loss suffers the severest performance degradation.
Removing visual distillation leads to the absence of visual perception in multimodal features, which evolves into a perturbed feature obtained by passing the global text feature into the Fc&Avg Unpool.Afterwards, w/o IrM-KD loss outperforms w/o IaM-KD loss, indicating that the capability of the IaM-KD to establish the text-image relevance that is critical for multimodal feature synthesis is stronger than the IrM-KD.We assume this event is related to that the IaM-KD covers the propagation path of the IrM-KD.Compared with Image Space loss, the improvement of our method reveals that the intermediate hidden state of the teacher model plays a vital role in teaching the student model to comprehend the text-image correlations, as also verified in preceding KD work (Romero et al., 2014;Yim et al., 2017).Overall, each distillation loss considerably improves translation.
Ablation Studies on Development set Table 3 attaches all the validation ablation results to corroborate that each distillation hyperparameter also contributes its decent gains on the model convergence rather than just the model generalization.Drawing from the tabular results, all hyperparameters can be tuned freely on the dev set.We further notice that the performances on the dev set align quite well with the testing set, in terms of tendency.

Analysis
In this section, we will investigate our IKD-MMT model from multiple perspectives.

Does IKD-MMT really generate multimodal features?
To explore the multimodal features generated by our distillation strategy, we test their informative   richness from three aspects: Image Retrieval The image retrieval task aims to analyzes the relationship between our generated multimodal feature and the visual feature.Specifically, we generate the multimodal feature from each source sentence.Further, we find the K closest visual features for each multimodal feature based on cosine similarity.Then, we measure the R@K score, which calculates the recall rate of the visual feature of current sample in these top K nearest neighborhoods.The results in Table 4 display that no matter any K, or whichever data set, the R@K scores are extremely low.These retrieval scores confirm that our model is not trying to generate the visual feature of the current image.
Cluster Visualization In Figure 3, we visualize the related pictures which retrieved by the multimodal feature at the cluster map.Here, points of different colors fall into different clusters, and the distance between points is specified by the cosine similarity between multimodal features and visual features.In the cluster case (a), the other images exist the points-of-parity with the original image, namely objects, backgrounds, and actions (girl, sand, walking).Likewise, in the cluster case (b), the other images satisfy the identical thematic content (person, rock, climbing) as the original image.Certainly, these related pictures also conform to the original text's description of the scene.So the multimodal features are confirmed to have learned commonalities between images.
Attention Weights In Figure 4, we envision the 2) The rest of the multimodal features pay the attention to words equally.We conjecture that these regions as non-object parts thus tend to contribute a consistent impact on text translation.
3) The attention weight of the former three attention heads are flat and presents linearization at the bottom part, while one of 4th attention head is fluctuating and presents dispersion at the bottom part.This means that the first three attention heads capture the entire sentence semantics with the "global attention" form.The 4th attention head, acts like the "local attention", and emphasizes understanding the keywords of the sentence.These findings demonstrate that the multimodal features have embedded the textual semantics.
To summarize, the above experiments can fully prove that our IKD-MMT reliably generates an information-rich multimodal feature.

Can multimodal features be directly used for translation?
Our IKD-MMT model synthesizes a multimodal feature equipped with textual and image knowledge through a multimodal generator.A natural question to ask is, can multimodal features be fed into the encoder alone, rather than being cascaded with textual features for translation?To this end, we compare the model removing text features with the original benchmark in Table 5.We notice that w/o Text Feat.appears a clifflike performance drop, which is explainable.In the multimodal encoding layer, the dot product of the query and key vectors is used to mark the importance of each token corresponding to other tokens in the sentence, i.e. the attention score.If we treat the multi-modal feature as the query, its fixed P regions can be regarded as a set of pseudo tokens.Considering this token set carries limited semantics and destroys the word alignment, it is difficult to obtain an available attention score alone.In addition, most studies convey that text semantics is more important than visual perception in the MMT task (Grönroos et al., 2018;Lala et al., 2018).

Can multimodal features recover the missing text?
In Table 6, we adopt two degradation strategies (Ive et al., 2019;Caglayan et al., 2019) for the source sentence, and feed into Transformer, Multimodal and our IKD-MMT, to probe whether multimodal features can recover the missing text.Test2017 Meteor scores are used for evaluation.
Color Deprivation We mask the source tokens that refer to colors as a special token [U], which involves 3.19% and 3.16% of the words in the training set and test set, respectively.As shown in the column D C , after color deprivation, the textonly Transformer fails to align the source and target tokens, then leads to the worst performance descent.Our IKD-MMT and Image-must Multimodal hardly synthesize color information to compensate for the deterioration of color missing.

Case Study
Figure 5 depicts the 1-best translation of the two test cases generated by various systems.Other systems mistranslate and over-translate text in case (a) and distort the semantics due to mistakenly translating "geht" (walking) to "läuft" (running) in case(b).Our IKD-MMT relies on rich multimodal semantics to keep the translation fidelity.

Conclusion
In this work, we propose the IKD-MMT framework to address the image-must issue for multimodal machine translation (MMT) via the knowledge distillation paradigm.Under this image-free MMT system, there are three key contributions: 1) An information-rich multimodal feature is generated by the dual constraints of visual distillation and text translation to support the image-free testing stage; 2) The knowledge distillation module is flexible, and pioneers to employ of the pre-trained model to guide translation; 3) Both quantitative and qualitative results validate the feasibility of the proposed approach IKD-MMT, where it can be deemed the first framework that rivals or even surpass most (if not all) image-must frameworks.

A Appendix
Table 7: Architecture of each module in multimodal feature generation.The multimodal student model has inverted data flow with visual teacher model.These architectures can be easily replaced with any CNN variant (e.g.VGG19 (Simonyan and Zisserman, 2014), AlexNet (Krizhevsky et al., 2012)) with reference to ResNet50 (He et al., 2016).

Figure 1 :
Figure 1: Examples of Image-must MMT (a), and our Image-free MMT (b).During testing, our IKD-MMT does not require the image as input.

Figure 2 :
Figure 2: The framework of our IKD-MMT model.The multimodal feature generator, multimodal student network and visual teacher network are the most critical modules, which help break the dataset constraints of image-must.

(
a) a little girl is walking barefoot on the sand.(b) young woman climbing rock face.

Figure 3 :
Figure 3: The cluster analysis of the learned multimodal feature, where the two colored boxes represent some representative images in the two cluster cases.The arrow points to the original image that is belonged to the current multimodal feature (i.e.cluster center).

(Figure 4 :
Figure 4: Visualization of attention weights for fusion of multimodal features and text features.The weight values decreasing as the color becomes lighter.
blue twirls .a little girl is walking barefoot on the sand .Target text:eine ballerina in blau wirbelt herum .ein kleines mädchen geht barfuß im sand .-

Figure 5 :
Figure 5: Translation cases of different models.The red and blue highlight error and correct translations respectively.

Table 1 :
BLEU ("B") and METEOR ("M") scores of EN-DE and EN-FR tasks.Encouragingly, our IKD-MMT as an image-free MMT model outperforms almost all MMT systems, and even rivals the SOTA image-must systems.‡/ † mark statistically significant variations for BLEU (p-value < 0.01/0.05)as compared to the Transformer.
Table 1 reports the performance of all MMT baselines on the En-DE task.Comparing all systems, we draw the following interesting conclusions:

Table 2 :
Ablation results for diverse distillation variants on the EN-DE task.The base row denotes the IKD-MMT in Table 1, and "-" means to retain the setting of the base row.Avg.B and Avg.M indicate the BLEU and METEOR scores of the three test sets

Table 3 :
Validation ablation results for diverse distillation variants on the EN-DE task.The base row denotes the IKD-MMT in Multi30K development sets, and "-" means to retain the setting of the base row.Dev.B and Dev.M indicate the BLEU and METEOR scores of the development set

Table 4 :
Image retrieval tasks on the Multi30K dataset.