Towards Zero-shot Learning for End-to-end Cross-modal Translation Models

,


Introduction
Speech translation (ST) requires knowledge transfer among different modalities, whereas models more often than not perform worse on cross-modal tasks.The ST model in real-world applications is usually a cascade approach that first uses an automatic speech recognition (ASR) system to transcribe the speech into text and then uses a text machine translation (MT) model.Recent end-to-end (e2e) ST models remove the need for an explicit ASR, with several practical advantages over the cascade models such as reduced latency, reduced error propagation, and shorter pipeline.
However, e2e ST models are less competitive than cascade models in practice (Zhang et al., 2019;Sperber and Paulik, 2020;Dinh, 2021) because end-to-end data are an order of magnitude less than those for ASR or MT, especially for low-resource language pairs.Solutions have been proposed to combat this data problem.In (Liu et al., 2020;Xu et al., 2021), an adapter with additional parameters is used during fine-tuning to combine the two pretrained models of different modalities.The new module, however, only learns from ST data, which is of a greatly reduced quantity.The alignment in building cross-modal representations is also a popular topic.Zhang et al. (2023) simply concatenates the representations of different modalities and lets the self-attention learn the cross-modal features.Some solutions deal with this problem through mapping features into fixed-size representations (Reimers and Gurevych, 2019;Feng et al., 2020;Han et al., 2021;Duquenne et al., 2022).The squared error is generally used as the optimization objective (Pham et al., 2019;Dinh et al., 2022).They may suffer from information loss when representations are compressed or constrained prior.
In order to overcome both the data and the length problems, we propose a pre-trainable adapter that connects two pre-trained modules.Specifically, we adopt a popular cross-modal ST architecture that can be generalized to many existing works.For the alignment adapter, we employ as loss the Word Rotator's Distance (WRD) minimization (Yokoi et al., 2020;Monge, 1781;Kantorovich, 1960;Peyré et al., 2019), allowing the adapter to promote the cross-modal representations that match in the space of the semantic encoder.Unlike previous works, this strategy allows us to pre-train the adapter.Meanwhile, instead of mapping to a fixed length, the CTC enables adjustment of the length of the source modality representation dynamically.This step can guarantee the cross-modal representations become features with a similar but not exactly the same length, and then our proposed WRD objective with optimal transport (OT) solver can align them properly.
Besides speech translation, our approach can be naturally adapted into image translation.Unlike the  image assistant translation (Su et al., 2019;Yawei and Fan, 2021), we attempt to translate the text within the image.Our goal is also to align the cross-modal representation between images and texts (Li et al., 2022).The related discussion and experiments can refer to the Appendix A.4.
The contributions of this paper are as follows: (1) We adopt the WRD loss together with the shrink mechanism to measure two feature sequences in different lengths, enabling the adapter pre-training.
(2) The pre-trained adapter allows an end-to-end zero-shot ST ability like cascade models.
(3) Experiments on the MuST-C demonstrate that our end-to-end zero-shot model can match or be slightly better than the CTC-based cascade model (without intermediate post-processing).The results of our end-to-end training can also match the recent supervised ST baselines.

Main Method
We adopt a popular framework in Figure 1(a), including a cross-modal encoder with a shrink adapter and a semantic encoder/decoder pack.

Semantic encoder-decoder training
A machine translation model is first pre-trained as illustrated in Figure 1(c).Given a machine translation corpus D mt = {(x t , y t )}, our aim is to obtain a semantic encoder Enc t (E t x t ) = h t and a semantic decoder Dec t (h t ) = P (y t |h t ), where E t is the source embedding matrix.The objective of the task is defined as the cross entropy loss L mt .

Zero-shot Translation Training
In this phase, we train a zero-shot translation model by training the cross-modal encoder alone as shown in Figure 1(d).As the tradition, we apply the recognition task with ASR data D cm = {(z s , x s )}, and adopt a classical ASR architecture with CTC classifier, and optimize the CTC loss L ctc .
Besides the regular recognition task, we use Word Rotator's Distance (WRD) (Yokoi et al., 2020) to supervise the encoder to generate encoding results with less discrepancy across different modalities.We expect to align different modalities in the space of the semantic encoder, allowing the seamless transition between the cross-modal encoder and the semantic decoder.To be precise, suppose the acoustic encoder output as h s and the CTC distribution as d c = softmax(W c h s ), a lightweight adapter shrinks and integrates them.The shrink mechanism (Yi et al., 2019;Tian et al., 2020;Gaido et al., 2021) is widely employed to remove the representations of blank and repeated tokens.Thus, we consider using the CTC path via efficient arg max as the guidance to remove the blank tokens and average the representations of consecutively duplicated tokens, as shown in Figure 1(a).
By denoting the shrunk hidden state and CTC distribution as h s and d c , the adapter output is, where W a is the trainable parameters in the adapter.More details can refer to the implementation.Since the ASR performance greatly affects the quality of CTC paths (Fan et al., 2020), our shrink method differs from previous approaches, where the adapter merges the representations from both before and after the CTC module to reduce error propagation.h a can be regarded as the final audio representation which is ready to be fed into the semantic encoder.To alleviate the cross-modal mismatch, we optimize the WRD loss.
The detailed WRD loss is defined as follows.
where ⟨•, •⟩ denotes the dot-product and cos(•, •) is the cosine similarity.T * is the optimal transport (OT) plan from the following problem.
WRD emphasizes the semantic similarity between two sequences better than Euclidean distance.The common solution is to implement the Inexact Proximal point method for Optimal Transport (IPOT) algorithm (Xie et al., 2020) as shown in Appendix A.1.Because of CTC prediction errors, it is possible that m ̸ = n, but the loss L wrd can circumvent the length discrepancy for alignment.The final loss of the cross-modal training is, where λ ctc and λ wrd are hyper-parameters.To keep the semantic encoding intact, the semantic encoder including the embedding matrix is frozen, leading to a zero-shot translation system naturally from the ASR training.

End-to-End Translation Training
Once we have the triplets supervision dataset D T ri = {(z, x, y)} such as speech-transcriptiontranslation, it is possible to proceed for the finetuning phase as shown in Figure 1(e).Since the zero-shot training loss Eq. ( 5) is still valid in this phase, we can integrate it into the final end-to-end ST training objective L = L st (z, y) + L asr .
ASR The 960h LibriSpeech English ASR dataset (Panayotov et al., 2015) is mainly used for pretraining the ASR in the zero-shot training stage.
MT For En-De and En-Fr, we collect the WMT 2014 data with about 4.5M and 36M parallel sentences respectively as in Vaswani et al. (2017).For En-Es, we collect the WMT 2013 data of size 28M.
Model Details The audio inputs are preprocessed as 80-channel log Mel filterbank coefficients as fairseq 1 .The cross-modal encoder contains two 1D convolutional subsampler layers (Synnaeve et al., 2019) and 12 transformer encoder layers with hidden dimension 512.The MT model is a standard transformer-based architecture (Vaswani et al., 2017).The individual vocabulary including 10K sub-word units is learned by Sentence-Piece (Kudo and Richardson, 2018).All hyperparameters such as λ are tuned on En-De V2 and directly applied to other datasets.Additional details can refer to the Appendix A.2.

Main Results
Zero-shot ST Recent works (Dinh, 2021;Escolano et al., 2021) indicate that when large amounts of ASR and MT data dominate the training, the cascaded ST is better than the direct end-to-end ST.In our proposed second phase, the desired ASR training can easily facilitate the building of a zeroshot ST model.The MT model is pre-trained with the WMT data alone, preventing the model from accessing the in-domain data of MuST-C.For the ASR training, we combine the Librispeech data and the speech-transcription pairs in MuST-C to give a comparable amount of ASR data as in the practical cascade system.Particularly, we set λ ctc = 1 and 1 https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_text λ wrd = 10.
In Table 1, we list the BLEU scores of our pretrained MT models of the first training stage, in both zero-shot and pseudo zero-shot settings.Our main results of zero-shot ST are illustrated in Table 2.We compare our model with the pioneering zero-shot ST method MultiSLT (Escolano et al., 2021), achieving the zero-shot translation via ASR training with an adapter as well.We compare to another cross-modal alignment method Chimera (Han et al., 2021), which is initially designed for supervised ST training but also suitable for zero-shot ST.Clearly, our system can achieve a minimum gap between the cascade and the end-to-end setups in zero-shot scenario, and our end-to-end zero-shot ST on average performs +0.79 higher than that of the cascade system.
Following Tight Integrated (Dalmia et al., 2021;Bahar et al., 2021), we also conduct a pseudo zero-shot ST experiment.In this case, even though each training phase does not directly consume any speech-translation pairs, the overlapped MuST-C transcription data could be seen by both ASR and MT models.The gap between cascade and end-toend remains unchanged (+0.79 → +0.78) for our model.It is an indication of the stability of our approach to bridging the modality gap.Supervised ST In this experiment, we evaluate the performance of our third training phase.We compare our approach only in the unconstrained scenario with the recent end-to-end ST methods that used similar datasets, and the results are summarized in Table 3.

Analysis
For zero-shot ST setting, the average gap is 0.79, and the breakdown difference is visualized in the middle panel of    We also plot the relation between BLEU and WRD for each sentence in the tst-CMMON set of En-De v2 (left panel of Figure 2).The overall trend indicates the BLEU score decreases with increasing word rotator's distance.
To achieve the zero-shot speech translation, the two losses in ASR training are both required.So the ablation study in the right panel of Figure 2 explored the effect of each loss in final end-to-end supervised training.All models are fine-tuned from the zero-shot ST model with BLEU 24.00 in Table 2.The CTC loss cannot be directly removed since the WRD depends on a reasonable CTC path.Therefore, we optimize the supervised loss without CTC loss by freezing the acoustic encoder and CTC layer.
In Table 4, we have another ablation study on whether the model parameters are trainable for the supervised training phase.The result becomes much worse if the semantic encoder/decoder is frozen.The main reason we hypothesize is that since the NMT teacher is frozen, the in-domain MT data is not used.So it's difficult for the NMT decoder to adapt to the supervised ST data, i.e., the decoder is not a good language model.

Conclusion
In this paper, we present a zero-shot architecture that takes better advantage of cascade models, bridging the gap between cascade and end-to-end translation models.By leveraging differentiable shrink adapter and WRD loss, our approach is a direct end-to-end ST model in the zero-shot setup that matches the cascade system without additional post-processing, e.g., rescoring via an additional language model.Our method can also achieve comparable results to recent supervised ST models.

Limitations
The accuracy of IPOT depends on the number of loops.Since we unroll the IPOT similar to RNN training and apply automatic differentiation on the IPOT algorithm in the back-propagate stage, the iterative process will consume more computing resources than directly calculating the Jacobian matrix of the OT plan.Besides, though our model is able to work on different translation tasks such as image translation and speech translation, the hyperparameters, especially the weights of WRD loss and CTC loss, would be varied on each task.The CTC loss and WRD loss are sometimes conflicting, which also requires us to set a pair of proper weights via deep hyper-para searching.differentiation packages (e.g., PyTorch) to backpropagate the gradients like an unrolled RNN.The corresponding implementation can refer to the submitted software.

A.2 Additional Training Details
As model design in Figure 1, the embedding layer in the adapter shares the weights with the semantic source embedding.The beam size is 5 during inference, and we use SacreBLEU in fairseq as evaluation metrics.
For the third phase, supervised ST training, we have multiple tasks in the final objective.For the ST task L st , some previous works may leverage the MT model and the Librispeech transcription to construct pseudo translation sentences.However, we only use the audio and translation pairs from MuST-C.For the ASR task L ctc , we only use the audios and transcriptions from MuST-C.For the MT task L mt , we optimize it on both the MuST-C parallel corpus and WMT data, making the decoder a better language model.En-De WMT only has 4.5M sentence pairs and the entire training is still manageable.However, for En-Fr/Es, optimizing the large end-to-end ST model with a huge amount of trainable parameters will be cumbersome because the size of WMT data overwhelmingly slows down the training.Therefore, we randomly sample a 10M corpus from the original WMT En-Fr/Es data to train the final supervised loss.

A.3 Additional Experimental Results of ST
In Figure 3, we plot the WER scores of the second training stage, in both zero-shot and pseudo zero-shot settings.The ASR of the cascade system (i.e., trained with CTC loss only and without semantic encoder) has a clearly higher WER than our proposed ASR training with additional WRD loss.However, the in-domain MuST-C data do not appear to make a significant difference as indicated by the orange and the green bars in Figure 3.

A.4 Generalization and Visualization
We also conduct an experiment on zero-shot Image Translation using only OCR data and NMT data to further test the effectiveness of our framework.It is also convenient to visualize with image data.The NMT model (i.e. the semantic encoder and decoder) is pre-trained on the WMT 2018 Zh-En data (20M) parallel sentences in the news domain.We crop 2M text-line images from Chinese OCR data2 .The test set has 2,000 images with Chinese transcriptions and English translations.The BLEU on the test set for the pre-trained NMT is 15.87, which is not high due to the domain shift.
In particular, we set different weights λ wrd = 0, 1, 5, 10, 20, 50 to investigate the effectiveness of the WRD loss, where the model with λ  For the cost matrix, the smaller elements are mainly distributed on the diagonal block regions.It could be the incorrect shrinking segments sometimes aligning with more than one character.For the transport plan matrix, the larger elements are mainly distributed on the diagonal.In this way, their products will remain small.

Figure 1 :
Figure 1: Overview of our proposed framework.(a) The overall model architecture.(b) Data usage in zero-shot setting and the possible supervised setting.(c) NMT pre-training.(d) Only the cross-modal encoder is trained with ASR data, where the semantic encoder is freezing.(e) Fine-tuning if end-to-end data is available.
Figure 2. The cascade system only has more sentences falling in BLEU interval [0, 10].
wrd = 0 reduces to a cascade model.The results of the zero-shot Image Translation are shown in Figure 4.It illustrates the intuition of how to tune the importance of WRD loss.In Figure 5, we visualize the transport plan T * and cost matrix C of some examples.

Figure 4 :Figure 5 :
Figure 4: The performance of OCR and zero-shot image translation over different weights λ wrd .

Table 1 :
Performance of MT on MuST-C testset (BLEU↑).The input is the ground truth of the source transcription.
† Tight Integrated extends our ASR data to 2300 hours, and it used 27M En-De and 48M En-Es MT data.

Table 3 :
Supervised ST on MuST-C (BLEU↑ with beam=5) with additional Librispeech and WMT data.† MTL uses the hidden dimension 256.# JT-S-MT only uses WMT data.
(Cattoni et al., 2021ingsST We conduct our experiments on the three popular language pairs in MuST-C(Cattoni et al., 2021): * FT from our models marked with * in Table2.

Table 4 :
Ablation study on the model parameters for supervised training phrase.All models are fine-tuned from the zero-shot ST model with BLEU 24.00 in Table2.† Acoustic encoder includes the CTC layer.