CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose Cross-modal Mixup via Optimal Transport (CMOT) to overcome the modality gap. We find the alignment between speech and text sequences via optimal transport and then mix up the sequences from different modalities at a token level using the alignment. Experiments on the MuST-C ST benchmark demonstrate that CMOT achieves an average BLEU of 30.0 in 8 translation directions, outperforming previous methods. Further analysis shows CMOT can adaptively find the alignment between modalities, which helps alleviate the modality gap between speech and text.


Introduction
Speech translation (ST) is the task to translate speech signals in one language into text in another. Conventional ST systems work in a cascaded mode (STENTIFORD and STEER, 1988;Waibel et al., 1991), by combining automatic speech recognition (ASR) and machine translation (MT), which may suffer from error propagation and high latency. To solve these problems, end-to-end ST is proposed (Bérard et al., 2016;Duong et al., 2016). End-toend ST directly generates the target text directly from the source audio without the intermediate transcript, which has drawn more attention in recent years (Vila et al., 2018;Salesky et al., 2019;Di Gangi et al., 2019b,c;Inaguma et al., 2020;Wang et al., 2020a;Zhao et al., 2021a;Dinh et al., 2022;Duquenne et al., 2022a).
End-to-end ST has to perform cross-lingual and cross-modal transformation simultaneously, hence bringing more challenges than ASR or MT. Unfortunately, it is harder to collect parallel speechto-text pairs than text-to-text pairs, which makes ST suffer from the under-fitting problem. To solve the data shortage problem, some researchers propose methods to introduce text-to-text parallel data of MT into ST, including pretraining (Alinejad and Sarkar, 2020;Zheng et al., 2021;Xu et al., 2021;Zhang et al., 2022b), multi-task learning (Le et al., 2020;Vydana et al., 2021;Tang et al., 2021), knowledge distillation (Liu et al., 2019;Gaido et al., 2020;Inaguma et al., 2021), etc. In this way, the knowledge of MT can be transferred to ST, helping learn a better cross-lingual transformation.
Although these methods of transfer learning have shown improvements in the translation quality, ST is joint learning of cross-lingual and crossmodal transformation, and the effectiveness of transfer learning is also based on the assumption that the cross-modal source will be projected to the common representation space. Therefore, the mixup strategy is employed to further improve ST by reducing the gap between the speech and text representation spaces. previous methods with mixup strategy usually perform word-level mixup between aligned tokens in different modalities (Fang et al., 2022), hence needing the help of external alignment tools. However, the alignment tools are not always available and require large amounts of additional annotated data for training, so how to obtain such alignments remains a problem for most languages.
Illuminated by the application of the optimal transport theory in finding the alignment between languages or modalities in recent years (Chen et al., 2020;Gu and Feng, 2022), we propose Crossmodal Mixup via Optimal Transport (CMOT). We use optimal transport to adaptively find the alignment between speech and text, and use this alignment to achieve token-level mixup. The mixup sequence will serve as a medium between the speech sequence and text sequence to realize cross-modal knowledge transfer. Experiments on the MuST-C dataset show that CMOT achieves an average BLEU of 30.0. In addition, we prove that optimal transport can help find cross-modal alignments and CMOT can help alleviate the modality gap in ST.

Method
In this section, we will describe our proposed Cross-modal Mixup via Optimal Transport (CMOT). We find the alignment between speech and text sequences via OT, and mix up the two unimodal sequences to get a cross-modal mixed sequence. We predict the translation with all these sequences and regularize their outputs. Figure 1 illustrates the overview of our proposed method.

Problem Formulation
Generally, a speech translation corpus contains triplets of source speech s, its transcript x and translation y, which can be denoted as D = {(s, x, y)}. Given the corpus, an end-to-end ST system directly converts speech signals s into text translation y without generating the intermediate transcript x.

Model Architecture
Our model consists of four modules: the speech encoder, the text embedding layer, the translation encoder and the translation decoder. We first get parallel speech and text sequences with the speech encoder and the text embedding layer, and feed them into the translation encoder separately. Then we mix up the two encoder outputs with CMOT, and feed both the unimodal sequences and crossmodal mixed sequence to predict the translation. Speech Encoder The speech encoder extracts the low-level features from the raw speech input. We use HuBERT (Hsu et al., 2021) model stacked with a sub-sampler as the speech encoder. The HuBERT model is a self-supervised pretrained speech model and can improve the performance of ST systems as shown in recent works (Zhang et al., 2022a;. It consists of a convolutional feature extractor and a BERT (Devlin et al., 2019) style encoder. The sub-sampler consists of two convolution layers and is designed to reduce the length of the speech sequence. Text Embedding The text embedding layer embeds tokenized text into a sequence of embeddings. Translation Encoder The joint translation encoder receives either speech or text embeddings and learns further semantic information. It is composed of N e transformer (Vaswani et al., 2017) encoder layers. Translation Decoder The joint translation decoder receives either unimodal sequences of speech and text or the cross-modal mixed sequence and predicts the translation y. It is composed of N d transformer decoder layers.

Cross-modal Mixup via Optimal Transport (CMOT)
We mix up the speech sequence and text sequence with CMOT. For a speech sequence H s and the corresponding text sequence H x , we first get the alignment A between them via optimal transport (OT), and then mix up H s and H x with the help of A to get a mixed sequence H ′ . Optimal Transport The Optimal Transport (OT) problem is a classical mathematical problem and is nowadays used as a geometric tool to compare probability distributions (Peyré et al., 2019). It focuses on the scheme with the minimum cost of transferring one distribution to another under a given transfer cost. Consider the following discrete transport problem: for two independent probability distributions P and Q, where each point w i ∈ R d has a weight m i ∈ [0, ∞). Given a transfer cost function c(w i ,ŵ j ) and denoting the mass transported from w i toŵ j by T ij , the transport cost can be defined as: T ij =m j , ∀j ∈ {1, . . . ,n}. (2) Relaxed OT with Window Strategy If we regard the speech sequence and the text sequence as two independent distributions, we can use OT to measure the distance between them. For a speech sequence H s = (h s 1 , . . . , h s n ), a text sequence H x = (h x 1 , . . . , h x n ), and a cost function c, we define the transport problem as: Here we use the Euclidean distance as the cost function c, and norm as the mass m s and m x inspired by Schakel and Wilson (2015) and Yokoi et al. (2020)'s discovery that important tokens have larger norms. The solutions to the OT problem include some accurate algorithms like Sinkhorn (Cuturi, 2013) algorithm and IPOT (Xie et al., 2020), which will bring great time complexity. Kusner et al. (2015) proposed a relaxed moving distance  : An example of OT from speech sequence to text sequence with window strategy. The x-axis is the speech sequence and the y-axis is the text sequence. White circles represent the transports that cannot be selected, pink ones represent the candidate transports and red ones represent the selected optimal transports. The dashed box represents the window, and we use W = 2 as an example here.
that removes one of the two constraints to obtain a lower bound of the accurate solution. Following this work, we remove the second constraint, and our transport problem thus becomes: This relaxed problem yields a lower bound of the original problem. We find that this relaxed OT improves the training speed without degrading the performance, as shown in Section 4.1. Now the optimal solution for each speech token h s i is to move all its mass to the closest text token h x j , so the transportation matrix becomes: This transport matrix T * implies the alignment between H s and H x . We define the cross-modal alignment as A = (a 1 , . . . , a n ), where Considering the monotonicity and locality of the alignment from the speech sequence to the text sequence, we apply a window strategy for this alignment. We constrain the index of the aligned target to the neighborhood of the diagonal. The window size is set as a constant W both to left and to right: λ =n/n.
(7) Figure 2 shows how the relaxed OT with window strategy works. Through the above steps, we get the alignment A in an adaptive manner. Cross-modal Mixup via OT With the help of the alignment A, we will be able to mixup the speech and text sequences at the token level. For a mixup probability p * , we generate the mixup sequence where p is sampled from a uniform distribution U (0, 1). We feed H s , H x , H ′ into the joint transformer to predict the translation. Note that the positions of OT and mixup can be different. Our scheme is calculating OT before the translation encoder and performing mixup with the encoder outputs, which brings better performance in the experiment. See more details in Section 4.3.

Training Strategy
We follow the paradigm of pretraining-finetuning to train our model. Pretraining We pretrain the text embedding and the joint transformer with an MT task. The training objective during this phase is the cross entropy loss: Multitask Finetuning During finetuning, we train the model with multiple tasks. We add the cross entropy loss of ST and MT to the final loss: Meanwhile, we regularize the predictions of mixup and speech sequences by minimizing the Kullback-Leibler (KL) Divergence between the token-level probability distributions of them, and we do the same for the predictions of mixup and text sequences. For a speech prediction distribution S, a text prediction distribution T and mixup prediction distribution M , we have: Hence the final training objective is: where λ is the weight to control L KL .

Experiments
We conduct experiments under two settings, the base and the expanded. The former only uses ST data, while the latter uses external MT data additionally. We will describe the datasets we use, our experimental setups and our main results in this section. For each direction, it comprises at least 385 hours of audio recorded from TED Talks. We use dev set for validation and tst-COMMON set for evaluation. External MT Datasets We incorporate external MT data to pretrain the text embedding and the joint transformer. For directions of En-De, En-Fr, En-Ru, En-Es and En-Ro we use data from WMT (Bojar et al., 2016). For En-It, En-Pt and En-Nl we use data from OPUS100 3 (Zhang et al., 2020). Table 2 shows the detailed statistics of the MuST-C, WMT and OPUS100 datasets we use.

Experimental Setups
Model Configurations For the speech encoder, we use HuBERT 4 (Hsu et al., 2021) base model, which is pretrained on LibriSpeech (Panayotov et al., 2015) with no finetuning and is one of the state-of-the-art pretrained audio models. Following the usual practice for ST, we stack two convolution layers of kernel size 5, stride size 2, padding 2 and hidden size 1024 after the HuBERT model as a sub-sampler. The text embedding is an embedding layer of dimension 512. For the joint transformer, we use N e = 6 encoder layers and N d = 6 decoder layers with 512 hidden units, 2048 feed-forward hidden units and 8 attention heads. We use fairseq 5 (Ott et al., 2019) for implementation.   Pre-processing For speech input, we use the 16-bit 16kHz mono-channel raw audio, and we filter out samples with frames greater than 480k or less than 1k to ensure training efficiency. For transcripts and translations, we tokenize them using a unigram Sen-tencePiece 6 (Kudo and Richardson, 2018) model with a vocabulary size of 10k, which is shared by the source language and the target language. Training Our training process follows the pretraining-finetuning paradigm. We pretrain the text embedding layer, the translation encoder and decoder with an MT task. For the base setting, we only use the transcript-translation pairs with a learning rate of 2e-3, and warm-up steps of 8k.
For the expanded setting we first pretrain them on external MT data with a learning rate of 7e-4 and warm-up steps of 8k, and then pretrain them on transcript-translation pairs with a learning rate of 1e-4 and warm-up steps of 4k. For both settings, 6 https://github.com/google/sentencepiece we allow at most 33k input tokens per batch. We set the learning rate to 1e-4 and warm-up steps to 10k to finetune the whole model. We train the entire model for 60k updates with a batch of 16M audio frames. We use an Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, dropout of 0.1 and the label smoothing value of 0.1. For CMOT, we set the weight λ of KL loss to 2.0 (see more in Appendix 4.3) and the mixup probability p * to 0.2 (see more in Section 4.3), and we use the window strategy for OT by default with window size W = 10. All models are trained on 4 Nvidia GeForce RTX 3090 GPUs. Inference We average the checkpoints of the last 10 epochs for evaluation. We use beam search with a beam size of 8. We evaluate case-sensitive detokenized BLEU on MuST-C tst-COMMON set using sacreBLEU 7 (Post, 2018). We use sacreBLEU to test the significance of the results. Baselines We compare our method with the Fairseq ST (Wang et al., 2020a) baseline implemented with the same framework (Ott et al., 2019). We implement HuBERT-Transformer as a strong baseline, which has the same architecture as our method. The only difference is that HuBERT-Transformer takes speech as input and performs the ST task during fine-tuning. We train this model for 40k steps, except for the En-Ru direction where we use a training step of 100k due to specific language complexities.
We also compare our method with several robust end-to-end ST systems: Chimera (Han et al., 2021) which learns a shared semantic space for speech and text, XSTNet (Ye et al., 2021) which uses a progressive multi-task training strategy, STEMM (Fang et al., 2022) which applies word-level mixup strategy and ConST  which applies contrastive learning strategy. These baselines all adopt the multi-task learning mode and have similar model architectures. Table 1 shows the main results. HuBERT-Transformer, the baseline model we implement, achieves relatively high BLEU scores. For the base setting without external MT data, CMOT outperforms HuBERT-Transformer by 1.1 BLEU in average of 8 directions. For the expanded setting, CMOT outperforms HuBERT-Transformer by 0.9 BLEU in average. Besides, CMOT also outperforms other strong baselines.

Effect of OT
We use CMOT to learn the alignment between speech and text, and to improve the performance of the ST system. In this section, we compare CMOT with other models without OT or with other types of OT. All the models have the same architecture. MTL denotes the multi-task learning (ST and MT) model. Relaxed OT denotes the model trained with the same process of CMOT but without the window strategy in Section 2.3. IPOT denotes the model trained with the same process as CMOT but using IPOT (Xie et al., 2020) for OT calculation. The A-score is defined as the accuracy of our alignment A = (a 1 , . . . , a n ) compared to the reference alignment A * = (a * 1 , . . . , a * n ): Here A * is obtained with Montreal Forced Aligner (MFA) 8 (McAuliffe et al., 2017). As shown in Table 3, we can see the result of relaxed OT is slightly lower than that of IPOT, but when the window strategy is used, its performance exceeds that of IPOT with a faster speed. CMOT achieves the best BLEU score, along with the best A-score of alignment, proving that CMOT not only takes advantage of this alignment but optimizes it as well.
In addition, we observe the cost matrices of OT to verify that the cost matrix can be used for crossmodal alignment. We compare the cost matrices in  MTL and CMOT, and analyze a typical example, as shown in Figure 3. The redder grids indicate the distance between the speech token and the text token is smaller, while the bluer grids are the opposite. To better observe the word-level alignment, we use MFA to cut the sentences of speech and text into words, and the white solid lines denote the segmentation. For CMOT, we can see that the closer speech-text pairs basically lie on the diagonal, which is consistent with the alignment of the two sequences. But for MTL without the help of OT, the distance between speech and the corresponding word is not that close. Comparing the two green boxes in the figure, we can see CMOT has better alignment on the word "pointed".

Effect of Cross-modal Mixup
To examine the effectiveness of the mixup strategy in bridging the modality gap, we compare the modality gap of CMOT and baseline models. We calculate the mean sentence-level distances and mean word-level distances of these models. We focus on the modality gap at the encoder input and output, and the results are shown in Figure 4. We can see that compared with the baseline model, CMOT has a smaller distance between modalities, which means our method effectively reduces the modality gap and improves the performance of ST.

Ablation Study
Training Objectives As a multi-task learning framework, our ST system's performance can be affected by the training objectives. Table 4 shows the impact of training objectives on ST performance. Comparing Exp.I and II, We can empirically conclude that multi-task learning strategy alone helps little, while Exp.III suggests regularizing the output distributions is effective. Exp.IV, V, and VI show that calculating KL divergence for both mixup and speech outputs, and mixup and text outputs can bring the greatest improvement. In Exp.VII, we He pointed at the algae "He" "pointed" "at""the" "algae" He pointed at the algae "He" "pointed" "at""the" "algae"

MTL CMOT
Speech Text Speech Text Figure 3: An example of how our CMOT can learn better alignment between speech and text. We draw the cost matrices in OT calculation for MTL and CMOT models.  Figure 4: Modality gaps of different models. The data are calculated on the MuST-C En-De dev set with models of expanded setting. "in" / "out" denotes the encoder input / output sequences, while "sentence" / "word" denotes sentence-level / word-level Euclidean distance.
add the translation loss of the mixup sequence: where H ′ is the mixup sequence calculated as Equation (8). Comparing Exp.VI and VII, we can see L Mixup does not help. We also inspect the effect of OT loss which takes the value of D * in Equation (4) in Exp.VIII, and see a slight performance decline.
Here OT is only used to help find the alignment, so OT loss is not indispensable during fine-tuning, and we conjecture adding OT loss may make it harder for the model to converge to good parameters, as the training objective is too complex. Table 4: ST performances with different training objectives on MuST-C En-De tst-COMMON set. These experiments are conducted under the expanded setting. L Mixup denotes the translation loss of the mixup sequence. We set the weights λ = 2 and µ = 0.1.

OT
Mixup encoder in encoder out encoder in 28.6 29.0 encoder out 28.7 28.7 Table 5: BLEU scores of different positions of OT and mixup on MuST-C En-De tst-COMMON set. These experiments are conducted under the expanded setting.

Positions of OT and Mixup
We investigate the impact of positions of OT and the mixup on our system. As shown in Table 5, the best scheme is to calculate OT for encoder inputs and perform the mixup for encoder outputs. We believe there is more original alignment information at the lower layers of the model, which is better for OT calculation, while the hidden states of higher layers are more suitable for the mixup, as they have passed several layers of joint transformer and are closer in the representation space. Mixup Probability We explore the impact of Speech Encoder Baseline CMOT wav2vec 2.0 (Baevski et al., 2020) 26.9 28.5 HuBERT (Hsu et al., 2021) 27.5 29.0 mixup probability on ST performance and the results are shown in Figure 5. Here p * denotes the proportion of text tokens in the mixup sequence. Previous works find a slightly larger probability is better, while we find p * = 0.2 improves ST performance the most. We think this is because CMOT needs to find the alignment between speech and text adaptively, so a larger proportion of text tokens in the mixup sequence might bring some noise. Speech Encoder We investigate the impact of different speech encoders on our ST system. HuBERT (Hsu et al., 2021) and wav2vec 2.0 (Baevski et al., 2020) are both commonly used speech pretrained models. Table 6 indicates that using HuBERT as a speech encoder can achieve higher performances than using wav2vec 2.0. In addition, in order to provide a fair comparison with other works using wav2vec 2.0, we conduct experiments on CMOT with wav2vec 2.0 as the speech encoder in more directions. As shown in Table 7, CMOT using wav2vec 2.0 still performs well, surpassing previous works in these three language pairs. Weight of KL Loss Our finetuning process follows a multi-task learning approach, where the weight of each training objective can affect the overall performance. For the KL loss in Equation 12, we choose several different weight values λ ranging

Model
En-De En-Fr En-Es  Table 7: BLEU scores with different models using wav2vec 2.0 on MuST-C tst-COMMON set. "CMOT-W2V2" refers to our CMOT model with wav2vec 2.0 as the speech encoder. "W2V2-Transformer" refers to the model with the same structure as CMOT-W2V2 but without mixup during training, and we directly report the results given in Fang et al. (2022). Here the experiments of W2V2-Transformer and CMOT-W2V2 are both conducted under the expanded setting. from 1.0 to 3.0, and find when λ = 2.0 or λ = 2.5 the performance is best, as shown in Figure  6. When the weight of KL loss is too small, the regularization effect of the KL objective becomes little, while a large weight of KL loss can lead to performance degradation of the main tasks ST and MT. Hence, we suggest the weight be at a moderate value, and we choose λ = 2.0 for our system.
Window Size for OT We examine the impact of window size W on our ST system. As shown in Figure 7, a small window size may restrict the model's ability to identify appropriate alignments, while a large window size may diminish its ability to capture local dependencies.
(2022b) projected speech and text into a common representation space. Chen et al. (2022) proposed the method of modality matching.  applied the contrastive learning strategy. Fang and Feng (2023b) proposed cross-modal regularization with scheduled sampling.
Mixup Our work draws on the strategy of mixup. Mixup is first proposed in Zhang et al. (2018) as a form of data augmentation to improve generalization and increase the robustness of neural networks by creating "convex combinations of pairs of examples and their labels". Verma et al. (2019) proposed the concept of manifold mixup, going a step further than surface-level mixup. In recent years, mixup is used to reduce the gap of representation spaces, such as cross-modal transfer (So et al., 2022;Ye and Guo, 2022) and cross-lingual transfer (Yang et al., 2021). Fang et al. (2022) proposed STEMM, applying mixup to end-to-end ST to overcome modality gap. Although both STEMM and our work use the mixup strategy, STEMM needs external alignment tools, while CMOT finds the alignment adaptively.
Optimal Transport Optimal transport is a classical mathematical problem. It is now commonly used to describe the transfer cost between two distributions. Villani (2009) systematically demonstrates the theories and proofs of optimal transport. Peyré et al. (2019) outlines the main theoretical insights that support the practical effectiveness of OT. There are some approximation algorithms for the OT problem, like the Sinkhorn algorithm (Cuturi, 2013) and IPOT (Xie et al., 2020) algorithm. Kusner et al. (2015) proposed a relaxed form of OT in order to measure the similarity between documents, which reduced the computational cost. Chen et al. (2020) used OT in image-text pre-training to "explicitly encourage fine-grained alignment between words and image regions". Gu and Feng (2022) applied the relaxed form of OT to machine translation and managed to "bridge the gap between the semantic-equivalent representations of different languages".

Conclusion
We propose Cross-modal Mixup via Optimal Transport (CMOT) to adaptively find the alignment between speech and text sequences, and to mix up the sequences of different modalities at the token level. Experiments on the MuST-C benchmark demonstrate the effectiveness of our method, which helps overcome the modality gap and thus improves the performance of end-to-end ST systems. We will continue to explore the application of this adaptive cross-modal alignment method in the future.

Limitations
The CMOT method for speech translation is based on the idea of cross-modal knowledge transfer and the paradigm of multi-task learning, so it requires the transcripts of speech during training, which may not be applicable to some languages without transcribed text. Besides, this work mainly focuses on the finetuning of speech translation, and applications on the pretraining phase or on other tasks are in need of exploration in the future.

Ethics Statement
Our proposed CMOT can help build a strong endto-end ST system. It has prospects for application in many scenarios requiring speech translation, which helps people understand speech in a foreign language. Nevertheless, the results generated by end-to-end ST systems are not necessarily perfect, so people may not fully rely on the results for the time being.