Mutual-Learning Improves End-to-End Speech Translation

A currently popular research area in end-to-end speech translation is the use of knowledge distillation from a machine translation (MT) task to improve the speech translation (ST) task. However, such scenario obviously only allows one way transfer, which is limited by the performance of the teacher model. Therefore, We hypothesis that the knowledge distillation-based approaches are sub-optimal. In this paper, we propose an alternative–a trainable mutual-learning scenario, where the MT and the ST models are collaboratively trained and are considered as peers, rather than teacher/student. This allows us to improve the performance of end-to-end ST more effectively than with a teacher-student paradigm. As a side benefit, performance of the MT model also improves. Experimental results show that in our mutual-learning scenario, models can effectively utilise the auxiliary information from peer models and achieve compelling results on Must-C dataset.


Introduction
Speech translation (ST) aims to translate speech signals into a foreign language. It is a multi-modality task, closely related to automatic speech recognition (ASR) and machine translation (MT). ST has a wide range of applications, such as video subtitling (Saboo and Baumann, 2019), real-time lecture translation (Müller et al., 2016), and protection of endangered languages (Bansal et al., 2017).
Despite the recent success in end-to-end (E2E) models, currently such systems still face the issue of labelled training data insufficiency (Sperber and Paulik, 2020). A recent popular advance in E2E ST is the use of knowledge distillation (KD), which can provide an effective paradigm for transferring knowledge from rich-resource to low-resource tasks Gaido et al., 2020). Under such paradigm the MT model is considered a teacher that guides the ST model, considered a student learning from the teacher. We pose that oneway knowledge transfer in a strict teacher-student relationship maybe sub-optimal for the following reasons: 1. Since the MT model is frozen in this one-way knowledge transfer scenario, the success of knowledge transfer and hence the performance of the ST task is constrained by the performance of the pre-trained MT model; 2. There is a modality gap between speech and text inputs of the two models, with speech input also containing inherent speaker variability.
Motivated to address the issues mentioned above, we set out to improve ST and MT tasks by training them jointly. Instead of freezing teacher model, we introduce a mutual-learning paradigm, which regards ST and MT models as peers that learn collaboratively, aiming to iteratively learn and share the knowledge between the two models. Originally, mutual-learning was proposed to leverage information from multiple models and allow effective dual knowledge transfer in image processing tasks (Zhang et al., 2018;Zhao et al., 2021). We leverage this idea and adapt it to sequence tasks. Our main contributions are: 1. We propose a jointly-trainable mutual-learning paradigm, which improves the distillation method by training together. The search space of MT and ST are both enlarged, providing the potential for a more robust local optima. 2. We further improve our mutuallearning method by integrating cyclical annealing schedule, which alleviates the KL vanishing problem suffered by many time-series tasks (Fu et al., 2019;Bowman et al., 2016;Higgins et al., 2017). 3. We implement extensive experiments on MuST-C En-Fr, En-Es datasets and illustrate the advantage of our model by empirically comparing with a cascaded model, a knowledge distillation (KD) model and a multi-task learning (MTL) model. The ex-perimental results show our model can effectively leverage the transcript and the auxiliary MT task, and we provide competitive results in all experiments. In addition, as a side benefit, the performance of the MT model also improves.
2 Model Description 2.1 End-to-End Speech Translation E2E ST learns a single model, which directly maps features extracted from speech signal to a target language text sequence (Duong et al., 2016;Weiss et al., 2017). More concretely, given a sample pair (x, y) from the training set D corresponding to speech signal features and translated target sentence, the ST model is trained by minimising the negative log likelihood (NLL) loss, L: E2E models consist of an encoder that encodes speech input as an intermediate representation, and a decoder that decodes this intermediate representation to a probability distribution over the target text feature space. In the past, the encoder and decoder were based on recurrent neural network architecture, but most recent work utilises Transformerbased architecture (Berard et al., 2016;Weiss et al., 2017;Di Gangi et al., 2019b;Vila et al., 2018).

Mutual-Learning
Model definition: Given a parallel data sample (x i , s i , y i ) from input speech features X, input language text features S and target text features Y , and an ST model M st and an MT model M mt , the output probabilities are given by: Our training loss has two components: a traditional supervised reconstruction loss and a mimicry loss that aligns the output posterior distributions between the two models. We adopt Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) as the mimicry loss, aiming to reduce the distance of outputs of ST and MT systems, effectively encouraging them to mimic each other. Since KL divergence is asymmetric, we include it calculated in both directions: where N represents the length of the output sentence. We adopt NLL loss as the reconstruction loss, denoted by LC st for ST and LC mt for MT: where y i denotes the i th token in the target sentence. The overall mutual-learning training loss is a combination of the weighted mimicry loss and the reconstruction losses, as described by Eq. 8. The proposed mutual-learning scenario is illustrated in Figure 1.

Training Strategy
Algorithm until convergence The training process is described by Algorithm 1. We propose to train ST and MT models iteratively until convergence. In each iteration, there are two steps: 1. MT model is frozen and the parameters of ST model are updated; 2. ST model is frozen and the parameters of MT model are updated.
KL vanishing issue: Leveraging KL divergence for mimicry loss in our mutual-learning strategy can suffer from the vanishing issue, which has been observed in other applications, for example in variational auto-encoders (Fu et al., 2019). We mitigate this by adopting a cyclical annealing schedule for β, which has been proposed for this purpose in the context of variational auto-encoders (Fu et al., 2019). More concretely, β in Eq. 8 changes periodically during training iterations, as described by Eq. 11: where t represents the current training iteration and r is defined as: The training process is effectively split into many cycles with each cycle lasting C iterations. In each cycle β t progressively increases from 0 to 1 during RC iterations and then stays at 1 for the remaining (1 − R)C iterations. With R = 0.5 and C = 5000, we are able to mitigate KL vanishing issue and train.

Dataset
We evaluate the proposed framework on the popular MuST-C multilingual speech translation cor-pus 1 (Di Gangi et al., 2019a), using the two mostused language pairs: English-to-French (En-Fr) and English-to-Spanish (En-Es). En-Fr dataset contains 500 hours of speech and 280k sentences. En-Es dataset contains 504 hours of speech and 270k sentences.
Pre-processing We implement the same data pre-processing steps as described in fairseq speechto-text framework (Wang et al., 2020). Specifically, we extract 80-channel log Mel-filterbank features. The training samples that are larger than 3000 frames are removed. For both, input and target texts, we employ newly proposed subword regularisation method (Kudo, 2018) to build a vocabulary with a size of 8000. We also experiment with a jointly-trained shared vocabulary of size 8000.

Architecture and Evaluation Details
For ST task we use a stack of 2 1D convolutional layers (kernel size 5, stride 2), followed by 12 Transformer layers of size 2048 as the encoder. We use 6 stacked Transformer layers with size 512 as the decoder. For MT task we use 12 stacked Transformer layers with size 2048 as the encoder and 6 stacked transformer layer with size 2048 as the decoder. Evaluation is based on the standard implementation of BLEU score, SACREBLEU (Post, 2018), with beam size of 5. The maximum number of tokens in each batch is set to 40000.

Comparison with a Cascaded Model
To form a cascaded model, we first train a Transformer-based E2E ASR model using speech inputs and English transcripts. We then train an MT model using English transcripts and target sentences. In inference mode, we first use ASR to generate intermediate text representation, then we pass this to the MT system and calculate the output probabilities on the target language vocabulary.
As shown in Table 1, our mutual-learning-based ST model provides competitive results comparing to a cascaded model. Our model achieves 0.6 and 0.5 BLEU score improvement in En-Fr and En-Es datasets respectively. The results illustrate that our mutual-learning paradigm provides an effective method for leveraging the additional information available via transcript.

Method
En  Table 1: A comparison of ST task evaluation results for different approaches: cascaded model, vanilla endto-end, end-to-end with multi-task learning, end-to-end with knowledge distillation, and end-to-end with mutuallearning (ML). " " denotes training with a joint vocabulary.

Comparison with a Knowledge Distillation Model
Knowledge distillation (KD) is a conceptually similar approach to the proposed framework. KD provides a one way transfer from a trained teacher model to a student model. We provide a focused comparison with this method: we pre-train an MT model using input language and target language sentences, freeze it and use it to guide an ST model by minimising Eq. 13: where KL 1 and KL 2 are described by Eqs. 4 -5 and LC is the reconstruction loss (Eq. 6). The main difference between KD and our approach is that the MT model is pre-trained, frozen and used in inference mode only to guide the ST model training, which is performed separately form MT model training. From table 1 it can be seen that the proposed mutual-learning approach outperforms one way knowledge distillation strategy (E2E+KD) by 1.0 and 0.6 BLEU score on the En-Fr and En-Es datasets, respectively.

Comparison with a Multi-Task Learning Model
Multi-Task Learning (MTL) is also a collaborative learning strategy. In contrast to the proposed mutual-learning scenario, in MTL we train all tasks in parallel: ST model and MT model are trained separately with the average of the NLL loss from MT and ST models: Evaluation of ST task after training using the MTL strategy is shown in Table 1 (E2E + MTL). These results show that our mutual-learning strategy is a more effective way of joint learning: gaining 0.7, 0.3 BLEU score increase over MTL in ST task.

Joint vocabulary training
Vanilla E2E ST model uses separate vocabularies for source and target languages. We also utilised a jointly-trained byte pair encoding (BPE) to build the vocabulary and achieved a surprising improvement on what was already a state-of-the-art result (see the last row of Table 1).  In addition, we evaluate the performance of the MT task and compare our proposed mutuallearning scenario with an independently trained MT model and also the multi-task learning scenario. The architecture of MT model and the hyperparameters' values used in each training scenario are identical, as described in Sec.3.2.

Evaluation of the MT task
From the results in Table 2 we can conclude that mutual-learning also improves the MT model's performance. Our system gains 0.7 and 0.3 BLEU score in En-Fr and En-Es datasets, respectively, compared to the independently trained MT system. Our system also exceeds a typical MTL approach by 0.2 and 0.4 BLEU score in the MT task. These results suggest that our mutual-learning leads to a more robust minima than the MTL paradigm.

Conclusion
We proposed a mutual-learning paradigm for endto-end speech translation to effectively transfer knowledge between ST and MT models. Experimental results demonstrate that our proposed approach outperforms knowledge distillation, the typical one-way transfer paradigm, as well as, multitask learning, a typical dual knowledge transfer paradigm. We also provide a competitive result compared to a cascaded model, which has thus far been outperforming E2E ST models.