THE IWSLT 2021 BUT SPEECH TRANSLATION SYSTEMS

The paper describes BUT’s English to German offline speech translation (ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a smaller amount of speech-translation training data. Large amounts of ASR and MT training data are utilized for pre-training the ASR and MT models. Speech-translation data is used to jointly optimize ASR-MT models by defining an end-to-end differentiable path from speech to translations. For this purpose, we use the internal continuous representations from the ASR-decoder as the input to MT module. We show that speech translation can be further improved by training the ASR-decoder jointly with the MT-module using large amount of text-only MT training data. We also show significant improvements by training an ASR module capable of generating punctuated text, rather than leaving the punctuation task to the MT module.


Introduction
Speech Translation (ST) systems are intended to generate text in target language from the audio in source language. The conventional ST systems are cascade ones, including (in the most popular form) three blocks i.e., an ASR, punctuation/segmentation module and an MT model (Ngoc-Quan Pham, 2019;Pham et al., 2020b;Ansari et al., 2020). Both Automatic Speech Recognition system (ASR) and Machine Translation (MT) models are independently trained, and the MT model processes the ASR output text (ASR hypotheses) to generate translations. In a cascade system, the advance-ments in ASR and MT can be directly extended to ST. These models can also leverage on the availability of large ASR and MT data-sets, and some of the state-of-the art ST systems are still cascade ones.
Recently, End-to-End ST systems have become widely popular. An End-to-End ST can directly generate text in target language from the audio in source language. These models are simpler in structure and they are more suitable for operating in streaming fashion. Most End-to-End speech translation systems are variants of encoderdecoder architecture with attention models (Bahdanau et al., 2015;Di Gangi et al., 2019;Zhao et al., 2020). This category includes the popular Transformer models, which have been adapted for training End-to-End ST in (Di Gangi et al., 2019). In (Inaguma et al., 2020), a better performance of ST was achieved by initializing the encoder and decoder modules from pre-trainied ASR and MT systems, respectively. Very-deep transformer models have been trained with stochastic depth for training End-to-End ST models in (Pham et al., 2019). The use of relative positional embeddings has also improved the performance of transformer (Pham et al., 2020a).
One major drawback or end-to-end ST is the data availability, i.e., paired speech-to-translation data is scarce compared to ASR or MT data. Data augmentations and use of synthetic data have been explored in (Bahar et al., 2019(Bahar et al., , 2020 to mitigate the issue. Unlike End-to-End ST systems, the data for training cascade systems is easily available and less costly. A brief survey of existing approaches and their principal limitations are discussed in (Sperber and Paulik, 2020). Despite multiple advantages, the cascade systems suffer from a major drawback: propagating erroneous early decisions into MT models, which then cause degradation in the trans-lation performance. To mitigate this degradation, rather than passing a single ASR output sequence to MT model, other forms such as lattices, n-best hypotheses and continuous representations have been explored in (Anastasopoulos and Chiang, 2018;Zhang et al., 2019;Sperber et al., 2019;Vydana et al., 2021;Dong et al., 2020).
In this work, we use our jointly trained Automatic Speech Recognition-Machine Translation (Joint-ASR-MT) model previously described in (Vydana et al., 2021). Joint-ASR-MT model is a cascade system, but it has a differentiable path between ASR and MT modules. To create such differentible path, the continuous hidden representations (corresponding to each output token) from the ASR decoder are passed to the MT-Model. The hidden continuous tokens corresponding to each output token are the attention-weighted value vectors in the last layer of the transformer decoder. We refer to these continuous representations as"context vectors" as proposed in (Sperber et al., 2019).
Existing large separate ASR training data and MT training data can be used to pre-train these modules; then, the pre-trained modules are jointly optimized using a small amount of speech translation data. The joint optimization mitigates the degradation in performance due to erroneous early decisions.
In this paper, we generate German translation from English speech, and we focus on two main contributions: (1) We train different MT models that can translate normalized text or punctuated text. It is known that MT-models translating punctuated text provide superior performance, therefore, we propose to train an ASR system that can generate the punctuated text. We confirm that such ASR system provides superior performance in ASR-MT pipeline. (2) We use the internal continuous representations from the ASR-decoder as the input to MT module. In section 6, we show that speech translation can be further improved by adapting ASR-decoder to the MT module. This is achieved by training the ASR-decoder jointly with the MT-module using a large amount of text-only MT training data.

Datasets and Pre-processing
The Datasets used for training various models are described in Table. 1. ASR-Train-set and MT-Train-set are used for pre-training ASR and MT models respectively. The pre-trained models are fine-tuned using ASR-MT-Train-set. All models are evaluated using MustC-Common test set.

Pre-processing and Feature Extraction
From audio data, 80-Dimensional Mel-Filter bank energies along with pitch features are extracted. The Moses toolkit is used for text tokenization and other standard text pre-processing. The umlauts from the German text are replaced by the special tokens. All the non ASCII characters are removed from the text data. The repetitions of the same sentences are removed from the corpora. We cleaned up the MT training data by identifying and manually removing the sentences where successive words were erroneously concatenated in to very long erroneous words. Sentence-piece models (Kudo and Richardson, 2018)

Pruning Noisy ASR corpus
Some of the utterances in ASR-MT-Trainset (MustC, IWSLT and Europarl) sets are erroneous due to the shift in alignments between audio and text. Training an End-to-End ASR on this data directly did not lead to convergence. To remove erroneous transcripts, a hybrid TDNN-LFMMI ASR system based on KALDI (Povey et al., 2011(Povey et al., , 2016 was trained and this ASR system was used to decode the ASR-MT-train set. The Word Error Rate (WER) for each sentence is computed and the sentences with more than 50% WER are deleted from the ASR-MT-Train-set (Potapczyk et al., 2019). Even with this cleaning, training the ASR systems only on ASR-MT-Train-set did not lead to convergence. Pre-training the ASR models on ASR-Train-set turned out to be crucial for convergence as described in section 3.

Automatic Speech Recognition (ASR)
ASR systems trained in this work are built on Transformer ASR models (Dong et al., 2018;Karita et al., 2019;Vydana et al., 2021;Vaswani et al., 2017). The ASR models have 12 encoder and 6 decoder layers with 4096 feed-forward units and 1024 attention dimension with 16 heads. Models are initially trained with ASR-Train-set and are later fine-tuned with ASR-MT-Train-set. A thresholding mechanism is used for pruning away the noisy end-of-sequence (EOS) tokens from beam search (Kahn et al., 2019). Models are trained with 30K warm-up updates and a checkpoint is saved after every 8K updates. The training is stopped with an early stopping criterion. 8best check-points are averaged and the averaged weights are used for decoding the hypothesis. Vectorized beam search (Seki et al., 2019) was used for decoding the ASR hypotheses with a beam size of 10. Further in this paper, ASR models described in this section are referred to as Ext.ASR models (Externally trained ASR models). Two different ASR systems were trained for generating normalized text (Norm-ASR) and punctuated text (Punc-ASR), and their performances are reported in Table 2. It can be observed that the WER of Punc-ASR appears to be higher than Norm-ASR. Punc-ASR is a obviously more difficult task than Norm-ASR -the punctuation tokens are considered as extra words and each error in those words contributes to the WER.
ASR-LM: A Transformer language model was trained on English text (Irie et al., 2019). The model has 6 layers, with 4096 feed-forward units and 1024 attention dimension with 8 heads. The model is initially pre-trained on Librispeech LM corpus and it is later fine-tuned on English text from MT-train-set and ASR-MT-train-set. An improvement in the performance is observed by shallow fusion of the ASR and language model (ASR-LM). Performances of these language models are presented in column 2 of Table. 5.

Machine Translation Systems(MT)
Transformer models (Vaswani et al., 2017) are also at the core of MT-systems. They have 6-encoder and 6-decoder layers with 4096 feed-forward units and 1024 attention dimensions and have 16 heads. The models are optimized with 30K warm-up updates and a check-point is saved every 8k updates. Training is stopped using an early stopping criterion. 8-best check-points are averaged and the averaged weights are used for decoding the hypotheses. The noisy EOS tokens are pruned out using (Kahn et al., 2019). Vectorized beam (Seki et al., 2019) search has been used for decoding the hypotheses with a beam size of 8. A large variance in the performance is observed w.r.t the decoding hyper-parameters such as maximum target sequence length and length-bonus. The maximum length of the target sequence is computed by multiplying the input sequence length with lengthratio: 1.2 was found as optimal on the development set. To control the length of the output sequence, the log-likelihood scores of the hypotheses are penalized by additive token insertion penalties. The optimal value for this penalty is tuned as a hyper-parameter on the development set. The hypothesis text is de-tokenized and BLEU score is evaluated using Moses Toolkit. All the BLEU scores reported in this paper are computed using the de-tokenized, punctuated German text using multi-bleu-detok.perl. The performances of the MT systems are reported in Table. 3. All BLEU scores reported in this paper are computed using punctuated text as reference.
In Table 3, Norm-MT, Punc-MT are MT models trained to predict punctuated German text. Norm- MT-LM: A transformer language model has been trained on German text from MT-Train-set, ASR-MT-train-set. This LM is also used while decoding with the MT model (Irie et al., 2019).
The architecture of the model is same as ASR-LM mentioned in section 3. A shallow fusion between the MT-model and the MT-LM Language model is performed. As shown in Table 3 and column 2 of  Table 5, the additional language model (MT-LM) did not improve the performance significantly.

Jointly Trained ASR-MT Systems
The model has two modules: ASR and MT; their architecture is same as described in sections 3 and 4 respectively -see block diagram in Figure 1 and full description of the model in (Vydana et al., 2021). The context vectors from the final layer of the ASR-decoder are used as inputs to the MT module. Passing context vectors from ASR to MT models while training has also been explored in (Sperber et al., 2019). Both the models are jointly optimized using a multi-task cross-entropy (ASR cross-entropy and MT cross-entropy) -both losses are also shown in Figure 1. During the inference, beam search has been used to obtain the ASR hypotheses, and the corresponding context vectors obtained from the ASR model are used by (log(P (y|z)) + log(P (z|x))), where x is the speech abnd z,y are the source and target sequences respectively.Ẑ is the n-best source sequence and y * is most likely decoded hypothesis. In this equation, y * is always a discrete sequence, while z is a discrete sequence when we are using Ext.MT and a continuous one when using Joint-MT. Note that similar coupled search was used in (Tu et al., 2017), where the back translation likelihoods are used for re-scoring the hypothesis of the MT-system.

Adapting ASR decoder to the MT module
Joint-ASR-MT models are jointly optimized by having an end-to-end differentiable path from speech to translations. The internal continuous representations from the ASR-decoder are used as the input to MT module. Speech translation can be further improved by adapting ASRdecoder to the MT module. This is achieved by training the ASR-decoder jointly with the MTmodule using large amount of text-only MT training data. The weights for the model are initialized from trained Joint-ASR-MT model. Speech translation data (ASR-MT-Train-set) is used to finetune Joint-ASR-MT model using a multi-task loss. Apart from that, the data from the MT-Train-set is used to jointly train the ASR-decoder and the MTmodule of Joint-ASR-MT model. We alternately update the model using multi-task loss described in section 4 and the adaptation loss as described in this section. A block diagram describing this training is presented in Figure 2. The input text sequence is given to the ASR-decoder and a sequence of zeros is considered as the encoder output sequence of the ASR model (i.e.,H ASR in Figure 2). The context vectors computed from these two sequences are used for training the MT-module. Note that similar method has been adopted in (Potapczyk et al., 2019) for improving the performance of ASR system using only text data. This training further improves the performance as will be shown in section 7.

Speech Translation Results
Results for the various configurations of speech translation systems are given in Table 4. First, we focus on column A, where the Joint-ASR-MT models are trained using ASR-MT-Train-set (only speech translation data) with a multi-task loss as described in section 5. Note, however, that Ext.ASR and Ext.MT systems are trained on large amounts of data and finetuned to ASR-MT-Trainset as described in sections 3 and 4 respectively. For systems in column-A, normalized (unpunctuated) text is passed from ASR to MT model. Row 1 corresponds to the conventional cascade system, where the Ext.ASR systems generates the n-best hypotheses of discrete token sequences and an Ext.MT uses these token sequences for generating the translations as described in Eq. 1. We consider this system achieving BLEU 23.20 as a baseline.
Usually, transformer-ASR decoder uses the partial output hypothesis and extends it by a new token with every autoregressive decoding step. For the system in row 2, Ext.ASR generates the complete hypothesis and ASR module from Joint-ASR-MT is "asked" to extend it by one more token. As a byproduct "context vectors" (the continuous representations) are generated for the whole sequence -these are then passed to the MTmodule in joint-ASR-MT model to generate translation. Compared to row 1 of column A, we see a degradation in performance ). This can be attributed to having only small amount of speech translation training data, which is not sufficient for robustly training the Joint-ASR-MT systems.
For the systems in row 3, Ext.ASR generates the ASR hypotheses which are used by Ext.MT similar to the system described in row 1; the hypotheses from Ext.ASR are used by Joint-MT similarly to the system described in row 2. To generate the translation, the hypotheses form both models are ensembled as follows: For each output token, a weighted average of Log-softmax outputs from the two MT models is computed. This weighted average is used in the beam-search to compute the n-best partial hypotheses. These partial hypotheses are further extended by both the models to generate the Log-softmax outputs for next tokens. We can see that this ensembling system achieves a BLEU score of 24.02 and outperforms the cascaded baseline.
The systems in rows 4-6 are essentially the same as the ones in rows 1-3, respectively, except that now, the ASR module from joint-ASR-MT system is directly used to produce the n-best ASR hypotheses and the corresponding context vectors. Rows 4-6 show the same trend as rows 1-3 with slightly improved performance; these improvements are mainly due to better performing ASR system: As described in Section 2.2, training ASR systems only on ASR-MT-Train-set (data from Mustc, IWSLT and Europarl with erroneous transcriptions) did not lead to convergence. How-  At the end, this system trained only on ASR-MT-Train-set achieves better ASR performance (WER 16.14%) compared to Ext.ASR (WER 18.20%), Which is pre-trained on ASR-Train-set (Approx 2000hrs) and fine-tuned on erroneous ASR-MT-Train-set. Similar trend will be observed with the systems in columns B, C and D.
The systems described in rows 7-9 are similar to those from rows 1-3, except that the ASR hy-potheses are obtained by ensembling the Ext.ASR and ASR-module in Joint-ASR-MT model. The ensembling is performed in a similar way as described for the MT-system (row 2). All the ensemble systems in rows 3, 6, and 7-9 are ensembled giving equal weight to both the systems, except for row 10, where the ensemble weights are tuned on the development set. For all these systems, we can see that the ensembling consistently improves the performances.
The systems in column B are similar to the ones in Column A, but for the Joint-ASR-MT model, the weights of ASR and MT module are initialized from the Ext.ASR and Ext.MT. Only then, the Joint-ASR-MT model is fine-tuned using ASR-MT-Train-set. Comparing column-A and column-B, we can see that such pre-training has significantly improved the performance.
We also see that the MT system using continuous representations (Joint-MT) (row 5; BLEU 23.97) outperforms the system with the Ext.MT (row 4; BLEU-23.86) and similar trend can be seen in columns C and D. This is in contrast to the system in column A where we did not use enough data for training the Joint-ASR-MT model; now, with the pre-training, the joint-ASR-MT model is effectively trained on the same amount of data as the Ext.MT systems.
The systems in column C are similar to the ones in Column B, but the ASR and MT modules used here are Punc-ASR (ASR systems which can generate punctuated text) and Punc-MT (MT systems which can process punctuated text as input), respectively. We can see that the systems from column-C perform significantly and consistently better than the corresponding ones in column-B. This shows that it is more effective to train an ASR module to generate punctuated text rather than leaving the punctuation task to the MT module. Note that the ASR performances reported in columns C and D is computed including the punctuation symbols, which results in higher WERs.
Finally, the systems in column D are the same as the ones in column C except that we additionally use the ASR decoder adaptation scheme described in section 6. The consistent improvements observed in column D as compared to column C show the effectiveness of this adaptation scheme. They are able to make use of the large amount of text-only MT training data to train also the ASR decoder in order to tighten the coupling between ASR-decoder and MT-module. Apart from improving MT-module, this adaptation has also improved the performance of ASR-decoder on its own. This can be observed by comparing WER's of row 4 in columns C and D.
The results of passing the n-best hypotheses from ASR to MT models are presented in Table 5. Passing the n-best hypothesis from ASR to MT module has better performance, but not significantly. This result is not in line with out previous studies (Vydana et al., 2021), where we have seen significant gains from switching from 1-best to nbest.

Conclusion
In this work, we have explored joint-training of ASR-MT models for speech translation. Initializing these models from pre-trained ASR and MT models has helped in better optimization. The joint training has improved the performance of the ASR module significantly as the additional MT module has provided better (light) supervision in the context of erroneous ASR transcripts. Adding the punctuation information into the input text improves the performance of the MT-model greatly.
In line with this observation, use of ASR system generating punctuated text also improves the MT performance significantly in a cascade pipeline. Use of the MT text only data to adapt the ASR decoder to the MT module in the joint-ASR-MT model further improves the performances of these systems. The systems trained in this work are offline models and their performances needs to be studied from the perspective of online or streaming models.