The HW-TSC’s Offline Speech Translation System for IWSLT 2022 Evaluation

This paper describes the HW-TSC’s designation of the Offline Speech Translation System submitted for IWSLT 2022 Evaluation. We explored both cascade and end-to-end system on three language tracks (en-de, en-zh and en-ja), and we chose the cascade one as our primary submission. For the automatic speech recognition (ASR) model of cascade system, there are three ASR models including Conformer, S2T-Transformer and U2 trained on the mixture of five datasets. During inference, transcripts are generated with the help of domain controlled generation strategy. Context-aware reranking and ensemble based anti-interference strategy are proposed to produce better ASR outputs. For machine translation part, we pretrained three translation models on WMT21 dataset and fine-tuned them on in-domain corpora. Our cascade system shows competitive performance than the known offline systems in the industry and academia.


Introduction
In recent years, end-to-end system and cascade system are fundamental pipelines for speech translation tasks. Traditional cascade system is comprised of continuing parts, automatic speech recognition (ASR) is responsible for generating transcripts from audios and machine translation model aims at translating ASR outputs from source language into target language. Obviously, the ASR part and MT part of this system are independent to some extent. Therefore, this paradigm enables people to utilise state-ofthe-art ASR models and MT models and conduct experiments by different permutations and combinations. And those experiments can help us find the best combination of choice of ASR and MT model. ASR model like Conformer (Gulati et al., 2020) and S2T-Transformer (Synnaeve et al., 2019) are commonly used. MT models like Transformer (Vaswani et al., 2017) can be considered as a standard configuration.
On the contrary, there is also a disadvantage when applying cascade systems. The main aspect is that some important information such as the intonation and emphasis of speakers could not be explicitly expressed in the transcripts. This "missing information" might be the key to distinguish the gender of speaker, or the sarcasm and symbolism behind the texts. It means, there is a risk of losing important information under the condition of cascade system. Correspondingly, end-to-end system preserves the competitive edge to learn the "missing information", because it is directly trained on the speech-to-text dataset without any transit process. Due to this property, end-to-end system has been paid attention in research and there is encouraging progress. For instance, Conformer (Gulati et al., 2020) can also be used in this task. However, there are some disadvantages for the end-to-end system. Firstly, due to the lack of large scale high quality bilingual speech translation datasets, training a productive end-to-end ST model can be nontrivial. Next, the mapping from speech space to the target language space is far more difficult than the mapping to the source language space, leading to greater demand on the scale of the training set. This paper presents our work in IWSLT 2022 (Anastasopoulos et al., 2022) offline speech translation track. The main contribution of this paper can be summarized as follows: 1) We tested various combinations of ASR models, and finally found ensemble of Conformer and S2T-Transformer and filter by U2 can improve the ASR fluency and sentence expression.
2) Context-aware LM reranking can effectively improve the possibility to choose the best candidate in beam search.

Data Preparation and Preprocessing
There are five different datasets used in the training of our ASR models and ST models, such as MuST-C V2 (Cattoni et al., 2021), Lib-riSpeech (Panayotov et al., 2015), TED-LIUM 3 (Hernandez et al., 2018), CoVoST , IWSLT, as described in the left sub-plot of Figure 1. For the training dataset we extracted 80-dimensional filter bank features from the raw waveform firstly. Then, the dataset was cleaned in a fine-grained process. The training set was filtered on the criteria of absolute frame size (within 50 to 3000), number of tokens (within 1 to 150) and speed of the speech (within µ(τ ) ± 4 × σ(τ )), where τ = # frames # tokens . The detailed attributes such as the number of utterance and the duration of training datasets are shown in table 1. For test set, each TED talk was segmented into several utterances (no more than 20 seconds) with the officially provided segmentation tool (LIUM_SpkDiarization.jar).
We use the exactly same corpus to train our MT models following the configuration of (Wei et al., 2021), with the scale of the dataset showing in Tabel 2.

Automatic Speech Recognition
There are three types of basic ASR models Conformer (Gulati et al., 2020), S2T-Transformer (Synnaeve et al., 2019) and U2 (Zhang et al., 2020) used to recognize the speech and get transcripts. The first two models are standard autoregressive ASR models built upon the Transformer architec-ture (Vaswani et al., 2017). The last one is a unified model that can perform both streaming and non-streaming ASR, supported by the dynamic chunking training strategy (Zhang et al., 2020). During the training and decoding process, there are three important strategies we used to generate ASR results of these models as follows.
Domain controlled training and decoding By observing the corpus in the training set, we find that the style of text and the domain of the speech can be different between each dataset. Although the model is able to learn such difference implicitly, there are still some confusing patterns like case sensitivity and existence of punctuation that can not be easily learned. Therefore, we add the domain tag as the prefix token, acting as a known condition to guide the model to generate texts in required domain and style. It means, the model can learn the pattern given more prior knowledge. For example, the tag "<MC>" provides an instruction to the model to generate texts in the MuST-C style, or we can also use <LS> to make the model to generate LibriSpeech alike transcripts. The strategy also had a positive effect in our offline task submission of IWSLT 2021 . For Conformer and S2T-Transformer, since they are autoregressive generative models, we simply use the domain tag as the prefix token. However, this is not feasible for U2 with the CTC decoder. Therefore, we propose to first encode the domain tag with the inputembedding of the attention-based decoder of U2, then, adding the encoded tag to the down-sampled features element-wise, being together fed into attention layers of the encoder.
Context-aware LM reranking In order to take benefits from both Conformer and S2T-Transformer which has different model architecture, we ensemble them by averaging the predicted probabilities while generation. However, the ensemble doesn't solve a key problem comes from the independence assumption on each utterance. In other words, we translate each utterance in a TED talk speech independently without considering context information, which often cause inconsistent prediction on named entities such as person names. To this end, we adopt a language model (LM) to rerank beam candidates conditioned on a fixed length window of generated contexts.
Specifically, a Transformer-LM was trained on Algorithm 1 Context-aware LM reranking Require: ASR, LM, context length, beam size, utterance list: ϕ, Q, N, k, U the WMT21 monolingual English dataset, providing the perplexity score of each ASR beam candidate from the ensemble models by taking N previous generated sentence into account, (N = 3 obtains the best result). This method is commonly used to optimize document-level translation . A detailed explanation is presented in Algo 1 and the right sub-plot of Figure 1, which actually works like performing context-aware greedy search in the sentence-level. Besides the PPL (converted to the log probability) estimated by the LM, we also take the log probability of each beam candidate output by ASR models into account, combining them with a weighted sum (best combination searched in the experiment: w LM = 0.6, w ASR = 0.4).
Ensemble based robustness enhancement strategy Compared with ASR results generated from different ASR models, an interesting pattern can be found that U2 prefers to predict blank lines when facing with some hard samples. Hard samples, such as laughter and applause always confused S2T-Transformer and Conformer and they are more likely to output incorrectly. For instance, S2T-Transformer always outputs "thank you very much indeed" and Conformer generates "There's many a slip, twixt cup and the lip." when the input is the audio which contained only the applause of audiences. This phenomenon can be explained by the reason that U2 is more robust to interference than S2T-Transformer and Conformer. Consequently, the strategy that U2 could be utilised to filter the noise of ASR results from Conformer and S2T-Transformer. In other words, we extracted the blank lines of prediction of U2 as the standard to correct the results of other two models. The process provides our system with more robustness to non-speech or background noise.

Machine Translation
In an cascade system, the input of machine translation (MT) model is the ASR results. In order to obtain the translated results, we use the WMT21 news corpora to train three individual MT models for each language (En-De, En-Zh, En-Ja). Then these MT models are fine-tuned on the combination of MuST-C and IWSLT dataset. After applying the MT models on the ensembling ASR results above, the final results, also called hypothesis were obtained in our experiment.

Multilingual E2E-ST
In the ene-to-end system, the ASR model and machine translation model trained on bilingual corpora are not the continents of the system. The E2E model can be directly trained on the bilingual/multilingual speech corpora. However, only MuST-C and COVOST provides the translation of some language pairs, which might not be enough. Therefor, we propose to use the MT model to generate translations in specific language for all ASR training corpora, and then combined them together including the ASR (English) text, tagged with domain and language abbreviations like "<MC_en>", "<LS_zh>", etc. This is commonly considered as sequence level knowledge distillation (KD) (Kim and Rush, 2016). Next, a multilingual speech translation (ST) model is trained on the corpora, which can be used in both ASR and translation in an end-to-end paradigm by giving required language and domain tag.

Settings
Model Configurations Sentencepiece (Kudo and Richardson, 2018) is utilised for tokenization on ASR texts with a learned vocabulary restricted to 20000 sub-tokens. ASR models are configured as: n encoder_layers = 16, n decoder_layers = 6, n heads = 16, d hidden = 1024, d FFN = 4096 for Conformer, n encoder_layers = 12, n decoder_layers = 6, n heads = 16, d hidden = 1024, d FFN = 4096 for S2T-Transformer and n encoder_layers = 12, n decoder_layers = 6, n heads =  Figure 1: This figure presents the example of the training of our ASR models (left) as well as the inference of our cascade system (right). In the example of inference, input features and domain tags are feed into ASR models, being decoded by the ensemble of Conformer and S2T-Transformer and cleaned by U2. Then, beam candidates (k=3 here) are scored together with contexts (6 to 8) by the language model. Finally, the optimal candidate is selected according to modulated scores and becomes the new context.

ASR Model
CoVoST  During the training of ASR models, we set the batch size to the maximum of 20,000 frames per card. Inverse sqrt is used for lr scheduling with warm-up steps set to 10,000 and peak lr set as 5e-4. Adam is used as the optimizer. All ASR models are trained on 8 V100 GPUs for 50 epochs. Parameters for last 5 epochs are averaged. Audio features are normalized with utterance-level CMVN for Conformer and S2T-Transformer, and with global CMVN for U2. All audio inputs are augmented with spectral augmentation (Park et al., 2019).
We followed the work of Wei et al. (2021) on the pretraining of all NMT models. All of them are fine-tuned on in-domain corpus for 10 steps.

Results
Comparison of ASR models on each individual dataset We tested three ASR models (Comformer, U2 and S2T-Transformer) on four individual test sets, CoVoST, MuST-C, TEDLIUM and LibriSpeech. In Table 3, Conformer shows the best results in each column, which are 11.27, 6.31, 5.33 and 4.39 WERs in each dataset. It is obvious that Conformer has the significant advantage compared to other two models. However, after manually evaluating some samples, we find that Conformer is easier to over-fit the training corpora. Therefore, we decide to ensemble it with the S2T-Transformer during inference.

Comparison of our approach on past years' test sets
In Table 4  years' report , we find that our strategy used in this year provides significant improvements on most of datasets, demonstrating their efficiency.
In order to illustrate the difference between ASR results of Conformer, S2T-Transformer and U2, we choose some representative cases in Tab 5. Case 1 presents three sentences generated from three ASR models given an audio segment which only contains background music and applause. Obviously Conformer and S2T-Transformer both outputs wrong sentences, because nothing should be generated in the decoding process. Contrarily, U2 outputs the blank line which indicates the robustness of the model itself. Case 2 provides the transcripts that Conformer and S2T-Transformer outputs the correct results. However, U2 made some mistakes on uppercase and punctuation marks even though the contents are generally correct, which shows that U2 is not sensible with case or punctuation; This actually caused by the multi-modality problem (Gu et al., 2018), which is faced by all non-autoregressive generation models. Since the prediction of each token are independently modeled in U2 (conditional independence assumption used by the CTC decoder), the prediction of tokens with one-to-many mappings (usually referred to as capitalism or existence of punctuation) can be difficult to learn without visible contexts (compared to autoregressive models). Case 3 presents that the results of Conformer and S2T-Transformer contains different errors. The Conformer misunderstood the "an ex-boyfriend" for "a next boyfriend", and S2T Transformer made a mistake on "cuss words". By fixing the different mistakes, we successfully obtain the correct sentence in the ensemble results.
The higher the w LM is, the more contribution does the LM provides to the scoring. The ablation study shows that context length at 3 is the best choice for reranking, since the results with context length at 4 or 5 both indicates lower BLEU scores. We suspect that longer contexts often misleads the scoring processing due to the unstable estimation of PPL on beam candidates of current utterance, resulting in non-convincing reranked results. Meanwhile, we find that the best combination of the weight on LM and ASR is 0.6 and 0.4, indicating that scoring only with LM cannot always produce promising estimation on the quality of the sentence.
Performance of Translation models We used the ASR results generated from Conformer on MuST-C tst-COMMON dataset to measure the performance of two text MT models and an endto-end ST model, i.e. the MT model pretrained There's many a slip, twixt cup and the lip.

S2T-Transformer
Thank you very much indeed.

Case 2 Conformer
And I predict that in 10 years, we will lose our bees. S2T-Transformer And I predict that in 10 years, we will lose our bees.
U2 and i predict that in ten years we will lose our bees Ensemble And I predict that in 10 years, we will lose our bees.
Case 3 Conformer ... the language that a next boyfriend taught you, where you learned all the cuss words ...

S2T-Transformer
... the language that an ex-boyfriend taught you, where you learned all the cusp words ...

U2
... the language that an ex-boy taught you or you learned all the cus words ...

Ensemble
... the language that an ex-boyfriend taught you, where you learned all the cuss words ...   on WMT news corpora, the in-domain fine-tuned MT model and our multilingual ST model. The in-domain FT MT was trained on the combination of MuST-C and IWSLT text corpora, providing the best BLEU scores compared with other two models. The result demonstrates that the in-domain fine-tuning is effective to generate the reasonable translation hypothesis. On the other hand, Endto-End multilingual ST proves to be a competitive model since the results are relatively close to those of the baseline pretrained MT model. More importantly, the E2E ST was only trained once on the combination of all language pairs, without further fine-tuning on any of them.

Model
En-De En-Zh En-Ja

Conclusion
This paper presents our offline speech translation systems in the IWSLT 2022 evaluation. We explored different strategies in the pipeline of building the cascade and end-to-end system. In the data preprocessing, we adopt efficient cleansing approaches to build the training set collected from different data sources. Domain controlled generation was used in the training and decoding of ASR models to fit the requirement of the evaluation test set. We also investigated the positive effect of context-aware LM reranking aiming at improving the quality and consistency of ASR outputs. Fi-nally, we demonstrated that the cascade system consisted of reranking ASR system and MT model has the best performance than end-to-end system. In our future works, we would like to investigate more strategies on improving the consistency of ASR outputs beyond reranking, as well as better training and data augmentation strategies for endto-end models.