The NiuTrans End-to-End Speech Translation System for IWSLT 2021 Offline Task

This paper describes the submission of the NiuTrans end-to-end speech translation system for the IWSLT 2021 offline task, which translates from the English audio to German text directly without intermediate transcription. We use the Transformer-based model architecture and enhance it by Conformer, relative position encoding, and stacked acoustic and textual encoding. To augment the training data, the English transcriptions are translated to German translations. Finally, we employ ensemble decoding to integrate the predictions from several models trained with the different datasets. Combining these techniques, we achieve 33.84 BLEU points on the MuST-C En-De test set, which shows the enormous potential of the end-to-end model.


Introduction
Speech translation (ST) aims to learn models that can predict, given some speech in the source language, the translation into the target language. Endto-end (E2E) approaches have become popular recently for its ability to free designers from cascading different systems and shorten the pipeline of translation (Duong et al., 2016;Berard et al., 2016;Weiss et al., 2017). This paper describes the submission of the NiuTrans E2E ST system for the IWSLT 2021 (Anastasopoulos et al., 2021) offline task, which translates from the English audio to the German text translation directly without intermediate transcription.
Our baseline model is based on the DLCL Transformer (Vaswani et al., 2017; model with Connectionist Temporal Classification (CTC) (Graves et al., 2006) loss on the encoders (Bahar et al., 2019). We enhance it with the superior model architecture Conformer (Gulati et al., 2020), relative position encoding (RPE) (Shaw et al., 2018), and stacked acoustic and textual encoding (SATE) (Xu et al., 2021). To augment the training data, the English transcriptions of the automatic speech recognition (ASR) and speech translation corpora are translated to the German translation. Finally, we employ the ensemble decoding method to integrate the predictions from multiple models (Wang et al., 2018) trained with the different datasets.
This paper is structured as follows. The training data is summarized in Section 2, then we describe the model architecture in Section 3 and data augmentation in Section 4. We present the ensemble decoding method in Section 5. The experimental settings and final results are shown in Section 6.

Training Data
Our system is built under the constraint condition. The training data can be divided into three categories: ASR, MT, and ST corpora 1 . ASR corpora. ASR corpora are used to generate synthetic speech translation data. We only use the Common Voice (Ardila et al., 2020) and Lib-riSpeech (Panayotov et al., 2015) corpora. Furthermore, we filter the noisy training data in the Common Voice corpus by force decoding and keep 1 million utterances. MT corpora. Machine translation (MT) corpora are used to translate the English transcription. We use the allowed English-German translation data from WMT 2020 (Barrault et al., 2020) and Open-Subtitles2018 (Lison and Tiedemann, 2016). We filter the training bilingual data followed , which includes length ratio, language detection, and so on.
The statistics of the final training data are shown in Table 1. We augment the quantity of the ST training data by translating the English transcription (the details are unveiled in Section 4).

Model Architecture
In this section, we describe the baseline model and the architecture improvements. Then, the experimental results are shown to demonstrate the effectiveness.

Baseline Model
Our system is based on deep Transformer (Vaswani et al., 2017) implemented on the fairseq toolkit (Ott et al., 2019). Furthermore, dynamic linear combination of layers (DLCL)  method is employed to train the deep model effectively (Li et al., 2020a,b  the sequence by a factor of 4 (Weiss et al., 2017). For strong systems, we use Connectionist Temporal Classification (CTC) (Graves et al., 2006) as the auxiliary loss on the encoders (Watanabe et al., 2017;Karita et al., 2019;Bahar et al., 2019). The weight of CTC objective α is set to 0.3 for all ASR and ST models. The model architecture is showed in Figure 1 4 .

Conformer
Conformer (Gulati et al., 2020) models both local and global dependencies by combining the Convolutional Neural Network and Transformers. It has shown superiority and achieved promising results in ASR tasks.
We replace the Transformer blocks in the encoder by the conformer blocks, which include two macaron-like feed-forward networks, multihead self attention modules, and convolution modules. Note that we use the RPE proposed in Shaw et al. (2018) rather than Transformer-XL (Dai et al., 2019).

Relative Position Encoding
Due to the non-sequential modeling of the original self attention modules, the vanilla Transformer employs the position embedding by a deterministic sinusoidal function to indicate the absolute position of each input element (Vaswani et al., 2017). However, this scheme is far from ideal for acoustic modeling (Pham et al., 2020).  The latest work (Pham et al., 2020;Gulati et al., 2020) points out that the relative position encoding enables the model to generalize better for the unseen sequence lengths. It yields a significant improvement on the acoustic modeling tasks. We reimplement the relative position encoding scheme (Shaw et al., 2018). The maximum relative position is set to 100 for the encoder and 20 for the decoder. We use both absolute and relative positional representations simultaneously.

Stacked Acoustic and Textual Encoding
The previous work (Bahar et al., 2019) employs the CTC loss on the top layer of the encoder, which forces the encoders to learn soft alignments between speech and transcription. However, the CTC loss demonstrates strong preference for locally attentive models, which is inconsistent with the ST model (Xu et al., 2021).
In our systems, we use the stacked acoustic-andtextual encoding (SATE) (Xu et al., 2021) method to encode the speech features. It calculates the CTC loss based on the hidden states of the intermediate layer rather than the top layer. The layers below CTC also extract the acoustic representation like an ASR encoder, while the upper layers with no CTC encode the global representation for translation. An adaptor layer is introduced to bridge the acoustic and textual encoding.

Experimental Results
We use the architecture described in Section 3.1 as the baseline model. The encoder consists of 12 layers and the decoder consists of 6 layers. Each layer comprises 256 hidden units, 4 attention heads, and 2048 feed-forward size. The encoder of SATE includes an acoustic encoder of 8 layers and a textual encoder of 4 layers. The model is trained with MuST-C English-German dataset and we test the results on the tst-COMMON set based on the Sacre-BLEU. The other experimental details are shown  in Section 6. We report the experimental results after applying each architecture improvement in Table 2. Benefitting the power of the deep Transformer, our baseline model achieves 23.98 BLEU points. The Conformer and RPE methods strengthen the encoding and achieve an improvement of 0.45 and 0.26 BLEU points. SATE achieves a remarkable improvement by encoding the acoustic representation and textual representation respectively. We will explore better architecture designs in the future.

Data Augmentation
A large amount of the training data is necessary for a strong neural model. However, unlike the ASR and MT tasks, annotated speech-to-translation data is scarce, which prevents well-trained ST models. This is the main reason why cascaded systems are the dominant approach in the industrial scenarios. In this section, we describe our data augmentation method.
We train a deep DLCL Transformer  with the 25 encoder layers on all available MT data. To keep the domain consistency with the original ST data, we finetune the MT model on the MuST-C dataset. The model achieves the Sacre-BLEU of 35.89 of the MuST-C tst-COMMON test set. For the case-insensitive LibriSpeech dataset, we train a similar MT model except for lowercasing the source text without punctuation during training.
Then, we generate the German translation from English transcription in the LibriSpeech and Common Voice ASR datasets. Furthermore, sequencelevel knowledge distillation (Kim and Rush, 2016) is applied to augment the training data. We generate the translation of the MuST-C and Speech-Translation TED ST datasets which are more re-lated to the target domain.
Corrupting the acoustic feature is another data augmentation method, including SpecAugment, speed perturbation, and so on. SpecAugment (Park et al., 2019) is a simple data augmentation applied on the input acoustic features. The time masking and the frequency masking are applied in our systems. Speed perturbation transforms the audio by a speed rate, which changes the duration of the audio signal. Limited by the size of GPU resources, we do not use this method. Compared with the perturbed data, we think the synthetic samples improve the robustness more effectively. All available ST corpora are shown in Table 3.

Ensemble Decoding
Ensemble decoding is an effective method to improve performance by integrating the predictions from multiple models. It has been proved in the WMT competitions (Wang et al., 2018;. In our systems, we train multiple ST models with different training data for diverse ensemble decoding. The models are chosen based on the performance of the development set. This leads to a significant improvement over a single model.

Preprocessing
We remove the utterances with more than 3000 frames or less than 5 frames. The 80-channel logmel filterbank features are extracted from the audio file by torchaudio 5 library. We use the lowercased transcriptions without punctuations for CTC loss computation. We learn SentencePiece 6 subword segmentation with a size of 10,000 based on a shared source and target vocabulary for all datasets.

Model Settings
All experiments are implemented based on the fairseq toolkit 7 . We use Adam optimizer and adopt the default learning schedule in fairseq. We apply dropout with a rate of 0.1 and label smoothing ls = 0.1 for regularization. We also set the activate function dropout to 0.1 and attention dropout to 0.1, which improves the regularization and overcomes the overfitting.
We use the best model architecture that combines all the improvements described in Section 3. The encoder includes an acoustic encoder of 12 conformer layers and a textual encoder of 6 transformer layers. The decoder consists of 6 Transformer layers. Each layer comprises 512 hidden units, 8 attention heads, and 2048 feed-forward size. Pre-norm is applied for training a deep model. The weight of CTC objective α for multitask learning is set to 0.3 for all models. All the models are trained for 50 epochs on one machine with 8 NVIDIA 2080Ti GPUs.
During inference, we average the model parameters on the final 10 checkpoints. We use beam search with a beam size of 5 for all models. The coefficient of length normalization is tuned on the development set. We report the case-sensitive Sacre-BLEU (Post, 2018) on the MuST-C tst-COMMON set, IWSLT tst2019 and tst2020 test set.
The organizers provide the segmentation of the test sets and allow the participants to use the own segmentation. We simply use the segmentation provided by the WerRTCVAD 8 toolkit.

Experimental Results
Firstly, We train the model on all training corpora, including real and synthetic speech-to-translation paired data. As shown in Table 4, we achieve a high BLEU on the tst-COMMON test set, but a low performance on the tst2019 test set compared with the previous work (Gaido et al., 2020). A possible reason is that the data distribution between IWSLT test sets and the synthetic data is different.
tst-COMMON tst2019 32.65 14.16 To verify this assumption, we pick some subsets from the available datasets for training, including MuST-C and ST TED from the real corpora and MuST-C and LibriSpeech from the synthetic corpora. We present the results in Table 5. Although the performance on the tst-COMMON test set drops by 0.8 BLEU points, the model achieves a reasonable performance on the tst2019 test set. Furthermore, we finetune the model on the MuST-C dataset with a small learning rate. This yields a slight improvement.   We train multiple models with different training data for diverse ensemble decoding. We select a part of the synthetic corpora randomly, then mix them with the whole real training data. Finally, we use the ensemble decoding with 6 models for the final results and achieve a substantial improvement over a single model. As shown in Table 6, we achieve an excellent performance of 33.84 BLEU points on the MuST-C En-De tst-COMMON set.
The best end-to-end system of last year achieves 20.1 BLEU points on the tst2019 test set and 21.49 BLEU points on the tst2020 test set with the given segmentation. We achieve remarkable improvements of 2.58 and 0.31 BLEU points, which demonstrates the superiority of our systems.
There are two references available for tst2021 test set. The TED reference is the original one from the TED website. Since new regulations for the official regulation lead to translations that are much shorter, they created a second reference translation, called the IWSLT reference. The final results are based on both references. We achieve better performance with the own segmentation on the TED reference, which is consistent with the results on the previous test sets. However, the results with the own segmentation are worse on the IWSLT reference. A possible reason is that we do not optimize the segmentation tool for IWSLT test sets. We will explore better segmentation methods in the future.

Conclusion
This paper describes the submission of the Niu-Trans E2E ST systems for the IWSLT 2021 offline task, which translates the English audio to German translation directly without intermediate transcription. We build our final submissions considering two mainstreams: • Model architecture improvements for the speech translation task.
• Data augmentation by translating the English transcription to German translation.
We also find that the distribution of the training data has a great impact on the performance and alleviate it by ensemble decoding. Using the given segmentation, we achieve remarkable improvements over the best end-to-end system of last year.