The Volctrans Neural Speech Translation System for IWSLT 2021

This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 7.9 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simultaneous translation, we explore the best practice to optimize the wait-k model. As a result, our final submitted systems exceed the benchmark at around 7 BLEU on the same latency regime. We release our code and model to facilitate both future research works and industrial applications.


Introduction
This paper describes the neural speech translation systems submitted to IWSLT 2021 by the Volctrans team (also known as the team from ByteDance AI Lab), including cascade and end-to-end speech translation (ST) systems for the offline ST track and a simultaneous neural machine translation (NMT) system. We aim at finding the best practice for these two tracks.
For offline ST, the cascaded system often outperforms the fully end-to-end approach. Recent studies on the fully end-to-end approaches obtain promising results and attract a lot of interest. Last year's results have shown that an end-to-end model achieves an even better performance (Ansari et al., 2020) compared with the cascaded competitors. However, they introduce pre-training (Bansal et al., 2019;Stoian et al., 2020;Alinejad and Sarkar, 2020) and data augmentation techniques (Jia et al., 2019; to end-toend models, while the cascaded is not that strong 1 Code and models are available at https: //github.com/bytedance/neurst/tree/ master/examples/iwslt21 enough. Hence, in this paper, we would like to optimize the speech translation model in two aspects. First, we are devoted to building a strong cascade competitor and learns the best practice from WMT evaluation campaigns , such as back translation (Sennrich et al., 2016a) and ensemble. Second, we explore various self-supervised learning methods and introduce as much semi-supervised data as possible towards finding the best practice of training endto-end ST models. In our settings, ASR data, MT data, and monolingual text data are all considered in a progressively training framework. The results are very promising, and the final performance on the MuST-C test set surpasses the end-to-end baseline by 7.9 BLUE scores, while it is still lagging behind our cascade model by 1.5 BLUE scores. It is not surprising since some well-optimized methods for MT can not be easily used on ST, such as back translation. However, our experience shows that the external data can effectively close the gap between end-to-end models and cascade models.
In parallel, we also participate in the simultaneous NMT track, which translates in real-time. Our system is based on an efficient wait-k model (Elbayad et al., 2020). We investigate large-scale knowledge distillation (Kim and Rush, 2016;Freitag et al., 2017) and back translation methods. Specially, we develop a multi-path training strategy, which enables a unified model serving different wait-k paths. Our target is to obtain the best translation quality at different latency levels.
The remaining part of the paper proceeds as follows. Section 2 and section 3 describe our cascade and end-to-end systems respectively. Section 4 presents the implementation of simultaneous NMT models. Each section starts from the training sources and how we synthesize large-scale data. And then, we give details about the model structure and techniques for training and inference. We con-
2 Cascaded Speech Translation

Automatic Speech Recognition
The ASR model is transformer-like and trained on paired speech and transcript data

Datasets and Preprocessing
We divide the allowed ASR datasets into two parts: clean and noisy and consider MuST-C 2 , LibriSpeech (Panayotov et al., 2015), and Mozilla Common Voice as the clean datasets, and use them for training an ASR system to filter the noisy part, i.e., iwslt-corpus 3 and TED-LIUM 3 (Hernandez et al., 2018). We remove the training samples where the word error rate (WER) score between the ASR output and English transcript exceeds 75%. The statistics of the ASR datasets are shown in Table 1.
For model training, we extract 80-channel log Mel-filterbank coefficients with windows of 25ms and steps of 10ms on the audio input. The transcripts are lowercased and we remove all punctuation marks. Then, we apply Moses tokenizer 4 and byte pair encoding (BPE) (Sennrich et al., 2016b) 5 to the transcripts with 8,000 merge operations.
End-to-End ASR Model We refer to the recent progress of transformer-based ASR (Dong et al., 2018;Karita et al., 2019) and implement the speech transformer model, as illustrated in Figure 1 a)  2, each of which is followed by a layer normalization and ReLU activation. The major architecture is the same as the transformer model, including 12 layers for the encoder and 6 layers for the decoder. The model width is 768, and the hidden size of the feed-forward layer is 3,072. The attention head is set to 12 for both self-attention and cross-attention. To train the model, we use Adam optimizer (Kingma and Ba, 2015) and set the warmup steps to 25,000. Empirically, we scale up the learning rate by 5.0 to accelerate the convergence. The ASR model is trained on 8 NVIDIA Tesla V100 GPUs with 320,000 frames per batch. And we truncate the audio frames to 3,000 and remove training samples whose transcript length exceeds 120 for GPU memory efficiency. To further improve the performance, we apply SpecAugment technique (Park et al., 2019) with frequency masking (mF = 2, F = 27) and time masking (mT = 2, T = 70, p = 0.2).

Neural Machine Translation
All MT models are based on transformer (Vaswani et al., 2017). We employ data augmentation and model ensemble techniques to improve the final performance.

Datasets and Preprocessing
We utilize English-German (EN-DE) parallel sentences from WMT 2020 6 , OpenSubtitles 2018 7 , MuST-C and iwsltcorpus for training. We filter the parallel corpora following the rules listed in , with a much stricter constrain on word alignment. Additionally, we randomly select 10% sentences separately from both sides of the original WMT and OpenSubtitles corpus for data augmentation (see below), along with the transcripts in ASR datasets described in sec 2.1.
As for text preprocessing, we apply Moses tokenizer and BPE with 32,000 merge operations on each side.

Tagged
Back-Translation Back-translation (Sennrich et al., 2016a) is an effective way to improve the translation quality by leveraging a large amount of monolingual data and has been widely used in WMT evaluation campaigns. In our setting, we add a "<BT>" tag to the source side of back-translated data to prevent overfitting on the synthetic data, which is also known as tagged back-translation (Caswell et al., 2019;Marie et al., 2020).
Knowledge Distillation Sequence-level knowledge distillation (Kim and Rush, 2016;Freitag et al., 2017) is another useful technique to improve performance. In this way, we enlarge the training data by translating English sentences to German using a good teacher model.

ASR Output Adaptation
Traditionally, the output of ASR systems is lowercased with no punctuation marks, while the MT systems receive natural texts. In our system, we attempt to make the MT systems robust to these irregular texts. A simple way to do so is to apply the same rules on the source side of the MT training set. However, empirical study shows it causes performance degradation. Inspired by the tagged back-translation method, we enhance the regular MT models with transcripts from both ASR systems and the ASR datasets, as illustrated in Figure 1 b). An extra tag "<ASR>" indicates the irregular input. Note that the basic idea to bridge the gap between the ASR output and the MT input involves additional sub-systems, like case and punctuation restoration. In our cascade system, we prefer to use fewer sub-systems, and the detailed comparison would be our future work.

Data Combination and Sampling Strategy
We train transformer models with different combina-6 http://www.statmt.org/wmt20/ translation-task.html, including Common Crawl, tions of data sets because increasing the model's diversity can benefit the model ensemble. The detailed setups are listed in Table 2. We over-sample the in-domain datasets (i.e., MuST-C/iwslt-corpusrelated portions) to improve the in-domain performance. Specifically, to control the ratio of samples from different data sources, we sample a fixed number of sentences being proportional to ( Ns s Ns ) 1 T , where N s is the number of sentences from data source s, and sampling temperature T is set to 5. Note that the MT#1 is trained on lowercased source texts without punctuation marks, while MT#2-5 use the tagged transcripts.
Model Setups We follow the transformer big setting, except that • we deepen the encoder layers to 16.
• the dropout rate is set 0.15.
• the model width is changed to 768, the hidden size of the feed-forward layer is 3,072, and the attention head is 12 for MT#5 only. We use Adam optimizer with the same schedule algorithm as Vaswani et al. (2017). All models are trained with a global batch size of 65,536.

Inference
We average the latest 10 checkpoints of a single training process for all the above experiments. And during inference, the "<ASR>" tag is added to the front of the ASR output. The beamwidth is set to 10 for both ASR and MT tasks.

End-to-End Speech Translation
Recent studies show that the fully end-to-end solution achieves promising performance when compared with the cascaded models (Ansari et al., 2020). This section will introduce how we build our end-to-end models for the offline ST task.

Training Data
The end-to-end model is trained on paired speech and translation data. We collect MuST-C and iwsltcorpus (after filtering described in section 2), with a total of only 681 hours transcribed and translated speech. To address the data scarcity problem, we explore the knowledge distillation technique to augment the data by leveraging ASR datasets and MT models, also known as pseudo labeling. In detail, we distill from four MT models: MT#1,  Table 2: The statistics of MT datasets after data filtering and the detailed combination modes of datasets for difference MT models (MT#1-5). The MT#1 setting is used for training both DE→EN and EN→DE directions. "P" denotes the parallel corpus. "BT" is the back-translated data using MT#1 (DE→EN). "SR" indicates the irregular data from both ASR datasets and the ASR model. "KD" is the synthetic data generated by MT#2.  MT#2, an ensemble of MT#3-5, and MT#3-R2L which is trained with the same setting as MT#3 and generates the target translations in the right to left fashion. We filter the augmented samples with bad alignment scores as the same as data filtering in MT. The statistics of training data is shown in Table 3. Moreover, two additional copies of the original and the augmented training data are created by modifying the speed to 110% and 90% of the initial rate, which makes a 3-fold training set.

Speech Transformer for End-to-End ST
As a baseline system, the model architecture and training configurations are the same as the end-toend ASR in our cascade system, except for the learning rate, which is scaled up by 3.0 for ST. We initialize the feature extractor and encoder from the corresponding component of ASR.
We keep the cases and punctuation marks on the target side and apply Moses tokenizer and BPE to the translations with 32,000 merge operations.

Progressive Multi-task Learning
Inspired by the multi-task learning framework for ST and the progressive training strategy Ye et al., 2021), we introduce PMTL-ST, a progressive multi-task learning framework for speech translation, which can leverage additional  ASR and MT data for training. As illustrated in Figure 2 a), the encoder accepts both audio and text inputs. Then we add a modality embedding to the representation to indicate audio input or text before passing to the shared transformer encoder. For decoding, we involve "<EN>" and "<DE>" tokens to make the decoder compatible with ASR and translation (MT/ST) tasks, as shown in 2 b)/c).
For progressive training, we separately train an ASR model and an MT model via different branches in Figure 2. Then, we initialize the feature extractor and the audio modality embedding from the ASR model, and the rest of the model parameters are initialized by the MT model. The final model is trained jointly with ASR, MT, and ST.
All other training configurations, such as batch size and learning rate, are the same as the corresponding single task described before. Additionally, for the PMTL-ST models, we jointly learn the

Fbank2vec
Inspired by the recent progress of speech representation learning, like wav2vec 2.0 (Baevski et al., 2020), we introduce a fbank2vec network to learn contextualized audio representations from log Melfilterbank features, as shown in Figure 3.

Convolutional Feature Encoder
The encoder consists of two blocks containing a convolution followed by layer normalization and a GELU activation (Hendrycks and Gimpel, 2016). The convolution in each block has 512 channels with 3×3 kernel and stride size 2.

Relative Positional Encoding
We use a group convolution layer to model the relative positional embeddings as Baevski et al. (2020) does. The kernel size is 128, and the number of groups is 16.

Contextualized Encoder
The final contextualized audio representations are generated by several transformer encoder blocks. In our setting, we stack 6 layers of the post-norm transformer, and the inner activation function for the feed-forward layers is GELU. In turn, the number of shared encoder layers in Figure 2 is changed to 6. We insert the fbank2vec network in the front of the feature extractor. The feature extractor further reduces the dimension of audio representations by one convolution layer with 5×5 kernel and stride size 2. The number of channels keeps the same as the dimension of fbank2vec output.
We experiment with two setups, fbank2vec-768 and fbank2vec-512. The fbank2vec-768 means that • the dimension of fbank2vec output is 768; • inner the contextualized encoder, the hidden size of feed-forward layers is 3,072, and the head of the self-attention layers is 12. For the fbank2vec-512, the numbers are 512, 2,048, and 8, respectively. Note that the fbank2vec module is pretrained by an ASR task and the overall model follows the progressive multi-task learning framework, so the configurations of word embeddings, the shared encoder and decoder vary accordingly.

Simultaneous Translation
This section describes our submissions to the text-to-text simultaneous speech translation track for English to German (EN2DE) and English to Japanese (EN2JA). For versatility, we adopt identical methods for these two language pairs.

Training Data
The training data for EN→DE is from MuST-C, OpenSubtitles 2018, and WMT 2020 datasets. And for EN→JA, we use the parallel and monolingual data from the WMT 2020 news task.

Data Preprocessing
We follow the data filtering process proposed in WMT works , including language detection, length ratio filtering, dictionary alignment, and so on. For pre-processing, we first apply MeCab 9 tokenizer to the Japanese sentences. Then, words are segmented into subword units using sentencepiece toolkit for both language pairs. We jointly learn on the source and target side with a vocabulary of 10,000 tokens.
Data Augmentation Similar to section 2.2, we utilize tagged back-translation (BT) and knowledge distillation (KD) strategies to improve the performance of simultaneous NMT. We experiment with both LightConv (Wu et al., 2018) and transformer models. The model with the best BLEU score on the development set is chosen for data augmentation. The statistics of all training data and model settings are presented in Table 4 and Table 5 respectively.

Efficient wait-k Model
Our simultaneous NMT systems are based on transformer wait-k models, which first read k source tokens and then alternate between reading and writing (translating). Formally, when decoding the sentence x, the number of visible source tokens is constrained within min(k + t − 1, |x|) at decoding step t, where k is the hyper-parameter controlling the latency. Furthermore, to avoid recomputing the hidden states of the encoder each time a token is read, we implement incremental unidirectional encoders (Elbayad et al., 2020). And multi-path training is also applied to leverage more possible wait-k paths which refers that hyper-parameter k ∈ [3,9] is random selected at each batch during training. Models are trained with a batch size of 32,000 tokens on Tesla V100 GPUs. We average the last 6 checkpoints once the model converges.

Inference
We explore the look-ahead beam search strategy for inference. Specifically, we apply beam search to generate M (M > 1) tokens at each decoding step and pick the first token in the one with the highest log-probability out of multiple decoding paths. The look-ahead beam search achieves consistent performance improvement when k eval is small while its performance improvement is insignificant with a large k eval . This search method is excluded from our final submissions due to its higher latency, and we choose the greedy search instead.
Additionally, we split the source sentences into sub-sentences once the end-of-sentence punctuation is recognized. Though it may result in a slight performance drop due to the lack of context, we can obtain a much lower latency.
For the final submissions, we use ensemble models. We train several models with different k train values and disjoint subsets of training data for data diversity. Each model produces different latencyquality trade-offs.

Experimental Results
We conduct all our experiments using NeurST (Zhao et al., 2020) and report results for the submitted speech translation tasks in this section. It is worth noting that all transcripts and translations in the test sets are removed from the training data.
When evaluating the offline ST models, tags such as applause and laughing are removed from both hypothesis and reference. We use word error rate (WER) to evaluate the ASR model and report case-sensitive detokenized BLEU 10 for MT. No other data segmentation techniques are applied to the dev/test sets. Results on MuST-C dev and tst-COMMON, as well as dev(v1) and tst-COMMON(v1) from MuST-C v1 (Gangi et al., 2019) are listed together, which serve as strong baselines for comparison purpose in the end-to-end speech translation field.
When evaluating the simultaneous translation, we use the official SimulEval  toolkit and report case-sensitive detokenized BLEU (Post, 2018) and Average Lagging (Ma et al., 2019)    on MuST-C tst-COMMON (EN2DE) and IWSLT21 dev set (EN2JA).

Offline Speech Translation
The overall performance of the offline ST and the ASR component used in the cascade system are listed in Table 6 and Table 7 respectively. In Table 6, line 1-4 show the performance of our pure MT systems, which translate the lowercased ground truth transcripts with no punctuation marks, and the natural texts. As seen, there may be no essential improvements with the "<ASR>" tag on the irregular input (up to 2 BLEU gap on the single model), and it suggests that text restoration has the potential to narrow the gap. Line 6-7 present the results of translating the ASR output, and we see our cascaded approach surpasses last year's best cascade system (line 5) by 2.6 BLEU. However, there is still a significant loss of up to 3 BLEU scores than line 1/3 due to ASR errors.
The results of our end-to-end solutions are presented in line 8-20, where line 8 is a benchmark model (Zhao et al., 2020) trained on the MuST-C dataset only. With the growth of model capacity (256d→768d) and data augmentation, we obtain 6 BLEU improvement on the tst-COMMON over the benchmark (line 8). Then, increasing the size of augmented data gains slight improvement, as comparing line 9 to line 10/11 (+0.3∼0.5 BLEU scores). Line 13-16 show the results of our proposed fbank2vec. As shown in line 15, we achieve 31.1 BLEU on tst-COMMON, the best single model with fbank2vec, progressive multi-task learning, and speed perturbation. We obtain 31.8 BLEU (line 20) for the final ensemble model, which surpasses the end-to-end benchmark by 7.9 BLEU scores and is approaching the cascade system with a nearly 1.5 BLEU gap.
Lastly, our primary cascade system is line 7, and the primary end-to-end system is line 20 for submission, which achieves higher performance via model ensemble.  Figure 4: Latency-quality trade-offs of the simultaneous NMT. k7/9 means k train = 7/9. MT#X indicate the aforementioned training datasets and model settings in Table 4 and 5. beam refers to our look-ahead beam search strategy. seg means that the sentences are pre-splited during inference. multipath means that k is random selected during training.

Simultaneous Translation
We evaluate the simultaneous NMT systems with different combinations of strategies and present our results in Figure 4. Then we report the performance on different latency regimes in Table 8. As shown in Figure 4, we can obtain remarkable BLEU improvements by training with only the knowledge distilled data (black) comparing to the filtered parallel data (green) and back-translated data (magenta), on average 1.0 BLEU improvement on EN→DE and 0.5 on EN→JA. The possible reasons may be: 1) Noise in origin data is migrated, like non-parallel sentence pairs. 2) Complex sentences with diverging word order are excluded, and the machine-translated texts, i.e., translationese, sometimes have simpler expressions.
We can see that the proposed look-ahead beam search (red) is competitive when k eval is relatively small but is comparable with the greedy search when k eval is large. So overall considering translation latency, we use the greedy search for our final submissions. As for multi-path training, we see it achieves limited BLEU improvement in our experiments.  For our final submission of EN→DE, we use the ensemble model, which consists of three transformer models trained on different dataset combinations, with k train = 7. For EN→JA, the submitted model is formed by two transformer models, with k train = ∞ (trained on full sentences) and multi-path training respectively. As presented in Figure 4, the model ensemble technique leads to at least 0.5 BLEU improvement on average (yellow). Additionally, with the sentence segmentation (bleu), the average lagging is significantly reduced. As a result, our final submitted systems exceed the baseline system at around 7 BLEU on the same latency regime. Table 9 lists the final results of the IWSLT 2021 offline ST track. Surprisingly, we find that our endto-end models significantly surpass the cascade systems, which is different from our conclusions on  the MuST-C test sets. We think this may be caused by the reference of tst2021. Since the ref1 of tst2021 is the original one from the TED website, the translations could be much shorter for subtitling, and our end-to-end models may fit well on it. Table 10 shows the official evaluation for our simultaneous NMT systems.

Conclusion
This paper summarizes the results of the shared tasks in the IWSLT 2021 produced by the Volctrans team. We investigate the performance of the end-toend solutions with data augmentation and progressively training framework for the offline ST task. Our end-to-end approach surpasses the last year's best cascaded system by 1 BLEU, but it is still lagging behind our cascade model by 1.5 BLEU scores on MuST-C test sets. However, our endto-end solutions achieve promising performance on tst2020 and tst2021. Afterwards, we develop the efficient wait-k model with multi-path training, and large-scale knowledge distillation and back translation methods. The final submitted systems exceed the baseline systems at 7 BLEU on the same regime. We see the data augmentation technique plays the most important role in these tasks. In the future, we would like to explore a more extensive data condition on both modality and quantity. We hope our practice could facilitate batch research works and industrial applications.