The USYD-JD Speech Translation System for IWSLT2021

This paper describes the University of Sydney & JD’s joint submission of the IWSLT 2021 low resource speech translation task. We participated in the Swahili->English direction and got the best scareBLEU (25.3) score among all the participants. Our constrained system is based on a pipeline framework, i.e. ASR and NMT. We trained our models with the officially provided ASR and MT datasets. The ASR system is based on the open-sourced tool Kaldi and this work mainly explores how to make the most of the NMT models. To reduce the punctuation errors generated by the ASR model, we employ our previous work SlotRefine to train a punctuation correction model. To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking, and transductive finetuning. For model structure, we tried auto-regressive and non-autoregressive models, respectively. In addition, we proposed two novel pre-train approaches, i.e. de-noising training and bidirectional training to fully exploit the data. Extensive experiments show that adding the above techniques consistently improves the BLEU scores, and the final submission system outperforms the baseline (Transformer ensemble model trained with the original parallel data) by approximately 10.8 BLEU score, achieving the SOTA performance.


Introduction
Recent years have seen a surge of interest in speech translation (ST, Ney 1999) task, that translates the source-side speech to the target-side text directly. The ST task contains two major components, Automatic Speech Recognition (ASR, Jelinek 1997) and Machine Translation (MT, Koehn 2009). In this year's IWSLT low-resource speech translation * Work was done when Di Wu was visiting at JD. task, our USYD-JD translation team participated in the Swahili to English track. We break the speech translation task into "ASR→NMT" pipeline, and mainly focus on the NMT component.
For model frameworks, we tried autoregressive neural machine translation, including Transformer-BASE and -BIG (Vaswani et al., 2017), and nonautoregressive translation models (Gu et al., 2018). Also, we employ our previous work SlotRefine (Wu et al., 2020a) to tackle the case and punctuation problems after ASR. To make the most of the parallel and monolingual data, we proposed two pretrain strategies, i.e. BIDIRECTIONAL PRETRAIN-ING §2.2 and DENOISING PRETRAINING §2.3, and employed two data augmentation strategies, i.e. BIDIRECTIONAL SELF-TRAINING §2.5 and TAGGED BACK TRANSLATION §2.7. Where the data used for tagged back translation are carefully selected with our proposed multi-feature in-domain selection approach in §2.6. For post finetune/ process, we employed TRANSDUCTIVE FINE-TUNE §2.8 and a simple postprocessing approach §2.10. This paper is structured as follows: Section 2 describes the major approaches we used. We present the data descriptions in Section 3. The experiments settings and main results are shown in Section 4. Finnaly, we conclude our work in Section 5.

Autoregressive Translation
Given a source sentence x, an NMT model generates each target word y t conditioned on previously generated ones y <t . Accordingly, the probability of generating y is computed as: where T is the length of the target sequence and the parameters θ are trained to maximize the likelihood of a set of training examples according to L(θ) = arg max θ log p(y|x; θ). Typically, we choose Transformer (Vaswani et al., 2017) as its SOTA performance. The training examples can be formally defined as follows: where N is the total number of sentence pairs in the training data. Note that in standard MT training, the x is feed to the encoder and y <t to the decoder to finish the conditional estimation for y t , thus the utilization of − → B is directional, i.e. x i →y i . In the preliminary experiments, we utilized autoregressive translation (AT) model for translation, case correction and punctuation generation tasks as its powerful modelling ability and generation accuracy.

Bidirectional Pretraining
Motivation The motivation is when human learn foreign languages with translation examples, e.g.
x i and y i . Both directions of this example, i.e. x i →y i and y i →x i , may help human easily master the bilingual knowledge. Motivated by this, Levinboim et al. (2015); Liang et al. (2007) propose to modelling the invertibility between bilingual languages. Cohn et al. (2016) introduce extra bidirectional prior regularization to achieve symmetric training from the point view of training objective. He et al. (2018); Zheng et al. (2019) enhance the coordination of bidirectional corpus with model level modifications. Different from the above methods, we model both directions of a given training example by a simple data manipulation strategy.
Our Implementation Many studies have shown that pretraining could transfer the knowledge and data distribution, hence improving the generalization (Hendrycks et al., 2019;Mathis et al., 2021).
Here we want to transfer the bidirectional knowledge among the corpus. Specifically, we propose to first pretrain MT models on bidirectional corpus, which can be defined as follows: such that the θ in Equation 1 can be updated by both directions, then the bidirectional pretraining (BiPT) objective can be formulated as: where the forward − → L θ and backward ← − L θ are optimized iteratively. From data perspective, we achieve the bidirectional updating as follows: 1) swapping the source and target sentences of a parallel corpus, and 2) appending the swapped version to the original. Then the training data was doubled to make better and full use of the costly bilingual corpus. The pretraining can acquire general knowledge from bidirectional data, which may help better and faster learning further tasks. Thus, we early stop bidirectional training at 1/3 of the total steps. To ensure the proper training direction, we further train the pretrained model on required direction − → B with the rest of 2/3 training steps. Considering the effectiveness of pretraining (Mathis et al., 2021) and clean finetuning (Wu et al., 2019), we introduce a combined pipeline: ← → B → − → B as out best training strategy.

Denoising Pretraining
Motivation The motivation is when human learn one language, one of the best practices for language acquisition is to correct the sentence errors, e.g. noised(x i )→x i and noised(y i )→y i . Motivated by this, Lewis et al. (2020) propose several noise adding approaches and denoise them with end-to-end pretraining.  introduce this idea to the multilingual scenarios. Different from above monolinugal denoising pretraining approaches, we proposed a simpler noise function and apply them to each side of the parallel data.
Our Implementation Here we want the model to understand the source-and target-side languages well. For noise function noised(·), we apply the common noise-injection practice, i.e. removing, replacing, or nearby swapping one time for a random word with a uniform distribution in a sentence (Edunov et al., 2018;. Then the size of the original parallel data doubled as follows: where S src and S tgt can be combined to update the end-to-end model to achieve denoising pretraining. such that the θ in Equation 1 can be updated by denoising both the source and target data, then the denoisig pretraining (DPT) objective can be formulated as: where the Source Denoising : L S θ and Target Denoising : L T θ are optimized iteratively. The pretraining can store knowledge of the source and target languages into the shared model parameters, which may help better and faster learning further tasks. Similar to bidirectional pretraining in §2.2, we early stop denoising training at 1/3 of the total steps, and tune the model normally with the rest of 2/3 training steps. This process can be formally denoted as such pipeline:

Note that Bidirectional Pretraining (BiPT) and
Denoising Pretraining (DPT) can be combined and further enhance the model performance (The effect of their complementary can be found in Table 7). In particular, the combination order of BiPT and DPT are empirically inspired by human learning behavior, where a good interpreter will first master at least one language (usually the mother tongue), and then learn other languages and achieve bilingual translation. Thus, the combined pretraining process follows DPT → BiPT. In combined pretraining setting, we will train longer until the model converges completely.

Nonautoregressive Translation
Different from autoregressive translation (Bahdanau et al., 2015;Vaswani et al., 2017, AT) models that generate each target word conditioned on previously generated ones, non-autoregressive translation (Gu et al., 2018, NAT) models break the autoregressive factorization and produce the target words in parallel. Given a source sentence x, the probability of generating its target sentence y with length T is defined by NAT as: where p L (·) is a separate conditional distribution to predict the length of target sequence. Typicallly, most NAT models are implemented upon the framewok of Transformer (Vaswani et al., 2017). In the preliminary experiments, we utilized NAT for translation, case correction and punctuation generation tasks as NAT can well avoid the error accumulation and exposure bias problems during generation. Also, we employ several advanced structure (Gu et al., 2019;Ding et al., 2020b) (Levenshtein with source local context modelling) and our proposed training strategies (Ding et al., 2021a,b,c) as default settings.

Bidirectional Self-Training
Besides improving NMT at model level, many researchers turn to data perspective, including exploiting the parallel and monolingual data. The most representative approaches include: a) Back Translation (BT, Sennrich et al. 2016) combines the synthetic data generated with target-side monolingual data and parallel data; b) Knowledge Distillation (KD, Kim and Rush 2016) trains the model with sequence-level distilled parallel data; c) data diversification (DD, Nguyen et al. 2020) diversifies the data by applying KD and BT on parallel data. Clearly, self-training is at the core of above approaches, that is, they generate the synthetic data either from source to target or reversely, with either monolingual or bilingual data.
To this end, we propose a bidirectional selftraining approach for both parallel and monolingual data (including source and target, respectively). Specifically, the base teacher models are trained with original parallel data in the first iteration (Round 1 in Table 6), and based on these forwardand backward-teachers, all available Swahili & English sentences can be used to generate the corresponding synthetic English & Swahili sentences. After balanced-sampling between synthetic and authentic data, the concatenated data can be used to train the second iteration teachers (Round 2 in Table 6).
To reveal why our approach works, we show the results in Table 8 from the point view of data complexity (Zhou et al., 2020). Self-training reduces the data complexity, thus increasing the model deterministic and in turn enhancing the model performance.

LM Features
BERT LM (Devlin et al., 2019) Transformer LM (Bei et al.) N-gram LM (Stolcke, 2002) In-domain features Moore-Lewis (Moore and Lewis, 2010) Rule-based features Illegal characters (Bei et al.) Count Features Word count  (Vaswani et al., 2017). We score all sentences in non-autoregressive fashion 2 to utilize contextualized information. According to our observations, by using above multiple data selection filters, issues like illegal characters, unfluent and domain unmatched sentences could be significantly reduced. The data statistics for back translation monolingual data can be found in Table 5.

Tagged Back Translation
Back-translation (Sennrich et al., 2016;Bojar et al., 2018), translating the large scale monolingual corpus to generate synthetic parallel data by Targetto-Source pretrained model, has been widely utilized to improve the translation quality. However, recent studies find that back translation increase the target-original test set performance rather than source-original ones from the perspective of translationese 3 (Zhang and Toral, 2019;. To eliminate such concerns, we leverage tagged back translation (Caswell et al., 2019) to im-  prove the source-original testing performance. The implementation is straightforward, that is, adding a simple tag on the beginning of each source-side synthetic sentence. The detailed reason why this trick works can be found in Marie et al., 2020. To ensure tagged back translation works well for our task, we carefully selected the target side in-domain monolingual data ( §2.6). Final results in Table 7 show the effectiveness of tagged back translation #9 against competitive model #8 (+1.9 BLEU scores).

Transductive Fine-Tuning
The key idea of transductive finetune is that source input sentences from the validation and test sets are firstly translated to the target language space with the best well-performed NMT model, which results in a pretranslated synthetic dataset. Then models are finetuned on the generated synthetic dataset. We borrow this concept from previous systems . We empirically show that transductive finetune (#10 − 11 in Table 7) indeed improves the official validation performance but harms the performance of our sampled valid& test set that co-distributed with the training set. Note that we randomly sampled 5K/ 5K sentences from the training set as valid and test sets, respectively, to avoid the sub-optimal problem caused by the distribution gap. Experimental details can be found in §3 and 4.

Reranking N-best Hypotheses
As the NMT decoding being generally from left to right, this leads to label bias problem (Lafferty et al., 2001). To alleviate this problem, besides using NAT ( §2.4), we rerank the n-best hypotheses through training a k-best batch MIRA ranker (Cherry and Foster, 2012) with multiple features on validation set. The feature pool we integrated include R2L (right-to-left) translation model, T2S (target-to-source) translation model, language model and IBM model 2 alignment score. After multi-feature reranking, the best hypothesis was retained.

Right-to-Left NMT Model
The R2L NMT model using the same training data but with inverted target sentences (i.e., reverse target side characters "a b c d"→"d c b a"). Then, inverting the hypothesis in the n-best list can obtain perplexity score by R2L model.
Target-to-Source NMT Model The T2S model was initially trained for back-translation, we can employ this model to assess the translation adequacy as well by adding the T2S feature to reranking feature pool.
Language Model Besides above features, we employ language models as an auxiliary feature to give the fluent sentences better scores such that the results are easier to understand by human.

Post Processing
Besides general post-processing (i.e., de-BPE, detokenization and de-truecase 4 ), we also used a post-processing algorithm (Wang et al., 2018) for inconsistent number, date translation, for example, "2006-07" might be segmented as "2006 -@@ 07" by BPE, resulting in the wrong translation "2006 at 07". Our post-processing algorithm will search for the best matching number string from the source sentence to replace these types of errors, see Table 2.

Data Preparation
For ASR task, we downloaded all available Swahili speech-to-text data 5 , such as openslr 6 and IARPA Babel 7 etc., as training corpus and employ all default settings in Kaldi 8 to preprocess and train them. To simplify the ASR task, we lowercased all Swahili sentences and removed punctuation. To rejuvenate these case and punctuation information, we design two pipeline tasks after ASR: case correction task and punctuation generation. Also, it is worth noting that we design some rules to perform the "voice activity detection" process for the   official speech testset. Take a piece of speech in Figure 1 for example, partial of speech in the red box will be keep as the valid input.
For NMT task, the parallel datasets we utilized are described at Table 3, including CCAligned (El-Kishky et al., 2020), Tanzil (Tiedemann, 2012), ParaCrawl 9 , WikiMatrix (Schwenk et al., 2019), GlobalVoices (Tiedemann, 2012), TED2020 (Reimers and Gurevych, 2020), Wiki-Media (Tiedemann, 2012) and Gamayun 10 . The monolingual data we utilized are described in Table 4 and Table 5, where the monolingual data in Table 4 are used to train the system #1 − 8 in Table 7, and data in Table 5 are used to train the system #9 − 11 in Table 7, respectively. Table 6 denotes how the data used and generated by iterative bidirectional self-training ( §2.5). The total data size after two round of bidirectional self-training is 50.4M, and after tagged back translation, the final data volume is 60.4M.
To avoid the sub-optimal problem caused by the distribution gap between official validation and training data, we randomly sampled 5K/ 5K sentences from the training set as valid and test sets, respectively. The randomly sampled valid sentences are used to optimize the hype-parameters. Mono. Corpus for Tagged BT #Sent. Totally collected corpus commoncrawl English 30,513,498 Cleaned corpus with criteria in §2.6 in domain English 10,000,000

Experiments
Settings For case correction and punctuation generation tasks mentioned in §3, we tried Autoregressive Transformer-BASE (AT, §2.1), Non-Autoregressive model (NAT, §2.4) and our previously designed SLOTREFINE (Wu et al., 2020a). In our preliminary experiments, NAT and SlotRefine work better on case correction and punctuation generation tasks, respectively, thus leaving as the default components in our final speech translation pipeline.
For NMT task, we tried Autoregressive Transformer-BIG (AT, §2.1) and Non-Autoregressive model (NAT, §2.4) in preliminary experiments, and found that AT performs robust on all settings. Thus we employ Transformer-BIG for all MT systems. Inspired by He et al. (2019), we empirically adopt large batch strategy (Edunov et al., 2018) (i.e. 458K tokens/batch) to optimize the performance. The learning rate warms up to 1 × 10 −7 for 10K steps, and then decays for 30K (data volumes range from 2M to 10M) / 50K (data volumes large than 10M) steps with the cosine schedule. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance, and apply weight decay with 0.01 and label smoothing with = 0.1. We use Adam optimizer (Kingma and Ba, 2015) to train models. We evaluate the performance on an ensemble of last 10 checkpoints to avoid stochasticity.
For fair comparison, the metric we employed is sacreBLEU (Post, 2018  enization and truecasing with Moses scripts (Koehn et al., 2007). In order to limit the size of vocabulary of NMT models, we adopted byte pair encoding (BPE) (Sennrich et al., 2016) with 32k operations. Larger beam size may worsen translation quality (Koehn and Knowles, 2017), thus we set beam size=10 when performing n-best reranking ( §2.9). All models were trained on 4 16GB NVIDIA V100 GPUs.

Main results
Our main experiment is shown in Table 7, our baseline system is developed with the original parallel corpus and last-10 ensemble strategy. Unsurprisingly, the baseline system relatively performs the worst. The proposed Bidirectional Pretrain in §2.2 and Denoising Pretrain in §2.3 could consistently and significanly improve the model performance, showing their effectiveness in low resource scenarios (Zhang and Tao, 2020 Table 7: Sacrebleu of Sw→En on our randomly sampled "Valid/ Test" sets and official validation set "Off. Valid", where "∆" represents the performance gains compared with baseline #1. The submitted system is #13.

Data
Compl. BLEU Baseline 7.87 47.1 Bi. Self-Training 5.34 49.4 Iterative Bi. Self-Training 4.89 49.7 Table 8: Explanation of why Bidirectional Self-Training works. The data complexity "Compl." is measured on their corresponding training sets and alignment information is trained with fast-align (Dyer et al., 2013). The BLEU scores are reported on our sampled validation set. could achieve better results (averaged +1.7 BLEU scores), indicating their complementary.
As shown in #5 and #7, the proposed Bidirectional Self-Training and its refined iterative version, could consistently enhance the model. To explore why self-training improves model performance, we discuss it from the point view of data complexity. As shown in Table 8, with the Bidirectional Self-Training iteratively progresses, the data complexity becomes lower, leading to the better BLEU scores. Notably, the combination of our proposed two pretraining approaches push the SOTA performance up to higher points. We believe that the effect of our proposed two pretrain strategies are still under-investigated, which will leave as future works. Overall, with strategies #2 − 8, the model performance in terms of official validation test achieves surprisingly +6.4 BLEU scores.
The Tagged Back Translation ( §2.7) with in-domain monolingual data significantly improves the performance of both our sampled test set and official valid set by +5.0 and +8.3 against baseline, respectively.
We empirically show that Transductive FineTune ( §2.8) indeed improves the official validation performance but harms the performance of our sampled valid& test set that co-distributed with the training set. This indicates that tranductive learning is a effective practice to transfer a well-trained model across domains.
And the last two strategies Reranking ( §2.9) and Post Processing ( §2.10) could further improve the official validataion BLEU score from 41.9 to 42.6, which substantially outperforms the baseline by +10.8 BLEU score.

Conclusion and Future Work
This paper presents the University of Sydney & JD's speech machine translation system for IWSLT2021 Swahili→English task. The whole system is pipelined, containing ASR, case correction, punctuation generation and NMT tasks, and we main focused on NMT task.
We leveraged multi-dimensional strategies and frameworks to improve the translation qualities, which achieves surprisingly +10.8 BLEU scores improvement against baseline and ranks the 1st among all the participants. We find that our proposed BIDIRECTIONAL PRETRAINING ( §2.2) and DENOISING PRETRAINING ( §2.3) can consistently improves the competitive baselines. Also, we employ BIDIRECTIONAL SELF-TRAINING in §2.5 and TAGGED BT in §2.7 make the most of the existing parallel and monolingual data.
In the future, we would like to polish other components in the pipeline to achieve better performance. Also, it is worthy to try an end-to-end approach with cross-modal structures to incorporate audio and vision knowledge (Xu et al., 2021). For robust model training and data utilization, we would explore better strategies, e.g. adversarial training (Wu et al., 2021) and curriculum learning (Liu et al., 2020a;Zhou et al., 2021).