RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

End-to-end simultaneous speech translation (SST), which directly translates speech in one language into text in another language in real-time, is useful in many scenarios but has not been fully investigated. In this work, we propose RealTranS, an end-to-end model for SST. To bridge the modality gap between speech and text, RealTranS gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Besides, to improve the model performance in simultaneous scenarios, we propose a blank penalty to enhance the shrinking quality and a Wait-K-Stride-N strategy to allow local reranking during decoding. Experiments on public and widely-used datasets show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models as well as cascaded models in diverse latency settings.


Introduction
Simultaneous speech translation (SST) (Fügen et al., 2007;Oda et al., 2014;Ren et al., 2020) aims to translate speech in one language into text in another language concurrently. It is useful in many scenarios, like synchronous interpretation in international conferences, automatic caption for live videos, etc. However, prior studies either focus on full sentence speech translation (ST) (Berard et al., 2016;Weiss et al., 2017; or simultaneous text-to-text machine translation (STT) (Cho and Esipova, 2016;Gu et al., 2017;Dalvi et al., 2018) which takes a segmented output from an automatic speech recognition (ASR) system as input. Such two-stage models (i.e., cascaded models) inevitably introduce error propagation and also increase translation latency (see Figure 1). Ren et al. (2020) propose an end-to-end SST system

ST Encoder
Mi madre era la ….
My mom was the ….
(b) RealTranS Figure 1: An example for a cascaded model and our Re-alTranS. The ASR part in the cascaded model wrongly recognizes "My mom was" as "I'm almost". The error is propagated to NMT and leads to a wrong translation. RealTranS avoids such errors and translates accurately.
called SimulSpeech, but they ignore the modality gap between speech and text, which is important for improving translation quality .
In this paper, we propose RealTranS model for SST. To relieve the burden of our encoder (Wang et al., 2020c;, we decouple it into three parts: acoustic encoder, weighted-shrinking operation, and semantic encoder. We apply Conv-Transformer  as our acoustic encoder, which gradually downsamples the input speech and learns acoustic information with interleaved convolution and Transformer layers. The weighted-shrinking operation bridges the length gap between speech and text, by weighted summing up the frames in one detected segment based on the posterior probabilities generated by a CTC module (Graves et al., 2006). Finally, we use a semantic encoder to extract semantic features and deliver them to the decoder for translation.
To enable simultaneous decoding, unidirectional Transformer is applied in our encoder. This inevitably affects the performance of the CTC module and the following shrinking operation. To alleviate this, we introduce a blank-limited CTC loss, which adds a blank penalty to the traditional CTC loss to encourage the model to produce non-blank labels, given the observation that CTC tends to produce peaky distribution as a kind of overfitting (Liu et al., 2018) by over predicting blank labels. Accordingly, the shrinking quality can be improved. Furthermore, we propose a new simultaneous strategy Wait-K-Stride-N which allows local reranking during decoding. This strategy can resolve the inherent drawback of the conventional Wait-K strategy (Ma et al., 2019), which cannot apply vanilla beam search efficiently (Zheng et al., 2019c).
Experiments on Augmented LibriSpeech En-Fr, MUST-C En-Es and En-De datasets demonstrate the effectiveness of the Wait-K-Stride-N strategy, and show that RealTranS achieves better performance than the prior end-to-end model Simul-Speech (Ren et al., 2020) as well as the cascaded models. Further analysis and ablation study reveal the effects of our proposed modules in RealTranS. We also compare RealTranS with other methods on full sentence ST. Results show that RealTranS achieves competitive or even better results, indicating its superiority.
In summary, the contributions of this work include the following aspects: • We propose RealTranS for SST, which can gradually bridge the modality gap between speech and text with the help of gradual downsampling and weighted shrinking.
• We introduce a blank penalty and the Wait-K-Stride-N strategy to improve the performance in simultaneous translation scenarios.
• Extensive experiments on public and widelyused datasets show the superiority of our Re-alTranS model and our Wait-K-Stride-N strategy in diverse latency settings.

Related Work
Speech Translation. Speech translation (ST) has recently attracted intensive attention from the AI community. Earlier works are mostly based on cascaded models, which perform NMT on the outputs of ASR systems (Ney, 1999;Mathias and Byrne, 2006;Sperber et al., 2017;Bahar et al., 2021). Cascaded models inevitably introduce error propagation from ASR (Weiss et al., 2017). To avoid this problem and for better efficiency, end-to-end ST models are proposed and become popular in recent years (Berard et al., 2016(Berard et al., , 2018Bansal et al., 2018;. To alleviate the data scarcity problem of end-to-end ST models, various techniques are utilized, including pre-training (Bansal et al., 2019), multi-task learning (Anastasopoulos and Chiang, 2018), knowledge distillation Ren et al., 2020), data synthesis (Jia et al., 2019), self-supervised learning  and speech augmentation techniques like SpecAugment (Bahar et al., 2019) or speed perturbation (Stoian et al., 2020). Some studies focus on how to bridge the gap between different modalities (speech and text) or different modules (acoustic and semantic modeling). Wang et al. (2020b,c) propose a TCEN model and a curriculum pre-training technique to make sure the modules learn desired information, respectively. Salesky and Black (2020)    propose adaptive feature selection, while Dong et al. (2020a) and  exploit the CTC-based (Graves et al., 2006) shrinking mechanism. Nevertheless, they do not explore in simultaneous scenarios, where encoding quality inevitably suffers because of lacking future information in unidirectional encoders.
Simultaneous Translation. Previous studies on simultaneous translation focus on text-to-text scenarios (STT) (Cho and Esipova, 2016;Gu et al., 2017;Dalvi et al., 2018), where fixed policies (Ma et al., 2019) and adaptive policies (Arivazhagan et al., 2019;Zheng et al., 2019a,b) are proposed to decide when to read and write tokens. (Ma et al., 2019) proposed a simple yet effective strategy, Wait-K, based on a prefix-to-prefix framework. It first waits for the first k tokens, and then start to generate target tokens concurrently with the source stream. It achieves competitive performance in simultaneous translation (Zheng et al., 2019a. Traditional simultaneous speech-to-text translation (SST) mainly depends on the ASR segmentation and then performs NMT based on the stream-  ing segmented chunks (Oda et al., 2014;Iranzo-Sánchez et al., 2020). There is little attention on end-to-end SST. Ren et al. (2020), to our knowledge, first propose an end-to-end model called SimulSpeech with multi-task learning and knowledge distillation, and apply the Wait-K strategy to perform simultaneous translation.  explore how to define a "token" in source speech, and then adapt methods from STT to SST. And Ma et al. (2020b) introduce a memory-augmented Transformer to tackle the streaming speech input. However, none of them investigate the modality gap between speech and text.

The RealTranS Model
Our RealTranS follows the sequence-to-sequence architecture, which consists of an ST encoder and an ST decoder. The ST encoder is decoupled into three parts, an acoustic encoder, a weighted shrinking operation, and a semantic encoder, to gradually map speech inputs into semantic representation space of text. Figure 2 shows the architecture.

Problem Formulation
Speech translation corpora usually contain triples of speech, transcription and translation, denoted as D ST = {(x, z, y)}. Specifically, x = (x 1 , x 2 , ..., x Tx ) is a sequence of speech features extracted from speech signals, e.g., filterbanks. z = (z 1 , z 2 , ..., z Tz ) and y = (y 1 , y 2 , ..., y Ty ) are the corresponding transcription in source language and translation in target language. T x , T z , and T y are the lengths of speech, transcription and translation, respectively, where usually T x T z and T x T y . A typical end-to-end model only makes use of x and y, while z can be used as multi-task training for other objectives, like CTC loss.

Acoustic Encoder
Acoustic encoder mainly encodes speech features x into a hidden space to learn acoustic knowledge. We apply Conv-Transformer  to extract the desired features. It contains three blocks, each of which is composed of three convolution layers followed by unidirectional Transformer layers (see lower left in Figure 2) to prevent from leveraging future context in SST. Similar to , we make the model aware of limited future frames with a look-ahead window in the convolution layers, to help improve acoustic modeling. At the same time, we gradually downsample the long speech features by setting the stride size to 2 in the second convolution layer in each block. In this way, the speech features are gradually reduced and approach the length of the corresponding text.
To predict the word boundaries 1 in the speech input and further improve the learned acoustic features, we adopt the Connectionist Temporal Classification (CTC) (Graves et al., 2006) module on top of the acoustic encoder. This module contains a Multilayer Perceptron (MLP) followed by a Softmax operator. CTC predicts a path π = (π 1 , π 2 , ..., π T x ), where T x is the length of hidden states after the Conv-Transformer. And π t ∈ V ∪ {φ} can be either a token in the source vocabulary V or the blank symbol φ. CTC paths have the many-to-one mapping to output sequences by removing blank symbols and consecutively repeated labels, denoted as operation B. Therefore, CTC loss is defined as follows: where B −1 (z) denotes all possible CTC paths that can be mapped to the transcription z. With the CTC module, we can define a word boundary between two frames where the first frame has a non-blank label while the second frame has a different label from the first one. Figure 3 shows an example of word boundaries. Due to the data scarcity problem in ST and CTC's inherent characteristics (Liu et al., 2018), the detected word boundaries are usually not accurate enough. The module will overly predict the occurrence of blank labels (as a kind of overfitting), resulting in a large gap between the number of detected boundaries and the number of tokens in transcription, especially when unidirectional Transformer is applied (see Table 3). To alleviate the problem, we add a blank penalty to encourage the module to produce non-blank labels. It is called blank-limited CTC loss and defined as follows: where π(x) means the argmax results of CTC softmax outputs. λ controls the effect of blank penalty.

Weighted-Shrinking Operation
The length gap between acoustic features and the corresponding transcription and translation is still large after the gradual downsampling in Section 3.2. Inspired by previous studies (Dong et al., 2020a;, we adopt a shrinking operation to bridge the gap based on CTC predictions. Prior works usually either remove blank frames and then average repeated frames (Dong et al., 2020a) or select a single representative frame  in a detected segment (i.e., frames between two word boundaries). However, there might be useful information in the detected blank or repeated frames, especially when the boundary detection is not accurate enough (see Section 3.2).
We propose a weighted-shrinking mechanism to tackle the problem. We assume that the probability of a frame to be labeled as "blank" represents the confidence that the model "thinks" it is not important. Therefore, for the frames in one segment, their weights are decided by the probabilities to be blank labels. The representation of the segment will be the weighted average of the corresponding frames. The specific operation is shown as follows: where p b t denotes the probability of the frame t to be blank, and h t represents the hidden state of the frame t in our acoustic encoder. µ ≥ 0 controls the temperature of the distribution (i.e., Softmax function). When µ = 0, it means that we simply average the frames; and when µ → ∞ it degenerates to the general shrinking mechanism where only the representative frames with the highest confidence are selected.

Semantic Encoder
The shrinking operation only bridges the length gap between speech and text, but the shrunk representations still lack semantic information. Therefore, we apply a semantic encoder on top of the shrunk representations . It first applies a positional embedding layer, and then follows several Transformer layers (also unidirectional to mask future context), to extract semantic representations.

ST Decoder
A similar decoder as the basic Transformer architecture in NMT is adopted, where several Transformer decoder layers are stacked on top of target embeddings. To simulate simultaneous translation, we follow the prefix-to-prefix framework (Ma et al., 2019) and mask certain future context in cross attention to ensure that the model predicts the current token based on only part of the input from the ST encoder. How much context the model can see depends on the simultaneous strategy that is applied. long-tail scenario (Ma et al., 2019;Zheng et al., 2019c), though beam search has been proven very effective in improving translation quality. Based on the prefix-to-prefix framework and STATIC-RW strategy proposed in Dalvi et al. (2018), we propose the Wait-K-Stride-N strategy to allow using beam search for local reranking during simultaneous decoding. Similar to the Wait-K strategy, our strategy first reads k input units (tokens in MT or segments in ST). Then, the model repeatedly performs n write and read operations until the end of the sentence (see Figure 4). In this way, the translation latency is close to Wait-K, but we can perform beam search during the n write operations. The objective with such a strategy is hence defined as follows: where y <t represents the target tokens before y t , T y is the length of the target sentence, and x ≤t represents the first t detected source segments.

Training Procedure
The total objective of our model will be the sum of the CTC part and the ST part: where α controls the influence of the CTC part.
To enhance the CTC quality, we also apply a pretraining procedure (Stoian et al., 2020). We only use CTC loss to pre-train the acoustic encoder 2 . In this way, we can prevent the training waste , and focus on improving the alignment results, which are essential for shrinking operations (see Table 3). The whole model is then fine-tuned with the whole ST corpus.   (Wang et al., 2020c) and use the 100-hour clean training set with aligned references and provided Google translations, resulting in double size of training pairs. For MUST-C datasets, We use the official data splits for train and development and tst-COMMON set for test. The statistics for these three datasets are listed in Table 1.

Experimental Settings
We use 80-dimensional log-mel filterbanks as acoustic features, which are calculated with 25ms window size and 10ms step size and normalized by utterance-level Cepstral Mean and Variance Normalization (CMVN). For transcriptions and translations, SentencePiece 3 (Kudo and Richardson, 2018) is used to generate subword vocabularies with the sizes of 4k and 8k respectively. We remove the punctuation in transcriptions. Our acoustic encoder follows the settings of the original Conv-Transformer , except that the channel number in convolution layers and hidden size and head number in Transformer layers are half values of theirs. This means the output dimension of the acoustic encoder is 256. We use 6 Transformer layers in the semantic encoder and 4 in the ST decoder. The hyper-parameters λ (Eq. 2), µ (Eq. 3) and α (Eq. 5) in our model are set to 0.5, 1.0 and 1.0, respectively. Our model is trained with 8 NVIDIA Tesla V100 GPUs, batched with an approximate 40000-frame features. We use Adam optimizer (Kingma and Ba, 2015) with a 0.002 learning rate and 10000 warm-up steps followed by the inverse square root scheduler. Dropout strategy is used with a rate of 0.1. We save checkpoints every epoch and average the last 10 checkpoints for evaluation with a beam size of 5. For simplicity, we use the same K and N values as those of training for inference. We implement our model based on Fairseq S2T 4 .

Evaluation Metrics
We apply SacreBLEU 5 for translation quality evaluation unless otherwise stated. For the metrics of latency, we adapt Average Proportion (AP) (Cho and Esipova, 2016) and Average Lagging (AL) (Ma et al., 2019) to ST settings, following previous studies (Ren et al., 2020;.
Average Proportion. AP calculates the mean absolute latency cost by each target token, where we replace the steps of source tokens with the time spent. It can be calculated as follows: where d(y i ) is the speech duration that has been listened when producing the target token y i .
Average Lagging. AL evaluates the degree of that the user is out of sync with the speaker, in terms of the number of source tokens (Ma et al., 2019). Following , we also extend it to the basis of time duration rather than source tokens, which is defined as follows: where |y * | is the length of the reference translation, and τ (|x|) denotes the index of the corresponding target token when our model has read the entire source speech. T s represents that the speech features are extracted every T s ms (decided by the step size in the feature extraction and downsampling rates in convolution layers), which will be 80ms in our model. What's more, our Conv-Transformer module introduces a 140ms look-ahead window , so we add 140 to the final AL scores to be fairly compared with other models.

Experimental Results
This section displays our experimental results. To explicitly show the performance trend of models in different scenarios, we use line charts to display most of the compared results. Their corresponding numeric results can be found in Appendix A.

Translation Quality vs. Latency
We first evaluate our RealTranS model with our Wait-K-Stride-N simultaneous strategy on the three datasets. We select three values 1, 2 and 3 for N to compare (it becomes conventional Wait-K when N=1). The results are displayed in Figure 5. Results show that RealTranS achieves higher BLEU scores as the K value increases, with sacrfice of translation delay, consistent with prior works (Ren et al., 2020;. Compared to the conventional Wait-K (N=1), our model with N=2 can achieve better BLEU scores under the same latency requirements, which demonstrates the effectiveness of our proposed Wait-K-Stride-N strategy. When N=3, the latency becomes higher. And it only achieves similar gains in BLEU scores compared to N=2 on MUST-C En-Es and En-De datasets. Therefore, we will use N=2 as our simultaneous strategy in later experiments unless otherwise stated.

Comparison with SimulSpeech
We compare with SimulSpeech (Ren et al., 2020), the state-of-the-art end-to-end model for SST. Figure 6 shows the performance comparison on the MUST-C En-Es dataset (we report tokenized case-  sensitive BLEU scores following their settings).
Since they only report segment-based AL, we transfer it to our time-based AL proportionally based on the latency when K=inf. We find that our RealTranS model outperforms SimulSpeech almost in all latency settings, with an average of about 3 higher BLEU scores. Although SimulSpeech achieves relatively lower latency (e.g. less than 1000 ms AL when K=1), the performance inevitably suffers. What's more, SimulSpeech has leveraged multi-task learning and knowledge distillation to enhance their performance, which can be also applied to further improve the performance of our RealTranS model.

Comparison with Cascaded Model
We implement a cascaded model to compare with RealTranS under the same latency. Specifically, we combine our acoustic encoder (Section 3.2) and a Transformer decoder as our ASR model and use the conventional Transformer encoder-decoder architecture as our NMT model. Their configuration is similar to RealTranS (e.g. the same hidden dimension and the same number of Transformer layers). And we train the ASR and NMT model with the same corpus. The conventional Wait-K strategy is used in the ASR model since the alignment between speech and transcription is monotonic, while Wait-K-Stride-N is applied in the NMT model. Since several combinations of the ASR and NMT models may be under the same latency, we report the best BLEU score among them. Table 2 shows the comparison results on MUST-C En-Es dataset. We have the following observations: 1) our RealTranS model outperforms the cascaded model in all latency settings, which demonstrates the superiority of RealTranS. 2) The improvement over the cascaded model becomes larger when the value of K is smaller. This observation is consistent with Ren et al. (2020). We attribute this to the advantage of end-to-end models over cascaded models, where the impact of error propagation in cascaded models may be amplified when the latency is low.  Table 3: Shrinking quality of RealTranS. "CTC PT" indicates that the acoustic encoder is pre-trained with CTC. "BP" represents blank penalty, and "Bi-Enc" means that using bidirectional Transformer encoders. "Diff ≤ n" means the difference between the length of shrunk representations and that of the ground-truth transcription is less than or equal to n. We report the percentage of the cases on the MUST-C En-Es test set. The BLEU scores displayed are results when K=inf.

Effects of Blank Penalty and Weighted Shrinking Operation
In this subsection, we examine the effects of our proposed methods, including blank penalty (Eq. 2) and weighted-shrinking operation (Eq. 3).
Blank Penalty. We propose a blank penalty to alleviate the inaccuracy of alignments between speech features and transcriptions when applying unidirectional encoders for SST (Section 3.2). To examine the effect, we evaluate the shrinking quality, i.e., the differences between the length of representations after the shrinking operation and that of the ground-truth transcription, on MUST-C En-Es dataset, and display the statistics in Table 3. We can see that the performance, as well as the shrinking quality, drops when removing CTC pre-training. It further decreases when removing the blank penalty, which shows its effectiveness. Also, we can see that the performance loss partly comes from using unidirectional encoders rather than bidirectional for simultaneous purposes (the 4th row), which can be compensated by the blank penalty.
Weighted-Shrinking. To validate our weightedshrinking operation, we first investigate the effects of various values for the shrinking temperature µ in Eq. 3 and display the results in Figure 7(a). The results show that our weighted-shrinking mechanism (µ = 1.0) performs better than both simply averaging all the frames (µ = 0) and dropping blank frames ("DB").
We also try to replace our weighted-shrinking module with another Conv-Transformer block (see Section 3.2), resulting in a model with 240ms downsample rate in total (denoted as "4 blocks w/o shrink"). sults, together with a model with only 2 blocks (40ms downsample rate, denoted as "2 blocks with shrink"). We can find that downsampling only with convolution layers performs worse than RealTranS, while less downsample rate also affects the performance. This implies that there is an upper-bound for downsampling with only convolution layers while maintaining performance, and our weightedshrinking operation can be used as an addition to further improve the performance.

Ablation Study
We evaluate the contributions of different modules in RealTranS. Each module is evaluated in four kinds of latency settings: Wait-2-Stride-2, Wait-6-Stride-2, Wait-10-Stride-2, and also Wait-Inf (full sentence translation). The results are shown in Figure 8, where "-CTC PT" means we do not pre-train the encoder with CTC loss and "-BP" indicates we further remove the blank penalty. "-shrink" means removing the weighted-shrinking operation and the semantic encoder, while "-GD" denotes disabling the gradual downsampling by moving the Trans- former layers in block 1 & 2 to block 3 (see Figure 2) to perform downsampling at the beginning layers of the acoustic encoder. Finally, "-CTC" indicates that CTC objective is removed. Figure 8 shows that all modules play a role in RealTranS. Specifically, we have the following observations: 1) The CTC module is important for improving translation quality, and it can be further improved by pre-training. 2) The blank penalty is useful in reducing latency while maintaining translation quality. 3) The blank penalty is essential for the shrinking operation since on full sentence translation the shrinking operation degrades the performance without the blank penalty ("-BP" vs "-Shrink"). 4) Gradual downsampling also contributes to the performance, because directly downsampling to a large rate may make it difficult to learn acoustic features.

Comparison in Full Sentence Translation
Although focusing on SST, RealTranS can also be applied in full sentence ST. For fair comparison, we replace the unidirectional Transformer layers in RealTranS with bidirectional ones and report both case-insensitive and case-sensitive BLEU scores following prior works. Table 4 displays the BLEU scores compared with existing methods (we only compare with end-to-end models trained with the same data for fair comparison) on Augmented Lib-riSpeech En-Fr and MUST-C En-De dataset. Re-alTranS yields competitive (En-Fr) or even better (En-De) results, even though most of these prior methods depend on some extra techniques like pre-training decoders or using SpecAugment (Park et al., 2019). This validates the superiority of our Dataset Method BLEU En-Fr Transformer+KD  17.02 TCEN-LSTM  17.05 Curriculum PT (Wang et al., 2020c) 17.66 LUT (Dong et al., 2020b) 17.75 STAST  17.81 COSTT (Dong et al., 2020a) 17.83 Transformer+AFS    (Inaguma et al., 2020) 22.91 STAST  23.06 RealTranS (ours) 23.53 22.99 proposed architecture.

Conclusion
This work proposes a new end-to-end model Re-alTranS and a new strategy Wait-K-Stride-N for SST. RealTranS gradually bridges the modality gap between speech and text, and achieves new stateof-the-art results for SST. Empirical studies have shown the proposed blank penalty for CTC loss helps on the alignment with transcription, which reduces latency while maintaining translation quality. Our weighted-shrinking operation, as well as Wait-K-Stride-N simultaneous strategy, further improves the performance. We also compare RealTranS with other methods for full sentence translation, where RealTranS still exhibits competitive results, showing its superiority.   Table 9: Numeric Results for Figure 6(b).