AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation

In end-to-end speech translation, acoustic representations learned by the encoder are usually ﬁxed and static, from the perspective of the de-coder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the beneﬁts of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the de-coder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sub-layer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method signiﬁcantly outperforms state-of-the-art neural speech translation models.


Introduction
Speech-to-text translation (ST) aims at translating the source language speech into the text of the target language. Approaches to ST can be roughly divided into two categories: end-to-end ST and cascaded ST. Early research on ST is primarily using a cascaded model that combines a speech recognition (ASR) module with a machine translation component, both usually trained independently on speech and parallel corpora (Ney, 1999;Matusov et al., 2005). In contrast, end-to-end ST, which directly translates the speech of the source language into text of the target language (Berard et al., 2016), not only avoids error propagation in the ASR-MT pipeline, but also greatly reduces inference latency.
However, despite these advantages, end-to-end ST is confronted with its own challenging problem- * Corresponding author s: performing cross-modal translation and crosslingual conversion in one shot. On the one hand, compared with text-to-text translation, end-to-end ST has to deal with acoustic inputs which are typically longer than their corresponding text inputs. This makes the cross-modal source-target dependencies more complicated. On the other hand, compared with monotonic ASR, end-to-end speech translation usually handles non-monotonic crosslingual conversion.
Generally, end-to-end ST uses the seq2seq encoder-decoder framework (Sutskever et al., 2014) as the backbone for training and inference, where the encoder computes hidden states layer by layer according to speech inputs. The decoder yields target translations word by word by attending to the fixed-after-computing hidden states of the encoder. Since the hidden states are static in the encoder, information only flows one direction: from the encoder to the decoder. Given the cross-modal and cross-lingual challenge in end-to-end ST, we argue that more sophisticated interaction between the encoder and decoder would be desirable.
In this paper, we propose an adaptive ST (AdaST) model that incorporates acoustic states into the decoder for modeling the deep interaction between the encoder and decoder for end-to-end ST. We enable AdaST to dynamically adapt encoder states in the decoder when target hidden states are updated layer by layer. It also learns to represent speech and text in one shared space in the decoder for mitigating the cross-modal issue.
Our contributions can be summarized as follows: • We present AdaST, a new architecture for endto-end ST, which learns representations of two modalities (textual and audio) in one shared space in the decoder.
• We conduct experiments to validate the effectiveness of AdaST. Our experiments and analyses disclose that dynamically adaptive acoustic representations are more desirable than static acoustic states for end-to-end ST.  (Wang et al., 2020b), and two-pass decoding (Sung et al., 2019), have also been studied in end-to-end speech translation. To solve the cross-modal and cross-lingual challenges of end-to-end speech translation, Wang et al. (2020a) and Dong et al. (2020) propose to use submodules to separately analyze cross-modal and cross-lingual problems in end-to-end ST. Each module introduced solves one problem. Unfortunately, they introduce a large number of extra parameters and rely on a large amount of external data to pre-train each submodule. In contrast, we do not introduce any additional submodules and therefore we do not need external data for pretraining.

The AdaST Model
In this section, we first introduce the widely-used CNN + Transformer structure as the strong baseline for end-to-end ST. After that we elaborate the proposed AdaST model.

Baseline ST Model
The CNN + Transformer end-to-end ST model consists of a speech encoder and a translation decoder. The basic building unit of Transformer (Vaswani et al., 2017) is the self-attention mechanism, which can be formulated as follows: The speech encoder is composed of N c CNN layers for encoding acoustic signals and Ne Transformer encoder layers stacked over CNN layers. The translation decoder consists of N d Transformer decoder layers.
The CNN module subsamples acoustic features to fit them into the subsequent Transformer encoder layers. The Transformer encoder layers then learn encoder states from the output of the CNN module, which are fixed during decoding. That is to say, the Transformer decoder layers attend to static Transformer encoder hidden states for yielding target words.

AdaST
As shown in Figure 1, our proposed AdaST uses the same speech encoder as the baseline ST model. The significant difference lies in the decoder. In order to make acoustic states dynamically adaptive to decoder states in each layer, we concatenate the hidden acoustic state sequence generated from the last layer of the speech encoder with the target word embedding sequence and feed the concatenated sequence into the subsequent decoder blocks. The concatenated input sequence is combined with positional encoding, similar to the vanilla Transformer decoder. In addition to positional encoding, we also adapts modality embeddings, which are defined in a embedding matrix with size of 2 × c (c is the dimension of attention) adding to the input sequence to distinguish the target textual tokens from the source acoustic features. Modality embeddings has also been used in other cross-modal tasks, e.g., Vilbert for vision-text multimodal pretraining. Our experiments show that using modality embeddings in our model can slightly improve translation quality.
In the decoder, each block consists of a multihead speech-text mixed attention sublayer and a feedforward sublayer. The multi-head speech-text mixed attention (STMA) is calculated as follows: where src and tgt represent the sequence of acoustic hidden states and target word embeddings respectively, and M ask is a predefined matrix which serves as indicators controlling which positions of the acoustic and target sequence are visible to attention heads, similar to the look ahead mask matrix used in Transformer to prevent the decoder from attending future tokens.
In each decoder layer of the proposed AdaST, we divide the M ask matrix into four parts: M SS represents the self-attention mask matrix of the acoustic state, which is the same as used in the encoder. M ST is the mask matrix for the attention from acoustic states to target hidden states. During parallel training, as source acoustic states are not visible to target hidden states, we set all values of M ST to minus infinity to forbid such attention. M T S denotes the mask matrix for attention from target hidden states to acoustic states. Values in M T S are the same as the mask matrix used for the cross-attention in Baseline ST. M T T is the mask matrix for self-attention over target hidden states, which is the same as the mask matrix used for selfattention on the Baseline ST decoder.
The proposed AdaST benefits from the following features. First, the acoustic states and decoder hidden states are unified into a shared semantic space. Second, the acoustic states at each decoder layer change accordingly after the calculations at the current layer are performed. Third, instead of calculating softmax for self-attention and then calculating softmax for cross-attention in the baseline ST, the neural representations in the AdaST decoder are updated by calculating a single softmax over both acoustic states and hidden states for target words. With these changes, we hope to mitigate the cross-modal and cross-lingual challenges in end-to-end ST.

Experiments
We conducted experiments to examine the proposed AdaST model.

Datasets
We used two datasets that are widely adopted to evaluate end-to-end ST: IWSLT18 En-De and Augmented Librispeech En-Fr (Berard et al., 2018).
Augmented Librispeech English-French. The corpus provides triples for each instance: English speech signal, English transcription, French text translation from the aligned e-books. Following Wang et al. (2020b), we only used the 100 hours clean data for training, with 2 hours data as the development set and 4 hours as the test set, which corresponds to 47,271, 1071 and 2048 utterances respectively. To be consistent with their settings, we also doubled the training data by concatenating the aligned references with pseudo translations by the Google Translate.
IWSLT18 English-German. The IWSLT18 speech translation dataset is from TED Talks, which contains 271 hour speech with 171K corresponding English transcripts and German translations. As there is no validation set in this dataset, we randomly sampled 2000 samples from the training data as our validation set. Following Wang et al. (2020b), we used tst2013 as the test set.

Settings
We built our model based on the Espnet toolkit (Inaguma et al., 2020). On the two datasets, we extracted 80-dimensional Fbank features from audio files, setting the step size as 10ms and the window size as 25ms. We deleted sentences with frame size larger than 3000 and sentences with poor alignments. Following Wang et al. (2020a), we adopted speed perturbation with factors 0.9 and 1.1. To further reduce overfitting, we used SpecAugment strategy (Bahar et al., 2019). In Librispeech, we used subword level decoding, which was performed via SentencePiece with a size of 1K tokens. In I-WSLT18, we performed character level decoding. As the tst2013 of IWSLT2018 is not aligned, we employed Espnets default LIUM SpkDiarization tool to segment each audio sequence. We used RWTH toolkit (Bender et al., 2004) to calculate BLEU scores (Papineni et al., 2002).
A two-layer CNN was taken in the speech encoder. The step size was set to 2. The size of the convolution kernel was 2 × 2. The dimension of the attention was set to 256. We used 12-layer encoder. The number of decoder layers in both the baseline and AdaST was set to 10. We used the Adam optimizer (Kingma and Ba, 2015) and run our models on four P100 GPUs.

Main Results
In order to make each layer of the decoder to interact with acoustic states, our model requires additional computational overhead. However, the conventional source-to-target attention network in Transformer is subsumed in the decoder, which helps AdaST to use fewer parameters than Transformer, hence partially offsetting the additional cost. Overall, the number of parameters in AdaST is 0.65 million fewer than that of the standard CNN+Transformer structure. On the augmented dataset, AdaST increased the training time by 11.7% and the inference time by 15.7%. We compared our work against previous state-of-the-art models and the ASR pretraining + MT fine-tuning method. Table 1 shows the results on the two datasets. We observe that the proposed AdaST is able to achieve improvements of +0.83 BLEU and +1.18 BLEU over the best baseline results on En-Fr and En-De translation, respectively. This demonstrates that our proposed method benefits end-to-end ST at both the character and subword Method BLEU En-Fr LSTM ST (Berard et al., 2018) 12.90 Transformer+ASR pre-train (Inaguma et al., 2020) 15.53 Transformer+ASR pre-train 16.27 AdaST 17.10

En-De
Transformer+ASR pre-train (Inaguma et al., 2020) 13.12 Transformer+ASR pre-train (Wang et al., 2020b) 15.35 Transformer+ASR pre-train 15.21 AdaST 16.39  level. We have also carried out experiments to compare against a standard CNN+Transformer model with deeper encoder and decoder. Experiment results show that simply deepening either encoder or decoder of the standard structure is not helpful for speech-to-text translation.

Analysis
We conducted further experiments and analyses to take a deep look into our proposed method.

Only Cross-modal or Cross-lingual Challenge
In order to investigate whether our proposed architecture is helpful for a task with only cross-modal or cross-lingual challenge, we also conducted experiments for automatic speech recognition (AS-R) and machine translation (MT) tasks with the proposed method on the Agmented Librispeech dataset. Experimental results in Table 2 show that the performance of ASR task drops, while the performance of MT task is improved slightly. This suggests that the proposed architecture is more appropriate for dealing with cross-lingual and crossmodal challenges at the same time.

Adaptive vs. Static Acoustic States
We assume that dynamically adaptive representations of acoustic states in accord with hidden decoder states at each decoder layer will be of great   help to end-to-end ST. In order to examine this hypothesis, we add an additional self-attention at each encoder layer in the baseline ST, which forces acoustic states at the corresponding encoder layer to adapt to decoder hidden states. The results on the IWSLT18 dataset, as displayed in Table 3, validate this assumption. However, the added additional self-attention substantially increase the number of parameters at each layer. By contrast, our AdaST does not introduce additional parameters at each layer to learn adaptive acoustic states on the one hand and achieves better performance on the other.

Probing the Speech Encoder
We further compared the trained speech encoder of our AdaST against that of the baseline ST by evaluating speaker verification accuracy on the Fluent Speech Commands dataset (Lugosch et al., 2019) to investigate the change of the semantic information learned by the encoder. Generally, the more semantic information the encoder contains, the less audio information it learns and hence the lower classification accuracy it will obtain. We froze parameters of these two speech encoders, and added a linear classification layer on the top of the encoder . Only the added classification layer is trained on the dataset mentioned above. Table 4 shows the classification accuracy results, where the baseline encoder achieves 74.2% while our encoder 96.7%, substantially higher than the baseline encoder. This indicates that our encoder focuses on modeling the audio modality and passes the major task of modeling semantic information in speech inputs to the decoder. In contrast, the baseline encoder has to model both semantic and modality information of speech inputs, which weakens its modeling capacity for modality and therefore makes it have a much lower performance on speaker verification.

Conclusions
In this paper, we have presented AdaST, a neural model dynamically adapting acoustic states in the decoder, which is able to mitigate the cross-lingual and cross-modal challenge for end-to-end speech translation. Experiments demonstrate that AdaST achieves an improvement of 1.18 BLEU points over state-of-the-art neural speech translation models.