Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation

We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC), and use CTC-based automatic speech recognition as an auxiliary task to improve the performance. CTC's success on translation is counter-intuitive due to its monotonicity assumption, so we analyze its reordering capability. Kendall's tau distance is introduced as the quantitative metric, and gradient-based visualization provides an intuitive way to take a closer look into the model. Our analysis shows that transformer encoders have the ability to change the word order and points out the future research direction that worth being explored more on non-autoregressive speech translation.


Introduction
Recently, there are more and more research works focusing on end-to-end speech translation (ST) (Bérard et al., 2016;Weiss et al., 2017;Bérard et al., 2018;Vila et al., 2018;Di Gangi et al., 2019;Ran et al., 2019;Chuang et al., 2020). Instead of cascading machine translation (MT) models to an automatic speech recognition (ASR) system, endto-end models can skip the error bottleneck caused by ASR and be more computationally efficient. However, in the inference time, an autoregressive (AR) decoder is needed to decode the output sequence, causing the latency issue.
In MT, non-autoregressive (NAR) models have been heavily explored recently (Gu et al., 2018;Lee et al., 2018;Ghazvininejad et al., 2019;Stern et al., 2019;Gu et al., 2019;Saharia et al., 2020) by leveraging the parallel nature of transformer (Vaswani et al., 2017). In contrast, such kind of models is rarely explored in the field of speech translation, except for a concurrent work (Inaguma et al., * Contributed equally. 1 The source code is available. See Appendix A. 2020). In this work, we use connectionist temporal classification (CTC) (Graves et al., 2006) to train NAR models for ST, without an explicit decoder module. Our entire model is merely a transformer encoder. Multitask learning (Anastasopoulos and Chiang, 2018;Kano et al., 2021) on ASR, which is often used in speech translation, can also be applied in our transformer encoder architecture to further push the performance. We achieve initial results on NAR speech translation by using a single speech encoder. CTC's success on the translation task is counterintuitive because of its monotonicity assumption. Previous works directly adopt the CTC loss on NAR translation without further verification on the reordering capability of CTC (Libovickỳ and Helcl, 2018;Saharia et al., 2020;Inaguma et al., 2020). To further understand the reason that the CTC-based model can achieve ST task, we analyze the ordering capabilities of ST models by leveraging Kendall's tau distance (Birch and Osborne, 2011;Kendall, 1938), and a gradient-based visualization is introduced to provide additional evidence. To the best of our knowledge, this is the first time to examine the ordering capabilities on the ST task.
We found that after applying multitask training, our model can have more tendency to re-arrange the positions of the target words to better positions that are not aligned with audio inputs. We highlight that our contribution is to 1) take the first step on translating pure speech signal to target language text in a NAR end-to-end manner and 2) take a closer look at the reason that NAR model with CTC loss can achieve non-monotonic mapping.

CTC-based NAR-ST Model
We adopt transformer architecture for nonautoregressive speech-to-text translation (NAR-ST). The NAR-ST model consists of convolutional layers and self-attention layers. The audio sequence X is downsampled by convolutional layers, and self-attention layers will generate the final translation token sequence Y based on the downsampled acoustic features. We use CTC loss as the objective function to optimize the NAR-ST model. The CTC decoding algorithm allows the model to generate translation in a single step.
CTC predicts an alignment between the input audio sequence X and the output target sequence Y by considering the probability distribution marginalized over all possible alignments.
The CTC loss function is defined as: where x is audio frame sequence, y is target sequence, and D is the training set. CTC uses dynamic programming to marginalize out the latent alignments to compute the log-likelihood: t=0 is an alignment between x and y and is allowed to include a special "blank" token that should be removed when converting a to the target sequence y. β −1 (a) is a collapsing function such that β −1 (a) = y if a ∈ β(y).
CTC has a strong conditional independence and monotonicity assumption. It means that the tokens in Y can be generated independently, and there exists a monotonic alignment between X and Y . The monotonicity property is suitable for tasks such as ASR. However, in translation tasks, there is no guarantee that the output sequence should follow the assumption, as word orders differ in different languages. In this work, we want to examine whether the powerful self-attention based transformer model can overcome this problem to some degree or not.

CTC-based Multitask NAR-ST Model
Multitask learning improves data efficiency and performance across various tasks (Zhang and Yang, 2017). In AR end-to-end ST, multitask learning technique is often applied using ASR as an auxiliary task (Anastasopoulos and Chiang, 2018;Sperber and Paulik, 2020). It requires an ASR decoder in addition to the ST decoder to learn to predict transcriptions while sharing the encoder.  To perform multitask learning on NAR-ST model, we propose to apply CTC-based ASR on a single M -th layer in the model, as illustrated in Figure 1. It helps the NAR-ST model capture more information with a single CTC layer in an end-to-end manner. And the ASR output will not be involved in the translation decoding process.

Reordering Evaluation -Kendall's Tau Distance
We measure reordering degree by Kendall's tau distance (Kendall, 1938). LRscore (Birch and Osborne, 2011) also introduced the distance with consideration of lexical correctness. Different from LRscore, we purely analyze the reordering capability rather than lexical correctness in this work. Given a sentence triplet (T, H, Y ), where T = t 1 , ..., t |T | is the audio transcription. H = h 1 , ..., h |H| and Y = y 1 , ..., y |Y | are hypothesis and reference translation, respectively. An external aligner provides two alignments: π = π(1), ..., π(|T |) and σ = σ(1), ..., σ(|T |) . π maps each source token t k to a reference token y π(k) , and σ maps t k to a hypothesis token h σ(k) . We follow the simplifications proposed in LRscore to reduce the alignments to a bijective relationship. The proportion of disagreements between π and σ is: Then, we define the term reordering correctness R acc by introducing the brevity penalty (BP ):  Table 1: BLEU on Fisher Spanish dataset and CALLHOME (CH) dataset, including autoregressive and nonautoregressive models. The abbreviation b stands for the beam size for beam search decoding. Multitask learning (MTL) represents using ASR as the auxiliary task trained with ST. In Autoregressive Models, the auxiliary loss always applied on the final encoder output in MTL, and we applied it on different layers in NAR models.
The higher the value, the more similar between two given alignments. Ideally, a well-trained model could handle the reordering problem by making σ close to π and result in R acc = 1.

Experimental Setup
We use the ESPnet toolkit (Watanabe et al., 2018) for experiments. We perform Spanish speech to English text translation with Fisher Spanish corpus. The test sets of CALLHOME corpus are also included for evaluation. The dataset details and download links are listed in Appendix B.
The NAR-ST model consists of two convolutional layers followed by twelve transformer encoder layers. Knowledge distillation (Kim and Rush, 2016) is also applied. More training details and parameter settings can be found in Appendix C.

Translation Quality and Speed
We use BLEU (Papineni et al., 2002) to evaluate the translation quality, as shown in Table 1. Beamsearch decoding with beam-size b is considered for the AR models in this experiments. Greedy decoding is always used for the NAR models.
In the results of AR models (part (A)), multitask learning (MTL) can get better performance compared to the model without jointly training with an auxiliary task (row (b) v.s (a)). Further improvement can be brought by using a pre-trained ASR encoder as the initialization weight (row (c) v.s (b)). It shows that using ASR data for MTL and initialization are the essential steps to achieve exceptional performance. The performance drops when beamsize decreases, which shows a trade-off between the decoding speed and the performance (row (c) v.s (d)(e)).
To better optimize the decoding speed, NAR-ST provide a great solution to reach a shorter decoding time (part (B)). NAR-ST models is ×28.9 faster than the AR model with beam-size 10 (part (B) v.s rows (a)-(c)) and ×3.4 faster than the AR model with greedy decoding (part (B) v.s row (e)). We initialize the NAR-ST models with the weight pretrained on ASR task and applied the proposed MTL approach on different intermediate layers (rows (g)-(j)). As the results showed in part (B), applying MTL on the higher layers improves the performance (rows (i)(j) v.s (f)). It shows that speech signal needs more layers to model the complexity, and sheds light on selecting the intermediate layer to apply MTL is essential. We also evaluate the ASR results of the MTL models in Appendix D.
Some text-based refinement approaches can further improve the translation quality (Libovickỳ and Helcl, 2018;Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019;Chan et al., 2020). We leave it as the future work and focus on analyzing the reordering capability of the CTC-based model.

Word Order Analysis
In this section, we discuss the word ordering problem in the translation task. We use R acc defined in section 2.3 to systematically evaluate the reordering correctness across the corpora. Besides, we examine the gradient norm in models to visualize the reordering process.
Quantitative Analysis We use SimAlign (Sabet et al., 2020) to align the transcriptions and translations with details in Appendix E.  R acc evaluated on ST models. We included the correctness of random permutation as a baseline. The AR models obtain high R acc scores (rows (b)(c)), it shows that the AR model can handle a complex word order. The NAR models also have the ability to rearrange words (rows (d)(e) v.s. (a)) but are weaker than AR models due to the independent assumption brought by CTC. An interesting observation is that applying MTL tends to improve R acc (rows (c)(e) v.s. (b)(d)). We conclude that the monotonic natural in ASR improves the stability in training ST (Sperber et al., 2019).
To investigate the relation between model performance and the reordering difficulty, we measure the reordering difficulty by R π = R(π, m), where m = 1, ..., |T | is a dummy monotonic alignment. We split all the testing data (dev/dev2/test) into smaller equal-size groups by different reference R π . The BLEU scores for these groups were plotted in Figure 2. Obviously, AR models are more robust to higher reordering difficulty. Nonetheless, we observed that when MTL is applied at layer 8, CTC model is more robust to reordering difficulty, in some cases (R π <0.07) even come close to the AR model without ASR pretraining.

Gradient-based Visualization
We consider the gradient norm as an approximated indicator of reordering in our model. For each output token h i , we concatenate the relative influence on it across all layers, which yields a matrix O i ∈ R |X|×L , where each row is a frame and each column is a layer. We refer to this as the reordering matrix for token h i . We leave the computational details in Appendix H. Figure 3 shows a reordering matrix for token _thing. We can observe that the single-task CTC model (Figure 3, right) tends to keep focusing on

Concluding Remarks
We propose a CTC-based NAR-ST model with an auxiliary CTC-based ASR task and are the first to study the reordering capability in CTC-based NAR-ST model. R acc is adopted to analyze reordering in the ST task, and gradient-based visualizations reveal the internal manipulation of the models. Besides trying to improve BLEU scores, we encourage future research on NAR models to also evaluate whether the NAR models have inferior reordering capabilities in order to close the gap between AR and NAR models.

Broader Impact and Ethical Considerations
We believe that our work can help researchers in the NLP community understand more about the non-autoregressive speech translation models, and we envision that the model proposed in this paper will equip the researchers with a new technique to perform better and faster speech translation. We do not see ourselves violating the code of ethics of ACL-IJCNLP 2021.

A Source Code
Please download our code at https://github. com/voidism/NAR-ST and follow the instructions written in README.md to reproduce the results.

B.1 Statistics
The data statistics are listed in Table 3.

B.2 Preprocessing
We use ESPnet to preprocess our data. For text, we use Byte Pair Encoding (BPE) (Sennrich et al., 2016) with vocabulary size 8,000. We convert all text to lowercase with punctuation removal. For audio, we convert all audio files into wav file with a sample frequency of 16,000. We extract 80-dim fbank without delta. We use SpecAugment (Park et al., 2019) to augment our data. More details can be found in our source code in Appendix A.
C Training Details

C.1 Computing Infrastructure and Runtime
We use a single NVIDIA TITAN RTX (24G) for each experiment. The average runtime of experi-ments in Table 1 is 2-3 days for both autoregressive and non-autoregressive models.

C.2 Hyperparameters
Our training hyperparameters are listed in Table 4. We do not conduct hyperparameter search, but follow the autoregressive ST best setting in ESPnet toolkit (Watanabe et al., 2018), and use the same hyperparameter for our non-autoregressive models. Due to the limited budget, we run each experiment for once. For inference stage of CTC-based models, we simply use greedy decode to produce the output sequences.

C.3 Knowledge Distillation
To perform sequence-level knowledge distillation (Seq-KD) (Kim and Rush, 2016) to improve the performance of NAR models, we firstly trained an autoregressive transformer-based MT model on the transcriptions and translations in same training set with ESPnet. Then we used the trained model to produce the hypotheses with beam search size of 1 for the whole training set. We swapped the ground truth sequences with the hypotheses for all NAR ST model training. We also show the ablation results on knowledge distillation in Table 5. We also try the possibility of using the autoregressive ST model to produce the hypotheses for Seq-KD, but the results are not as good as using a MT model. The results are shown in the second row in Table 5. The download links to MT/ST decode results for conducting Seq-KD are also provided in the README.md file in our source code (See Appendix A).

C.4 Model Selection
When evaluating the models, we average the model checkpoints of the final 10 epochs to obtain our final model for NAR experiments. For AR experiments, we follow the original setting in ESPnet to average the 5 best-validated model checkpoints to obtain the final model.

C.5 Model Size
The number of parameters of our CTC model is 18.2M. The number of parameters of the autoregressive model is 27.9M.  We compute the Word Error Rate (WER) for ASR output obtained from the intermediate ASR branch of our proposed models. The results are shown in Table 6. We can observe that when applying multitask learning in the higher layers, the WER becomes lower. It indicates that ASR need more layers to perform better. However, the best ST scores are achieved by CTC+MTL@8 instead of CTC+MTL@10. It may be caused by the fact that there are only two transformer encoder layers for CTC+MTL@10 to perform ST. It may be too difficult for the model to convert the information from source language to target language in two encoder layers, even though the lower WER indicates useful information is provided to perform ST.

E SimAlign setup
We use the released code 2 by the authors of SimAlign as the external aligner to obtain word alignments used for calculating reordering metrics. SimAlign uses contextualized embeddings from pretrained language models, and there are several proposed algorithms to do word alignments. We use XLM-R (Conneau et al., 2019) as the underlying contextualized word embeddings with the itermax matching algorithm.

F Reordering Difficulty
We provide reordering difficulty measured on all en-xx language pairs in CoVoST2 dataset in Table 7.  G Details on Figure 2 In Figure 2, the primary goal is to view the relation between reordering difficulty and the model's performance. We describe the method used in (Birch and Osborne, 2011) to represent the reordering difficulty as follows: For each example in the fisher dev and test set, calculate Kendall's tau distance between 1) its reference alignment (alignment between source transcription and reference translation) and 2) a dummy monotonic alignment, which is just the sequence 1...m. Intuitively this shows how much the reference alignment disagrees with monotonic alignment, and hence the reordering difficulty. Next, we divide all examples into 10 bins, where each bin contains examples with similar reordering difficulty, and all bins have an equal number of examples. Finally, we calculate the BLEU score of the hypotheses in each bin. The result is plotted in Figure 2.

H Gradient-based Visualization
We first obtain a saliency matrix J M ∈ R |X|×|X| for the M -th transformer layer by computing the gradient norm of output logits w.r.t. the latent representations of each timestep in that layer. An example is shown in Figure 4. Then, we normalize J M across the dimension corresponding to the source audio sequence. Intuitively, the i-th column of J M can be interpreted as the relative influence of the representations at each position on the i-th output token. Consequently, we proceed to re-arrange J M in the following way: for each output token h i , we concatenate the relative influence on it across all layers, which yields the reordering matrix for token h i , denoted as O i ∈ R |X|×L .

I Reordering Matrix
We provide some additional examples of visible reordering in our CTC-based models. In Figure 5, "_we" is heavily influenced by the position of audio signal "amos", even though the ASR output is incorrectly predicted as "as". In Figure 6, "_you" also influenced by audio signal "usted". It is interesting to observe that in some cases the pure CTCbased model appears more capable of reordering, while in others it does not.

J Higher Reordering Difficulty
We address instances of higher difficulty by analyzing Figure 7. In the figure, the horizontal axis  corresponds to the reordering difficulty, and the vertical axis corresponds to the reordering correctness. In this figure, there is a very consistent decrease in reordering correctness when reordering difficulty increases, and the rate of decrease is very similar between NAR and AR models. This observation reveals that when evaluated on distant language pairs, that the reordering difficulty is large, the gap between NAR and AR will probably remain roughly the same. We will conduct experiments on different language pairs to verify the above claim in future work.