CTC-based Compression for Direct Speech Translation

Previous studies demonstrated that a dynamic phone-informed compression of the input audio is beneficial for speech translation (ST). However, they required a dedicated model for phone recognition and did not test this solution for direct ST, in which a single model translates the input audio into the target language without intermediate representations. In this work, we propose the first method able to perform a dynamic compression of the input in direct ST models. In particular, we exploit the Connectionist Temporal Classification (CTC) to compress the input sequence according to its phonetic characteristics. Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement over a strong baseline on two language pairs (English-Italian and English-German), contextually reducing the memory footprint by more than 10%.


Introduction
Speech translation (ST) is the process that converts utterances in one language into text in another language. Traditional approaches to ST consist of separate modules, each dedicated to an easier subtask, which are eventually integrated in a so-called cascade architecture (Stentiford and Steer, 1988;Waibel et al., 1991). Usually, its main components are an automatic speech recognition (ASR) model -which generates the transcripts from the audio -and a machine translation (MT) model -which translates the transcripts into the target language. A newer approach is direct ST, in which a single model performs the whole task without intermediate representations (Bérard et al., 2016;Weiss et al., 2017). The main advantages of direct ST systems are: i) the access to information not present in the text (e.g. prosody, vocal characteristics of the speaker) during the translation phase, ii) a reduced latency, iii) a simpler and easier to manage architec-ture (only one model has to be maintained), which iv) avoids error propagation across components.
In both paradigms (cascade and direct), the audio is commonly represented as a sequence of vectors obtained with a Mel filter bank. These vectors are collected with a high frequency, typically one every 10 ms. The resulting sequences are much longer than the corresponding textual ones (usually by a factor of~10). The sequence length is problematic both for RNN (Elman, 1990) and Transformer (Vaswani et al., 2017) architectures. Indeed, RNNs fail to represent long-range dependencies (Bengio et al., 1993) and the Transformer has a quadratic memory complexity in the input sequence length, which makes training on long sequences prohibitive due to its memory footprint. For this reason, architectures proposed for direct ST/ASR reduce the input length either with convolutional layers (Bérard et al., 2018;Di Gangi et al., 2019) or by stacking and downsampling consecutive samples (Sak et al., 2015). However, these fixed-length reductions of the input sequence assume that samples carry the same amount of information. This does not necessarily hold true, as phonetic features vary at a different speed in time and frequency in the audio signals.
Consequently, researchers have studied how to reduce the input length according to dynamic criteria based on the audio content. Salesky et al. (2019) demonstrated that a phoneme-based compression of the input frames yields significant gains compared to fixed length reduction. Phone-based and linguistically-informed compression also proved to be useful in the context of visually grounded speech (Havard et al., 2020). However, Salesky and Black (2020) questioned the approach, claiming that the addition of phone features without segmentation and compression of the input is more effective.
None of these works is a direct ST solution, as they all require a separate model for phone recog-nition and intermediate representations. So, they: i) are affected by error propagation (Salesky and Black 2020 show in fact that lower quality in phone recognition significantly degrades final ST performance), ii) have higher latency and iii) a more complex architecture. A direct model with phonebased multi-task training was introduced by Jia et al. (2019) for speech-to-speech translation, but they neither compared with a training using transcripts nor investigated dynamic compression.
In this paper, we explore the usage of phones and dynamic content-based input compression for direct ST (and ASR). Our goal is an input reduction that, limiting the amount of redundant/useless information, yields better performance and lower memory consumption at the same time. To this aim, we propose to exploit the Connectionist Temporal Classification (CTC) (Graves et al., 2006) to add phones prediction in a multi-task training and compress the sequence accordingly. To disentangle the contribution of the introduction of phone recognition and the compression based on it, we compare against similar trainings leveraging transcripts instead of phones. Our results show that phone-based multi-task training with sequence compression improves over a strong baseline by up to 1.5 BLEU points on two language pairs (English-German and English-Italian), with a memory footprint reduction of at least 10%.

CTC-based Sequence Compression
The CTC algorithm is usually employed for training a model to predict an output sequence of variable length that is shorter than the input one. This is the case of speech/phone recognition, as the input is a long sequence of audio samples, while the output is the sequence of uttered symbols (e.g. phones, sub-words), which is significantly shorter. In particular, for each time step, the CTC produces a probability distribution over the possible target labels augmented with a dedicated <blank> symbol representing the absence of a target value. These distributions are then exploited to compute the probabilities of different sequences, in which consecutive equal predictions are collapsed and <blank> symbols are removed. Finally, the resulting sequences are compared with the target sequence.
Adding an auxiliary CTC loss to the training of direct ST and acoustic ASR models has been shown to improve performance (Kim et al., 2017;Bahar et al., 2019). In these works, the CTC loss is com- puted against the transcripts on the encoder output to favour model convergence. Generally, the CTC loss can be added to the output of any encoder layer, as in Figure 1 where the hyper-parameter N CTC indicates the number of the layer at which the CTC is computed. Formally, the final loss function is: where E x is the output of the x-th encoder layer, D N D is the decoder output, CT C is the CTC function, and CE is the label smoothed cross entropy. If N CTC is equal to the number of encoder layers (N E ), the CTC input is the encoder output. We consider this solution as our baseline and we also test it with phones as target. As shown in Figure 1, we use as model a Transformer, whose encoder layers are preceded by two 2D convolutional layers that reduce the input size by a factor of 4. Therefore, the CTC produces a prediction every 4 input time frames. The sequence length reduction is necessary both because it makes possible the training (otherwise out of memory errors would occur) and to have a fair comparison with modern state-of-the-art models. A logarithmic distance penalty (Di Gangi et al., 2019) is added to all the Transformer encoder layers.
Our proposed architecture is represented in Figure 2. The difference with the baseline is the introduction of an additional block (Collapse same predictions) that exploits the CTC predictions to compress the input elements (vectors). Hence, in this case the CTC does not only help model convergence, but it also defines variable-length segments representing the same content. So, dense audio portions can be given more importance, while re- dundant/uninformative vectors can be compressed. This allows the following encoder layers and the decoder to attend to useful information without being "distracted" by noisy elements. The architecture is a direct ST solution as there is a single model whose parameters are optimized together without intermediate representations. At inference time, the only input is the audio and the model produces the translation into the target language (contextually generating the transcripts/phones with the CTC).
We compare three techniques to compress the consecutive vectors with the same CTC prediction: • Average. The vectors to be collapsed together are averaged. As there is only a linear layer between the CTC inputs and its predictions, the vectors in each group are likely to be similar, so the compression should not remove much information.
• Weighted. The vectors are averaged but the weight of each vector depends on the confidence (i.e. the predicted probability) of the CTC prediction. This solution is meant to give less importance to vectors whose phone/transcript is not certain.
• Softmax. In this case, the weight of each vector is obtained by computing the softmax of the CTC predicted probabilities. The idea is to propagate information (nearly) only through a single input vector (the more confident one) for each group.

Data
We experiment with MuST-C (Cattoni et al., 2021), a multilingual ST corpus built from TED talks. We focus on the English-Italian (465 hours) and English-German (408 hours) sections. For each set (train, validation, test), it contains the audio files, the transcripts, the translations and a YAML file with the start time and duration of the segments. In addition, we extract the phones using Gentle. 1 Besides aligning the transcripts with the audio, Gentle returns the start and end time for each recognized word, together with the corresponding phones. For the words not recognized in the audio, Gentle does not provide the phones, so we lookup their phonetic transcription on the VoxForge 2 dictionary. For each sample in the corpus, we rely on the YAML file and the alignments generated by Gentle to get all the words (and phones) belonging to it. The phones have a suffix indicating the position in a word (at the end, at the beginning, in the middle or standalone). We also generated a version without the suffix (we refer to it as PH W/O POS in the rest of the paper). The resulting dictionaries contain respectively 144 and 48 symbols.

Experimental Settings
Our Transformer layers have 8 attention heads, 512 features for the attention and 2,048 hidden units in FFN. We set a 0.2 dropout and include SpecAugment (Park et al., 2019) in our trainings. We optimize label smoothed cross entropy (Szegedy et al., 2016) with 0.1 smoothing factor using Adam (Kingma and Ba, 2015) (betas (0.9, 0.98)). The learning rate increases linearly from 3e-4 to 5e-3 for 4,000 updates, then decays with the inverse square root. As we train on 8 GPUs with minibatches of 8 sentences and we update the model every 8 steps, the resulting batch size is 512. The audio is pre-processed performing speaker normalization and extracting 40-channel Mel filter-bank features per frame. The text is tokenized into subwords with 1,000 BPE merge rules (Sennrich et al., 2016).
As having more encoder layers than decoder layers has been shown to be beneficial (Potapczyk and Przybysz, 2020;, we use 8 Transformer encoder layers and 6 decoder layers for ASR and 11 encoder and 4 decoder layers for ST unless stated otherwise. We train until the model does not improve on the validation set for 5 epochs and we average the last 5 checkpoints. Trainings were performed on K80 GPUs and lasted 48 hours (~50 minutes per epoch). Our implementation 3 is based on Fairseq (Ott et al., 2019).

ASR
We first tested whether ASR benefits from the usage of phones and sequence compression. Table 1 shows that having phones instead of English transcripts (Baseline -8L EN) as target of the CTC loss (8L PH) without compression is beneficial. When compressing the sequence, there is little difference according to the target used (8L PH AVG, 8L PH W/O POS. AVG, 8L EN AVG). However, the compression causes a 0.3-0.5 WER performance degradation and a 12-5% saving of RAM.
Moving the compression to previous layers (4L PH AVG, 2L PH AVG) further decreases the output quality and the RAM usage. We can conclude that compressing the input sequence harms ASR performance, but might be useful if RAM usage is critical and should be traded off with performance.

Direct ST
In early experiments, we pre-trained the first 8 layers of the ST encoder with that of the ASR model, adding three adapter layers (Bahar et al., 2019). We realized that ASR pre-training was not useful (probably because ASR and ST data are the same), so we report results without pre-training. As we want to ensure that our results are not biased by a poor baseline, we compare with , which uses the same framework and similar settings. 6 As shown in Table 2, our strong baseline (8L EN) outperforms  by 2 BLEU on en-it and 1.3 BLEU on en-de.
As in ASR, replacing the transcripts with phones as target for the CTC loss (8L PH) further improves respectively by 0.5 and 1.2 BLEU. We first explore the introduction of the compression at different layers. Adding it to the 8 th layer (8L PH AVG) enhances the translation quality by 0.6 (enit) and 0.2 (en-de) BLEU, with the improvement on en-it being statistically significant over the version without CTC compression. Moving it to previous layers (4L PH AVG, 2L PH AVG) causes performance drops, suggesting that many layers are needed to extract useful phonetic information.
Then, we compare the different compression policies: AVG outperforms (or matches) WEIGHTED and SOFTMAX on both languages. Indeed, the small weight these two methods assign to some vectors likely causes an information loss and prevents a proper gradient propagation for the corresponding input elements.
Finally, we experiment with different CTC targets, but both the phones without the position suffix (8L PH W/O POS. AVG) and the transcripts (8L EN AVG) lead to lower scores.
The different results between ASR and ST can be explained by the nature of the two tasks: extracting content knowledge is critical for ST but not for ASR, in which a compression can hide details that are not relevant to extrapolate meaning, but needed to generate precise transcripts. The RAM savings are higher than in ASR as there are 3 more layers. On the 8 th layer, they range from 11% to 23% for en-it, 16% to 22% for en-de. By moving the compression to previous layers, we can trade performance for RAM requirements, saving up to 50% of the memory.
We also tested whether we can use the saved RAM to add more layers and improve the translation quality. We added 3 encoder and 2 decoder layers: this (8L PH AVG (14+6L)) results in  Table 2: Results using the CTC loss with transcripts and phones as target. AVG, WEIGHTED and SOFTMAX indicate the compression method. If none is specified, no compression is performed. The symbol " * " indicates improvements that are statistically significant with respect to the baseline. " † " indicates statistically significant gains with respect to 8L PH. Statistical significance is computed according to (Koehn, 2004) with α = 0.05. Scores in italic indicate the best models among those with equal number of layers.
small gains (0.2 on en-it and 0.1 on en-de), but the additional memory required is also small (the RAM usage is still 10-16% lower than the baseline). The improvements are statistically significant with respect to the models without compression (8L PH) on both language pairs. When training on more data, the benefit of having deeper networks might be higher, though, and this solution allows to increase the number of layers without a prohibitive memory footprint. We leave this investigation for future works, as experiments on larger training corpora are out of the scope of this paper.