On Knowledge Distillation for Translating Erroneous Speech Transcriptions

Recent studies argue that knowledge distillation is promising for speech translation (ST) using end-to-end models. In this work, we investigate the effect of knowledge distillation with a cascade ST using automatic speech recognition (ASR) and machine translation (MT) models. We distill knowledge from a teacher model based on human transcripts to a student model based on erroneous transcriptions. Our experimental results demonstrated that knowledge distillation is beneficial for a cascade ST. Further investigation that combined knowledge distillation and fine-tuning revealed that the combination consistently improved two language pairs: English-Italian and Spanish-English.


Introduction
Speech translation (ST) converts utterances in a source language into text in another language. Conventional ST systems called cascade or pipeline ST consist of two components: automatic speech recognition (ASR) and machine translation (MT). In the cascade ST, the error propagation from ASR to MT seriously degrades the ST performance. On the other hand, a new ST system called end-to-end or direct ST uses a single model to directly translate the source language speech into target language text (Bérard et al., 2016). Such an end-to-end approach is a new paradigm in ST and is attracting much research attention. However, a naive end-toend ST without additional training, such as ASR tasks, remains inferior to a cascade ST (Liu et al., 2018;Salesky and Black, 2020). Additionally, it requires parallel data of the source language speech and the target language text, which cannot be obtained easily in practice.
Recent ST studies have incorporated the techniques of cascade ST to end-to-end STs. Multitask training with an ASR subtask has been used successfully in end-to-end ST (Weiss et al., 2017;Anastasopoulos and Chiang, 2018;Sperber et al., 2019). Initializing an end-to-end ST with a pretrained ASR or MT has also become a common approach (Bérard et al., 2018;Bansal et al., 2019;Inaguma et al., 2020;Wang et al., 2020;Bahar et al., 2021).
In this work, we focus on the cascade approach due to its performance advantage against end-toend STs. Another reason is that cascade ST models can be incorporated into end-to-end STs, as shown in previous studies.
During the training of an MT model for a cascade ST, we can use clean human transcripts for the source language speech as input. However, since the MT in a cascade ST always receives ASR output during inferences, ASR errors should be propagated to the MT model to cause translation errors. What if we use erroneous speech transcriptions by ASR for training? That approach means the MT model is trained to translate erroneous transcriptions into correct text, which would not generally be appropriate. One possible solution is to use both types of input (clean and erroneous transcriptions) for training, not just one. The question is how to use them. What is the proper training strategy for cascade STs? This is what we want to learn.
In this work, we address such problems by applying knowledge distillation to cascade STs. We distill the knowledge of a teacher model based on clean transcriptions to a student model based on erroneous transcriptions. We also investigate the joint use of knowledge distillation and fine-tuning. Experimental results revealed that the knowledge distillation improved the robustness against ASR errors and that the knowledge distillation after the fine-tuning provided more significant improvement.

Related work
Some ST studies have tackled the problem of ASR error propagation. N-best hypotheses (Zhang et al., 2004;Quan et al., 2005), confusion networks (Bertoldi and Federico, 2005;Bertoldi et al., 2007), and lattices (Matusov and Ney, 2010;Sperber et al., 2017a) were used to include ASR ambiguity in the ST process. Osamura et al. (2018) used the weighted sum of embedding vectors for ASR word hypotheses based on their posterior probabilities. Sperber et al. (2017b) and Xue et al. (2020) showed that translation accuracy against erroneous speech transcriptions can be improved by introducing pseudo ASR errors in the training data of MT.
Knowledge distillation (KD) (Buciluǎ et al., 2006;Hinton et al., 2015) is a method of transferring knowledge from a teacher to a student model. Typically, the student model is trained by minimizing the KL-divergence (Kullback and Leibler, 1951) loss between the output probability distributions of the teacher and student models (wordlevel KD). Sequence-level knowledge distillation (sequence-level KD) (Kim and Rush, 2016a) targets the token-sequence generated by the teacher model using beam search. In our experiments, sequence-level KD outperformed word-level one, and Kim and Rush (2016b) showed similar trends. Therefore, in our experiments, we call it KD.
The KD technique is prevalent in many applications of machine learning, including MT (non-autoregressive machine translation (Gu et al., 2017), simultaneous translation (Ren et al., 2020), etc.). Typically, it is used to distill knowledge from a larger teacher model to a smaller or faster student model. Recent works (Furlanello et al., 2018;Yang et al., 2018) have shown that the student model's accuracy exceeds that of the teacher model, even if its size is identical as the student model. KD has also been applied to ST. Gaido et al. (2020) applied KD to an end-to-end ST using an MT model based on clean transcriptions as the teacher of the end-toend ST model. Our work focuses on the application of KD to a cascade ST using a teacher model based on clean transcripts for the student model that takes erroneous inputs.
Dakwale and Monz (2019) proposed distillation as a remedy for the effective use of noisy parallel data for machine translation. They first trained the teacher model only on high-quality, clean data. Then they fed the source-side of the noisy parallel data into the teacher model and trained the student model to translate from the noisy source to the teacher's output. The main difference between their work and ours is that we have loosely equivalent source sentences (clean or erroneous transcription), which can be paired with the same target sentence. Therefore, the student model can be trained with more reliable objectives obtained by feeding clean transcriptions to the teacher model.

Cascade ST
, and Y (1 ≤ l ≤ L) are sequences of the speech features in a source language, the corresponding transcribed source language tokens, and translated target language tokens.
In a cascade ST, first the ASR model is trained by the W and X pair. Then the MT model is trained to translate from X to Y . The loss function of MT model L M T is defined using cross entropy: where P (y l = v) is the posterior probability of candidate v in target language vocabulary V at time l in Y :

Proposed method
When training an MT model, we can also useX instead of X, which is the output of the ASR model. We call the model trained with clean input X MT clean ( Fig. 1(a)) and the one trained with ASR-based inputX MT asr ( Fig. 1(b)).

Joint use of KD and FT
To most effectively exploit both clean input X and ASR-based inputX, we introduce two training techniques: KD and fine-tuning. In KD, the student model is trained usingX by minimizing loss L KD . As shown in Fig. 1 are the outputs of the teacher and student models. We use the sequence-level KD so that L KD is calculated by replacing L with M and l with m in Eq. 1. On the other hand, fine-tuning (FT) has been widely used for domain adaptation in MT (Sennrich et al., 2016a). Di Gangi et al. (2019c) showed that a model fine-tuned with ASR-based input becomes robust to erroneous ASR input while maintaining high performance for clean input. Following this finding, we employ FT for MT training. In FT, the student model withX, which inherits the parameters of the teacher model with X, is trained by minimizing L M T (Fig. 1).
In addition to the independent use of KD and FT, we examined their possible combinations: • FT+KD. Apply these techniques at the same time. Unlike regular FT, we use loss L KD instead of L M T . Specifically, (1) the teacher model is trained with clean input X and loss L M T . Then (2) the student model is trained with ASR-based inputX and loss L KD , inheriting the parameters of the teacher model.   (Post et al., 2013;Salesky et al., 2018). It has the following roughly 140K segments of multi-way parallel data: For the sake of reproducibility we used (2) or (3) as clean or noisy input included in the dataset. We preprocessed the text data with Byte Pair Encoding (BPE) (Sennrich et al., 2016b) to split the sentences into subwords. The vocabulary size was set to 8,000 in all the languages. For the English audio, we extracted 80-channel log mel filterbank features (25-ms window size and 10-ms shift) and applied an utterance-level CMVN.
To evaluate the performance, we calculated the case-sensitive BLEU with sacreBLEU. 1 We measured BLEU for both the ASR-based and clean input to evaluate the ASR error robustness and the topline performance in an ideal situation without ASR errors.

Model
We used the Transformer (Vaswani et al., 2017) implementation of Fairseq 2 to construct both the ASR and the MT. The hyper-parameters of the model generally follow the Transformer Base settings (Vaswani et al., 2017). Each encoder and decoder has 6 sub-layers. We set the word embedding dimensions, the hidden state dimensions, and the feed-forward dimensions to 512, 512, and 2,048. We performed the sub-layer's dropout with a probability of 0.1 and employed 8 attention heads for both the encoder and the decoder. The model is trained using Adam with an initial learning rate 1 End-to-end (above) or cascade ST (below) systems using Fairseq's Transformer Base model, which resembles our conditions.
For English-Italian, we also built several end-toend ST variants using Fairseq for comparison with the cascade models. All the settings are identical as in MT: using Transformer described above and trained with label-smoothed cross entropy loss. Table 1 shows the BLEU results for the English to Italian NMT. In the end-to-end systems, a naive model (ST) without any additional technique, such as an ASR subtask, was significantly lower than the others and was significantly improved by pretraining the ASR encoder (ST + ASR-PT).

English-Italian
The cascade methods worked better than the end-to-end methods. In the cascade ST, the performance of a system trained using only ASR input (MT asr ) was worse (0.3-BLEU drop for the ASRbased test data and 2.5-BLEU drop for the clean test data) than the clean input (MT clean ). The ASRbased training data contained erroneous transcriptions of WER 14.49, leading to degradation. On the other hand, some systems trained using both ASR input and clean input were better than MT clean when translating clean input. This indicates that the training with ASR errors may contribute to reg-ularize the model, which yields improvements.
The FT for the ASR-based input (MT asr + FT) showed improvements for the ASR-based input (+1.1 BLEU). Compared to FT, KD (MT asr + KD) produced a small improvement with the ASR-based input (+0.4 BLEU). In the KD, a teacher model got a BLEU score of 41.6 on the reference for training data.
With respect to the joint use of FT and KD, simultaneously applying these techniques (MT asr + FT + KD) shows only slight improvements (+0.2 BLEU for ASR-based test data and +0.1 BLEU for clean test data), compared to FT only (MT asr + FT). Applying FT after KD (MT asr + KD → FT) was inferior to the other combinations, especially for clean data, probably because the MT was not trained with clean input. Distilling knowledge after FT (MT asr + FT → KD) gave the best score for both the ASR-based and the clean test data. FT enables the student model to learn good parameter values, and KD provides the student model with its upper bounds from the teacher model. Table 2 shows the overall results for the Spanish to English cascade ST. They are similar to those in English-to-Italian; FT and KD improved BLEU, and combining them yielded more significant improvements. However, the gap was larger for the clean test data between systems only trained on the ASR-based input (MT asr ) and only on the clean input (MT clean ). The ASR-based training data contained many erroneous transcriptions of WER 36.5,

System
Fisher/Test 0  causing more serious degradation. It also differs from the English-to-Italian experiments in that KD (MT asr + KD) was superior to FT (MT asr + FT) for the ASR-based test data when it was used alone. In KD, BLEU using the teacher model as training data was 48.0, which is higher than 41.6 for English-Italian. One possible reason is that there was a higher upper bound that can be trained by KD. Another difference was a gap between the clean and ASR-based inputs, which have many erroneous transcriptions of WER 36.5. In such a case, parameter initialization by FT may not be very helpful.
In spite of the differences between the two experiments, we achieved consistent improvement by combining FT and KD.

Discussion
We analyzed the results with the Spanish to English models to discuss how erroneous transcriptions affect translation results and how KD and FT work.
Erroneous transcription The example below shows the problem of error propagation: • (Clean input) uno super, super nuevo que salio • (ASR output) en un sur super nuevo que salio • (Reference) One super new that came out • (MT asr with ASR-based input) In the South, it came out • (MT asr + KD with ASR-based input) In a super new one that came out.
Here the Spanish word super was misrecognized as sur by the ASR. This error was propagated to MT, and MT asr translated it as South. Although the word's translation itself from sur to South was not wrong, but it is not what we wanted. The model trained by KD ignored this error and generated a more proper sentence. We found such ASR error correction phenomena in the results, although KD and FT did not directly address this issue.
Effect of Knowledge Distillation Spoken language parallel data have translations of colloquial spoken utterances. They increase the difficulty of training MT. For instance: • (Clean input) le ayuda si si, no es, no es interesante pero entonces, a ba-entonces ya despues cuando eso termino, tiene que escribir varios asi, ensayos, hacer un analisis • (Reference) You have to write some essays like that, to make an analysis • (KD teacher) It helps her yes, it's not interesting but then, when I finish, you have to write several, you have to make an analysis A human translator ignored many disfluent utterances from the original text, resulting in low fidelity.
Here are some other examples: • Inconsistent translations: "De Venezuela" was translated into "From Venezuela" at one time and "Venezuela?" at another time.
• All-caps: "donde hay problemas" was capitalized and translated into "WHEN TROUBLE ARISES." • Omission of a part of speech: "Porque, tengo el, el bodysuit, pero" was translated into "I have the bodysuit.." The conjunction "pero (but)" was removed for fluency.
The MT model can be confused by such translations. KD forces the student model to mimic literal teacher translations that may include some errors instead of reproducing translations of colloquial spoken utterances.
Effect of Fine-tuning Sometimes the fine-tuned MT model corrected the ASR errors: • (Clean input) Eh, para mi pues, eh, tengo como diez mil canciones en, en el, en la Ipod • (ASR output) eh para mi pues eh tengo como diez mil canciones en en la epod • (Reference) I have ten thousand songs in the Ipod.
• (MT clean with ASR-based input) To me, I have about ten thousand songs in the ethics • (MT asr + FT with ASR-based input) I have about ten thousand songs in the Ipod The ASR misrecognized "Ipod" as "epod," and the model before FT, which was only trained with clean inputs, incorrectly translated it as "ethics." As a result of the FT with ASR-based inputs, the model successfully translated it as "Ipod." The FT for the erroneous ASR outputs may have provided robustness against common errors.

Conclusion
We presented and discussed the benefits of using two machine learning techniques in cascade ST: knowledge distillation and fine-tuning. Our experimental results showed the advantages of the proposed method in two different conditions. Our results also suggest that combining knowledge distillation and fine-tuning is more beneficial than using either one because they have different roles.
In future work, we will incorporate our findings into an end-to-end ST to grow speech translation.