KIT’s Multilingual Speech Translation System for IWSLT 2023

Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which focuses on the translation of scientific conference talks. The test condition features accented input speech and terminology-dense contents. The tasks requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.


Introduction
This paper summarizes Karlsruhe Institute of Technology's speech translation system for the multilingual track of IWSLT 2023 (Agarwal et al., 2023). In this track, the task is to translate scientific talks in English into 10 languages: Arabic (ar), Chinese (zh), Dutch (nl), French (fr), German (de), Japanese (ja), Persian/Farsi (fa), Portuguese (pt), Russian (ru), Turkish (tr). The talks are from presentations in the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).
Translating scientific talks presents several challenges. On the source side, most speakers are non-native, and the recording conditions often vary. This requires acoustic robustness to accents and noise. On the target side, domain-specific terminologies are frequently used, calling for accurate translation of these words that rarely occur in the training data. The styles of the talks, e.g. formality, also differ from other domains. As no training data from the same domain is provided, effective few-shot or zero-shot adaptation is crucial.
As the task focuses on one-to-many translation, it is also an interesting testbed for whether multilinguality improves speech translation quality. For text-to-text translation, the gain from multilinguality is mostly concentrated in many-to-one translation (Aharoni et al., 2019;, i.e., multilinguality on the source side. In contrast, for X-to-many translation, it remains unclear whether incorporating more target languages improves translation quality.
In this system description paper, we present cascaded and end-to-end systems for the Englishto-many speech translation task.
We leverage pretrained models, including WavLM (Chen et al., 2022), mBART50 (Tang et al., 2020), and DeltaLM . The systems do not use additional data beyond the allowed corpora, and therefore fall under the constrained data condition. For the cascaded system, to handle the unique style of scientific talks, we use kNN-MT (Khandelwal et al., 2021) to bias the output generation towards the target domain. Moreover, as no target monolingual data is provided, we use data diversification (Nguyen et al., 2020) to enrich the existing parallel data. We also use adapters (Rebuffi et al., 2017;) as a lightweight approach for incremental learning and language adaptation. For the ASR model, we improve over last year's performance by using a more recent audio encoder (Chen et al., 2022) and adding a dedicated decoder. To adapt the ASR system to the target domain, we use n-gram re-weighting and synthesized data for the target domain. For the end-to-end system, we use our machine translation model for knowledge distillation. We also ensemble models trained with and without synthesized speech data. arXiv:2306.05320v1 [cs.CL] 8 Jun 2023 Our main findings are as follow: • For cascaded ST systems, we can effectively adapt the model towards a target domain/style using kNN-MT (Khandelwal et al., 2021). A datastore as small as a few hundred sentence pairs was sufficient for achieving consistent gains (avg. +0.8 BLEU over 10 languages).
• Besides the common use-case of adding language-specific capacity, adapters  is also an effective method when subsequently adding training data. Empirically, we show it matches the performance of re-training on all new data.
• For ASR, lexical constraints for domain adapation are more easily integrated in CTC models. For encoder-decoder model, the control could be achieved by TTS-synthesized source speech, but it requires more careful tuning.

Development and Test Data
In the multilingual track, the testing condition is scientific conference talks. Therefore, we primarily rely on the ACL development (dev) set for validation. It consists of English transcripts of the talks and translations into the 10 target languages. The systems are then evaluated on a blind test set. The dev and test sets consist of 5 talks each. The paper abstracts for all talks are available in English. The talks are pre-segmented. In all experiments, we use the given segmentation. We also report performance on tst-COMMON of MuST-C (Di Gangi et al., 2019), tst2019 and tst2020 from previous years' evaluations (Anastasopoulos et al., 2021(Anastasopoulos et al., , 2022.
An overview of the development and test data is in Table 1.

Synthesized Speech Data
To adapt the ASR model to the ACL talks, we add synthesized speech created by a text-to-speech (TTS) model. Specifically, from the MT bitext English side (Table 3), we select sentences similar to the ACL domain based on similarity with the provided ACL dev bitext and abstracts. Inspired by data selection strategies for MT (Eck et al., 2005;Koneru et al., 2022), we use n-gram overlap as similarity metric. 4.7M sentences are selected and then synthesized to speech by a VITS (Kim et al., 2021) model trained on MuST-C. The synthesized data amount is shown in the last row of Table 2.
As preprocessing, we perform truecasing, deduplication, length ratio filtering, and histogram filtering using the statistics by . Then we perform subword segmentation using Sentencepiece (Kudo and Richardson, 2018) based on the vocabulary of mBART50 (Tang et al., 2020).
Data Diversification Different from last years' shared tasks (Anastasopoulos et al., 2021(Anastasopoulos et al., , 2022, no monolingual (non-English) data is provided. This means conventional data augmentation techniques like backward translation are not directly applicable. On the other hand, forward translation from existing English monolingual data may introduce undesirable errors in the translation targets, especially on lower-resource languages. In this light, we use data diversification (Nguyen et al., 2020), a data augmentation method that enriches existing parallel data by forward and backward translating the training bitext. As the model has seen the parallel data in training, the synthetic translations are expected to have relatively high quality. Moreover, either the source or target side of the synthetic data is from the original bitext. The diversified data amount after deduplication is shown in Table 3. Here we perform one round of forward and backward translation, as Nguyen et al. (2020) have empirically shown further rounds do not lead to substantial gains.

Casing/Punctuation Restoration Data
The ASR outputs are lower-cased and unpunctuated, while the MT model expects cased and punctuated inputs. We randomly sample 1.5 million English sentences from the MT training data (Table 3), and remove the casing and punctuation marks as training source data. We then train a model to restore the casing and punctuation marks.

Speech Translation Data
The speech translation data are shown in Table 4. We additionally use our trained MT model to create forward translations based on the following transcript-only datasets: Common Voice, TEDLIUM, and VoxPopuli. The TTS data described in §2.2 is also used.

Automatic Speech Recognition Module
Baseline Models The first baseline is our ASR model for last year's offline track (Pham et al., 2022). It is a Wav2vec 2.0 (Baevski et al., 2020) with LARGE configuration pretrained on 960 hours of Librispeech data. This year, after seeing initial favourable results compared to Wav2vec, we opt for WavLM (Chen et al., 2022) as audio encoder. We use the LARGE configuration with 24 layers. We use the mBART50 (Tang et al., 2020) decoder along with the WavLM encoder. As the ASR model only needs to transcribe English 2 , we trim the mBART50 vocabulary from 256k down to 62k tokens by removing all non-alphabetic tokens.
In-Domain TTS Data We also use the synthesized TTS data. Compared to the same model without TTS data, the word error rate (WER) improves from 11.6% to 10.7% on ACL dev, but degrades from 8.4% to 9.0% on the TEDLIUM test set. There are two potential explanations: First, the noisy TTS speech may be helpful for handling the non-native utterances prominent in the ACL dev set. Second, the target side of the TTS data is more relevant to the ACL domain, as we selected them based on n-gram overlap with ACL data. This in turn improves ASR performance on the ACL dev set. As shown in Table 5, compared to last year's submission, this year's ASR model achieves consistent gains across domains on ACL dev, tst-COMMON, and tst2020.
Language Model (LM) Adaptation Aside from using TTS data, we also investigate other methods to adapt towards the ACL domain using the provided paper abstracts. On preliminary experiments with Connectionist Temporal Classification (CTC) + n-gram LM models, we integrate ACL abstract 5-grams statistics into the language models. As shown in the upper section of Table 6, this improves on ACL dev (WER 13.8% → 13.0%) while preserving the performance on TED talks (tst-COMMON WER stays at 7.6%). As our final system is an encoder-decoder model (WavLM + mBART50), adapting the LM alone is less straightforward. We create pseudo ASR training data with ACL data on the transcript side. Specifically, we use our TTS model to synthesize speech from the ACL dev and test abstracts. As the amount of ACL abstract data is very limited (less than 100 sentences in total), we heavily upsampled them, so that they consist of 60% of the training data. As shown in the lower section of Table 6, this leads to a minor improvement of WER for ACL dev. However, the gain does not carry over to ST performance when later cascading with our MT model. Therefore, our final ASR system did not use the abstracts. The lack of improvement could be related to the low amount of ACL abstract data, which requires heavy upsampling of the TTS data, and as a result hinders the ability of transcribing real speech.
The contrast between the two sets of experiments may be related to diminishing gains as WER improves, i.e., for the Wav2vec + CTC + LM model, gaining over a WER of 13.8% is easier than starting from a 10.7% WER. Another interpretation of the difference could be that adding specific constraints to "end-to-end" ASR models is more challenging than the counterparts with separate LMs.

Model ACL dev tst-COMMON
Wav2vec + CTC + 5-gram 13.8 7.6 + ACL abstract 5-gram 13.0 7.6 WavLM + mBART50 10.7 3.9 + ACL abstract TTS (upsampled) 10.5 4.3 Casing/Punctuation Restoration We take a sequence-to-sequence approach to the casing and punctuation restoration problem. Specifically, we train a punctuation model initializing from DeltaLM-base  to restore the casing and punctuation information, using the training data described in §2.4.  tion. The pretrained model has 24 and 12 encoder and decoder Transformer layers respectively. It uses postnorm layer normalization. It is a fully multilingual model where all parameters are shared across languages. The target language tokens are prepended to the source target sentences. We use temperature-based sampling (Arivazhagan et al., 2019) with τ = 5.0 to counteract the data imbalance between languages. When training, we use a relatively large effective batch size of 128k as preliminary experiments with smaller batch sizes showed more instabilities in training. This might be a side effect of the postnorm layer normalization (Nguyen and Salazar, 2019). The results of the baseline are shown in Row (1) of Table 7, with an average score of 37.4 BLEU 3 on ACL dev.

Machine Translation
Data Diversification As motivated in §2.3, we use data diversification as an alternative data augmentation method in absence of monolingual target data for backtranslation. As data diversification needs forward and backward translations on the training data, we additionally train a 10-to-English model to create the backward translations. Row (2) of Table 7 shows the results after data diversification on all languages pairs. On average, this data augmentation approach improves MT quality by 2.1 BLEU and (37.4 → 39.5), and ST quality by 1.6 BLEU (31.1 → 32.7).
Adapters for Incremental Data Retraining on the new training data after diversification (Row (2) of Table 7) is time-consuming and costly.
To adapt the initial model (Row (1) of Table 7) rapidly towards to the augmented data, we use adapters Philip et al., 2020). In this case, the adapters are target-languagespecific. The adapters are inserted after each encoder and decoder layer. We initialize from the trained baseline (Row (1) in Table 7), freeze trained parameters and update the adapters only. We use the efficient implementation from Baziotis et al. (2022). As shown in Row (3) of Table 7, only training the adapters on the new diversified training data performs on par with the re-training setup in Row (2) (39.6 on MT and 32.7 on ST on average for ACL dev). These results demonstrate that adapters are suitable for fast and effective incremental learning when additional training data emerges later.
To our surprise, adding adapters to the model trained with full data diversification (Row (2) from Table 7) does not bring further gain. A similar observation was reported by Pires et al. (2023), who opted for training the full network from scratch along with adapters instead. In our case, it therefore would be interesting to see the impact of training on data diversification with adapters from scratch.

Multilingual vs Bilingual
To investigate the impact of interference from multiple target languages, in preliminary experiments, we also compare the multilingual and bilingual translation performance for selected language pairs. As shown in Table 8, compared to bilingual models, the multilingual model lags behind especially on higher-resource languages. Adding the adapters partly closes this gap. Note the score difference to main result table (Table 7) is because the preliminary experiments did not fully use diversified data for all languages.  Table 8: Comparison of bilingual vs multilingual translation performance in BLEU (↑) on German (de), Russian (ru), Farsi (fa), which are high-, mid-, lowresource in the training data (Table 3). Multilingual system falls behind bilingual system, while adapters partly closes the gap. Note the score difference to main result table (Table 7) is because the experiments here did not fully use diversification.
Ensemble Although the models in Row (2) and (3) in Table 7 are trained on the same data and share the same base architecture, we expect their representations to be sufficiently different, as (3) additionally uses adapters. We therefore ensemble these two models. The results are in Row (4) of Table 7. On MT and ST, for ACL, ensembling shows an improvement of 0.6 and 0.5 BLEU respectively over the single models in Row (2) and (3). On TED, however, ensembling does not seem to impact the scores compared to the single models. One explanation is that the adapter model from Row (3) performs worse than its non-adapter counterpart (Row (2)) on TED, which limits the overall effectiveness of ensembling.
kNN-MT We also adapt the MT model to the target domain of scientific talks. A challenge is that we do not have sufficient training data to fully finetune the MT model towards the desired domain or style. In this case, we use kNN-MT (Khandelwal et al., 2021) to adapt the model at inference time.
In kNN-MT, bitexts are passed through a trained MT model. For each target token, its decoder hidden state is stored in a datastore. At inference time, based on the current decoder hidden state, k candidate target tokens are retrieved from the datastore using a nearest neighbor lookup. The retrieved token distribution is then interpolated with the MT target distribution, which in turn generates the out-Source (ASR output): ... in a zero shot evaluation setup, meaning that pre trained word embedding models are applied out of the box without any additional fine tuning w/o kNN-MT (Table 7 row (4)): ... in einer Null-Shot-Bewertungs-Setup (zero-shot evaluation setup), was bedeutet, dass vorgebildete (pre-educated) Wort-Einbettungsmodelle ohne zusätzliche Feinabstimmung direkt angewendet werden. w/ kNN-MT (Table 7 row (5)): ... in einer Null-Shot-Bewertung (zero-shot evaluation), was bedeutet, dass vortrainierte (pretrained) Wort-Einbettungsmodelle ohne zusätzliche Feinabstimmung direkt angewendet werden.
Source (ASR output): Hello. My name is Ramachandra, and I will present our paper. w/o kNN-MT (Table 7 row (4)): 你好 (Hello; addressing a single person),我叫拉玛钱德拉 我要发表 (publish)我 们的论文 w/ kNN-MT (Table 7 row (5)): 大家好 (Hi all; addressing a group of audience),我叫拉玛钱德拉, 我要介绍 (introduce)我们的论文。 Table 9: Examples of kNN-MT improving translation quality for en→de (upper) and en→zh (lower). kNN-MT creates more accurate terminology translations ("pre trained" for en→de) and create more contextappropriate translation ("Hello" for en→zh). put tokens. Hyperparameters for kNN-MT include the number of retrieved neighbors k, the temperature for smoothing the kNN distribution T , and the interpolation weight w.
In our experiments, we use systems (2) and (3) from Table 7 for creating the datastores. As different models' hidden states (which serve as keys in the datastore) also differ substantially, the datastore is MT-model-dependent. To use kNN-MT when ensembling systems (2) and (3), we therefore need two datastores for systems (2) and (3) respectively. The kNN-MT candidate tokens are interpolated with the output vocabulary distribtuion before the ensembling operation.
Naively using all ACL dev bitext as datastore would lead the model to copying the oracle targets.
To simulate the scenario on the blind test set, when translating the i-th talk, we use the other j j =i ∈ [n] talks' bitext as datastore, where n is the total number of talks.
As shown in Row (5) of Table 7, kNN-MT brings an additional gain of 1.3 BLEU on MT and 0.8 BLEU on ST. These results shows a datastore as small as hundreds of sentence pairs can be effectively used for inference-time domain adaptation. Table 9 shows two examples of kNN-MT improving translation quality, apart from generic improvements in fluency and accuracy, in these examples kNN-MT also helps generate correct terminologies and context-appropriate greetings.

End-to-End System
For the end-to-end system, similar to our ASR model, after seeing initial favourable results of WavLM over Wav2vec, we choose WavLM as the audio encoder. Following last year's submission (Pham et al., 2022), we use the mBART50 decoder. The results are shown in Row (6) of Table 7. Contrasting Row (6) and (7) reveals that adding the TTS data does not substantially change ST performance. However, ensembling the two models trained with and without TTS data (Row (8)) improves over the single models (on average +0.7 for ACL, +0.4 for TED), despite them having the identical architecture.
Compared to the strongest cascaded system (Row (5)), the end-to-end system falls behind 2.6 BLEU on ACL dev. On TED, however, it appears to slightly outperform the cascaded system. One explanation is that the MT model of the cascaded system has not been separately adapted to TED texts (although parts of the full training data do cover TED data), which was shown essential in improving performance on TED test sets (Zhang et al., 2022;Pham et al., 2022). The end-to-end system, on the other hand, has seen a larger proportion of TED data in training (Table 4).
Similar to the previous year (Polák et al., 2022), we also adapt our end-to-end offline model for simultaneous track (Polák et al., 2023).

Conclusion
In this paper, we described our systems for the multilingual speech translation track of IWSLT 2023, which translates English speech into 10 target languages. To tackle the task of translating scientific conference talks, which feature non-native in-put speech and terminology-dense contents, our systems have several novelties. Lacking suitable training data for the target domain, we used kNN-MT for inference-time adaptation and showed an improvement of +0.8 BLEU for cascaded speech translation system. We also used adapters to integrate incremental data from augmentation, and achieved performance on-par with re-training on all data. In our experiments, we observed that cascaded systems are more easily adaptable towards desired target domains due to their separate modules. Our cascaded speech system outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks. For future work, we are interested in the feasibility of applying the adaptation approaches shown effective on MT to end-to-end ST.