2023
pdf
bib
abs
KIT’s Multilingual Speech Translation System for IWSLT 2023
Danni Liu
|
Thai Binh Nguyen
|
Sai Koneru
|
Enes Yavuz Ugan
|
Ngoc-Quan Pham
|
Tuan Nam Nguyen
|
Tu Anh Dinh
|
Carlos Mullov
|
Alexander Waibel
|
Jan Niehues
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which focuses on the translation of scientific conference talks. The test condition features accented input speech and terminology-dense contents. The tasks requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.
2022
pdf
bib
abs
Effective combination of pretrained models - KIT@IWSLT2022
Ngoc-Quan Pham
|
Tuan Nam Nguyen
|
Thai-Binh Nguyen
|
Danni Liu
|
Carlos Mullov
|
Jan Niehues
|
Alexander Waibel
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
Pretrained models in acoustic and textual modalities can potentially improve speech translation for both Cascade and End-to-end approaches. In this evaluation, we aim at empirically looking for the answer by using the wav2vec, mBART50 and DeltaLM models to improve text and speech translation models. The experiments showed that the presence of these models together with an advanced audio segmentation method results in an improvement over the previous end-to-end system by up to 7 BLEU points. More importantly, the experiments showed that given enough data and modeling capacity to overcome the training difficulty, we can outperform even very competitive Cascade systems. In our experiments, this gap can be as large as 2.0 BLEU points, the same gap that the Cascade often led over the years.
pdf
bib
abs
CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022
Peter Polák
|
Ngoc-Quan Pham
|
Tuan Nam Nguyen
|
Danni Liu
|
Carlos Mullov
|
Jan Niehues
|
Ondřej Bojar
|
Alexander Waibel
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being 3x faster than offline in terms of latency on the test set. We also show that the onlinized offline model outperforms the best IWSLT2021 simultaneous system in medium and high latency regimes and is almost on par in the low latency regime. We make our system publicly available.
2021
pdf
bib
abs
KIT’s IWSLT 2021 Offline Speech Translation System
Tuan Nam Nguyen
|
Thai Son Nguyen
|
Christian Huber
|
Ngoc-Quan Pham
|
Thanh-Le Ha
|
Felix Schneider
|
Sebastian Stüker
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
This paper describes KIT’submission to the IWSLT 2021 Offline Speech Translation Task. We describe a system in both cascaded condition and end-to-end condition. In the cascaded condition, we investigated different end-to-end architectures for the speech recognition module. For the text segmentation module, we trained a small transformer-based model on high-quality monolingual data. For the translation module, our last year’s neural machine translation model was reused. In the end-to-end condition, we improved our Speech Relative Transformer architecture to reach or even surpass the result of the cascade system.
pdf
bib
abs
Multilingual Speech Translation KIT @ IWSLT2021
Ngoc-Quan Pham
|
Tuan Nam Nguyen
|
Thanh-Le Ha
|
Sebastian Stüker
|
Alexander Waibel
|
Dan He
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
This paper contains the description for the submission of Karlsruhe Institute of Technology (KIT) for the multilingual TEDx translation task in the IWSLT 2021 evaluation campaign. Our main approach is to develop both cascade and end-to-end systems and eventually combine them together to achieve the best possible results for this extremely low-resource setting. The report also confirms certain consistent architectural improvement added to the Transformer architecture, for all tasks: translation, transcription and speech translation.
2020
pdf
bib
abs
KIT’s IWSLT 2020 SLT Translation System
Ngoc-Quan Pham
|
Felix Schneider
|
Tuan-Nam Nguyen
|
Thanh-Le Ha
|
Thai Son Nguyen
|
Maximilian Awiszus
|
Sebastian Stüker
|
Alexander Waibel
Proceedings of the 17th International Conference on Spoken Language Translation
This paper describes KIT’s submissions to the IWSLT2020 Speech Translation evaluation campaign. We first participate in the simultaneous translation task, in which our simultaneous models are Transformer based and can be efficiently trained to obtain low latency with minimized compromise in quality. On the offline speech translation task, we applied our new Speech Transformer architecture to end-to-end speech translation. The obtained model can provide translation quality which is competitive to a complicated cascade. The latter still has the upper hand, thanks to the ability to transparently access to the transcription, and resegment the inputs to avoid fragmentation.
pdf
bib
abs
Supervised Adaptation of Sequence-to-Sequence Speech Recognition Systems using Batch-Weighting
Christian Huber
|
Juan Hussain
|
Tuan-Nam Nguyen
|
Kaihang Song
|
Sebastian Stüker
|
Alexander Waibel
Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems
When training speech recognition systems, one often faces the situation that sufficient amounts of training data for the language in question are available but only small amounts of data for the domain in question. This problem is even bigger for end-to-end speech recognition systems that only accept transcribed speech as training data, which is harder and more expensive to obtain than text data. In this paper we present experiments in adapting end-to-end speech recognition systems by a method which is called batch-weighting and which we contrast against regular fine-tuning, i.e., to continue to train existing neural speech recognition models on adaptation data. We perform experiments using theses techniques in adapting to topic, accent and vocabulary, showing that batch-weighting consistently outperforms fine-tuning. In order to show the generalization capabilities of batch-weighting we perform experiments in several languages, i.e., Arabic, English and German. Due to its relatively small computational requirements batch-weighting is a suitable technique for supervised life-long learning during the life-time of a speech recognition system, e.g., from user corrections.