2023
pdf
bib
abs
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Christian Huber
|
Tu Anh Dinh
|
Carlos Mullov
|
Ngoc-Quan Pham
|
Thai Binh Nguyen
|
Fabian Retkowski
|
Stefan Constantin
|
Enes Ugan
|
Danni Liu
|
Zhaolin Li
|
Sai Koneru
|
Jan Niehues
|
Alexander Waibel
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
2021
pdf
bib
abs
KIT’s IWSLT 2021 Offline Speech Translation System
Tuan Nam Nguyen
|
Thai Son Nguyen
|
Christian Huber
|
Ngoc-Quan Pham
|
Thanh-Le Ha
|
Felix Schneider
|
Sebastian Stüker
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
This paper describes KIT’submission to the IWSLT 2021 Offline Speech Translation Task. We describe a system in both cascaded condition and end-to-end condition. In the cascaded condition, we investigated different end-to-end architectures for the speech recognition module. For the text segmentation module, we trained a small transformer-based model on high-quality monolingual data. For the translation module, our last year’s neural machine translation model was reused. In the end-to-end condition, we improved our Speech Relative Transformer architecture to reach or even surpass the result of the cascade system.
2020
pdf
bib
abs
Supervised Adaptation of Sequence-to-Sequence Speech Recognition Systems using Batch-Weighting
Christian Huber
|
Juan Hussain
|
Tuan-Nam Nguyen
|
Kaihang Song
|
Sebastian Stüker
|
Alexander Waibel
Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems
When training speech recognition systems, one often faces the situation that sufficient amounts of training data for the language in question are available but only small amounts of data for the domain in question. This problem is even bigger for end-to-end speech recognition systems that only accept transcribed speech as training data, which is harder and more expensive to obtain than text data. In this paper we present experiments in adapting end-to-end speech recognition systems by a method which is called batch-weighting and which we contrast against regular fine-tuning, i.e., to continue to train existing neural speech recognition models on adaptation data. We perform experiments using theses techniques in adapting to topic, accent and vocabulary, showing that batch-weighting consistently outperforms fine-tuning. In order to show the generalization capabilities of batch-weighting we perform experiments in several languages, i.e., Arabic, English and German. Due to its relatively small computational requirements batch-weighting is a suitable technique for supervised life-long learning during the life-time of a speech recognition system, e.g., from user corrections.