2023
pdf
bib
abs
BUT Systems for IWSLT 2023 Marathi - Hindi Low Resource Speech Translation Task
Santosh Kesiraju
|
Karel Beneš
|
Maksim Tikhonov
|
Jan Černocký
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes the systems submitted for Marathi to Hindi low-resource speech translation task. Our primary submission is based on an end-to-end direct speech translation system, whereas the contrastive one is a cascaded system. The backbone of both the systems is a Hindi-Marathi bilingual ASR system trained on 2790 hours of imperfect transcribed speech. The end-to-end speech translation system was directly initialized from the ASR, and then fine-tuned for direct speech translation with an auxiliary CTC loss for translation. The MT model for the cascaded system is initialized from a cross-lingual language model, which was then fine-tuned using 1.6 M parallel sentences. All our systems were trained from scratch on publicly available datasets. In the end, we use a language model to re-score the n-best hypotheses. Our primary submission achieved 30.5 and 39.6 BLEU whereas the contrastive system obtained 21.7 and 28.6 BLEU on official dev and test sets respectively. The paper also presents the analysis on several experiments that were conducted and outlines the strategies for improving speech translation in low-resource scenarios.
2022
pdf
bib
abs
Legal and Ethical Challenges in Recording Air Traffic Control Speech
Mickaël Rigault
|
Claudia Cevenini
|
Khalid Choukri
|
Martin Kocour
|
Karel Veselý
|
Igor Szoke
|
Petr Motlicek
|
Juan Pablo Zuluaga-Gomez
|
Alexander Blatt
|
Dietrich Klakow
|
Allan Tart
|
Pavel Kolčárek
|
Jan Černocký
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
In this paper the authors detail the various legal and ethical issues faced during the ATCO2 project. This project is aimed at developing tools to automatically collect and transcribe air traffic conversations, especially conversations between pilots and air controls towers. In this paper the authors will develop issues related to intellectual property, public data, privacy, and general ethics issues related to the collection of air-traffic control speech.
2021
pdf
bib
abs
THE IWSLT 2021 BUT SPEECH TRANSLATION SYSTEMS
hari Krishna Vydana
|
Martin Karafiat
|
Lukas Burget
|
Jan Černocký
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
The paper describes BUT’s English to German offline speech translation (ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a smaller amount of speech-translation training data. Large amounts of ASR and MT training data are utilized for pre-training the ASR and MT models. Speech-translation data is used to jointly optimize ASR-MT models by defining an end-to-end differentiable path from speech to translations. For this purpose, we use the internal continuous representations from the ASR-decoder as the input to MT module. We show that speech translation can be further improved by training the ASR-decoder jointly with the MT-module using large amount of text-only MT training data. We also show significant improvements by training an ASR module capable of generating punctuated text, rather than leaving the punctuation task to the MT module.
2004
pdf
bib
Orthographic and Phonetic Annotation of Very Large Czech Corpora with Quality Assessment
Petr Pollák
|
Jan Černocký
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)