Joan Albert Silvestre-Cerdà

Also published as: Joan Albert Silvestre-Cerda

2022

MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks
Javier Iranzo-Sánchez | Javier Jorge Cano | Alejandro Pérez-González-de-Martos | Adrián Giménez Pastor | Gonçal V. Garcés Díaz-Munío | Pau Baquero-Arnal | Joan Albert Silvestre-Cerdà | Jorge Civera Saiz | Albert Sanchis | Alfons Juan
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference.

2020

pdf bib abs

Direct Segmentation Models for Streaming Speech Translation
Javier Iranzo-Sánchez | Adrià Giménez Pastor | Joan Albert Silvestre-Cerdà | Pau Baquero-Arnal | Jorge Civera Saiz | Alfons Juan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into hopefully, semantically self-contained chunks to be fed into the MT system. This is specially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and throughly experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.

2018

pdf bib abs

This work describes AppTek’s speech translation pipeline that includes strong state-of-the-art automatic speech recognition (ASR) and neural machine translation (NMT) components. We show how these components can be tightly coupled by encoding ASR confusion networks, as well as ASR-like noise adaptation, vocabulary normalization, and implicit punctuation prediction during translation. In another experimental setup, we propose a direct speech translation approach that can be scaled to translation tasks with large amounts of text-only parallel training data but a limited number of hours of recorded and human-translated speech.