Streaming Models for Joint Speech Recognition and Translation

Orion Weller, Matthias Sperber, Christian Gollan, Joris Kluivers


Abstract
Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap, recent work has shown initial progress into the feasibility for end-to-end models to produce both of these outputs. However, all previous work has only looked at this problem from the consecutive perspective, leaving uncertainty on whether these approaches are effective in the more challenging streaming setting. We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches. We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders. Our evaluation across a range of metrics capturing accuracy, latency, and consistency shows that our end-to-end models are statistically similar to cascading models, while having half the number of parameters. We also find that both systems provide strong translation quality at low latency, keeping 99% of consecutive quality at a lag of just under a second.
Anthology ID:
2021.eacl-main.216
Volume:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:
April
Year:
2021
Address:
Online
Editors:
Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2533–2539
Language:
URL:
https://aclanthology.org/2021.eacl-main.216
DOI:
10.18653/v1/2021.eacl-main.216
Bibkey:
Cite (ACL):
Orion Weller, Matthias Sperber, Christian Gollan, and Joris Kluivers. 2021. Streaming Models for Joint Speech Recognition and Translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2533–2539, Online. Association for Computational Linguistics.
Cite (Informal):
Streaming Models for Joint Speech Recognition and Translation (Weller et al., EACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eacl-main.216.pdf
Data
MuST-C