Tutorial Proposal: End-to-End Speech Translation

Speech translation is the translation of speech in one language typically to text in another, traditionally accomplished through a combination of automatic speech recognition and machine translation. Speech translation has attracted interest for many years, but the recent successful applications of deep learning to both individual tasks have enabled new opportunities through joint modeling, in what we today call ‘end-to-end speech translation.’ In this tutorial we will introduce the techniques used in cutting-edge research on speech translation. Starting from the traditional cascaded approach, we will given an overview on data sources and model architectures to achieve state-of-the art performance with end-to-end speech translation for both high- and low-resource languages. In addition, we will discuss methods to evaluate analyze the proposed solutions, as well as the challenges faced when applying speech translation models for real-world applications.


Abstract
Speech translation is the translation of speech in one language typically to text in another, traditionally accomplished through a combination of automatic speech recognition and machine translation. Speech translation has attracted interest for many years, but the recent successful applications of deep learning to both individual tasks have enabled new opportunities through joint modeling, in what we today call 'end-to-end speech translation.' In this tutorial we will introduce the techniques used in cutting-edge research on speech translation. Starting from the traditional cascaded approach, we will given an overview on data sources and model architectures to achieve state-of-the art performance with endto-end speech translation for both high-and low-resource languages. In addition, we will discuss methods to evaluate analyze the proposed solutions, as well as the challenges faced when applying speech translation models for real-world applications.

Description
Machine translation (MT) and automatic speech recognition (ASR) have been mainstays of the speech and natural language processing communities for decades. Speech translation (ST), the combination of both tasks to translate from speech in one language typically to text in another, has existed for nearly as long as either of these (Waibel et al., 1991), attracting interest from both academia and industry. Until very recently, however, research in this area involved a cascade of separately trained speech recognition and machine translation models, with main questions pertaining to intermediate representations and processing steps to best connect these models. The successful application of deep learning methods to speech and language processing has not only significantly improved the quality of models for both tasks (Sennrich et al., 2016;Hinton et al., 2012), but has also enabled new opportunities through joint modeling of speech and translation in what is today referred to as end-to-end speech translation (Bérard et al., 2016;Weiss et al., 2017). By integrating ideas from machine translation and speech recognition, this research topic is at the intersection of speech and language processing, traditionally two separate communities.
The paradigm switch to neural, end-to-end models has brought a significant increase in research interest and data resources for ST. The yearly evaluation campaign organized by IWSLT has seen large increases in participation in recent years (Ansari et al., 2020), and this year brought the creation of a joint special interest group (SIGSLT) spanning the ACL and ISCA communities. "Simpler" sequence-to-sequence architectures have lowered the barrier to entry; where previously researchers wishing to work in this area typically needed to either have significant knowledge of both ASR and MT or work in large collaborations, this is no longer the case. However, it remains the case that the best-performing models do draw on insights from both of these fields, and so we think that the time is ripe for a tutorial to better introduce the techniques to do cutting-edge research in ST.
This tutorial will summarize recent developments in end-to-end speech translation. We will start with discussion about the term 'end-to-end' 1 as well as a comparison to the traditional cascaded approach. In the subsequent sections, we will summarize ideas leveraged from automatic speech recognition (e.g. Chan et al. (2016)) and machine translation (e.g. Vaswani et al. (2017)) that are part of current state-of-the-art models, which are cur-rently demonstrated through evaluation campaigns like IWSLT. A particular focus point of the tutorial will be the current data landscape, as well as techniques to exploit different resources (Kano et al., 2020;Sperber et al., 2019) to enable speech translation not just for the few high-resource languages for which multi-parallel speech, transcripts, and translations exist.
After the survey of current state-of-the-art methods, we will present evaluation and analysis methods, and challenges when bringing these models from the lab to real-world environments. For example, one challenge of end-to-end models is their 'opaqueness'; with one joint system, it is more difficult to isolate causes of particular model behaviors and perhaps intervene, to avoid situations where key terms are translated in unexpected ways. Further, most training examples used fixed, presegmented input with parallel sentences, while in most practical applications the audio is not segmented. This brings additional challenges both in processing and also scoring. Finally, there are aspects of speech, such as speaker gender, accent, and prosody, which in cascaded systems the MT model did not have access to. We will touch on the impacts of some of these aspects, and provide greater detail about the specific example of gender bias mitigation (Bentivogli et al., 2020).
During the tutorial, we will highlight the present successes and challenges in end-to-end speech translation using examples from current state-ofthe-art systems. Resources and teaching materials will be made available at https://st-tutorial. github.io.

Tutorial Type
This tutorial will cover cutting-edge research in the emerging field of end-to-end speech translation, and the aspects from speech and MT needed for this interdisciplinary research. The topic has not been previously covered in *CL tutorials. -Available data for end-to-end ST -Different ways to leverage data sources: * Multi-task learning * Transfer learning and pretraining * Alternate data representations (e.g. phonemes) • Evaluation/Analysis (20 min) -Automatic metrics -Utterance segmentation for automatic scoring -Mitigating errors due to speaker variation (gender, accent, etc.) • Advanced topics (30 min

Prerequisites
We would assume acquaintance with basic knowledge of machine learning and sequence-tosequence models for machine translation, such as are covered in most introductory NLP courses. Any programming examples will be shown in Python.

Reading list
• Survey paper (Sperber and Paulik, 2020) • The first papers on end-to-end ST (Bérard et al., 2016;Weiss et al., 2017) • Data for end-to-end ST (Di Gangi et al., 2019b) • Integrating additional data (Bansal et al., 2019;Jia et al., 2019;Sperber et al., 2019) • Data representation (Salesky and Black, 2020) • Adapting the Transformer for ST (Di Gangi et al., 2019a) • Multilingual models (Inaguma et al., 2019)  His research interests are in the field of computational linguistics, particularly machine translation, spoken language translation, textual entailment and question answering. He worked in several EU projects (QT21, CRACKER, MMT, Mate-Cat, CoSyne, QALL-ME) and co-organised conferences, workshops and evaluation campaigns in NLP and MT-related areas (including the Conference on Machine Translation, the International Workshop on Spoken Language Translation and Se-mEval shared tasks). Together with Marco Turchi, he was the recipient of an Amazon AWS ML Research Award on "End-to-end Spoken Language Translation in Rich Data Conditions."