Long-form Simultaneous Speech Translation: Thesis Proposal

Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Traditionally, SST has been addressed primarily by cascaded systems that decompose the task into subtasks, including speech recognition, segmentation, and machine translation. However, the advent of deep learning has sparked significant interest in end-to-end (E2E) systems. Nevertheless, a major limitation of most approaches to E2E SST reported in the current literature is that they assume that the source speech is pre-segmented into sentences, which is a significant obstacle for practical, real-world applications. This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting, i.e., without pre-segmentation. We present a survey of the latest advancements in E2E SST, assess the primary obstacles in SST and its relevance to long-form scenarios, and suggest approaches to tackle these challenges.


Introduction
In today's highly globalized world, communication among individuals speaking different languages is gaining importance.International conferences and multinational organizations like the European Parliament often rely on human interpreters.However, in many scenarios, employing human interpreters can be impractical and costly.In such cases, simultaneous speech translation 1 (SST) offers a viable solution by enabling real-time translation before the speaker completes their sentence.
Traditionally, both offline speech translation (ST) and simultaneous speech translation (SST) have relied predominantly on cascaded systems that decompose the task into multiple subtasks, including speech recognition, speech segmentation, and machine translation (Osterholtz et al., 1992;Fügen et al., 2007;Bojar et al., 2021).However, recent advancements in deep learning and the availability of abundant data (Tan and Lim, 2018;Sperber and Paulik, 2020) have led to a significant paradigm shift towards end-to-end (E2E) models.While the cascaded approach continues to dominate offline ST, the opposite is true for SST (Anastasopoulos et al., 2022;Agarwal et al., 2023).
Despite the recent popularity of end-to-end SST, the vast majority of research focuses on the "shortform" setting, which assumes that the speech input is already pre-segmented into sentences.Critically, this assumption poses an obstacle to deployment in the wild.Therefore, we aim to achieve a "true" long-form simultaneous speech translation in our thesis.We break down our efforts into three steps: Quality-latency tradeoff in SST The first step of our research concentrates on enhancing the quality-latency tradeoff, mainly in the traditional "short-form" regime.We will evaluate different approaches and architectures.
Towards the long-form SST In the next step, we will explore the feasibility of long-form simultaneous speech translation by adopting segmented inference.
True long-form SST The final goal of our work is to explore the potential of end-to-end modeling for true long-form SST.We will focus on identifying an appropriate model architecture and effective training procedures to achieve seamless and reliable long-form simultaneous speech translation.
The next section introduces some important aspects of simultaneous speech translation.

arXiv:2310.11141v1 [cs.CL] 17 Oct 2023
The ultimate goal of SST is to enable real-time communication between people speaking different languages.To achieve this goal, SST systems must meet two important criteria.First, they must be computationally efficient to ensure timely translation during ongoing speech.Second, SST systems must be capable of handling unfinished sentences.Working with unfinished sentences allows for more timely translations, particularly when waiting for sentences to be completed is impractical, such as matching slides or presenters' gestures.However, translating unfinished sentences increases the risk of translation errors since translation usually requires re-ordering that benefits from a more complete sentence context.Thus, there exists a qualitylatency tradeoff.This means that given a certain latency constraint, we want the model to produce as good translations as possible.Ideally, we want the model to "predict" the future context without the risk of an incorrect translation.The quality-latency tradeoff is one of the main topics of our research.

Re-Translation vs. Incremental SST
SST can be classified as either re-translation or incremental.Re-translation SST (Niehues et al., 2016(Niehues et al., , 2018) ) can revise the hypothesis or re-rank the set of hypotheses as more speech input is read.Revising the translation allows the re-translation SST to have comparable final translation quality with the offline speech translation (Arivazhagan et al., 2020).This design approach arguably introduces challenges for the user in processing the translation and makes it impossible to use in realtime speech-to-speech translation.Additionally, it also complicates the latency evaluation.
In fact, several SST latency metrics (Ma et al., 2020) were originally developed specifically for incremental translation scenarios. 2Incremental SST (Cho and Esipova, 2016;Dalvi et al., 2018) differs from the re-translation system in that it prunes all hypotheses to a common prefix, which is then shown to the user.For the user, the translation changes only by incrementally getting longer; none of the previously displayed outputs are ever modified.In our work, we focus on incremental SST.

Cascaded vs. End-to-End
Traditionally, offline speech translation and SST were achieved as a cascade of multiple systems: automatic speech recognition (ASR), inverse transcript normalization, which includes punctuation prediction and true casing, and machine translation (MT, Osterholtz et al., 1992;Fügen et al., 2007;Bojar et al., 2021).The advantage of the cascade approach is that we can optimize models for each subtask independently.Also, ASR and MT tasks typically have access to larger and more diverse corpora than direct speech translation.
However, using a cascade system introduces several challenges (Sperber and Paulik, 2020).The most important among them is error propagation (Ruiz and Federico, 2014).Further, MT models might suffer from mismatched domains when trained on written language.Furthermore, as the source is transformed into a textual form, it loses crucial information about prosody, i.e., the rhythm, intonation, and emphasis in speech (Bentivogli et al., 2021).Finally, many languages, especially endangered ones, have no written form, which makes the cascade approach impractical or impossible for such languages (Harrison, 2007;Duong et al., 2016).
As of the latest findings, the current state-ofthe-art for offline speech translation continues to be based on a cascaded approach (Anastasopoulos et al., 2022;Agarwal et al., 2023).In simultaneous speech translation, however, both approaches yield competitive performance.The advantage of the end-to-end models in SST may be that they avoid the extra delay caused by ASR-MT collaboration in the cascade (Wang et al., 2022).
In our work, we focus on end-to-end models.

Long-form Simultaneous Speech Translation
Most of the contemporary research on SST assumes speech pre-segmented into short utterances with segmentation following the sentence boundaries.However, in any real application, there is no such segmentation available.This section places longform SST within the broader context of long-form ASR, MT, and offline ST.Subsequently, we explore the current literature on long-form SST.

Long-Form ASR
In terms of input and output modalities, long-form ASR and ST face similar issues.There are two types of strategies for long-form processing: (1) the segmented approach, which divides the input into smaller chunks, and (2) the true long-form approach, which handles the entire long-form input as a single unit.
Most of the literature focuses on the segmented approach.A typical solution involves presegmenting the audio using voice activity detection (VAD).However, VAD segmentation may not be optimal for real-world speech since it might fail to handle hesitations or pauses in sentences that must be treated as undivided units.More sophisticated approaches leverage latent alignments obtained from CTC (Graves et al., 2006) and RNN-T (Graves, 2012) for better segmentation (Yoshimura et al., 2020;Huang et al., 2022).Alternatively, segmentation into fixed segments is also popular (Chiu et al., 2019(Chiu et al., , 2021)).To reduce low-quality transcripts close to the segment boundaries, they typically perform overlapped inference and use latent alignments to merge the transcripts correctly.The chunking approach is also adopted by the attentional model Whisper in the offline (Radford et al., 2023) and simultaneous regime (Macháček et al., 2023).
Another line of work focused on long-form modeling directly.For example, Chiu et al. (2019) 2023) compared a chunk-wise attention encoder, which involves an encoder with a limited attention span, in combination with the attention-based decoder (AD) and CTC.We note that while the encoder has a limited attention span, the attention-based decoder sees the entire encoder representation.The model employing AD could not function without chunking, whereas the CTC model processed the entire speech at once and still outperformed the AD model.

Long-Form MT
The primary objective of long-form MT is to enhance textual coherence, as conventional MT sys-tems assume sentence independence.Early work explored a concatenation of previous (Tiedemann and Scherrer, 2017;Donato et al., 2021) and future sentences (Agrawal et al., 2018).These works showed that MT models benefit from the extra context and better handle the inter-sentential discourse phenomena.However, the benefits diminish if the context grows beyond a few sentences (Agrawal et al., 2018;Kim et al., 2019;Fernandes et al., 2021).This can be attributed to the limitations of attention mechanisms, where an extensive volume of irrelevant information can lead to confusion.
Other body of work tries to model very long sequences directly.Dai et al. (2019) introduced a recurrence mechanism and improved positional encoding scheme in the Transformer.Later work proposed an explicit compressed memory realized by a few dense vectors (Feng et al., 2022).

Long-Form Offline ST
Unlike written input text in long-form MT, speech input in the ST task lacks explicit information about segmentation.Therefore, the research in the area of long-form offline speech translation concentrates on two separate issues: (1) improving segmentation into sentences, and (2) enhancing robustness through the use of larger context.
In the traditional cascaded approach with separate speech recognition and machine translation models, the work focused on segmentation strategies for the ASR transcripts. 3The methods are usually based on re-introducing punctuation to the transcript (Lu and Ng, 2010;Rangarajan Sridhar et al., 2013;Cho et al., 2015Cho et al., , 2017)).However, these approaches suffer from ASR error propagation and disregard the source audio's acoustic information.This was addressed by Iranzo-Sánchez et al. (2020a), however, the approach still requires an intermediate ASR transcript that is unavailable in E2E models.
An alternative approach involves source-speechbased segmentation.The early work focused on VAD segmentation.This is usually sub-optimal as speakers place pauses inside sentences, not necessarily between them (e.g., hesitations before words with high information content, Goldman-Eisler, 1958).To this end, researchers tried considering not only the presence of speech but also its length (Potapczyk and Przybysz, 2020;Inaguma et al., 2021;Gaido et al., 2021).Later studies tried to avoid VAD and focused on more linguisticallymotivated approaches, e.g., ASR CTC to predict voiced regions Gállego et al. (2021) or directly modeling the sentence segmentation (Tsiamas et al., 2022b;Fukuda et al., 2022).
To address the problem of inadequate segmentation, Gaido et al. (2020) showed that context-aware ST is less prone to segmentation errors.In an extensive study of context-aware ST, Zhang et al. (2021) observed that context improves quality, but this holds only for a limited number of utterances.

Long-Form Simultaneous ST
Research focusing on direct long-form simultaneous speech translation remains relatively scarce.The closest works are in long-form simultaneous MT.Schneider and Waibel (2020) proposed a streaming MT model capable of translating unsegmented text input.This model could be theoretically adapted for speech input.However, it was later shown that this model exhibits huge latency (Iranzo Sanchez et al., 2022).Another work (Iranzo Sanchez et al., 2022) explored the extended context and confirmed the findings from long-form MT and offline ST, demonstrating that using the previous context significantly enhances performance.They also confirmed that a too-long context leads to decreased translation quality.
Finally, the only direct SST model that claims to work on a possibly unbounded input is Ma et al. (2021).The model utilizes a Transformer encoder with a restriction on self-attention, allowing it to attend solely to a memory bank and a small segment.Unfortunately, based on the reported experiments, whether the model was specifically evaluated in the long-form setting remains unclear.

Evaluation
Evaluation of SST is a complex problem as we have to consider not only the translation quality but also the latency.Additionally, in the long-form regime, segmentation becomes another obstacle.
The most commonly used metric for translation quality in speech translation is BLEU (Papineni et al., 2002;Post, 2018).Other metrics such as chrF++ (Popović, 2017) and a neural-based metric COMET (Rei et al., 2020) can be applied, too.
The other important property of an SST system is latency.There are two main types of latencies: computation-unaware (CU) and computationaware (CA) latency.The computation-unaware latency measures the delay in emitting a translation token relative to the source, regardless of the actual computation time.Hence, CU latency allows for a fair comparison regardless of the hardware infrastructure.However, CU latency cannot penalize the evaluated system for extensive computation; hence, CA latency can offer a more realistic assessment.
Measuring latency relative to the source or reference in SST is quite difficult because of the reordering present in translation.Historically, latency metrics were first developed for simultaneous machine translation (i.e., the source is text rather than speech).The most common are average lagging (AL; Ma et al., 2019) and differentiable average lagging (DAL; Cherry and Foster, 2019).Broadly speaking, they measure "how much of the source was read by the system to translate a word".The latency unit is typically a word.The speech community quickly adopted these metrics.Unfortunately, these metrics assume a uniform distribution of words and uniform length of these words in the speech source.Alternatively, Ansari et al. (2021) proposed to use a statistical word alignment of the candidate translation with the corresponding source transcript.This theoretically allows for more precise latency evaluation, but it is unclear how the alignment errors impact the reliability.
In the unsegmented long-form setting, additional issues arise.In a typical "short-form" segmented setup, the SST model does inference on a presegmented input.However, the candidate and reference segmentation into sentences might differ in the long-form unsegmented regime.Traditionally, this issue was addressed by re-segmenting the hypothesis based on the reference (Matusov et al., 2005).After the re-segmentation, a standard sentence-level evaluation of translation quality and latency is done.It should be noted that the commonly used latency metrics (AL, DAL) cannot be used in the long-form regime (Iranzo-Sánchez et al., 2021) without the re-segmentation.Yet, recent work observed that the re-segmentation introduces errors (Amrhein and Haddow, 2022).This poses a risk of incorrect translation and quality assessment and remains an open research question.

Thesis Goals
The goal of our thesis is to achieve a "true" longform simultaneous speech translation.This section outlines the steps we will take to accomplish this goal.

Data and Evaluation
In our future research, we will mainly use the setup similar to the IWSLT shared tasks (Ansari et al., 2020;Anastasopoulos et al., 2021Anastasopoulos et al., , 2022)), i.e., mostly single speaker data.Identical to the IWSLT, we will treat the TED data as an in-domain setting.We will consider domains such as parliamentary speeches (e.g., Europarl-ST Iranzo-Sánchez et al., 2020b) for the out-of-domain setting.As for the languages, we will include a diverse set of language pairs.A good inspiration might be again the IWSLT, i.e., English-to-{German, Japanese, Chi-nese}.Challenging will be the long-form setting, as to the best of our knowledge, none of the available data is strictly long-form.Our preliminary review found that the original TED talks can be reconstructed from the MuST-C (Cattoni et al., 2021) development and test set available for English-to-{German, Japanese, Chinese} language pairs.As highlighted in the literature review in Section 3.5, evaluating the long-form SST remains an open problem.The quality and latency evaluation metrics currently used are designed for sentencelevel evaluation.We must re-segment the long hypotheses into sentences based on their word alignment with provided references to use these metrics in the long-form regime.Unfortunately, the resegmentation introduces errors, which poses a risk to the evaluation reliability.To tackle this, we will investigate alternative evaluation strategies.One potential approach for reducing the alignment error could be to move the alignment to the sentence level rather than the word level and allow an mto-n mapping between the reference and proposed sentences, similar to the Gale-Church alignment algorithm (Gale et al., 1994), with a reasonably small m and n (e.g., 0 ≤ m, n ≤ 2).To verify the effectiveness of this method, we need to compare its correlation with human evaluations.

Quality-latency tradeoff in SST
The first step of our research concentrates on enhancing the quality-latency tradeoff, mainly in the traditional "short-form" simultaneous speech translation.We hope the insights and improvements from the short-form regime will translate into the long-form regime.
In the research done so far, we already successfully reviewed the possibility of "onlinizing" state-of-the-art offline speech translation models in Polák et al. (2022).Our observations indicated that the attention-based encoder-decoder (AED) models tend to over-generate.This not only affects the resulting quality but also negatively impacts the AL latency evaluation reliability.Therefore, we proposed an improved version of the AL metric, which was later independently proposed under name length-adaptive average lagging (LAAL; Papi et al., 2022).To remedy the over-generation problem, we proposed an improved version of the beam search algorithm in Polák et al. (2023b).While this led to significant improvements in the quality-latency tradeoff, the decoding still relied on label-synchronous decoding.In Polák et al. (2023a), we proposed a novel SST policy dubbed "CTC policy" that uses the output of an auxiliary CTC layer to guide the decoding.The proposed CTC policy led to even greater improvements in quality and reduced the real-time factor to 50 %.
Thus far, our research has focused primarily on the AED architecture.Nonetheless, recent findings (Anastasopoulos et al., 2022;Agarwal et al., 2023) suggest that other approaches, such as transducers (Graves, 2012), yield competitive results.Nevertheless, it remains unclear which approach is the most advantageous for SST.Our goal will be to compare these architectures for SST.We will put a particular emphasis on architectures with latent alignments (e.g., transducers).Generally, the latent alignment models make a strong monotonic assumption on the mapping between the source and the target, which might be problematic for the translation, typically involving word reordering.Therefore, we will assess the alignment quality and potential applications (such as segmentation).

Towards the Long-Form SST via
On-the-Fly Segmentation In the second stage, we will concentrate on the longform SST by utilizing on-the-fly segmentation and short-form models from the previous stage.
Drawing inspiration from offline long-form ST, which primarily emphasizes segmentation, we consider direct segmentation modeling the most promising approach (Tsiamas et al., 2022a;Fukuda et al., 2022).The limitation of these approaches is that they do not allow out-of-the-box simultaneous inference.However, we believe their adaptation to the simultaneous regime should be relatively straightforward (e.g., using a unidirectional encoder) and a custom decoding strategy.The main challenge here will be integrating this segmenta-tion with existing models, especially considering the quality-latency tradeoff.
Our hopes go even further: Can we train a model to translate and predict the segmentation at the same time?The translation already contains punctuation marks (full stop, exclamation, and question marks), so if we knew the alignment between the translation and the source speech, we could use this information to segment the utterances directly.Therefore, we will experiment with various alignment approaches and asses their applicability to the segmentation.The results of our initial investigation on on-the-fly separation with CTC outputs are available in Polák and Bojar (2023).
However, we see another valuable use of direct speech-to-translation alignments -dataset creation.Today, ST datasets are created using the cascaded approach (Iranzo-Sánchez et al., 2020b;Cattoni et al., 2021;Salesky et al., 2021).The source transcript is first forced-aligned to the speech, then the transcript is word-aligned to the translations, and finally, these two alignments are used to segment the source speech into sentences based on the punctuation in the translation.In fact, this approach has a critical drawback: it virtually eliminates all data without a source transcript, preventing the research community from utilizing potentially valuable data sources.It is also worth noting that some languages do not have a writing system, which makes the direct speech-to-translation alignment even more attractive.Therefore, if the alignments show promising results, we will explore the feasibility of E2E speech-to-translation dataset creation.
An additional question is how to accommodate long context in the simultaneous regime.As pointed out in Sections 3.2 to 3.4, the performance usually drops with a context longer than a few sentences.Some solutions have been suggested (Kim et al., 2019;Feng et al., 2022), but it remains unclear how to adapt these approaches for SST with the specifics of SST in mind (e.g., computational constraints, speech input).

True Long-Form SST
The ultimate goal of our work is to achieve true long-form simultaneous speech translation.In other words, we aim to develop an architecture capable of processing a potentially infinite stream of speech input without any segmentation or special inference algorithm, translating the speech directly into the target language in real time.Ad-mittedly, this is a very ambitious goal.However, there is plenty of evidence that it is feasible.For example, in long-form ASR, related work has already observed that the RNN-T and CTC architectures are capable of long-form regime (Chiu et al., 2019;Narayanan et al., 2019;Lu et al., 2021;Zhang et al., 2023;Rekesh et al., 2023).Arguably, speech recognition is simpler than speech translation because it monotonically transcribes speech without reordering.However, the literature also shows that an architecture like RNN-T can be used in the "short-form" offline and simultaneous ST (Yan et al., 2023).
Therefore, based on the previous work in speech recognition and translation, we will propose a novel architecture that will allow simultaneous speech translation of a possibly infinite stream of speech.We will take inspiration from the existing architectures but revise them for the specific needs of simultaneous ST.This will require a particular focus on speech-to-translation alignment so that the source speech and target translation do not get out of sync.This architecture will also contain a "forgetting" mechanism that will allow the storage of essential bits of context while preventing memory issues.Finally, we will address the train-test mismatch because current hardware and training methods do not permit models to fit long inputs.

Conclusion
In conclusion, this thesis proposal presents an overview of the challenges involved in simultaneous speech translation (SST).The literature review highlighted the limited research on long-form speech translation.Our research sets out three main goals with an emphasis on long-form speech translation.These include improving the general quality-latency tradeoff in SST, exploring longform SST through segmented inference, and ultimately achieving true long-form SST modeling.We placed these goals in the context of related work and outlined a clear strategy for achieving them.
conducted a comprehensive study comparing different architectures, including RNN-T and attentionbased models.The findings indicate that only RNN-T and CTC architectures can generalize to unseen lengths.To further improve the true long-form ASR, Narayanan et al. (2019) suggest simulation of long-form training by LSTM state passing.While the previously mentioned research was predominantly based on RNNs, more recent work has transitioned to utilizing Transformer models.Zhang et al. (