End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.


Introduction
In many applications scenarios for speech translation, the quality of the translations is not the only important metric, but it is also essential to provide the translation with a low latency.This is for example the case in translations of presentations or meetings.Therefore, we observe an increasing interest in the field of low-latency speech translations, as shown by numerous published techniques and the organization of a dedicated shared task as part of the International Conference on Spoken Language Translations (IWSLT) (Agrawal et al., 2023).
In order to enable further progress in the field as well as a wide adoption of the technique a framework to evaluate different approaches is essential.However, the current evaluation only considers a limited number of aspects or techniques.In contrast, for an overall evaluation of different architectures (end-to-end and cascaded) and presentation style (revision and fixed) a general evaluation framework is needed.This should also consider the computational latency as well as the ability to process several sessions in parallel.
Motivated by this, we present a new framework to apply and evaluate low-latency, simultaneous speech translation.Thereby we focus on a framework that can evaluate the different approaches in as realistic conditions as possible.The system is able to simulate different load conditions as well as compare systems using different design choices.Finally, we also provide a web interface1 to present the low-latency model outputs to the user.
The main contributions of our paper are: • A framework2 for low-latency speech translation with dynamic latency adjustment • An evaluation setup that allows for assessing the quality and latency of a low-latency scenario in an end-to-end fashion • A comprehensive evaluation of different translation approaches and streaming algorithms In the next section, we describe the overall architecture of the framework.The two following sections explain the streaming algorithms for the speech and text processing components.After that, we illustrate how we evaluate our framework and then how the experimental setup looks like.In Section 7 we present the results.Then, we review the related work.At the end we describe the limitations and conclude our work.

Dynamic Framework for low-latency speech translation
Motivated by previous work (Cho et al., 2013), we use a central mediator that coordinates the interaction of the different components (see Figure 1).The user sends data to an API component which then sends the data to the mediator.The mediator forwards all arriving data to the corresponding component(s), e.g., the audio signal from the user to the speech processing component, the resulting transcripts to the text processing component and the output (through the API) to the user.In order to allow a flexible processing, for each session a graph dynamically defines how the data is sent to the different components.We process different requests at each component using the existing streaming framework Kafka3 .Each component consists of a middleware and a backend with the processing separated into three steps: 1) Input processing: The middleware implements the streaming algorithms and can be run on the CPU.It uses the state of the current session to generate requests to the backend.Other approaches (Niehues et al., 2018) repeatedly send requests to the backend for all input messages.This can result in increasing latency if the backend is not able to keep up in high-load situations.In order to minimize this, we enable the middleware to skip intermediate processing steps.This is done by combining multiple input messages by concatenating audio or text.Several middleware workers can be run in parallel.We achieve the locality of the state by sticky queues, where a message from the same session is always sent to the same middleware worker.
2) Backend request: The backend contains the hosted models.It processes the requests without Because of the division in a stateful middleware and a stateless backend, we are able to share the backend and use batching of the requests.
3) Output processing: The output of a backend request is used to send information to the next component(s).Furthermore, the state of the corresponding session is updated.
Our framework supports two modes for lowlatency speech translation.First, a revision mode (Niehues et al., 2018) where the component (Automatic speech recognition (ASR) or machine translation (MT)) can send stable and unstable outputs.Given more context at a later time step, the component can revise the unstable outputs.Second, a fixed mode (Liu et al., 2020a;Polák et al., 2022) where the component is only allowed to send stable output.For fixed mode (and the revision mode of the ASR component), the component needs to perform a stability detection (see Sections 3 and 4 and Figure 2), i.e., determine which parts of the output should be considered stable.Note that for our streaming algorithms the backend models need to support prefix decoding, i.e., one can send a prefix which is then forced in the output.
Our framework is easily extendable by deploying additional backend models for different languages, adding new streaming algorithms in the middleware or adding custom components (e.g., speaker diarization as a preprocessing step before the ASR) and including them in the session graph.

Low-latency Speech Processing
The speech processing component receives a stream of audio packets and sends chunks of text (transcript or translation) to the mediator.For this two steps are run: Input processing: First, a voice activity detection generates a speech segment that can be extended when new packets of audio arrive.For this we use WebRTC Voice Activity Detector (Wiseman, 2016).Each audio frame (30ms) is classified if it contains speech or not.Then a moving average is calculated.If it exceeds a certain threshold, a new segment is started.New audio is added to this segment until the moving average falls below a certain threshold and the segment ends.Second, the backend model (ASR or speech translation (ST)) is run.If there exist speech segments that already ended, they are processed only once and the output is sent as stable text, other segments are constantly processed until they end.

Stability detection and output processing:
We use the method local agreement two (LA2) from Polák et al. (2022).The intuition is that if the prefix of the output stays the same when adding more audio, the prefix should be considered stable.Let C denote the chunk size hyperparameter (LA2_chunk_size).The fixed mode works as follows (see Figure 2): It waits until the segment contains (at least) C seconds of audio (denoted by M 1 ) and then runs the model but does not output any stable text.Let's denote this first model output by H 1 .After the segment contains (at least) C more seconds of audio (denoted by M 2 ) the model is run again with all the audio and outputs H 2 .Then the component outputs the common prefix of H 1 and H 2 as stable output S 2 .After the segment again contains (at least) C more seconds of audio (denoted by M 3 ) the model is run again with all the audio.However, now S 2 is forced as prefix in the ASR/ST model decoding.The model outputs H 3 and the common prefix from H 2 and H 3 is the next stable output S 3 .This procedure is continued until the speech segment ends.
Note that the ASR/ST model has a certain maximum input size due to latency, memory and compute constraints.Therefore, if this limit is reached, the input audio to the model as well as the corresponding forced prefix is cut away.
The revision mode differs from the fixed mode in that the last hypothesis except the common prefix is sent as unstable output.Furthermore, in the time period until the speech segment contains again C more seconds of audio, the currently given audio is run through the model and the hypothesis except the last stable output is sent as unstable output.

Low-latency Text Processing
The text processing component receives a stream of (potentially revisable) text messages and sends chunks of text (translation) to the mediator.
Input processing: First, all input text that arrived is split into sentences by punctuation.Then, the backend model (MT) is run.
Stability detection and output processing: All sentences containing only stable text are processed once and the output is sent as stable text.For the other sentences containing unstable text the behavior depends on the mode.If text is stable or not is given by the speech processing component.
The revision mode works as follows: All sentences containing unstable text are processed by the backend model and the output text is sent as unstable text.A similar approach is not possible in the speech processing revision mode (see Section 3) since speech segments are not limited in size but the model input size is.
For the fixed mode we use the method local agreement from Liu et al. (2020a).The processing is similar to the speech processing.The difference is that the backend model is run when at least one new word is given instead of at least C seconds of audio.In our preliminary experiments, up to at least five words but the results were basically identical since the input is extended by a few words most of the time.Furthermore, only the stable part of the sentences containing unstable text is used as input.This restriction is not necessary in the speech processing component since there is no unstable audio input.

Evaluation Framework
We evaluate our system in an end-to-end fashion.That is, given an input audio, we send it to the system and evaluate the final returned transcript and translation.We provide an evaluation framework4 that assess the system in different aspects and logs the results to categorized experiments on an UI board using MLflow (Zaharia et al., 2018).We consider different evaluation metrics as follows.
BLEU: In order to assess the translation quality, we use case-sensitive BLEU score, calculated using sacreBLEU (Post, 2018).We extract the final stable translation, align it sentence-wise with the gold reference using mwerSegmenter (Matusov et al., 2005) before calculating the BLEU score.

WER:
In order to assess the transcription quality of the ASR component in the cascaded setting, we use the case-sensitive Word Error Rate (WER) calculated using JiWER 5 .Similar as before, we extract the final stable transcription, align it sentencewise with the gold reference using mwerSegmenter (Matusov et al., 2005) before calculating the WER.
Latency: We define the total latency of the system as the average time (in seconds) it takes since an utterance is spoken until its first-unchanged translation is returned by the system.Note that the first-unchanged translation is not necessarily already marked as "stable" by the system.
For each message returned by the system, we have the stable/unstable flag along with three timestamps, t s , t e , t r .The timestamps t s and t e are the start and end time of the audio segment that aligns to the message.The timestamp t r is when the message was received.We collect the first unchanged messages as follows.We split the received messages into blocks of messages marked from "unstable" to "stable".In each unstable-to-stable block, from the last stable message, we backtrack the previously received unstable messages to find the first ones that has prefix-overlaps with the final stable message.The illustration is shown in Figure 3.
Once we have collected the first-unchanged messages, we can calculate the latency.We use the same definition of delay as Niehues et al. (2016), where the average delay of the i th message is: Then we calculate the latency as the weighted average of the delays of all m first-unchanged messages based on their length: . Note that the timestamps t s and t e in our latency formula are calculated by the used streaming algorithm.Therefore, we also tried another modelindependent latency metric that only uses t r .This metric approximates the segment-message alignment by assuming that each word output by the system has the duration of 0.3 second in the audio.Due to the strong assumption, this metric does not represent well the perceived latency.We only use this metric in order to verify our main modeldependent latency metric.
We find that the model-independent latency metric and our model-dependent metric provide the same relative ranking of the systems.This indicates that the timestamps t s and t e provided by the model itself are reliable to measure latency.
Flickering rate: The flickering rate is the average number of flickers per reference word.We count the number of flickers by looking at every pair of consecutive messages in a message block.If two words in the same position in the two messages differ, then it is counted as a flicker (see Figure 4).The flickering rate is calculated as the total number of flickers divided by the total number of words in the reference.
6 Experimental setup

Evaluation data
We test our system using datasets from different language pairs.Test datasets includes: • Test data from the IWSLT shared task (tst19, tst20) (Anastasopoulos et al., 2021(Anastasopoulos et al., , 2022)), where the domain is TED talks.
• The test split of Multilingual TEDx Corpus (mTEDx) (Salesky et al., 2021), where the domain is TED talks.Computer Science domain, and a nonCS variance which includes lectures outside of the Computer Science domain.
The detailed statistics of the test data is shown in Table 1.

Transcription and translation models
The English ASR models are built based on pretrained WavLM (Chen et al., 2022) and BART (Lewis et al., 2019) 6 , while for Multilingual ASR we utilized the XLS-R models (Babu et al., 2021) for the encoder and the MBART-50 model (Liu et al., 2020b) for the decoder following (Pham et al., 2022).On the other hand, the translation models are based on the pretrained DeltaLM (Ma et al., 2021).For the en→X direction, the models are fine-tuned to optimize for ACL talks based on Liu et al. (2023).For other directions, DeltaLM is fine-tuned on the combination of commonly available datasets7 .Finally, for the end-to-end ST system, we used the language-agnostic model from Huber et al. (2022) that can decode en-de ST and de ASR.

Quality vs Latency trade-off
In the first experiment, we assess the trade-off between translation quality and latency by modifying the LA2_chunk_size parameter.The results are shown in Table 2.As can be seen, as we increase chunk size, the translation quality improves while the latency gets worse, both for cascaded ST and end-to-end ST.This is expected, since higher chunk size means longer input given to the model at each step, thus the output has better quality due to having more context, while the latency gets worse due to more waiting time for collecting the input.

Revision mode vs fixed mode
Second, we report the results of comparing the revision mode to the fixed mode with different LA2_chunk_size values when performing cascaded translation on the en-de ACV dev set.As can be seen in Figure 5, in general, revision mode has better BLEU score yet worse latency than fixed mode.This is expected, since for the revision mode, when more input audio is available, the system can correct its previous output, thus ending up having better translation quality yet worse latency due to the additional re-translation overhead.

Cascaded vs End-to-End
Third, we report the results of comparing the cascaded setting to the end-to-end setting when performing online translation with revision mode on the ACL dev set.As can be seen in Figure 6, in general, cascaded ST has better BLEU score yet worse latency than end-to-end ST.Cascaded ST has worse latency since it contains two components and each component has to do computation.However, we observe that, with a similar latency of ∼ seven seconds, cascaded ST still obtains a better BLEU score.On the other hand, end-to-end ST has a better minimum latency that can be achieved (∼ four seconds lower than the cascaded system).

Load balancing
In order to assess the system's capability to balance loads, we conduct experiments on running multiple sessions simultaneously using the same hosted model, with and without scaling the system's number of middleware workers.For speech processing, we test parallel sessions on ACL dev en-de using the end-to-end ST model.For text processing, we test one cascaded ST session on ACL dev where the number of parallel sessions is the number of requested MT languages.In all experiments, we set LA2_chunk_size = 2.We report only the en-de results.
The results are shown in Table 3.As expected, the latency gets worse as the number of parallel sessions increases.Using multiple middleware workers counteracts that to some extent by making sure that the backend model is always busy and not waiting for the next request.Furthermore, we see that when the number of parallel sessions increases, the flickering rate decreases.This is because during higher load, fewer requests are sent to the backend and we observe less flickering.Here our automatic load balancing can be seen in action.

Related work
SimulEval (Ma et al., 2020) provides an evaluation framework for low-latency simultaneous speech translation with a decoupled client-server architecture allowing to plug-in translation models and stability detection policies.As the main difference we leave the audio segmentation up to the model whereas Ma et al. (2020) rely on a pre-segmentation of the audio, we factor in the computational latency in addition to the model latency and explore the scaling behavior in multi-session scenarios, both for a more realistic deployment scenario.Similar to this work Franceschini et al. (2020) implement a low-latency speech translation pipeline, however, their architecture does not scale well to multiple sessions and is not well suited for end-to-end evaluation.

Limitations and Conclusion
Since we run and evaluate the experiments in a realistic real-world scenario, it is difficult to exactly reproduce the results.The experiments are nondeterministic, e.g., because of network latencies.Furthermore, the results depend on the speed of the used hardware, especially the used hardware for the backend models.Additionally, we expect that each streaming algorithm implemented returns start and end timestamps.This may not be the case for all streaming algorithms one could want to compare.
In conclusion, this paper presented a framework for running and evaluating low-latency speech translation under realistic conditions.The research opens up new possibilities for advancing lowlatency translation systems and serves as a resource for researchers seeking to improve the latency and quality of real-time speech translation applications by being able to properly evaluate different models and streaming algorithms.

A Detailed results
We report the overall performance of our system on different test data and language pairs with different settings at Table 4.In this experiment, we use the cascaded setting with LA2_chunk_size = 2.As can be seen, the BLEU scores drop around by one point when we move from offline to online setting (in fixed mode a little more), depending on the language directions.

B Additional information
A video demonstrating the system can be found here: Video link Figure 1: Framework overview Figure 2: Stability detection

5Figure 4 :
Figure 4: Example of flickers (denoted by red arrows) in an unstable-to-stable message block.

Figure 5 :
Figure 5: Latency vs.quality (for the cascaded model) in revision mode or fixed mode.C: LA2_chunk_size (s).

Figure 6 :
Figure 6: Latency vs. quality (in revision mode) for the cascaded ST or End-to-End ST model.C: LA2_chunk_size (s).

Table 1 :
Statistics of the test data.*Testdata containing en audio with translations intode, ja, zh, ar, nl, fr, fa, pt,  ru and tr.
• Lecture data (LT) which we collected internally at our university.This test set include a CS variance which includes lectures on the

Table 4 :
Overall performance of our cascaded system with LA2_chunk_size set to 2 seconds: Quality, latency and flickering rate.∆BLEU: difference compared the corresponding offline setting.