Attention as a Guide for Simultaneous Speech Translation

In simultaneous speech translation (SimulST), effective policies that determine when to write partial translations are crucial to reach high output quality with low latency. Towards this objective, we propose EDAtt (Encoder-Decoder Attention), an adaptive policy that exploits the attention patterns between audio source and target textual translation to guide an offline-trained ST model during simultaneous inference. EDAtt exploits the attention scores modeling the audio-translation relation to decide whether to emit a partial hypothesis or wait for more audio input. This is done under the assumption that, if attention is focused towards the most recently received speech segments, the information they provide can be insufficient to generate the hypothesis (indicating that the system has to wait for additional audio input). Results on en->de, es show that EDAtt yields better results compared to the SimulST state of the art, with gains respectively up to 7 and 4 BLEU points for the two languages, and with a reduction in computational-aware latency up to 1.4s and 0.7s compared to existing SimulST policies applied to offline-trained models.


Introduction
Simultaneous speech translation (SimulST) is the task in which the model has to generate incremental translations while it continues to receive audio input. This characteristic makes SimulST a very challenging task since the need of generating high-quality outputs has to be balanced with the need to minimize their latency, i.e. the time elapsed between when a word is uttered and when it is actually translated by the system. SimulST is commonly addressed with end-to-end (or direct) models (Bérard et al., 2016;, which exhibit lower latency and better performance than traditional cascade architectures consisting of separate automatic speech recognition (ASR) and machine translation (MT) components (Ansari et al., 2020;Anastasopoulos et al., 2021Anastasopoulos et al., , 2022. Most of the current solutions simulate the simultaneous test conditions by providing partial input during training (Ren et al., 2020;Zeng et al., 2021;Liu et al., 2021b). In most cases, this results in the necessity to train different models to reach different quality-latency trade-offs (Ren et al., 2020;Ma et al., 2020b;Han et al., 2020;Zeng et al., 2021;Ma et al., 2021;Zaidi et al., 2021;Liu et al., 2021a;Zaidi et al., 2022), which dramatically increases the computational costs of training and maintenance.
To avoid this problem, (Chen et al., 2021;Nguyen et al., 2021;Polák et al., 2022;Papi et al., 2022a) explored the use of offline ST models, obtaining competitive or even better results than those trained in the simultaneous regime in the last Simultaneous Speech Translation task at the IWSLT evaluation campaign (Anastasopoulos et al., 2022). Their advantages include the reduced need for computational resources, since only one model has to be trained and maintained to perform both offline and simultaneous ST, and the possibility to use already available and up-to-date systems without the need for ad-hoc adaptations to the SimulST scenario. However, these approaches are characterized by i) the need for external ASR models to inform the so-called decision policy that is in charge of deciding whether to wait for more audio input or to emit partial translations, as in (Chen et al., 2021), ii) still lower performance compared to the current state of the art in SimulST (Liu et al., 2021b), especially at lower latency (Nguyen et al., 2021;Papi et al., 2022a), or iii) high computational latency (Polák et al., 2022), which represents a critical aspect for the application in real-world scenarios.
In order to define a reliable and computationally efficient SimulST policy for offline ST systems, we propose EDATT (Encoder-Decoder Attention), 1 an attention-based policy that exploits the encoderdecoder attention patterns of offline ST models to decide whether to emit partial translations or not. The use of attention to inform the decision policy is motivated by related works on MT and language modeling proving that attention scores can encode syntactic dependencies (Raganato and Tiedemann, 2018;Htut et al., 2019), predict language representations in the brain (Lamarre et al., 2022), and align source and target tokens (Tang et al., 2018;Garg et al., 2019;Chen et al., 2020). We posit that this encoder-decoder attention relation between source audio and target tokens also exists in ST models and can be used to guide our offline model during the simultaneous inference. In a nutshell, our idea is that each word of the partial hypothesis generated at each time step is emitted only if it does not attend to the most recent audio frames, meaning that the information received up to that time step is sufficient to emit that word. Building up on this idea, in this paper we show that: • Encoder-decoder attention scores are representative of source frames-target tokens relations that can be exploited to guide the ST model policy in the simultaneous scenario; • The EDATT policy outperforms the other SimulST policies applied to offline ST models on MuST-C (Cattoni et al., 2021) en→{de, es} for almost all latency regimes, with a reduction of up to 1.4s for German and 0.7s for Spanish in computational-aware latency; • The EDATT policy also outperforms the SimulST state of the art, especially in terms of computational latency, with gains up to 7.0 BLEU for German and 4.0 BLEU for Spanish; 2 Background 2.1 Transformer-based architecture In terms of architectural choices, Transformer (Vaswani et al., 2017) and its derivatives (Gulati et al., 2020;Chang et al., 2020;Burchi and Vielzeuf, 2021;Andrusenko et al., 2022) are the de-facto standard both in offline and simultaneous ST (Ansari et al., 2020;Anastasopoulos et al., 2021Anastasopoulos et al., , 2022. A generic Transformer model is composed of an encoder, whose role is to map the input speech sequence x = (x 1 , ..., x n ) into an internal representation z = (z 1 , ..., z n ), and a decoder, whose role is to generate the output textual sequence y = (y 1 , ..., y n ) by exploiting z in an auto-regressive manner (Graves, 2013), that is by consuming the previously generated output as additional input when generating the next one.
The encoder and the decoder are composed of a stack of identical blocks, whose components may vary depending on the particular Transformerbased architecture, but their main common idea is the dot-product attention mechanism (Chan et al., 2016). In general, the attention is a function that maps a query matrix Q and a set of key-value matrices (K, V ) to an output matrix (Bahdanau et al., 2016). The output is obtained as a weighted sum of V , whose weights are computed through a compatibility function between Q and K that, in the case of scaled dot-product attention used in the Transformer formulation, is: In the encoder layers, Q, K, and V are computed from the same speech input sequence, realizing self-attention. Differently, in the decoder layer, two types of attention are computed one after the other: self-attention, and encoder-decoder attention. In the encoder-decoder attention, Q comes from the previous decoder layer (or directly from the previously generated output, in the case of the first decoder layer) while K and V come from the output of the encoder. In this paper, we exploit this encoder-decoder attention matrix to guide the model during the simultaneous inference.

Simultaneous Speech Translation
In SimulST, the source speech context x available for the translation y is incrementally increased over time t. Hence, the input x(t) has length n(t) which in turn depends on time t. The time is discrete and its span is defined by a pre-defined segment size T s , obtaining t = [T s , 2 · T s , 3 · T s , ...]. Thus, the simultaneous output y(k) at a certain time k depends on the partial audio received until that time x(k), and on the previous output tokens y(t < k).
Every time a new speech segment is added to the input context, the decision on whether to wait for more input or to emit the partial hypothesis y can be taken through adaptive (i.e. learned by the model, as in (Liu et al., 2021b)) or fixed (i.e. based on heuristics, as in (Ma et al., 2019)) decision policies. In the last IWSLT campaigns (Anastasopoulos et al., 2021(Anastasopoulos et al., , 2022, the former policies have been proven to outperform the latter ones. In our case, even if our ST models have not been trained in simultaneous, we define our policy as adaptive since it is based on parameters (attention scores) learned by our model during (offline) training.

EDATT policy
We propose to exploit the information contained in the encoder-decoder attention matrix of an offline ST model during inference to determine whether to wait for more audio input or write a partial hypothesis. Our approach builds on the following hypothesis (see Figure 1): at each time step, if the attention is focused towards the end of the input sequence, the system will probably need more information to correctly produce the current output candidate. In fact, if the encoder-decoder attention of the predicted token points to the most recent speech information, i.e. attention scores are higher towards the last audio frames received, this information could be incomplete and therefore still insufficient to generate that token. This means that, if the model has produced a partial hypothesis, it will emit all the tokens of that partial hypothesis until the above condition is verified, that is until its encoder-decoder attention scores do not focus towards the end of the available speech segment.
(1) When the first speech segment is received, the partial hypothesis "Ich werde" is emitted since the attention is not concentrated towards the end of the segment while "reden." is not since the attention is all concentrated on the last frames.
(2) When the second speech segment is received, the new partial hypothesis "über Klima sprechen." is emitted since the attention is not concentrated towards the end of the segment. More formally, given an input sequence x of n audio frames at time k, we verify for each token j if the sum of encoder-decoder attention A over the last λ frames is below a certain threshold α: where α ∈ [0, 1] since attention scores sum up to 1. As long as this condition holds true, we continue to emit the tokens; otherwise, we stop the emission. This means that, if the model at time k has produced a partial hypothesis y made of w tokens and Equation 1 holds true until the j-th token included, the output will be: During simultaneous inference, applying this decision policy means: i) computing the partial hypothesis y, and consequently the encoder-decoder attention matrix A, at every time step in which a new speech segment is received as input, ii) verifying until which token j the attention over the last λ frames does not exceeds α, and iii) emitting the partial hypothesis y until that token. Under this formulation, while λ is a parameter to be determined, α handles the latency of the model, i.e. larger values of α will decrease the latency and vice versa.

Data
To be comparable with previous works (Ren et al., 2020;Ma et al., 2020b;Zeng et al., 2021;Liu et al., 2021b;Papi et al., 2022a), we train two separate models on MuST-C en→{de, es} (Cattoni et al., 2021). The choice of the two target languages is also motivated by their different word ordering: Subject-Object-Verb (SOV) for German and Subject-Verb-Object (SVO) for Spanish. This opens the possibility of validating our approach on target-language word orderings that are respectively similar and different with respect to the English (i.e. SVO) source audio. We also perform data augmentation by applying sequencelevel knowledge distillation (Kim and Rush, 2016) as in (Liu et al., 2021b), for which the transcripts of MuST-C en→{de, es} are translated using an MT model (more details can be found in Appendix B) and used together with the gold reference during training. Data statistics are given in Appendix A.

Architecture and Training Setup
The offline ST model is made of 12 Conformer (Gulati et al., 2020) encoder layers and 6 Transformer decoder layers with a total of ∼115M parameters. Each encoder/decoder layer has 8 attention heads. The input is represented as 80 audio features extracted every 10ms with sample window of 25 and processed by two 1D convolutional layers with stride 2 to reduce its length by a factor of 4 (Wang et al., 2020). Utterance-level Cepstral Mean and Variance Normalization (CMVN) and SpecAugment (Park et al., 2019) are applied during training. Detailed settings are described in Appendix B.

Inference and Evaluation
We use the SimulEval tool (Ma et al., 2020a) to simulate simultaneous conditions and evaluate all the models. For our policy, we use speech segments of 800ms, i.e. T s = 800. During inference, the features are computed on the fly and CMVN normalization is based on the global mean and variance estimated on the MuST-C training set. All inferences are performed on a single NVIDIA TESLA K80 GPU with 12GB of RAM as in the IWSLT Simultaneous evaluation campaigns.
We use sacreBLEU (Post, 2018) 2 to evaluate translation quality, and Average Lagging (Ma et al., 2019) -or AL -to evaluate latency, as in the default SimulEval evaluation setup. As suggested by (Ma et al., 2020b), we report also the computational aware version of the Average Lagging for our comparisons with other approaches. Computational aware average lagging (AL_CA) is computed by considering the real elapsed time instead of the ideal one considered by AL, giving a more realistic latency measure when the system operates in real time. Its computation is also provided by SimulEval.

Terms of Comparison
We conduct experimental comparisons with: Cross Attention Augmented Transformer (CAAT)the state of the art in SimulST (Liu et al., 2021b), and the winning system at IWSLT 2021 (Anastasopoulos et al., 2021). Inspired by the Recurrent Neural Network Transducer (Graves, 2012), it is made of three Transformer stacks: the encoder, the predictor, and the joiner. These three elements are jointly trained in simultaneous to optimize translation quality while keeping latency under control. We train and evaluate the CAAT model using the code provided by the authors 3 on the same data used for our offline ST model.
Common prefixthe decision policy by , used by the winning system at IWSLT 2022 (Anastasopoulos et al., 2022). It consists in generating a partial hypothesis from scratch every time a new speech segment is added, and emitting them -or part of them -if the current partial hypothesis coincides with those generated at the previous steps. Since  empirically found that considering only the most recent previously generated tokens as memory works better, in this work we adopt the same strategy to directly apply this policy to our offline ST models.
Wait-kthe most popular decision policy in SimulST (Ren et al., 2020;Ma et al., 2020b;Zeng et al., 2021). This decision policy consists in waiting for a number of fixed words (k) before starting to emit the translation. Following the findings of (Papi et al., 2022a), we directly apply this policy to our offline ST models by adopting the adaptive word detection, i.e. the number of words k is detected through the CTC outputs obtained from the input audio, since it proved to achieve better performance. In addition, we employ beam search to be comparable with the other two baselines.

Experiments
In this section, we first motivate the choice of attention as a guide for SimulST, proving that a relationship between source audio frames and target translation texts actually exists and can be used as a proxy to guide our simultaneous adaptive policy (Section 5.1). Then, we report the experimental results obtained by our proposed EDATT policy compared with state-of-the-art approaches in SimulST (Section 5.2).

Attention Analysis
To validate our hypothesis and study the feasibility of our method, we start by exploring the encoderdecoder attention matrices of the offline trained models. We proceed as follows: first, by visualizing the attention weights, we check for the existence of patterns that could potentially be exploited during simultaneous inference. Then, by computing alignment errors, we analyse the informativeness of attention scores looking for the best candidate layer from which to start our analysis. Lastly, we apply the EDATT policy to discover which is the best configuration to adopt, considering the value of λ, the decoder layer, and attention head from which to extract the attention scores.
Do attention patterns exist also in ST? One of the simplest methods to discover patterns of attention is to visualize its weights. To this end, we analyse the encoder-decoder matrices obtained on the MuST-C dev set and discover a common phenomenon across the two languages (de, es). We observe that the attention weights concentrate on the last frame, independently from the input length. This problem has already been observed in prior works on attention analysis showing that attention often concentrates on the initial/final token (Clark et al., 2019;Kovaleva et al., 2019;Kobayashi et al., 2020;Ferrando et al., 2022), accumulating up to 97% of the attention weights (Vig and Belinkov, 2019) and hindering the visualization of the attention patterns. Thus, we filter out the last frame from the attention matrix and then re-normalize it. In this way, we obtain a clear pseudo-diagonal pattern compared to the previous unfiltered representation, as shown in Figure 2. The patterns depicted by the encoder-decoder attention scores after the last frame removal suggest that there exists a relationship between source audio frames and target translation texts that can be further explored.
From which layer to start? To automatically analyse attention patterns and choose the best layer from which to start our analysis, we evaluate the alignment quality produced by the attention matrix for each layer as in prior MT works (Garg et al., 2019;Chen et al., 2020) since alignment and translation quality are strongly related (Chen et al., 2016;Alkhouli and Ney, 2017 computing the alignment error rate -or AER - (Och and Ney, 2000) between the alignments produced by the attention matrices and those obtained by textual aligners (e.g. GIZA++ 4 ), no tool exists to date to force-align translations with their corresponding audio segments. 5 For this reason, we resort to the CTC (greedy) prediction associated with each input frame produced by our offline ST model to compute the alignments instead of the frame itself. Hence, we compute the alignment between the predicted translation and transcript as the one-hot representation of the encoder-decoder attention matrix, i.e. the target (translated) token is associated with the source token (obtained by the CTC prediction) having the highest attention score. Then, we generate the silver alignments between transcripts and translations using a neural model as SimAlign (Jalili Sabet et al., 2020), since it proved to produce better alignments compared to the most popular statistical models fastAlign 6 and GIZA++, and then compare these alignments using the AER with the ones produced through the attention scores. Table 1 shows the average AER computed on the MuST-C dev set for both languages (de, es). Despite the fact that the alignments are estimated from CTC predictions, we clearly observe that the last 3 layers are more representative than the first 3 ones and that Layer 5 is the best layer from which to extract the alignment information, coherently with the finding of (Garg et al., 2019). We hence select Layer 5 to start our analysis for selecting the best configuration for the EDATT policy.
What is the best λ? To set the best working point for EDATT policy, we aim at finding the best value of λ, that is the number of frames on which to apply Equation 1. 7 Following the previous finding, we start from the 5 th decoder layer to perform our search using the MuST-C dev set for both language pairs (de, es). In Figure 3, we show the quality-latency curves by varying the value of λ between 2 and 8. 8 As we can see, as the value of λ increases, the curves move to the right, meaning that their latency increases. This also means that considering too many frames towards the end (λ ≥ 6) affects latency, despite quality remaining almost the same, and is valid for both languages. To select the best value among 2 and 4, we compute the area under the curve 9 . It results that with λ = 2 we obtain a higher value compared to that obtained with λ = 4 (44.1 vs. 41.0 for German and 55.6 vs. 54.3 for Spanish). Hence, we select λ = 2 for both German and Spanish. This result is 7 We also experimented with different ways to determine lambda, such as using a percentage instead of a fixed number, but yielded negligible differences. 8 We perform experiments also on λ = 1 but we found that it consistently degrades performance. 9 The area is calculated through the Trapezoidal Rule (Atkinson, 1989). interesting since, even if being different languages with different word ordering, the outcomes of the experiments on German and Spanish yielded the same value of λ. This suggests that the λ parameter depends more on the source language, which is always English in our case, instead of the target one. Future works can be devoted to exploring this aspect and identifying if this trend is also valid for different source languages.
What is the best layer for EDATT? Once the λ parameter has been determined, we proceed by analysing the EDATT performance by varying the decoder layer from which to extract the encoderdecoder attention. We conduct this study on the MuST-C en→{de, es} dev set by using λ = 2, as we previously found to be the best value for both languages. We present in Figure 4 the simultaneous curves for each decoder layer. 10 As we can see, Layers 1 and 2 perform consistently worse than the other layers and this is valid for both languages. Also, Layer 3 achieves inferior quality compared to Layers ≥ 4, especially at medium-high latency (AL ≥ 1.5s), even if better than Layers 1 and 2. This is in agreement with the outcomes of the alignment quality evaluation of Table 1 indicating worse performance by the first 3 layers, suggesting that there exists a correlation between alignment and translation quality also in simultaneous. Concerning Layer 6, both graphs show that the curves cannot achieve lower latency, with a starting point at around 1.5s of AL. This phenomenon is also valid for Layer 5 compared to Layer 4, although being much less pronounced. Since Layers 5 and 6 never reach low latency (AL never approaches 1s), we can conclude that the optimal choice for the simultaneous scenario is Layer 4, in accordance with (Lamarre et al., 2022) that indicates the middle layers as the best choice to provide accurate predictions for language representations. As a consequence, we will use Layer 4 for the following experiments.
Does a single head encode more useful information? Motivated by previous works on attention heads (Jo and Myaeng, 2020;Behnke and Heafield, 2020) proving the usefulness of selecting a single or a set of heads to perform the ST task (Gong et al., 2021), we also study the behavior of EDATT policy by varying the attention head from which to extract the encoder-decoder attention matrix. We present in Table 2 11 the results obtained from each attention head extracted from the 4 th decoder layer since it turned out to be the best layer for our simultaneous policy. 12 First, we observe that many heads cannot reach low latency, especially in Spanish. In addition, there is no agreement on the best head among languages nor at different latencies (e.g. Head 6 is the best in Spanish at 1.6s while it does not reach lower latency). However, we notice that the average across the heads (last row) has an overall better performance compared to the encoder-decoder matrix extracted from each head and this is valid for both languages. As a result, we choose to compute the average over the attention heads for our following experiments in order to achieve a better quality-latency trade-off for SimulST.

Main Results
For the comparison of EDATT with the SimulST systems described in Section 4.4, we report in Figure 5 both AL (solid curves) and AL_CA (dashed curves) as latency measures to give a more reliable evaluation of the performance of the systems in real time, as suggested in (Ma et al., 2020b). Results with other metrics are reported in Appendix C. For our policy, we extract the encoder-decoder attention matrix from Layer 4, average the weights across heads, and set λ = 2 since results in the best setting on the MuST-C dev set for both languages (de, es), as already discussed in Section 5.1.
Quality-latency curves for the two language 11 We choose to present the results in a tabular format instead of AL-BLEU curves since many parts of the curves are indistinguishable from each other. 12 Since obtaining a specific latency in s is not possible with this method, we linearly interpolate the previous and the successive points to obtain the BLEU value, when possible.  Table 2: BLEU scores on MuST-C dev set en→{de, es} for each attention head of Layer 4. Latency is reported in seconds. "-" means that the BLEU value is not available nor calculable. The last row represents the numerical values of Layer 4 curves of Figure 4 obtained by taking the average across heads.
pairs show similar trends. The EDATT policy achieves better overall results compared to the common prefix and wait-k policies applied to offline ST models. EDATT always outperforms the waitk policy with gains ranging from 1 to 2.5 BLEU for German and 1 to 3 for Spanish, both considering ideal (AL) and computationally aware latency (AL_CA), and is also capable of reaching lower latency as the starting point of wait-k is always around 1.5s against 1s of our policy. Compared to the common prefix, we observe an AL_CA reduction of up to 1.4s for German and 0.7s for Spanish. Moreover, the computational overhead of EDATT is consistently lower, 0.9s on average between languages, against 1.3s of the common prefix. Thus, the computational cost of our policy is reduced by more than 30% compared to the common prefix policy. In addition, EDATT outperforms the common prefix at almost every latency, with gains up to 2 BLEU for German and 3 for Spanish.
Regarding the performance of EDATT compared with the current SimulST state-of-the-art CAAT system, we observe that our policy achieves higher quality at medium-high latency (AL ≥ 1.2s) when ideal latency is considered (solid curves) with BLEU gains of up to 5 BLEU for German and 2 for Spanish while it has a drop of 1.5 to 4 BLEU in German and 1 to 2.5 BLEU in Spanish when AL < 1.2s. However, with a more reliable measure of the simultaneous latency in real time as AL_CA (dashed curves), we observe that the EDATT curves always stand on the left of those of the CAAT system, meaning that our policy always outperforms the current state of the art. Gains in BLEU reach 6 points of improvement for German and 2 for Spanish and the latency obtained by applying our policy is always lower than that of the CAAT system.
Overall, we can conclude that EDATT achieves new state-of-the-art results when considering computational-aware metrics while also being superior at medium-high latency if the ideal, yet more unreliable, latency measure is considered. Moreover, additional experiments presented in Appendix D show that our policy is superior in computational efficiency also when highly accelerated GPUs are used, further encouraging the community to adopt computationally aware metrics for future and more reliable analyses.

Related Works
The first policy for SimulST was proposed by Ren et al. (2020) and is based on the wait-k policy (Ma et al., 2019), developed for simultaneous text-totext translation (SimulMT). Following works (Ma et al., 2020b;Han et al., 2020;Zeng et al., 2021;Karakanta et al., 2021;Nguyen et al., 2021;Zeng et al., 2022;Fukuda et al., 2022;Papi et al., 2022a) also exploit wait-k policy, even adopting auxiliary ASR models to determine the number of words (Chen et al., 2021). In parallel, several strategies have been developed to directly learn the best policy during training by adopting ad-hoc architectures (Ma et al., 2021;Liu et al., 2021a,b;Chang and yi Lee, 2022) and training procedures aimed at reducing latency (Liu et al., 2021a,b;Zaidi et al., 2021Zaidi et al., , 2022Chang and yi Lee, 2022;Zhang and Feng, 2022;Omachi et al., 2022). The latter are adaptive policies and obtain better performance according to the most recent results observed by Anastasopoulos et al. (2021) and Anastasopoulos et al. (2022). We define our policy as adaptive since it exploits attention weights learned by the offline ST model during training, even if the model has not been trained nor optimized for the SimulST task.
Among the works adopting adaptive policies, attention has been exploited during SimulST training in (Zaidi et al., 2021(Zaidi et al., , 2022Chang and yi Lee, 2022). These works make use of the monotonic multi-head attention (Ma et al., 2020c) developed for SimulMT, which adapts the monotonic attention (Raffel et al., 2017) to multi-headed transformer architectures, and integrates the infinite lookback (Arivazhagan et al., 2019) to improve translation quality. Also, Zhang and Feng (2022) exploit attention during training by imposing latency constraints on the weights of the encoder-decoder matrix. However, no work up to date either analyses the attention in relation to the SimulST task or directly exploits its pattern to guide an offline ST model, as we do in this work.

Conclusions
In this paper, after investigating the encoderdecoder attention behavior of offline speech translation (ST) models, we presented EDATT, a decision policy for simultaneous ST (SimulST) that guides an offline ST model to wait or to emit a partial hypothesis by looking at its encoder-decoder attention weights. Comparisons with state-of-the-art SimulST architectures and decision policies reveal that the EDATT policy does not only outperform the others at almost every latency with translation quality gains of up to 7 BLEU for German and 4 BLEU for Spanish, but it is also capable of achieving a latency of less than 2s in real time with a reduction of 0.7-1.4s compared to existing decision policies applied to the same offline ST system. We publicly release code, offline ST models, and outputs obtained at different latencies to help the usability and reproducibility of our work.

Limitations
Although applicable to any offline ST models, the EDATT policy and its behavior have been analysed on models applying CTC compression. Thus, the audio input undergoes a transformation that does not only reduce its dimension but also compresses it into more meaningful units, similar to words or subwords. As a consequence, the hyper-parameters regarding the number of frames to which apply the policy (λ) can vary and depend on the specific ST model. This would require having a validation set on which to search the best value of λ before directly testing. Moreover, the EDATT policy has been tested on Western European languages and, even if there is no reason suggesting that this cannot be applied (after a proper hyper-parameter search) to other languages, its usage on non-Western European languages has not been verified in this work.

A Data Statistics
MuST-C training data (train set) has been filtered: samples containing audio longer than 30s are discarded to reduce GPU computational requests. The total number of samples used during our trainings is shown in Table 3.

B Training Settings
The implementation of all our models is based on Fairseq-ST (Wang et al., 2020). We use 512 as embedding size and 2,048 hidden neurons in the feed-forward layers both in the encoder and in the decoder. We set dropout at 0.1 for feedforward, attention, and convolution layers. Also, in the convolution layer, we set 31 as kernel size for the point-and depth-wise convolutions. The vocabularies are based on SentencePiece (Sennrich et al., 2016) with dimension of 8,000 (Di Gangi et al., 2020) for the target side (de, es) and of 5,000 (Wang et al., 2020) for the source side (en). We optimize with Adam (Kingma and Ba, 2015) by using the label-smoothed cross-entropy loss with 0.1 as smoothing factor (Szegedy et al., 2016). We employ Connectionist Temporal Classification -or CTC - (Graves et al., 2006) as auxiliary loss to avoid pre-training and also to compress the input audio, reducing RAM consumption and speeding up inference (Gaido et al., 2021). The learning rate is set to 5 · 10 −3 with Noam scheduler (Vaswani et al., 2017) and warm-up steps of 20k. We stop the training after 15 epochs without loss decrease on the dev set and average 7 checkpoints around the best (best, three preceding, and three succeeding). Trainings are performed on 4 NVIDIA A40 GPUs with 40GB RAM. We set 40k as the maximum number of tokens per mini-batch, 2 as update frequency, and 100,000 as maximum updates (∼23 hours). The MT models used for knowledge distillation are trained on OPUS (Tiedemann, 2016)  ∼212M parameters. We achieve 32.1 and 35.8 BLEU on, respectively, MuST-C tst-COMMON German and Spanish.

C Main Results with Different Latency Metrics
Apart from AL, two metrics can be adopted to measure latency in simultaneous. The first one is the Differentiable Average Lagging -or DAL -(Cherry and Foster, 2019), a differentiable version of AL, and the Length-Adaptive Average Lagging -or LAAL - (Papi et al., 2022b), which is a modified version of AL that accounts also for the case in which the prediction is longer compared to the reference. Figure 6 and 7 show the results of the systems of Figure 5 by using, respectively, DAL and LAAL considering both computational aware (CA) and unaware metrics for German and Spanish.
As we can see, the results of Figure 6 and 7 confirm the phenomena found in Section 5, indicating EDATT as the best system among languages and latency values. We observe also that DAL reports higher latency for all systems (it spans from 3 to 7.5s for German and to 5.5s for Spanish), with a counter-intuitive curve for the common-prefix method considering its computational aware version. However, we acknowledge that DAL is less suited than AL/LAAL to evaluate current SimulST systems: in its computation, DAL gives a minimum delay for each emitted word while all the systems considered in our analysis can emit more than one word at once, consequently being improperly penalized in the evaluation.

D Main Results with Accelerated Hardware
To further explore the computational efficiency of the EDATT policy, we test all the systems of Section 5 by considering a highly accelerated GPU during the simultaneous inference. We select an NVIDIA A40 with 48GB GDDR6 memory configuration, 696 GB/s of memory bandwidth, and 10,752 cores. All the technical information is taken from https://www.nvidia.com/. Figure 8 reports the results. Looking at the quality-latency curves and comparing them with the computationally aware curves of Figure 5 (in dashed), the common prefix policy seems to benefit more from the use of a more expensive accelerated hardware, with a latency reduction of 0.5-1s, although it is not enough to let this policy to reach less than 2s of latency. Considering the other systems, both wait-k and CAAT curves are shifted by less than 0.5s, accordingly with EDATT. We can conclude that the same outcomes found and discussed in Section 5.2 hold also when more expansive but accelerated hardware is used, indicating our policy as the best performing one and further strengthening our findings. Moreover, these results indicate no substantial differences in the comparison of the systems among the use of less or more accelerated GPU hardware and advocate for more widespread use of computationally aware metrics.