Does Simultaneous Speech Translation need Simultaneous Models?

In simultaneous speech translation (SimulST), finding the best trade-off between high translation quality and low latency is a challenging task. To meet the latency constraints posed by the different application scenarios, multiple dedicated SimulST models are usually trained and maintained, generating high computational costs. In this paper, motivated by the increased social and environmental impact caused by these costs, we investigate whether a single model trained offline can serve not only the offline but also the simultaneous task without the need for any additional training or adaptation. Experiments on en->{de, es} indicate that, aside from facilitating the adoption of well-established offline techniques and architectures without affecting latency, the offline solution achieves similar or better translation quality compared to the same model trained in simultaneous settings, as well as being competitive with the SimulST state of the art.


Introduction
Many application contexts, such as conferences and lectures, require automatic speech translation (ST) to be performed in real-time.To meet this requirement, Simultaneous ST (SimulST) systems strive not only for high output quality but also for low latency (i.e. the elapsed time between the speaker's utterance of a word and the generation of its translation in the target language).Balancing quality and latency is extremely complex as the two objectives are conflicting: in general, the more a system waits -which implies higher latency -the better it translates thanks to a larger context to rely on.
SimulST models manage the quality-latency trade-off by means of a decision policy: the rule that determines whether a system has to wait for more input or to emit one or more target words.The most popular decision policy is the wait-k, a straightforward heuristic that prescribes to wait for * Work done when working at FBK. a predefined number of words before starting to generate the translation.Initially proposed by Ma et al. (2020b) for simultaneous machine translation (SimulMT), the wait-k is now widely adopted in SimulST (Ma et al., 2020b;Ren et al., 2020;Han et al., 2020;Chen et al., 2021;Zeng et al., 2021;Ma et al., 2021) thanks to its simplicity.Apart from wait-k, other attempts have been made to develop decision policies learned by the SimulST system itself (Ma et al., 2019;Zaidi et al., 2021;Liu et al., 2021a,b), all resulting in computationally expensive models with limited diffusion.
Regardless of the decision policy, SimulST systems are usually trained simulating the conditions faced at inference time, that is with only a partial input available (Ren et al., 2020;Ma et al., 2020b;Han et al., 2020;Zeng et al., 2021;Ma et al., 2021;Zaidi et al., 2021;Liu et al., 2021a).Since the size of the partial input -and consequently of the context that the SimulST system can exploit to translate -varies according to the latency requirements imposed by real-world applications,1 several models must be trained and maintained to accommodate different quality-latency trade-offs.This results in high computational costs that contrast with rising awareness on the need to reduce energy consumption (Strubell et al., 2019) towards more sustainable AI (Vinuesa et al., 2020;Schwartz et al., 2020).
So far, the benefits of training systems on partial inputs have been taken for granted and, although works employing models trained in offline mode are documented in literature (Nguyen et al., 2021;Ma et al., 2021), the indispensability of simultaneous training in SimulST has never been demonstrated.With an eye at the burden and environmental impact of training multiple dedicated models for different tasks -offline, simultaneous -and latency regimes, in this work we address the following question: Does simultaneous speech translation actually need models trained in simultaneous mode?To this end, we experiment with a single, easy-tomaintain offline model, which can effectively serve both the simultaneous and offline tasks.Specifically, we explore the application of the widely adopted wait-k policy to the offline-trained ST system only at inference time, bypassing any additional training neither to adapt the model to the simultaneous scenario nor to accommodate different latency requirements. 2Through experiments on two language directions (en→{de, es}), having respectively different and similar word ordering with respect to the source, we show that: • In terms of sustainability, offline training yields considerable reductions -by a factor of 9 in our evaluation setting -in carbon emission and electricity consumption (Section 4).
• The offline-trained model outperforms or is on par with those trained in simultaneous within the wait-k policy framework (Section 5); • Recent advancements in offline architectures and training strategies further improve output quality without affecting latency (Section 6); • The effectiveness of offline training also emerges in comparison with the state of the art in SimulST (Liu et al., 2021b): except for the lowest latency regime, our system is superior in the 2s-4s latency interval (ear-voice span) with gains up to 4.0 BLEU (Section 7).

wait-k
The wait-k policy requires to wait for a predefined number of words before starting to translate.For instance, a system using a wait-3 policy generates the 1 st target word when it receives the 4 th source word, the 2 nd target word when it receives the 5 th source word, and so on.The number of words to wait is controlled by the k parameter.SimulST systems based on the wait-k policy are usually trained considering the same k used for testing (Ren et al., 2020;Ma et al., 2020b;Zeng et al., 2021) while, in theory, its value can be different between the training and testing phases.A parameter k train can indeed be used to mask words at training time, while a parameter k test can be used to directly control the latency of the system at inference time according to the requirements posed by the target application scenario.Since many values of k train can be used to train the SimulST systems, even for identical values of k test , the standard approach involves performing several trainings to obtain the best translation quality while satisfying different latency requirements.In SimulMT, Elbayad et al. (2020) tried to avoid this large number of experiments by exposing the model to different values of k train sampled at each iteration.Surprisingly, they achieve the best performance on several k test using a single value of k for training (k train = 7).However, it is not clear if such a rule applies to SimulST, leaving the problem of performing a large number of trainings still unsolved.

Word detection for wait-k in SimulST
Since SimulMT operates on a stream of words, applying the wait-k is straightforward because the number of received words is explicit in the input.Conversely, its application to SimulST is complicated by the fact that the input is an audio stream and the number of received words has to be inferred by means of a so-called word detection strategy.
Two main categories of word detection strategies are currently employed by the community: fixed (Ma et al., 2020b), and adaptive (Ma et al., 2020b;Ren et al., 2020;Zeng et al., 2021;Chen et al., 2021).The fixed strategy is the easiest approach, as it assumes that a fixed amount of time is required to pronounce every word disregarding the information actually contained in the audio.In contrast, adaptive word detection determines the number of uttered words by looking at the content of the audio.This can be done either by means of an Automatic Speech Recognition (ASR) decoder (Chen et al., 2021), 3 or by means of a Connectionist Temporal Classification (Graves et al., 2006) -CTC -module (Ren et al., 2020;Zeng et al., 2021), every time a speech chunk is received by the system.
In its simplicity, the fixed strategy does not consider various aspects of the input speech, such as different speech rates, duration, pauses, and silences.For instance, if there are no words in the speech (e.g. in the case of pauses or silences), the fixed strategy forces the system to output something even if it cannot rely on sufficient context.In the opposite case, in which more than one word is pronounced in a speech chunk, the fixed strategy forces the emission of only one word, consequently accumulating a delay.By trying to guess the actual number of words contained in a speech chunk, the adaptive strategy is in principle more faithful to these audio phenomena.However, conflicting results are reported in literature, some in support of the adaptive strategy (Zeng et al., 2021) while others showing no advantage from its application (Ma et al., 2020b).

Do we need Simultaneous training?
While at training time the SimulST system has the entire audio available, at inference time it receives a partial, incremental input.This mismatch between offline training and simultaneous testing makes the system vulnerable to exposure bias (Ranzato et al., 2016).To mitigate this potential problem, SimulST models are trained under simulated simultaneous conditions.On an attentive model, this simultaneous training is realized by masking future audio frames when computing the encoder-decoder attention.For a wait-k SimulST system, the choice of the audio frames to be masked depends on two factors: the value of k train and the word detection strategy.The k train value determines the number of source words to mask (e.g., in the case of wait-3, the first target word is generated by looking at the first three source words and so on).The word detection strategy identifies the source words from the audio by detecting the number of frames each one corresponds to.Thus, the encoder-decoder attention is computed by limiting each target word to only attend to the audio frames that correspond to the previous k train source words identified by the word detection strategy.As a result, testing different word detection strategies requires training several systems, which in turn are trained with different values of k train to obtain different latencies.
In this paper, we question the need of all these experiments by investigating whether the simultaneous training of the ST systems is indispensable to obtain a good quality-latency trade-off.Within the framework of the wait-k policy, we explore the ability to translate in real-time of an offline-trained system that is neither trained nor adapted to the simultaneous scenario.To obtain a simultaneous prediction from the offline system, we add a pre-decision module after the encoder at inference time.Its role is to incorporate the logic of the word detection strategy to decide whether to wait or to emit words when a new speech chunk is received, according to the selected k test .In particular, it takes as input the encoder states representing the received audio chunk and applies the word detection strategy (either fixed or adaptive) to obtain the number of source words present in the input.If this number is equal or exceeds k test , the module activates the decoding part of the model and a word is emitted, otherwise it keeps reading the source speech.
Since the offline system is not trained for the simultaneous task, the choice of k test and word detection strategy are not constrained to those used during training as in the native SimulST case.Indeed, an offline model is trained by always attending to the entire source input.Different from the simultaneous training mode, the encoder-decoder attention is computed without masking, that is by considering past, current, and future information.Although this avoids multiple training for each k train and word detection strategy, it also exposes the model to operate in conditions different from its training setup, as it is not used to receive partial inputs.
To check if the exposure bias given by this mismatch in training and testing conditions constitutes a real limitation, we conduct a systematic analysis of the quality-latency performance of the offlinetrained system in the simultaneous scenario.To this aim, we compare the offline-trained system with the same model trained in simultaneous mode by varying the value of k train and the word detection strategy.ging for speech (Ma et al., 2020b) that correctly evaluates both shorter and longer predictions with respect to the reference.We report the simultaneous results in LAAL-BLEU graphs where each curve corresponds to a system trained using a different value of k train and each point to a different k test .The set of k values used for both training the simultaneous model and testing all the models is k = {3, 5, 7, 9, 11}.We also report the results of the offline generation using the greedy search and the beam search with the beam_size = 5 commonly used in offline ST.
Carbon Footprint.Each training contributed an estimate of 70.3 kg of CO 2eq to the atmosphere and used 184.7 kWh of electricity.This assumes 116 hours of runtime, a carbon intensity of 380.539gCO 2eq per kWh, 4 NVIDIA Tesla K80 GPUs (utilization 93%), and an Intel Xeon CPU E5-2683 v4 (utilization 100%). 5This means that training a single offline model instead of a model for each value of k train (in our case, 5 models) and for each word

Results
Fixed Word Detection.The results of the waitk models with fixed word detection are shown in Figure 1.The LAAL-BLEU curves indicate that the latency of all the systems lies between 1700ms and 3000ms, staying in a medium-high latency regime 6 for both language pairs.Translation quality is lower for en-de, for which it ranges from 11 to 19 BLEU, while for en-es it ranges from 14 to 25 BLEU.The difference in performance between the two language pairs is coherent with the results of the offline generations (both greedy and beam-5) and justified by the different level of difficulty when translating into the two target languages (having respectively similar and different word ordering with respect to English).The curves of the simultaneous-trained systems also show a tendency: if k train increases, both the quality and latency improve (e.g. on en-de, the k=11 curve lies higher -indicating better quality -and more leftward -lower latency -than the others).Interestingly, the offline-trained models (in solid black) outperform the systems trained in simultaneous at every latency regime, with gains from 1 to 7 BLEU for en-de and from 1 to 6 BLEU for en-es.This indicates that, to achieve the best performance and independently from the k test used, the offlinetrained model represents the best choice, at least for the fixed strategy.
Adaptive Word Detection.The results of the wait-k models with adaptive word detection are shown in Figure 2. The systems latency lies between 1700ms and 3500ms and, as with the fixed strategy, the quality is higher for en-es (from 15 to 26 BLEU) than for en-de (from 14 to 20 BLEU).Looking at Figures 1 and 2, we observe that the overall translation quality yielded by the adaptive strategy is higher compared to that of the fixed one.Moreover, the fixed strategy curves are far from being comparable with their offline greedy values (dashed lines), while the adaptive strategy curves almost reach them at higher latency.However, the models with fixed word detection perform better at lower latency, with a gain of 1 BLEU for en-de and 2 BLEU for en-es.In light of these results, there is not a clear winner between the two word detection strategies.From Figure 2, we also notice that the adaptive curves are very close to each other, in contrast with the fixed case.This phenomenon indicates that, in the case of the adaptive strategy, changing k train does not significantly influence the model performance.This suggests that the offline-trained model (comparable to a model trained with k train = ∞) should be on par with the simultaneous-trained ones, a consideration corroborated by the trend of the offline-trained system curves (in solid black) that are always above or on par with those of the simultaneous-trained systems.
All in all, we can conclude that, when using the wait-k policy, the offline-trained model achieves similar or even better results compared to the same models trained in simultaneous mode.Based on this finding, in the next section we explore the actual potential of offline training for SimulST by adopting the most promising offline architectures and training techniques to improve the quality-latency balancing of our systems.

Leveraging Offline Solutions
Offline training brings considerable advantages in terms of reducing the computational costs of SimulST technology.First, only one model can be trained and maintained to serve both offline and simultaneous tasks without performance degradation.Second, contrary to the simultaneous-training mode, the choice of the word detection strategy at run-time does not depend on the strategy used during training.Rather, it can be made according to the specific use case, making the offline-trained model more flexible.This also means that other decision policies can be applied to the offline-trained system without the need to re-train it from scratch.
Using a single offline-trained model not only speeds up its development but also opens up the possibility to directly adopt powerful offline architectures and techniques without performing any additional training nor adaptation to the simultaneous scenario.In the following, we test this hypothesis to find out whether recent architectural improvements (Section 6.1) and data augmentation techniques (Section 6.2) designed for offline ST also have a positive impact in SimulST.
In recent years, many architectures have been proposed to address the offline ST task (Wang et al., 2020;Inaguma et al., 2020;Le et al., 2020;Papi et al., 2021).Among them, the Conformer (Gulati et al., 2020) has recently shown impressive results both in speech recognition, for which it was initially proposed, and in speech translation (Inaguma et al., 2021).The main aspects characterizing this encoder-decoder architecture are related to the encoder part.Inspired by the Macaron-Net (Lu et al., 2019), the Conformer encoder is built with a sandwich structure and integrates the relative sinusoidal positional encoding scheme (Dai et al., 2019).
Given the promising results it achieved in the offline scenario, we choose to test if this architecture also brings quality and latency gains in SimulST.Since we found in Section 3 that fixed and adaptive word detection strategies have their own use cases (their best results are observed at different latency regimes, respectively low for fixed and mediumhigh for adaptive), we compare Conformer-and Transformer-based architectures using both strategies.For the offline training of Conformer, we follow the same procedure used for Transformer.Details about the model hyper-parameters are pre- sented in Appendix A.2.

Scaling Architecture
The offline results of both architectures are presented in Table 1, while their simultaneous curves are shown in Figure 3.As previously noticed by Inaguma et al. (2021), Conformer outperforms Transformer in offline generation.The improvements, of at least 2.4 BLEU points, are valid both for greedy and beam search.From Figure 3, we can see that Conformer outperforms Transformer also in the simultaneous setting.This holds both for fixed and adaptive word detection, with larger BLEU gains at higher latency regimes.As far as word detection strategies are concerned, we also notice a similar trend between Conformer and Transformer: the fixed one performs better or on par at lower latency while being outperformed by the adaptive one when the latency increases.In light of the better results obtained by Conformer, we conclude that improving the architecture of the offline system also has a positive impact on its simultaneous performance, enhancing translation quality without affecting latency.

Scaling Data
Data augmentation is a common practice used to improve systems performance.One approach to data augmentation is to apply knowledge distillation (KD), which was introduced to transfer knowledge from big to small models (Hinton et al., 2015).Among the possible methods, sequence-level KD (Kim and Rush, 2016) is one of the most popular ones in ST thanks to its application simplicity and the consistent improvements observed (Potapczyk and Przybysz, 2020;Xu et al., 2021;Gaido et al., 2022a).Sequence-level KD consists of replacing the target references of a given parallel training corpus with the predicted sequences generated by a teacher model (usually, an MT model), from which we want to distill the knowledge to a student model.
To investigate the effects of such a knowledge transfer on quality and latency, we apply sequencelevel KD to our offline-trained SimulST system.
To this end, we translate the transcripts present in the en→{de, es} sections of MuST-C with an MT model (more details are provided in Appendix A.3) and we substitute the gold translations with the MTgenerated ones to build new data.As in (Liu et al., 2021b), to train the models we use both gold and synthetic data by concatenating them.Since the performance of the Conformer model scales with data (Gaido et al., 2022b) and is better compared to that of Transformer (Section 6.1), we adopt the Conformer for the following study.We extend our analysis to the simultaneous-trained systems to verify if the offline-trained one continues to perform at least on par with them and we report the best simultaneous-trained system curve for each word detection strategy.
The effects of the additional KD data are shown in Figure 4. Compared to Figure 3, we notice a performance improvement that comes without sacrificing latency.On en-de, the quality of the offlinetrained Conformer with KD ranges from 18 to 25 BLEU, against the previous 15 to 22 BLEU.On enes, it ranges from 19 to 30 BLEU, against the previous 18 to 29 BLEU.Moreover, the offline-trained system (solid curves) is still better or at least comparable with the simultaneous-trained ones (dotted curves) for both language pairs.From Figure 4, we also notice that adaptive word detection (blue curves) shows overall better results compared to the fixed one (pink curves), even at lower latency.This suggests that comparing the two strategies by using models with higher translation quality shows the superiority of adaptive word detection at any latency regime.
In light of these results, we conclude that data augmentation improves the offline-trained system quality without affecting latency.To better assess these performance gains in the simultaneous framework, in the next section we present a detailed comparison of our offline-trained Conformer with the state-of-the-art SimulST architecture.

Comparison with the state of the art
So far, we discovered that scaling to better performing architectures and more data further improves the simultaneous results of offline-trained models.But how good is their performance compared to the state of the art in SimulST?To answer this question, we compare our best system, the offline-trained Conformer with adaptive word detection, with the Cross Attention Augmented Transducer (Liu et al., 2021b) -CAAT -used by the winning submissions at IWSLT 2021 (Anastasopoulos et al., 2021) and 2022 (Anastasopoulos et al., 2022).Inspired by the Recurrent Neural Network Transducer by Graves (2012), CAAT is made of three Transformer stacks: the encoder, the predictor, and the joiner.These three elements are jointly trained in simultaneous to optimize the quality of the translations while keeping latency under control.
For training and testing the CAAT architecture, we use the code published by the authors and adopt the same hyper-parameters of their paper.As the performance of the CAAT model is sensitive to sequence-level KD (Liu et al. 2021b show a 2 BLEU degradation without it), we compare it with the offline-trained Conformer model using the same data settings -see Section 6.2.We report the CAAT results obtained by adopting both the greedy search used in our SimulST settings and the beam search used by Liu et al. (2021b).As suggested by Ma et al. (2020b), we also compute the Computational Aware (CA) version of the LAAL metric (LAAL CA ), which is defined as the time elapsed from the beginning of the generation process to the prediction of the partial target. 7Since LAAL CA represents the real wall-clock elapsed time experienced by the user, it gives a more reliable evaluation of the SimulST performance in a real-time scenario.For the sake of completeness, we also report the results of Average Lagging (Ma et al., 2020a) in Appendix C.
We present the comparison in Figure 5. From the LAAL-BLEU curves, we see that, at low latency regime, the CAAT model (in solid red) outperforms our offline-trained Conformer model (in solid blue) by 2 BLEU on en-de and 4 BLEU on en-es.However, moving to medium-high latency regime, the Conformer significantly outperforms CAAT, reaching gains of 4 BLEU on en-de and 2 BLEU on en-es.We can also notice a degradation of the CAAT en-de translation quality that is caused by an under-generation problem at higher latency, for which we give details in Appendix B.
When it comes to LAAL CA -BLEU, the scenario changes, bringing CAAT curves much closer to those of Conformer.The state of the art still out-  performs the Conformer at lower latency but in this case, waiting about 100/200ms more, the Conformer performance starts to improve consistently.
Comparing the LAAL-and LAAL CA -BLEU curves, we see that our offline-trained system is more coherent between computational and noncomputational aware metrics: while Conformer has a computational overhead of 400/500ms, CAAT requires 1400/1500ms more than its ideal LAAL.The CAAT greedy curves (dotted red) show only a little improvement in latency compared to the beam search (solid red), suggesting that its higher computational cost does not depend on the generation strategy but on other factors like its complex and more computationally expensive architecture.
All in all, we can say that, compared to the state of the art in SimulST, the lower performance of our offline-trained Conformer at low latency regime is balanced by consistently higher BLEU scores at medium and high latency.

Conclusions
To reduce the potentially large amount of experiments usually performed to build SimulST models, we explored the use of a single offline-trained model to serve both the offline and simultaneous tasks.Through comparison with native SimulST systems, we showed that our offline-trained model can be successfully used in real-time, achieving comparable or even better results.To further enhance its performance, we investigated the adoption of consolidated techniques and emerging architectures from offline research, showing consistent improvements also in the simultaneous scenario.The benefits of offline training indicate the potential of applying this method without the need for any additional training or adaptation.Besides facilitating system deployment, another important advantage of building and reusing one single model to rule both tasks is the drastic reduction of the carbon footprint of ST training (by a factor of 9 in our evaluation setting).This represents an important step in response to rising concerns about the AI energy consumption and environmental impact toward more sustainable development.
As regards SimulST evaluation, the differences between results computed with noncomputationally and computationally aware latency metrics suggest that including computational time in the measurements heavily influences the outcomes of system comparisons.In our particular case, the differences in latency between the offline-trained models and the state of the art ob-served in terms of the non-computationally aware LAAL metric become smaller when considering its computational aware version.Although lower latency is theoretically reached by the state of the art CAAT model, this comes at the cost of a more complex and computationally expensive architecture that shows its limitations at inference time.We therefore invite the SimulST community to use computationally aware metrics for more sound evaluations, referring to ideal metrics only in the absence of similar testing assets, as machines with comparable computational power.

Limitations
Although it relies on a simpler architecture and generation strategy compared to the state-of-the-art in SimulST, our offline-trained model exhibits a high translation quality in real-time, which allows it to achieve better results at medium and high latency regimes.However, a performance gap of 2-3 BLEU points is still observed at low latency regime.This can be attributed to the use of a simple policy such as wait-k.Being the most popular and widely adopted one, we chose to focus on this policy to conduct our analysis.Notwithstanding, investigating better performing solutions to boost performance at low latency and close the gap is still necessary, and definitely among our future work priorities.
Also, the experiments presented in the paper are limited to two target languages, which were selected as representatives of those having similar and different word ordering with respect to the English source speech.Although this choice allowed us to reliably test our hypotheses in diverse conditions, verifying our findings on a wider set of languages is another natural evolution of this research.
Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, pages 33-39, Suzhou, China.A Models Architecture

A.1 Transformer
The models used in Section 5 are based on a 12 encoder and 6 decoder layers of Transformer (Vaswani et al., 2017) architecture.The embedding dimension is set to 256, the number of attention heads to 4 and the feed-forward embedding dimension to 2048, both in the encoder and in the decoder.The number of parameters is ∼ 32.4MWe use Fairseq (Ott et al., 2019) library for all the trainings.The wait-k with fixed word detection strategy was already present in the Fairseq library, while we implemented the adaptive one.We use the hyper-parameters of (Ma et al., 2020b) for all the trainings of the Transformerbased model.We use a unigram SentencePiece model (Kudo and Richardson, 2018) for the target language vocabulary of size 8,000 (Di Gangi et al., 2020).For the source language vocabulary of size 5,000 we use a BPE SubwordNMT model (Sennrich et al., 2016) with Moses tokenizer (Koehn et al., 2007).The reason for which we used Sub-wordNMT instead of SentencePiece lies in the strategy used for determining the end of a word, which is crucial for simultaneous inference.While Sen-tencePiece uses the character "_" at the beginning of a new word, SubwordNMT appends "@@" to any token that does not represent the end of a word.Thus, SentencePiece units require the generation of the first token of the next word to determine if the current word is over while SubwordNMT units do not.For instance, the sentence "this is a phrase", is encoded into SentencePiece units as "_th is _is _a _ph rase".As such, to determine if "_th is" is a complete word, we have to wait for the next word with the "_" character at the beginning, that is "_is".Instead, with SubwordNMT we have "th@@ is is a ph@@rase ", and we do not need to receive "is" to determine that "th@@ is" is finished.
We select the best checkpoint based on the loss and early stop the training if the loss did not improve for 10 epochs.We trained the system for 100 epochs at maximum.At the end of the training, we make the average of the 7 checkpoints around the best one.
For the inference part, we use the SimulEval tool (Ma et al., 2020a) as in (Ma et al., 2020b) with the additional force_finish tag that forces the model to generate text until the source speech has been completely ingested, i.e. to ignore the end of sentence token if predicted before the end of an utterance.In case of wait-k with adaptive word detection, we also force the model to predict the successive most probable token if the end of sentence is predicted (that we called avoid_eos_while_reading), while for the fixed we found that it degrades the performance.The detection is taken every average word duration, that is every 280ms, as estimated by Ma et al. (2020b) in the MuST-C dataset.

A.2 Conformer
For the Conformer model, we build an architecture similar to Inaguma et al. (2021), we use 12 Conformer encoder layers and 6 Transformer decoder layers.The number of parameters is ∼ 35.7M.We use the same embedding dimension of our Transformer-based architecture, 4 attention encoder heads and 8 attention decoder heads.For the Conformer Feed-Forward layer, Attention layer, and Convolution layer, we use 0.1 as dropout.We use a kernel size of 31 for the point-and depth-wise convolutions of the Convolution layer.The vocabularies are the same of the Transformer-based, as well as the selection of the checkpoint.At inference time, the force_finish tag is used with the avoid_eos_while_reading for both the word detection strategies.

A.3 Machine Translation
The MT model used to generate the target for the KD was trained on OPUS datasets (Tiedemann, 2016).It is a plain Transformer with 16 attention heads and 1024 features in encoder/decoder embeddings, resulting in 212M parameters.The English→German MT scores 32.1 BLEU and the English→Spanish MT scores 35.8 BLEU on MuST-C tst-COMMON.

B Under-generation Statistics
In Section7, while discussing the en-de curves of Figure 5, we highlighted a performance degradation of CAAT at higher latency regimes.In fact, during our experiments we observed that CAAT tends to generate shorter sentences as the value of k increases.This behaviour becomes apparent in Table 2, where we report the word length difference between the generated hypotheses and the corresponding references.For en-de, CAAT exhibits a strong tendency to under-generate (indicated by negative values) at high latency and this is presumably the reason why we observed the BLEU drop.

C Average Lagging
Figure 1: LAAL-BLEU curves of wait-k with fixed word detection strategy.

Figure 2 :
Figure 2: LAAL-BLEU curves of wait-k with adaptive word detection strategy.

Figure 5 :
Figure 5: LAAL/LAAL CA -BLEU curves of our offline-trained Conformer and state of the art (CAAT) models.
Figure 6: AL/AL CA -BLEU curves of our offline-trained Conformer and CAAT models.

Table 1 :
BLEU results of the offline generation.

Table 2 :
Average word length difference w.r.t. the reference.Positive values indicate exceeding words, negative values indicate missing words.