Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

A popular approach to streaming speech translation is to employ a single offline model with a wait-k policy to support different latency requirements, which is simpler than training multiple online models with different latency constraints. However, there is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input. We demonstrate that speech representations extracted at the end of a streaming input are significantly different from those extracted from a complete utterance. To address this issue, we propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input. FAST includes a Future-Aware Inference (FAI) strategy that incorporates future context through a trainable masked embedding, and a Future-Aware Distillation (FAD) framework that transfers future context from an approximation of full speech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks show that FAST achieves better trade-offs between translation quality and latency than strong baselines. Extensive analyses suggest that our methods effectively alleviate the aforementioned mismatch problem between offline training and online inference.


Introduction
Streaming speech translation (ST) systems generate real-time translations by incrementally processing audio frames, unlike their offline counterparts that have access to complete utterances before translating.Typically, streaming ST models use unidirectional encoders (Zhang et al., 2019;Ren et al., 2020;Ma et al., 2020b;Zeng et al., 2021;Zhang  et al., 2023) and are trained with a read/write policy that determines whether to wait for more speech frames or emit target tokens.However, it can be expensive to maintain multiple models to satisfy different latency requirements (Zhang and Feng, 2021;Liu et al., 2021a) in real-world applications.Recently, some works (Papi et al., 2022;Dong et al., 2022) have shown that a single offline model with bidirectional encoders (such as Wav2Vec2.0(Baevski et al., 2020)) can be adapted to streaming scenarios with a wait-k policy (Ma et al., 2019) to meet different latency requirements and achieve comparable or better performance.However, there is an inherent mismatch in using a model bidirectionally trained with complete utterances on incomplete streaming speech during online inference.
Intuitively, speech representations extracted from streaming inputs (Figure 1(b)) are less informative than those from full speech encoding ) due to limited future context, especially toward the end of the streaming inputs, which can be exacerbated by the aforementioned mismatch problem.This raises a natural question: how much do the speech representations differ between the two inference modes?We analyze the gap in speech representations, measured by cosine similarity, at different positions in the streaming input compared to using the full speech (Section 3).We observe a significantly greater gap for representations closer to the end of a streaming segment, with an average similarity score as low as 0.2 for the last frame, and the gap quickly narrows for earlier frames (Figure 2).Additionally, we observe more degradation in translation quality for utterances with the greatest gap in speech representations between online and offline inference (see Appendix B.2).
We conjecture that the lack of future contexts at the end of streaming inputs can be detrimental to streaming speech translation when using an offline model.To this end, we propose a novel Future-Aware Inference (FAI) strategy.This approach is inspired by masked language models' ability (Baevski et al., 2020) to construct representations for masked tokens from their context.Specifically, we append a few mask embeddings to the end of the streaming input and leverage the acoustic encoder (Wav2Vec2.0)'sability to implicitly construct representations for future contexts, which can lead to more accurate representations for the other frames in the streaming input.
Furthermore, we propose a Future-Aware Distillation (FAD) framework that adapts the offline model to extract representations from streaming inputs that more closely resemble those from full speech encoding.We expand the original streaming input with two types of future contexts: one with m oracle speech tokens for the teacher model, and another with m mask tokens for the student model, which is initialized from the teacher model.We minimize several distillation losses between the output of the teacher and student models.By incorporating additional oracle future contexts, the speech representations for the frames in the original streaming input extracted by the teacher model resemble those when the full speech is available.FAD aims to adjust the offline model to extract similar representations for streaming input as it would for full speech.In combination with FAI, we improve the model's ability to extract quality representations during both training and inference, alleviating the aforementioned mismatch problem.We refer to our approach as FAST, which stands for Future-Aware Streaming Translation.
We conducted experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks.The results show that our methods outperform several strong baselines in terms of the trade-off between translation quality and latency.Particularly, in the lower latency range (when AL is less than 1000ms), our approach achieved BLEU improvements of 12 in EnDE, 16 in EnEs, and 14 in EnFr over baseline.Extensive analyses demonstrate that our futureaware approach significantly reduces the representation gap between partial streaming encoding and full speech encoding.

Background and Related Work
Speech translation systems can be roughly categorized into non-streaming (offline) and streaming (online) depending on the inference mode.Regardless of the inference mode, speech translation models typically employ the encoder-decoder architecture and are trained on an ST corpus D = {(x, z, y)}, where x = (x 1 , . . ., x T ) denotes an audio sequence, z = (z 1 , . . ., z I ) and y = (y 1 , . . ., y J ) the corresponding source transcription and target translation respectively.
Non-Streaming Speech Translation For the non-streaming ST task, the encoder maps the entire input audio x to the speech representations h, and the decoder generates the j-th target token y j conditional on the full representations h and the previously generated tokens y <j .The decoding process of non-streaming ST is defined as p(y | x) = J j=1 p (y j | x, y <j ).A significant amount of works have focused on non-streaming ST, including pre-training (Wang et al., 2020;Dong et al., 2021a;Tang et al., 2022;Ao et al., 2022), multi-task learning (Liu et al., 2020;Indurthi et al., 2020Indurthi et al., , 2021)), data augmentation (Pino et al., 2019;Di Gangi et al., 2019b;Mc-Carthy et al., 2020), knowledge distillation (Dong et al., 2021b;Zhao et al., 2021;Du et al., 2022), and cross-modality representation learning (Tang et al., 2021;Fang et al., 2022;Ye et al., 2022).
Streaming Speech Translation A streaming ST model generates the j-th target token y j based on streaming audio prefix x ≤g(j) and the previous tokens y <j , where g(j) is a monotonic non-decreasing function representing the ending timestamp of the audio prefix that needs to be consumed to generate the j-th word.The decoding probability is calculated as p(y | x) = J j=1 p y j | x ≤g(j) , y <j .Thus, a streaming ST model requires a policy to determine whether to wait for more source speech or emit new target tokens.Recent studies (Ma et al., 2020b;Ren et al., 2020;Zeng et al., 2021;Dong et al., 2022) make read/write decisions based on a variant of the wait-k policy that was initially proposed for streaming text translation, which alternates write and read operations after reading the first k source tokens.Because there is no explicit word boundaries in a streaming audio, several works attempt to detect word boundaries in the audio sequence by fixed length (Ma et al., 2020b), Connectionist Temporal Classification (Ren et al., 2020;Zeng et al., 2021;Papi et al., 2022), ASR outputs (Chen et al., 2021), or continuous-integrateand fire (Dong et al., 2022;Chang and yi Lee, 2022).Moreover, some studies (Arivazhagan et al., 2019;Ma et al., 2020c;Zhang et al., 2020b;Schneider and Waibel, 2020;Miao et al., 2021;Zhang and Feng, 2022a,c;Zhang et al., 2022;Liu et al., 2021b;Zhang and Feng, 2022b;Lin et al., 2023) explore adaptive policies to dynamically decide when to read or write for streaming text and/or streaming speech translation.Zhang and Feng (2022d) fill future source positions with positional encoding as future information during training for simultaneous machine translation (MT) within the prefix-toprefix framework.In this paper, we focus on a matter less attended to -how to alleviate the mismatch between offline training and online inference.
Knowledge Distillation for Streaming Translation Existing studies on streaming text and/or speech translation usually introduce future information by distilling sequence-level knowledge from offline MT (Ren et al., 2020;Zhang et al., 2021;Liu et al., 2021b;Zhu et al., 2022;Deng et al., 2023;Wang et al., 2023) and online MT (Zaidi et al., 2021).Moreover, Ren et al. (2020) leverage the knowledge from the multiplication of attention weights of streaming ASR and MT models to supervise the attention of the streaming ST model.However, our FAD aims to reduce the representation gap between full speech and streaming speech.

Preliminary Analysis
In this section, we examine the mismatch problem in Transformer-based (Vaswani et al., 2017) ST architecture between offline training and online de- coding.In offline full-sentence ST, the speech representation of each frame is obtained by attending to all frames, including future frames, in the transformer encoder layers.Recently, a common approach in speech translation is to stack a pre-trained Wav2Vec2.0(Baevski et al., 2020) as the acoustic encoder with a semantic MT encoder-decoder, resulting in state-of-the-art performance in the ST task (Han et al., 2021;Dong et al., 2022;Fang et al., 2022;Ye et al., 2022).This approach leverages the ability of Wav2Vec2.0pre-training to learn better speech representations.
When applying an offline model to streaming inference, the lack of future frames causes an apparent mismatch problem, which can lead to a deterioration in the extracted speech representations.To quantify this effect, we examine three offline ST models trained on the MuST-C EnDe dataset using the Chimera (Han et al., 2021), STEMM (Fang et al., 2022), and MoSST (Dong et al., 2022) architectures, with a trainable acoustic encoder initialized from Wav2Vec2.0.We conduct analysis on the tst-COMMON set with a duration between 2s and 10s by removing outliers and noisy data, resulting 1829 examples.
For an input sequence of audio frames x = (x 1 , . . ., x T ), the convolutional subsampler of Wav2Vec2.0shrinks the length of the raw audio by a factor 320 and outputs the full speech representation sequence a.For readability reasons, we uniformly use the notation T to denote the sequence length of a = (a 1 , . . ., a T ).This simplified notation does not undermine any of our conclusions while making the equations more readable.For streaming input ∀t ≤ T, xt = (x 1 , . . ., x t ), Wav2Vec2.0will output the representation ât = (â t,1 , . . ., ât,t ).
To quantify the difference in speech representa-tions between offline and online inputs, we compute the cosine similarity s t,t ′ between the speech representation at the t ′ -th (t ′ ≤ t) position in the streaming audio input xt and at the same position with full-sentence encoding.We then calculate the statistics s−τ by averaging the cosine similarity over both the testset B and the time dimension with a reverse index −τ corresponding to a position τ − 1 frames before the end of the streaming input.
Figure 2 displays the s−τ curve for the last 100 positions in streaming inputs.For τ > 10, the averaged cosine similarity s−τ is greater than 0.8, indicating that the representations at those positions in a streaming input are similar to those with the full speech.However, the curve shows a sharp decline in the averaged cosine similarity s−τ for the ending positions, particularly for the last one (τ = 1), suggesting that the mismatch problem can significantly affect the quality of speech representation for these positions.We provide additional analysis in Appendix B.

Method
To address the mismatch problem between offline training and online inference, we propose a novel methodology called Future-Aware Streaming Translation (FAST).This approach adapts an offline ST model for streaming scenarios by using a Future-Aware Inference (FAI) strategy during inference and a Future-Aware Distillation (FAD) strategy during training.An overview of our proposed method is depicted in Figure 3.

Model Architecture
Unlike previous works (Ren et al., 2020;Ma et al., 2020b;Zeng et al., 2021;Liu et al., 2021a) that require training multiple streaming models for different latency requirements, our goal is to train one single offline model to meet the requirements.The overall architecture depicted in Figure 3(a) consists of an acoustic encoder, an acoustic boundary detector, a semantic encoder, and a translation decoder.Acoustic encoder: The pre-trained Wav2Vec2.0 is adopted as the acoustic encoder to learn a better speech representation (Ye et al., 2021(Ye et al., , 2022)).Acoustic boundary detector: To enable the offline ST model to perform chunk-wise streaming inference, we use a Continuous Integrate-and-Fire (CIF) module (Dong and Xu, 2020) as the acoustic boundary detector to dynamically locate the acoustic boundaries of speech segments following (Yi et al., 2021;Dong et al., 2022).The CIF module generates an integration weight α t for each acoustic representation a t by Wav2Vec2.0.Then, CIF accumulates α t in a step-by-step way.When the accumulation reaches a certain threshold (e.g.1.0), the acoustic representations corresponding to these weights are integrated into a single hidden representation h j by weighted average, indicating a found token boundary.The shrunk representations h will be fed into the semantic encoder.To learn the correct acoustic boundaries, we use the source text length J as the weakly supervised signal.
There are two benefits of using CIF as a boundary detector.For offline ST model, it can address the length gap between speech and text.It can also provide the acoustic boundaries to perform read-/write policies for streaming inference.Similar to the word alignment in NMT (Li et al., 2022(Li et al., , 2023)), it can align the source audio and source text token.Semantic encoder and Translation decoder: The standard transformer (Vaswani et al., 2017) composed of L e encoder layers and L d decoder layers is used.The translation loss is defined as: The offline ST model is trained with the following objective function: where λ is a hyper-parameter to balance two losses.
Based on the analysis in Section 3, we find that it is only necessary for the offline ST model to be aware of a short future during streaming encoding.Thus, we first propose a Future-Aware Inference (FAI) strategy to enhance the representations of streaming speech in Figure 3 (b).
In this strategy, the streaming inference is directly performed on offline ST model without finetuning.Particularly, we use the mask tokens of Wave2Vec2.0as the pseudo future context and append them to the speech tokens generated from the already consumed speech frames.Because the mask token embedding is trainable when pretraining Wave2Vec2.0,and the contrastive loss is to identify the quantized latent audio representation of masked regions based on unmasked context, this is intuition that mask tokens can possibly encode future context.In addition, the masking strategy during pre-training results in approximately 49% of all time steps being masked with a mean span length of 300ms, it also guarantees that Wav2vec2.0 is able to extract better speech representations even with the presence of large amount of mask tokens.Wav2Vec2.0consists of a multi-layer convolutional subsampler f c and a Transformer encoder f e .During our online inference, for each audio prefix xt = (x 1 , . . ., x t ), the f c first outputs streaming speech tokens ĉt = (c 1 , . . ., c τ ), where ĉ ∈ R τ ×d and d is the dimension of model and τ is the sequence length after convolutional subsampling.Then, we concatenate the streaming speech tokens ĉ and m mask token embeddings e ∈ R d along the time dimension, resulting in a longer sequence of speech tokens ∈ R (τ +m)×d .The new speech tokens are then fed into the Transformer encoder f e , but only the first τ encoder outputs (i.e., speech features) will be kept for the CIF module because, as discussed in Section 3, the last m speech features are of poor quality and adversely affect translation quality.Then, if an acoustic boundary is detected by the CIF module, the decoder will emit new words based on wait-k policy, otherwise, the streaming speech is continued to be read.The FAI strategy is outlined in Algorithm 1 in Appendix.

Future-Aware Distillation
Although FAI considers mask tokens as the pseudo future context, it is still preferred to leverage the fu-ture oracle speech tokens, which is unavailable during inference.Therefore, we take one step further by proposing a fine-tuning method -Future-Aware Distillation (FAD).It aims to distill the knowledge from teachers with oracle future contexts into students with pseudo future contexts.
The teacher model is the offline ST by optimizing Eq. ( 5) and is frozen.The student model has exactly the same architecture as the teacher and is initialised from the teacher.However, the semantic encoder and translation decoder are frozen to retain offline-trained ST performance.
Training A naive solution is to distill knowledge from the full speech into every possible streaming speech for each audio.However, since the length of speech tokens is typically very large, e.g., 300 on average, it is computationally prohibitive.To this end, we propose a simple and efficient implementation via random sampling.
Given a full audio waveform x, f c outputs the speech tokens c ∈ R T ×d .We randomly sample an integer t ∈ [1, T ] to construct the streaming speech token c ≤t .Then, we define the teacher input of f e with oracle future context as following: where m is a hyper-parameter to denote the number of future contexts.The most straightforward approach is to use the full speech as the teacher input.However, due to the bidirectional acoustic encoder, the streaming speech representation of the same position constantly changes when consuming new frames.
To maintain consistency with the inference method FAI, we use the mask tokens as the pseudo future context and append them to the sampled speech tokens to construct the student input.
where e ∈ R d is the mask embedding.
We can obtain the streaming speech representations from teacher f T e and student f S e .Then the first t speech representations are fed into the CIF module to derive the teacher and student weight sequence.Concretely, they can be written as follows. âT Eventually, two distillation losses are proposed to reduce the speech representation gap.
The first loss is to directly minimize the streaming speech representations with cosine similarity.
The second loss is to learn more correct acoustic boundaries for online inference by calculating the KL-divergence between two weight distributions.Note that according previous analysis in Section 3, the representations of the first t speech tokens after f T e should have high quality if m > 10, so only the first t speech representations are taken into account for loss calculation.
Optimization The total training objective of the FAD can be written as L = L W2V KD + L CIF KD .The overall training procedure of the proposed method is shown in Figure 3(c).

Experimental Settings
Datasets We evaluate our approach on MuST-C V1 English-German (EnDe), English-Spanish (EnEs) and English-French (EnFr) datasets (Di Gangi et al., 2019a), where limited previous works discussed the En-Fr streaming ST with BLEU-latency curve.All the corpora contain source audios, source transcriptions, and target translations, and the results reported are conducted on the corresponding tst-COMMON set.Detailed statistics of different language pairs are given in Appendix A.
For speech data, we normalize the raw audio wave to the range of [−1, 1).For text data, we keep punctuation and remove non-printing characters, and remain case-sensitive.For each transla-tion direction, the unigram SentencePiece2 model (Kudo and Richardson, 2018) is used to learn a shared vocabulary of size 10k.Model Configuration For the acoustic encoder, we use Wav2vec2.03(Baevski et al., 2020) following the base configurations.We construct the acoustic boundary detector by applying the CIF (Yi et al., 2021) on the last dimension of speech representation.We use 8 and 6 layers for the semantic encoder and the translation decoder respectively, with 4 attention heads and 768 hidden units.
Training The detailed training schedule of the offline ST model is given in Appendix C. We set the length m of future context tokens to 50 for both FAD and FAI.All hyper-parameters are tuned on EnDe devset and applied to other language pairs.We train all models with 3.2 million frames per batch on 8 Nvidia Tesla V100 GPUs.We implement our models with Fairseq4 (Ott et al., 2019).Inference We average the checkpoints of the best 10 epochs on development set for evaluation.We perform streaming-testing with the wait-k policy.k is counted by the detected acoustic units from the CIF module.To follow the tradition in simultaneous translation (Zeng et al., 2021;Dong et al., 2022), we do not rewrite the tokens that have already been generated.Evaluation Metrics We use SacreBLEU5 for the translation quality.The latency is evaluated with Average Latency (AL) (Ma et al., 2019), Average Proportion (AP) (Cho and Esipova, 2016), and Differentiable Average Lagging (DAL) (Cherry and Foster, 2019) in the SimulEval6 (Ma et al., 2020a).System Settings We compare our method with several strong end-to-end streaming ST approaches.(i) SimulSpeech (Ren et al., 2020) and RealTranS (Zeng et al., 2021) use uni-directional encoder rather than bidirectional one.(ii) MoSST (Dong et al., 2022) applies an offline-trained model with a monotonic segmentation module for streaming testing and achieves competitive performance.(iii) MMA-SLM (Indurthi et al., 2022) enhances monotonic attention to make better read/write decisions by integrating future information from language models.(iv) ITST (Zhang and Feng, 2022b) learns  0   1,000 2,000 3,000 4 an adaptive read/write policy by quantifying the transported information weight from source token to the target token.(v) MU-ST (Zhang et al., 2022) learns an adaptive segmentation policy to detect meaningful units, which makes read/write decisions.(vi) Baseline is our offline-trained ST model (B for abbreviation).For fair comparisons, it has the same structure as MoSST.

Main Results
We presents the main results in Figure 4 7 .Compared with the online models SimulSpeech, Re-alTranS, and ITST, our offline model (baseline) achieves higher translation quality with high latency as it encodes bidirectional context information during training, however, in the low latency region, it performs poorly due to the input mismatch between offline-training and online-decoding.B + FAI With the ability to reduce this mismatch, FAI is directly applied for our offline (baseline) model and can achieve higher BLEU in all latency regions.In particular, it outperforms our most compatible baseline B by large margins in lower latency regions (when AL is less than 1000ms), with improvements over 6 BLEU in both EnDe and EnEs, 10 BLEU in EnFr.
FAST (FAD + FAI) Furthermore, our FAST achieves the best trade-off between translation quality and latency, especially at extremely low latency

564
BLEU in all latency regions.This demonstrates the 565 important role of the L W2V2 KD in reducing the mis-566 match between full speech and streaming speech.

567
(2) w/o L CIF KD : If we remove the L CIF KD term in 568 Eq.( 14), the translation quality will be slightly de-   , 3, 5, 7, 9, 12, 15, 20, 30}.region (AL is about 200ms, k = 1), achieving the improvements of 6 BLEU in EnDe, 10 BLEU in EnEs, and 4 BLEU in EnFr compared to B + FAI.It indicates that FAST can effectively mitigate the input mismatch between offline-training and onlinedecoding.In addition, our method achieves comparable translation quality with full-speech translation at middle latency (at AL around 3000ms), especially for EnEs.

Ablation Study
In this section, we study the effectiveness of our methods.All ablation results are obtained from the MuST-C EnDe tst-COMMON set.The results are shown in Figure 5. (1) w/o L W2V2 KD : if removing the L W2V2 KD , the translation quality drops by 1-2 BLEU in all latency regions, including high latency region.This demonstrates optimizing L W2V2 KD can guarantee the full speech translation performance.
(2) w/o L CIF KD : If removing the L CIF KD , the translation quality will be slightly degraded.However, we observe that the distances between two consecutive acoustic boundaries become larger.For example, the AL of this variant at wait-1 is greater than 750, but the AL of the other variants at wait-1 is approximately 150.As expected, optimizing L CIF KD can ensure the correct acoustic boundaries.
(3) w/o FAI: In this variant, we use the student model by FAD with vanilla wait-k policy for streaming inference (i.e., inference without mask tokens).However, FAD training considers mask tokens as student input, so this mismatch leads to significant performance degradation in low and middle latency regions.This indicates that our FAD and FAI should be used together to achieve better streaming performance.
(4) w/o mask embeddings: During training and inference, our model appends m mask tokens into streaming speech tokens as the pseudo future contexts.In this variant, we remove the mask tokens during both training and inference.Even though no mismatch, we still observe a significant drop in translation quality, especially for high latency.This result indicates that the pseudo future contexts can enhance the streaming speech representations.

How much future context is needed?
To answer this question, we explore the FAST (FAD + FAI) with different lengths of future context.Figure 6 shows the overall results.m = 0 means the offline system without distillation.The offline system inherits the mismatch problem, but our method gradually improves the performance as m increasing from 0 to 20.Since we found only the representation of last 10 positions is poor (in Section 3), FAST obtains similar BLEU-AL curve when m is significantly larger than 10, e.g., 20-100.
After the FAD training, we investigate the representation of the last position (before mask tokens) by s−1 in Eq. (2) w.r.t.m.The results are shown in Figure 7.We observe that 1) as m increases, the streaming speech representation of the last position becomes better; 2) the curves of the cosine similarity becomes flattened when m > 10 significantly.This is consistent with the trend in Figure 6.

Analysis on The Representation Gap
Figure 8 plots the changes of average cosine similarity s−t ′ in Eq. (2) of the last 40 positions (before mask tokens) in the streaming speech after applying the FAI or FAST (FAD + FAI).They achieve at least 0.6 and 0.8 cosine similarity at the last position, respectively.The baseline only has the < 0.6 cosine similarity for the last 4 positions and only 0.2 for the last position.It indicates that the repre- sentations with FAI are closer to those of the full speech, especially at the ending positions, and FAD training can further close this gap.

What examples are improved?
For tst-COMMON on MuST-C EnDe, we use awesome-align8 (Dou and Neubig, 2021) to identify the token-level alignment between source transcription and target translation following Zhang and Feng (2022d).First, we define the source-totarget alignment position shift as max{0, i − j}, where the ith source token is aligned to the jth target token.If i − j is large, it means in order to translate the jth target token, the model may need to read more until seeing the ith source token.Then we calculate the monotonic level of each example as the averaged alignment position shift over the number of aligned tokens, i.e., where M denotes monotonic level and A represents aligned pairs.We evenly divide the test set into three groups (Easy, Medium, and Hard)

Conclusion
In this paper, we examine streaming speech translation from a new perspective.We investigate the effects of the input mismatch between offlinetraining and online-decoding.We find that the rep-resentations at the ending positions in the streaming input are particularly poor, directly impacting the translation quality.We propose FAST, which introduces future contexts to improve these representations during training and testing via FAD and FAI, respectively.Experiments and analysis demonstrate their effectiveness in bridging the representation gap between full speech encoding and partial streaming encoding.Furthermore, our methods can be generally beneficial to streaming speech translation models that are based on Wav2Vec2.0.
In the future, we will explore the relevant method independent on Wav2Vec2.0.

Limitations
Our proposed method is built upon the Wav2Vec2.0base model, whose superior representation power has been shown to enhance the performance of offline ST models.Nevertheless, it should be noted that its parameters are considerably large, approximately 95M.This may lead to increased computational costs during training and inference.If we want to extend the model to the long context audio (similar to the document level machine translation (Zhang et al., 2020a)), we have to explore the future work in our conclusion.
The CIF module for detecting the acoustic boundary is optimized from the weakly supervised signal -total length of text tokens.In streaming inference, the boundary detector is not guaranteed to predict accurate boundaries.In other words, it is not guaranteed to align each text token with detected boundaries during online inference.However, due to the good performance of overall translation quality, we hypothesize that these boundaries may represent some meaningful acoustic (or phrase-like) units.The underlying meaning should be another future work to explore.

A Data Statistics
We evaluate our model on MuST-C V1 English-German (EnDe), English-Spanish (EnEs) and English-French (EnFr) datasets (Di Gangi et al., 2019a).For training set, we follow Dong et al. (2022) to filter out short speech of less than 1000 frames (62.5ms) and long speech of more than 480,000 frames (30s).The statistics of different language pairs are illustrated in Table 2  To further verify that only the representation of the end position in streaming speech is poor, we calculate the cosine similarity s t,t ′ between the speech representation at the t ′ -th (t ′ ≤ t) position in the t-th streaming audio input xt and the speech representation at the same position in the full encoding.
Then we average the cosine similarities over the sentences in dataset B to obtain robust statistics. For where B t = {x : |x| ≥ t} contains the audio inputs with length no shorter than t.
We empirically compare the averaged cosine similarity at the beginning, middle, and end positions of the speech representations.Figure 9 shows st,t ′ of the first three (t ′ = 1, 2, 3), middle three (t ′ = ⌊ 1+t 2 ⌋ − 1, ⌊ 1+t 2 ⌋, ⌊ 1+t 2 ⌋ + 1), and last three (t ′ = t − 2, t − 1, t) positions for each encoding step t.At the beginning and middle positions, the averaged cosine similarity st,t ′ is greater than 0.8 except t ′ = 1, indicating that the representations at such positions in the partial streaming input are close to those in the full speech.Note that t ′ = 1 with a slightly lower similarity won't hurt the performance much, because in practice it is almost impossible to apply wait-1 policy (only read 20ms speech input) in streaming ST.However, the st,t ′ declines significantly for the end positions, especially for the last one.In addition, we observe that as t becomes larger, the streaming input will gradually approximate the full speech input, then the gap of the speech representation between the offline and the online input becomes smaller.We conclude that the representations of the end position in the streaming speech are particularly inferior.
B.2 Does the poor representation at the last positions of streaming speech affect streaming ST performance?
To answer this question, we only calculate the average cosine similarity in the last position for each sample.
s−1 (x) reflects the degree of deterioration of the representation at the last position of the streaming speech.We sort the dataset by the value of the degree and divide them evenly into 5 groups to ensure enough samples in each group.The translation quality of each group is shown in Figure 10.The performance of streaming ST drops close to 10 points as the representation at the last position of the streaming speech becomes worse, while the full-sentence ST fluctuates less than 4 points.In addition, the performance gap between the streaming ST and the full-sentence ST becomes larger as the representation at the last position gets worse.In the worse group, the streaming ST is 12.41 points lower than the full-sentence ST.Therefore, we conclude that the poor representation at the end position of the streaming speech has a strong effect on the translation quality.

C Details of Offline Training
We use an Adam optimizer with learning rate 1e −4 and warmup step 10k.We decay the learning rate with inverse square root schedule.
The offline ST model is first trained by a multitask learning, including ASR and ST tasks.A language identity tag is prepended to the target sentence for indicating which task is learned.In this stage, the CIF module which is used to detect the acoustic boundary is deactivated, in other words, the CIF module is not trained.The main purpose D.2 How important of the Wav2Vec2.0?
As we mentioned in the main text, the special audio token "mask" in Wav2Vec2.0 is pre-trained on the Librispeech dataset to reconstruct the corresponding feature conditional on unmasked context via the contrastive task.In our experiments, we didn't include contrastive learning as the auxiliary task in the downstream ST training.And in our FAI inference, we directly leverage the mask embeddings as the future context by appending them to the streaming input.However, we found the speech representations after ST training becomes even better.Particularly, we calculate the cosine similarity between every predicted future representation and full speech representations at the same position, and the results are illustrated in Figure 11.On either the Librispeech or the MuST-C audio test set, the fine-tuned Wav2Vec2.0can produce better speech representations from the masking inputs.

D.3 Why m > 10?
Based on the analysis in Section 3, we observed that the representations of the last 10 positions of the streaming speech are poorer.For example, the speech representations ât−10:t for streaming speech c 1:t of length t are poor.Similarly, in FKD for a teacher's streaming speech input c 1:t+m of length t + m, the speech representations ât+m−10:t+m are always suboptimal.Hence, not all t + m speech representations can be utilized as teachers, only the first t speech representations are taken into account for loss calculation.If m < 10, t + m − 10 will be smaller than t, and the representations ât−10+m:t will also be of inferior quality, making the representation ât−10+m:t a poor teacher.Thus, m needs to be greater than 10 for high quality teachers.
D.4 Why are all predicted features discarded?
In FAI strategy, all the output representations corresponding to the m = 50 masking tokens will be discarded, because we have demonstrated that the representations at the ending positions are inferior.However, as shown in 11, the first 10 predicted representations are not as bad as the next 40.Therefore, on the EnDE test set, we also conduct another streaming ST inference by appending different numbers of predicted context to the original speech representations.We use discard rate p to measure the number of appending features.When p = 1.0, all predicted features are discarded and it reduces to the standard FAI inference.In Figure 12, we compare the streaming speech translation quality between regular FAI and its variant.It is concluded that the predicted future context is too noisy and harmful to the performance.5).

Figure 1 :
Figure 1: The input mismatch between offline training and streaming testing.

Figure 2 :
Figure 2: The average cosine similarity s−τ of the end 100 positions in the streaming speech.

Figure 3 :
Figure 3: Illustration of offline ST model and proposed methods FAI and FAD.

Figure 4 :
Figure4: The translation quality (BLEU) against the latency metrics (AL) on the tst-COMMON set of MuST-C EnDe, EnEs, and EnFr dataset.† denotes that the results are obtained from corresponding papers.offline is the offline performance of teacher model (offline-trained ST) by greedy search.The curve corresponding to B is the online performance of the teacher model using vanilla wait-k policy.The curve corresponding to B + FAI is the online performance of the teacher model with our FAI strategy.The curve corresponding to FAST is the online performance of our student model with the FAI strategy, i.e., FAD + FAI.

Figure 6 : 8 Figure 6 :
Figure 6: Effect from different lengths of future context.

Figure 11 :
Figure 11: We measure the accuracy of predicted context by calculating the cosine similarity between every predicted future representation and full speech representations at the same position.

Figure 12 :
Figure 12: BLEU v.s.AL on different p.

Table 2 :
Number of samples for each split of MuST-C datasets.

Table 4 :
Numeric results for ablation study (Figure