Turning Whisper into Real-Time Transcription System

Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming , an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.


Introduction
Whisper (Radford et al., 2022) is a recent stateof-the-art system for automatic speech recognition (ASR) for 97 languages and for translation from 96 languages into English.Whisper models are publicly available under the MIT license.However, the current public implementations of Whisper inference usually allow only offline processing of audio documents that are completely available at the time of processing, without any processing time constraints.
Real-time streaming mode is useful in certain situations, e.g. for live captioning.It means that the source speech audio has to be processed at the time when it is being recorded.The transcripts or translations have to be delivered with a short additive latency, e.g. in 2 seconds.There are some implementations of Whisper for streaming, but their approach is rather naive, they e.g.first record a 30-second audio segment, and then process it.The latency of these methods is large, and the quality on the segment boundaries is low because a simple content unaware segmentation can split a word in the middle.
In this work, we implement, evaluate and demonstrate Whisper in simultaneous streaming mode using the simple but effective LocalAgreement (Liu et al., 2020) algorithm.LocalAgreement is one particular streaming policy that can be used to convert any full-sequence to full-sequence model to operate in simultaneous streaming mode.It was used by the winning system CUNI-KIT at IWSLT 2022 simultaneous speech translation shared task (Polák et al., 2022).We call our implementation Whisper-Streaming, although it is applicable to any model with API similar to Whisper.According to our evaluation, it achieves 3.3 seconds latency on average for English ASR on the European Parliament speech test set ESIC (Macháček et al., 2021), when running on NVIDIA A40 GPU, a fast hardware processing unit.We test it also on German and Czech ASR and present the results and suggestions for the optimal parameters.
The contribution of this work is implementation, evaluation and demonstration of Whisper-Streaming.Given that Whisper-Streaming can be quickly and easily packaged into a product, we want to ensure that the most recent scientific results, such as the algorithm for simultaneous mode, can be accessible to and be used by industrial researchers and engineers.Furthermore, we want to reliably evaluate the performance of our implementation and share the results with the research community, to further drive research and development of real-time transcription solutions which have reallife use cases.We expect that our results can be used as strong baselines for future comparison.
We make Whisper-Streaming publicly available1 along with a demonstration video.2

Background
In this section, we describe the background for the back-end components of our work.The yellow highlighted text is a "prompt", the previous context to follow.The black-bordered rectangle is an audio buffer, and the text inside is Whisper's transcript generated from that sound segment.The blue vertical line is a timestamp that splits the buffer to two parts, the left being previously confirmed, and the right one is unconfirmed.The LocalAgreement-2 policy, or searching the longest common prefix, is applied on the unconfirmed (right) part in two subsequent updates.The longest common prefix is highlighted in green and the green underline highlights the newly confirmed output, whereas the green dashed underline indicates previously and subsequently confirmed output.The gray underline demonstrates an update in the confirmed part that is disregarded.
Whisper (Radford et al., 2022) is a Transformer model for speech-to-text transcription and translation trained on a massive amount of multilingual data.We use "large-v2"3 model because it achieves the highest quality of all Whisper model size options.Since the original release of the whisper backend is rather slow, we use the faster-whisper4 reimplementation of Whisper inference using CTranslate2, a fast inference engine for Transformer models.It is approximately four times faster than the standard implementation (as reported by the authors).We use it with 16-bit float precision.
Although we primarily use Whisper, the underlying model in our implementation can be easily replaced by any other speech-to-text transcription or translation model (e.g.MMS, Pratap et al., 2023) if it produces word-level timestamps and punctuation.
Streaming Let us assume a model M that processes a source sequence c 1 , • • • , c n into a target sequence t 1 , • • • , t m , given a previous target s that can be used for for inter-sentence coherence.Streaming involves receiving the source sequence consecutively, one chunk at a time, and producing the target simultaneously.A streaming policy P predicts a target segment t T at time T as t T := P M (c i<T |s, t j<T ).It operates the model M on available source chunks c i<T , previous sequence target s, and previous target segments t j<T .The policy is triggered every time a new source segment is available.An empty target segment can be emitted, e.g. when waiting for context.The policy aims to minimize latency and maximize target quality.
Streaming was originally proposed for simultaneous translation (Ma et al., 2019), but it is applicable for any sequence-to-sequence task including ASR.Dong et al. (2022) give a summary of streaming speech translation.
LocalAgreement (Liu et al., 2020) is a streaming policy that outputs the longest common prefix of the model on n consecutive source chunks, or an empty segment when less than n chunks are available.Based on the IWSLT 2022 shared task on simultaneous translation, the CUNI-KIT system compared LocalAgreement to other policies (hold-n and wait-k) with different chunk sizes.They found that LocalAgreement with n = 2 was the most effective policy.Therefore, we use LocalAgreement-2 for identifying stabilized target segments.

Whisper-Streaming
We describe the core components and inner workings of Whisper-Streaming.It consists of the update loop, audio buffer, skipping the confirmed output in audio buffer, trimming the buffer, joining for inter-sentence context, and optional voice activity detection.

Update Loop
The main part of Whisper-Streaming is a program that utilizes a loop to receive source audio chunks and trigger streaming policy updates.The parameter MinChunkSize controls the latency and quality, and determines the minimal duration processed per iteration.If the update computation exceeds MinChunkSize, the next update is performed immediately on the accumulated audio input.This parameter impacts both latency and quality.
Audio buffer Whisper is trained to handle sequences that are up to 30 seconds long and contain one full sentence.It provides punctuation and word-level timestamps. 5The process is illustrated in Figure 1.Each update involves storing incoming audio at the top of the audio buffer and processing the entire buffer with Whisper.We keep an invariant that the buffer always starts with a new sentence, to maintain the high quality of Whisper.LocalAgreement-2 is applied to the current and previous Whisper output.The timestamp of the last word in the "confirmed output" is saved.In subsequent updates, we always reprocess Whisper from the beginning of the buffer, including the portion preceding the last "confirmed output" timestamp (indicated by the gray background in Figure 1).Changes to the transcription in the confirmed portion are disregarded, as they are often insignificant in terms of meaning alteration.
Skipping the confirmed part When determining the position of transcribed words relative to the last confirmed word from the previous update, we account for the potential inaccuracies and updates in Whisper timestamps due to new audio chunks.If a word's timestamp falls within a 1-second interval from the last confirmed word, we compare its preceding n-grams (where n ranges from 1 to 5) with the suffix in the last confirmed output.If they match, we skip those words.However, this rule can be further enhanced in future work by incorporating measures such as setting and fine-tuning a character edit distance threshold, trimming punctuation and casing from the n-grams, etc.
Trimming the audio buffer To avoid inacceptably long spikes in latenc, the audio buffer is limited to around 30 seconds.When the confirmed output includes a sentence-ending punctuation mark followed by a word starting a new sentence, the buffer is trimmed at the punctuation mark's timestamp.A language specific sentence segmentation tool (e.g.Koehn et al., 2007) is used for this purpose, ensuring that the buffer always contains a single sentence.Despite this, if the buffer length exceeds 30 seconds, we retain the last confirmed segment marked by Whisper.
Joining for inter-sentence context The Whisper transcribe function utilizes a "prompt" parameter to maintain consistency within a document (consistent style, terminology, and inter-sentence references).We extract the last 200 words from the confirmed output of previous audio buffers as the "prompt" parameter, as shown in Figure 1

(yellow backgrounded text).
Voice activity detection There is a parameter to activate or deactivate Whisper's default voice activity detection (VAD) filter, impacting both quality and latency.

Benchmarking Settings
We describe the dataset for evaluation, metrics, settings and hardware we used to evaluate our model.
Evaluation Data For latency and quality analysis, we utilize the dev set of the manually transcribed ESIC corpus (Macháček et al., 2021) for English, German, and Czech ASR containing 179 documents.This corpus contains 5 hours of original English speeches from the European Parliament, including simultaneous interpreting into German and Czech.It provides audio tracks with manual transcripts and word-level timestamps.
WER We use word error rate (WER) after removing punctuation and casing as the standard measure of ASR quality.
Latency In our latency analysis, we implement our own method wherein we use the timestamps provided in the ESIC corpus to align the gold transcripts to the ASR output using edit distance. 6This allows us to determine the edit operations for each gold word.We calculate the ASR latency by measuring the time difference between when the ASR emitted a word and when the corresponding gold word was spoken, excluding words deleted by the ASR.We compute the average latency within each document and, when comparing different setups across multiple documents, we report the average latency along with standard deviation.

GPU VAD % WER latency [s]
A40 off 5.8±0.92.85±0.45A40 on 5.2±0.93.12±0.36L40 off 5.1±1.0 3.58±0.62L40 on 5.0±0.6 3.96±0.81Hardware For benchmarking, we use NVIDIA A40 GPUs.We run Whisper on a computer in a cluster that is used by other processes at the same time, which may allocate the same resources and influence the latency.Since it is not always possible to have a dedicated server for a given service, this makes our evaluation very realistic.Since there will be variations in the latency metrics, we report mean and standard deviations.
Ensuring Reproducibility We simulate realtime processing of long-form transcription and record the times when Whisper emitted the outputs.We run the simulation on computers in a cluster that are not entirely under our control.For our simulation process, we block one GPU and a sufficient number of CPUs and RAM capacity.However, it can happen that other processes run at the same time, making a CPU and RAM load that is unpredictably slowing down our simulation.If the MinChunkSize is smaller than time for processing an update, then two runs of the same simulation have different segmentation to chunks, leading to different WER and latency.Therefore, we run simulation of the same setup of one document 10 times, to measure the standard deviation of the latency and quality.The setup is English transcription of the ESIC dev.20080925.013_007document that is 3 minutes 36 seconds long, on NVIDIA A40 or L40 GPU with 48GB GPU RAM, 8 blocked CPU cores and 200GB of CPU RAM, with or without VAD filter, with MinChunkSize 0.1 seconds.
The results are in Table 1.We observe small, negligible standard deviation in WER, below or near 1%.The standard deviation in the average latency is much larger, from 0.36 to 0.81 seconds depending on the setup.We conclude that we must be aware of the standard deviation of latency due to uncontrollable computation conditions.

Results
We evaluated Whisper-Streaming with various setups for English, German and Czech ASR.We first show the impact of outliers and voice activity detection (VAD) to determine optimal settings, and then present our main results with these settings.
Outliers After processing many setups, we observed extraordinarily high WER on English ASR of a document dev2.20101213.015_018_EN_Gallagher.We realized it is due to noise in the ESIC data set.The first half of the mentioned document is in Irish, and not English as intended.Only the English part is transcribed in gold, but Whisper transcribed both, leading to more precise transcription than the reference.Except of the Gallagher document, all the reported setups achieved WER between 0 and 52%, and average latency between 0 and 16.1 seconds.
VAD We studied the effect of VAD (voice activity detection) filter that is integrated within Whisper backend.The results are in Table 2 and Figure 2. We realized that in ESIC corpus, it is advisable to deactivate the VAD filter for the English original speech, because it is very fluent, not interleaved with silence and has no non-voice sounds.Without VAD, the quality remains nearly the same (difference within 0.2% WER), and the average latency was substantially lower, between 0.23 to 0.41 seconds.
For the processing of simultaneous interpreting, we recommend activating the VAD filter.The speech of a simultaneous interpreter contains many pauses, especially when waiting for context.With VAD, the latency was only 0.1 seconds larger, because VAD often filters out silence, which reduces the processing load.The quality with VAD was substantially higher, by 2 to 3 % WER with shorter MinChunkSize on German.With large chunk sizes, the quality is nearly the same (0.3 % WER difference with 2 seconds MinChunkSize) because a large chunk size causes the model to have large context and thus a low chance for risking uncertain output.Therefore, we activated VAD for German and Czech simultaneous interpreting, and we deactivated it for English original speech.
For a real-life setup, we recommend starting Whisper-Streaming shortly before the speech actually starts, so that the first words are not missed, along with turning the VAD filter on so that the  silence and non-voice sounds do not cause Whisper to make mistakes.If reducing the latency is important, an adaptive protocol for setting VAD on and off can be implemented.
Performance Table 3 and Figure 3 summarize the WER and average latency of Whisper-Streaming on ESIC validation set for the three language tracks.Overall, with 1 second MinChunk-Size, the average latency is 3.3 seconds for English, 4.4 seconds for German and 4.8 seconds for Czech, while the WER is by 2% higher than in the offline mode for English and German, and by 6% higher for Czech.Both WER and latency is the lowest on English, followed by German and Czech.This is related to the amount of language specific data used for training Whisper, as well as the morphological complexity of these languages.The latency in-creases with larger uncertainty because it requires more updates for an agreement.Moreover, the larger MinChunkSize, the larger the latency, but higher the quality because the system has sufficient context.
Offline mode WER We contrast the results with setups that serve as maximum performance estimates.One of them is offline mode in which processing of the whole audio document is done after recording, without any limitations on processing time.It is the default and most optimized setup for Whisper.The WER in offline mode and with VAD is lower than in streaming mode because the context size is not restricted.The model can use even the right (future) context that is unavailable or limited in streaming mode.Moreover, the internal segmentation of the long-form speech into processing chunks is optimized in the offline mode.
Computationally unaware latency Another contrastive setup is computationally unaware simulation.It uses an unrealistic assumption that computation for Whisper processing any audio segment is instant, so that the latency caused by computation is not included in the latency measurement.The measurement includes latency caused by uncertainty in the language.The gap between latency in computationally unaware and aware evaluation can be reduced by optimizing the hardware or inference algorithm.Computationally unaware latency can be reduced by improving the model or streaming policy.
We observe that the average computationally unaware latency is approximately twice the chunk size.This is expected because we use local agreement of two consecutive updates.However, the processing of English is actually faster, little less than twice the chunk size.We hypothesize that this could be caused by the anticipation ability of Whisper model.The second possible reason is the inaccuracy of the gold timestamps in ESIC.The timestamps were computed by automatic forced alignment, and thus they may be less accurate in non-standard situations such as overlapping and non-transcribed speech, e.g.hesitations and foreign language insertions.), put into contrast with offline WER ("offline") and with the computationally unaware simulation ("un.").The data are the same as in Figure 3.

System Demonstration
cesses live ASR on one ESIC document in three parallel instances for English, German and Czech speech, the original and simultaneous interpreting.
The video shows a contrast to gold transcripts with original timing, so that the latency can be observed.The video also contains color highlighting for ASR errors.
Integration with ELITR To demonstrate practical usability, we integrate Whisper-Streaming with the ELITR (European Live Translator, Bojar et al., 2020) framework for complex distributed systems for multi-source and multi-target live speech transcription and translation (Bojar et al., 2021a).Within Whisper-Streaming, we implement and release a server that is connected as a worker to Mediator server (Franceschini et al., 2020).Mediator allows a client to request a service of a worker.The client is then allowed to further process the text outputs received by the worker, e.g.translate them with another worker and present them at the web view server that delivers real-time captions to event participants during a live multilingual event.
Evaluation event We evaluated Whisper-Streaming as a component in an experimental live speech translation service at a multilingual conference.For this, we built a pipeline that used five parallel Whisper-Streaming workers, three of them for ASR only (English, Czech and Ukrainian), and two for speech translation (Czech-to-English and Ukrainian-to-English).There were three parallel language streams at the conference, Czech, English and Ukrainian.One of the languages was spoken at the main floor, and the others were provided by human simultaneous interpreting.
A human operator (as in Bojar et al., 2021b) was controlling the technical setup and the outputs using the language knowledge and had an option to redirect the streams, if necessary.The qualitative evaluation at the event showed that Whisper-Streaming is a robust and reliable part of the service, reaching acceptable latency and unexpectedly high quality on English, Czech and Ukrainian longform speech.
Demonstration at AACL Our system demonstration at the IJCNLP-AACL 2023 conference will use the ELITR framework.We will either simulate speech source from a recording, or allow participants to speak into microphone in any of the 97 languages supported by Whisper, and observe the real-time outputs.

Conclusion
We implemented, evaluated and demonstrated Whisper-Streaming, a tool that effectively operates an offline ASR model Whisper, with 3.3 second average computationally aware latency on English ESIC corpus.We described and explained the implementation and its underlying components, including LocalAgreement algorithm for streaming.Lastly, we demonstrated the robustness and practical usability at a real-life multi-lingual conference.

Limitations
The data collected in ESIC corpus were created relatively long time ago.It raises concerns about potential leakage into Whisper training set, which could compromise our evaluation.Additionally, performance tests on more affordable hardware are pending, highlighting the need for further evaluation in terms of computational cost.
It is worth noting that the reported latency and quality metrics obtained from ESIC may not be fully generalizable to other languages or language variants due to the nature of the corpus.
Furthermore, our focus is on demonstrating the online capabilities of Whisper rather than optimizing the algorithm or implementation.It is important to recognize that the actual latency experienced may fluctuate, and the reported average latency serves as an indicative measure without providing an upper bound.The streaming policy would need certain modifications to guarantee a maximum latency, at a possible loss in quality.
Lastly, we have not conducted comparison tests to other state-of-the-art systems, e.g. from IWSLT, because a common evaluation framework is pending, as well as X-to-English long-form speech test set.

Figure 1 :
Figure1: Illustration of processing three consecutive updates.The yellow highlighted text is a "prompt", the previous context to follow.The black-bordered rectangle is an audio buffer, and the text inside is Whisper's transcript generated from that sound segment.The blue vertical line is a timestamp that splits the buffer to two parts, the left being previously confirmed, and the right one is unconfirmed.The LocalAgreement-2 policy, or searching the longest common prefix, is applied on the unconfirmed (right) part in two subsequent updates.The longest common prefix is highlighted in green and the green underline highlights the newly confirmed output, whereas the green dashed underline indicates previously and subsequently confirmed output.The gray underline demonstrates an update in the confirmed part that is disregarded.

Figure 2 :
Figure 2: Impact of VAD filter on latency and quality.The striking difference in VAD activated or deactivated for English vs German is due to German being the speech of an interpreter.
Thank you, Mr. President.Today, I want to thank Mr. Brejc for his great report.And Today, I want to thank Mr. Brake for his great report.And we Thank you, Mr. President.Thank you, Mr. President.Today, I want to thank Mr. Brejc for his great report.And we appreciate

Table 1 :
Average (±stddev) WER and latency of English ASR of 10 repeated runs of ESIC dev.20080925.013_007document, with MinChunkSize 0.1 seconds, using or not using VAD filter, on two GPU types.Bold is the setup that we later use.