ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) – each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.


Introduction
The objective of this project is to contribute to the diversity of the open-source spoken language translation ecosystem. Toward this, we launched this ESPnet-ST-v2 update in collaboration with researchers working on Fairseq (Ott et al., 2019) and TorchAudio (Yang et al., 2021b). This project focuses on: offline speech-to-text (ST), simultaneous speech-to-text (SST), and offline speech-to-speech (S2ST). These three spoken language translation tasks have drawn significant interest, as evidenced by rising IWSLT 2 shared task participation.
The ST task can be considered a base form of spoken language translation. Early approaches to ST stemmed from coupling statistical automatic speech recognition (ASR) (Huang et al., 2014) and text-to-text translation (MT) (Al-Onaizan et al., 1999), and this type of cascaded approach is still 1 Please see our demo video and demo notebook. 2 International Workshop on Spoken Language Translation common in the neural network era (Bentivogli et al., 2021;Zhang et al., 2022). End-to-end differentiable (E2E) approaches have recently emerged as an alternative offering greater simplicity and superior performance in some cases (Inaguma et al., 2021b); however, E2E approaches still benefit from techniques originating from ASR and MT (Gaido et al., 2021;Inaguma et al., 2021a).
SST modifies ST by imposing an additional streaming requirement, where systems are expected to produce textual outputs while incrementally ingesting speech input. Both the aforementioned cascaded and end-to-end approaches to ST have been adapted for SST (Ma et al., 2020b;Iranzo-Sánchez et al., 2021;Chen et al., 2021), although the more direct nature of the latter may be advantageous for latency-sensitive applications. On the other hand, S2ST extends ST by producing target speech rather than target text. Again, cascaded approaches of ST followed by text-to-speech (TTS) came first (Waibel et al., 1991;Black et al., 2002) and E2E approaches followed Lee et al., 2022a;Jia et al., 2022a;Inaguma et al., 2022), with the latter offering smaller footprints and greater potential to retain source speech characteristics.
Given the recent swell in E2E ST, SST, and S2ST research, we have revamped ESPnet-ST (Inaguma et al., 2020) which previously only supported E2E ST. In particular, this work: discrete unit based models for S2ST.
• Benchmarks the ST, SST, and S2ST performance of ESPnet-ST-v2 against top IWSLT shared task systems and other prior works.
With this major update, ESPnet-ST-v2 keeps pace with the interests of the community and offers a variety of unique features, making it a valuable complement to Fairseq (Wang et al., 2020), NeurST (Zhao et al., 2021), and other spoken language translation toolkits.
In Table 1 we compare ESPnet-ST-v2 to Fairseq (Wang et al., 2020) and NeurST (Zhao et al., 2021), two toolkits which also cover multiple types of spoken language translation. Fairseq and NeurST offer cascaded and E2E approaches to ST and SST (some of which are not offered by ESPnet-ST-v2). Meanwhile, ESPnet-ST-v2 focuses on E2E approaches and offers multiple unique core architectures not covered by the other toolkits. For S2ST, Fairseq and ESPnet-ST-v2 both offer a range of approaches. All told, ESPnet-ST-v2 offers the greatest variety across ST, SST, and S2ST -however, we view these toolkits as complementary. The following section elaborates on the unique features of ESPnet-ST-v2.

ESPnet-ST-v2
In this section, we first describe the overall design and then introduce a few key features. Figure 1 illustrates the software architecture of ESPnet-ST-v2. This modular design is an improvement over the ESPnet-ST-v1 where monolithic model and task definitions made it more difficult to extend and modify the toolkit. We also designed ESPnet-ST-v2 such that modules developed for adjacent tasks (e.g. ASR, TTS, MT) can also be readily used for spoken language translation.

Modular Design
In ESPnet-ST-v2 major neural network modules, such as frontends, encoders, decoders, search, and loss functions, inherit from common abstract classes making them easy to interchange. These modules, which are detailed further in the next subsection, are used as building blocks in wrapper classes which are used to construct model architectures. Then the fully constructed models are fed to task wrappers which prepare data loaders, initialize models, and handle training/validation. For inference, pythonic APIs invoke search algorithms over the trained models and direct outputs to scoring scripts. For instance, the third-party SimulEval tool for evaluating SST latency (Ma et al., 2020a) is integrated via this API layer. We are also integrating with TorchAudio (Yang et al., 2021b) in the same manner. Finally, recipe scripts define experimental pipelines from data preparation to evaluation.

Key Features
Each of the following modeling components feature a variety of interchangeable approaches. Frontends & Targets Spectral features (e.g. FBANK) and features extracted from speech selfsupervised learning (SSL) representations are supported, as well as fusions over multiple features (Berrebbi et al., 2022). For speech SSL features, ESPnet-ST-v2 integrates with the S3PRL toolkit (Yang et al., 2021a). These speech SSL representations are also used to generate discrete targets for S2ST (Lee et al., 2022a). Conformer (Gulati et al., 2020;Guo et al., 2021), Branchformer , EBranchformer (Kim et al., 2023), and Transformer (Vaswani et al., 2017;Karita et al., 2019) encoder architectures are supported for ST and S2ST. For SST, a blockwise scheme is adopted following (Tsunoo et al., 2021;Deng et al., 2022) to form contextual block Conformer and Transformer encoders. Intermediate CTC (Lee and Watanabe, 2021) and Hierachical CTC (Sanabria and Metze, 2018) encoding are also supported; these techniques have been shown to stabilize deep encoder optimization (Lee and Watanabe, 2021) and improve representations for sequence tasks involving source-to-target re-ordering (Yan et al., 2023).

Encoder Architectures
Decoder Architectures Attentional Transformer and recurrent neural network decoders are supported (Karita et al., 2019). Multi-decoder schemes which allow for E2E differentiable decoder cascades via searchable hidden intermediates (Dalmia et al., 2021), are also supported; this technique has been shown to improve sequence modeling for tasks which naturally decompose into sub-tasks. Finally, large language model decoders (e.g. mBART (Liu et al., 2020b)) can be adopted through an integration with HuggingFace (Wolf et al., 2020).
Loss Functions Cross-entropy (for attentional decoders), CTC, and Transducer are supported for ST and SST. Multi-objective training with CTC/attention and CTC/transducer as well as multitasked training (e.g. ASR/MT/ST) is also supported. For S2ST, L1 and mean square error losses are also supported for spectral models.
Search Algorithms For offline attentional decoder models, label-synchronous beam search is supported with optional CTC joint decoding for multi-objective models (Watanabe et al., 2017). For offline Transducer models, the original Graves beam search (Graves, 2012) as well as timesynchronous and alignment-synchronous beam search (Saon et al., 2020) beam searches are supported. For SST, both incremental decoding and non-incremental (allowing re-translation) decoding are supported (Liu et al., 2020a). Blockwise attentional decoder models use a label-synchronous beam search or time-synchronous beam search if a CTC branch is available. Blockwise transducer models use time-synchronous beam search.

Synthesis & Post-processing For ST, Minimum
Bayes Risk (MBR) ensembling (Fernandes et al., 2022) is supported for leveraging quality-metrics (e.g. BLEU) to compare and rank n-best outputs from one or more models. For S2ST, neural vocoders are supported for both spectral and discrete inputs (Hayashi et al., 2020(Hayashi et al., , 2021.

Example Models
In this section, we introduce example models which are pre-built in ESPnet-ST-v2 using the neural network components described in the previous section. These examples include state-of-the-art core architectures, as evidenced by prior studies and our performance benchmarking (presented in §5).

ST Models
CTC/Attention (CA) Following Yan et al.
(2023), we use Conformer encoders with hierarchical CTC encoding and Transformer decoders. The hierachical CTC encoding, which aligns the first N layers of the encoder towards ASR targets and the last M layers towards ST targets, regularizes the final encoder representations to be monotonic with respect to the target. CTC/attention models are jointly decoded using either label-synchronous (wherein the attention branch is primary) or timesynchronous (wherein the CTC branch is primary) beam search. For offline tasks, label-synchrony has shown greater performance (Yan et al., 2023).
Multi-Decoder CTC/Attention (MCA) As shown in Figure 2.a, the Multi-decoder decomposes ST into two sub-tasks, logically corresponding to ASR and MT encoder-decoder models, while maintaining E2E differentiability (Dalmia et al., 2021). This Multi-decoder scheme is also combined with the CTC/attention scheme described in the blurb above, following Yan et al.

SST Models
Time-Synchronous Blockwise CTC/Attention (TBCA) As shown in Figure 2.b, we adapt the aforementioned CTC/attention model for ST ( §4.1) to SST by replacing the Conformer encoder with a contextual block Conformer (Tsunoo et al., 2021). During inference, we initially followed Deng et al. (2022) and used the label-synchronous CTC/attention beam search originally proposed for ASR by Tsunoo et al. (2021). However, we found that label-synchrony results in overly conservative boundary block detection for SST. Therefore we opt instead for the time-synchronous variant which relies on CTC's more robust end-detection (Yan et al., 2023) to control boundary block detection; this change reduces latency without sacrificing quality. To perform incremental decoding without re-translation (as expected by SimulEval), hypotheses are pruned after processing all of the time steps for each encoder block.
Blockwise Transducer (BT) As demonstrated by Xue et al. (2022), Transducers can be effectively applied to SST despite the monotonic nature of their underlying alignment model. We build Transducers for SST using contextual block Conformer encoders and unidirectional LSTM decoders. We found that the aforementioned hierarchical CTC encoding ( §4.1) improves training stability and convergence rate. During inference, we found that the time-synchronous algorithm described by Saon et al. (2020) outperformed the original Graves decoding (Graves, 2012) and the later proposed alignment-synchronous algorithms (Saon et al., 2020). We also found that length normalization is required to avoid overly short outputs. Incremental decoding is applied in the same manner as for TBCA.

S2ST Models
Spectral  Results are presented on MuST-C-v1 (English-to-X) for ST/SST and on CVSS-C (X-to-English) for S2ST.
Discrete Multi-Decoder (UnitY) The UnitY model (Inaguma et al., 2022) is similar to Translatotron 2, but critically predicts discrete units of speech SSL representations rather than spectral information in the final stage. In other words, UnitY is Multi-decoder consisting of a ST sub-task followed by a text-to-unit (T2U) sub-task. We use Transformer-based encoder-decoders for both subtasks. During inference, the ST stage is first decoded and then followed by the T2U stage. Both stages use label synchronous beam search. The final speech is generated with a unit HiFi-GAN vocoder with Fastspeech-like duration prediction (Polyak et al., 2021;Lee et al., 2022a), which is separately trained in the ParallelWaveGAN toolkit (Hayashi et al., 2020(Hayashi et al., , 2021.

Performance Benchmarking
In this section, we 1) compare open-source toolkits 2) compare our different example models and 3) compare our models with top IWSLT shared task systems and state-of-the-art prior works.

Experimental Setup
Please refer to §A.1 for reproducibility details. The following is only a summary of our setup.  and are trained on a ∼400h of single language pair data from a single corpus. For ST/SST, we also use a "large" setting for benchmarking against IWSLT submissions. Our large models have 150-200M trainable parameters and are trained on ∼1000h of single language pair data from multiple corpora.
Scoring For ST/SST, we evaluate detokenized case-sensitive BLEU (Post, 2018). For SST, we additionally evaluate Average Lagging (AL) (Ma et al., 2020a). For S2ST, we evaluate ASR-BLEU by transcribing the generated speech and then evaluating the BLEU of this transcription.   performances. In Table 4, we scale these two approaches by training on larger corpora and increasing model capacity -our large MCA model outperforms the best IWSLT 2021 offline track submission on the 2020 test set with given segmentation.

Toolkit Comparison
SST Table 5 shows a variety of approaches, of which the blockwise Transducer (BT) and timesynchronous blockwise CTC/attention (TBCA) models have the lowest AL. We choose to scale the TBCA to compare with IWSLT submissions due to its superior translation quality, but note that the BT has lower computational overhead due primarily to the lack of source-target computation; AL is non-computation aware. In Table 6, we fit the TBCA to the 2 second AL latency regime by selecting a blocksize of 32 and scale it with more data and model capacity -our large TBCA model would have ranked 3rd out of 6 amongst IWSLT 2022 submissions without using any SSL / LLM representations or knowledge distillation.
S2ST Table 7 shows a variety of approaches compared to prior works with comparable architectures our S2ST models are generally on par with prior    works which are considered state-of-the-art. In fact, all of our models slightly outperform their respective prior works except for Translatotron 2. Further, in Table 8 we ablate a range of SSL types for both the frontend and discrete units demonstrating the flexibility of our toolkit.

Conclusion
We presented ESPnet-ST-v2 which now supports offline speech translation, simultaneous speech translation, and offline speech-to-speech translation. ESPnet-ST-v2 will continue to grow to sup-port the community's interests. Future updates may include more new tasks, such as simultaneous speech-to-speech translation, and cross-toolkit integrations via TorchAudio.

Limitations
The first set of limitations to be aware of are datarelated. Although prior works have shown the feasibility of building E2E systems without source language transcriptions (Lee et al., 2022b;Chen et al., 2022;Zhang et al., 2021), in this work we only investigate cases where triplet data (source speech, source transcript, target translation) is available for ST/SST and where quadruplet data (source speech, source transcript, target translation, target speech) is available for S2ST. The second set of limitations to be aware of are evaluation-related. For SST, we follow prior works (Ma et al., 2020a;Wang et al., 2020;Anastasopoulos et al., 2022) and evaluate AL which is a measure of how much the system outputs lags behind the amount of input read. Notably, this does not consider the actual computation time and only the input-to-output ratio. For S2ST, we follow prior works (Jia et al., 2022a;Inaguma et al., 2022) and evaluate ASR-BLEU. This evaluation is dependent on an ASR system, which is not standardized across prior works. And further, our evaluation of S2ST outputs does not include naturalness. Finally, in this work we have not conducted any human evaluation of translation outputs.