SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.


Introduction
Starting with ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), substantial work has shown that pre-trained models can bring significant improvements in various tasks, including natural language processing (NLP), image recognition, and speech processing (Radford et al., 2019;Lample and Conneau, 2019;Bao et al., 2021;Baevski et al., 2020). It is becoming a new principle to solve Figure 1: An illustration of the proposed SpeechT5 framework, which treats all spoken language processing tasks as a speech/text to speech/text format, including voice conversion (VC), automatic speech recognition (ASR), text to speech (TTS), grapheme to phoneme (G2P), and so on. problems by pre-training a shared model with selfsupervision tasks on a large amount of unlabeled data to learn universal representations.
Particularly, "Text-To-Text Transfer Transformer" (T5) (Raffel et al., 2019) leverages a unified text-to-text framework and achieves state-ofthe-art results on a wide variety of NLP tasks, including machine translation, question answering, sentiment classification, and so on. The basic idea of T5 is to treat every NLP problem as a "textto-text" problem, and employ transfer learning to boost the performance of downstream tasks. Within the same period, self-supervised speech representation learning has also been investigated and shown promising results, benefiting from richly learned representations (Chung and Glass, 2018;Chuang et al., 2019;Song et al., 2019;Baevski et al., 2020;Wang et al., 2021;Chung et al., 2021a). A prominent line of work has been proposed to improve acoustic encoder with speech pre-training, such as Wav2vec2 (Baevski et al., 2020), APC (Chung and Glass, 2020), MPC (Jiang et al., 2021), and Hubert .
Another category of methods attempts to enhance a few spoken language understanding tasks by utilizing the speech-language pre-training (Chung et al., 2021b;Kim et al., 2021;. However, most of these models rely on an encoder-only model similar to BERT, and have task-specific model architectures for different tasks. How to design a unified encoder-decoder model which can take advantage of unlabeled speech and text data to improve various spoken language processing is not well explored. Inspired by the T5 method, we attempt to convert each spoken language processing task into a speech/text to speech/text problem via an encoder-decoder framework, e.g., automatic speech recognition (ASR), text-to-speech (TTS), voice conversion (VC), and speaker identification (SID), as shown in Figure 1, which enables us to use the same pre-trained model across diverse tasks.
To achieve this, we propose SpeechT5, a unifiedmodal encoder-decoder pre-training method for spoken language processing tasks. The proposed SpeechT5 contains an encoder-decoder backbone network and modal-specific pre-post networks. With the pre-nets, the input speech/text is embedded in a shared vector space, and the encoderdecoder backbone network models the sequence to sequence conversion, from which the modelspecific post-nets generate the speech/text output. SpeechT5 is pre-trained with a denoising sequenceto-sequence method by leveraging large-scale unlabeled text and speech corpus. To align the textual and acoustic information into a unified semantic space, the proposed SpeechT5 model (1) maps text and speech representations into a shared vector quantization space, and (2) randomly mixes up the quantized latent representations and the contextual representations, which can explicitly guide the quantizer to learn the cross-modal information.
We fine-tune SpeechT5 on a wide variety of downstream spoken language processing tasks, including VC, ASR, TTS, and SID. Extensive results show that the proposed SpeechT5 model achieves a significant improvement on these spoken language processing tasks when compared with strong baselines. Specifically, the proposed SpeechT5 method performs better than the state-of-the-art voice transformer network (VTN)  on the VC task, and achieves the state-of-the-art result of 90.97%. It also outperforms SpeechNet (Chen et al., 2021) and pre-trained models such as SU-PERB  on the SID task. Besides, SpeechT5 also obtains a gain of about 10.0% and 6.5% than the encoder-decoder based ASR model (i.e., Fairseq ) and some baseline model, respectively, on the ASR task, and obtains significant improvements over the strong Transformer TTS model  by 13.4% and 5.8% in terms of the word error rate and mean option score on the TTS task.
The contributions of this paper are summarized as follows.
• To the best of our knowledge, this is the first work to investigate a unified encoder-decoder framework for various spoken language processing tasks.
• We propose a cross-modal joint pre-training method, which learns potential alignment between acoustic and textual representation with large-scale unlabeled speech and text data.
• Extensive experiments on spoken language processing tasks demonstrate the effectiveness and superiority of the proposed SpeechT5 method.

SpeechT5
In this section, we introduce the proposed SpeechT5 method, a unified-modal framework for learning joint contextual representations of speech and text based on a shared encoder-decoder model. It aims to derive generic representations for spoken and natural language via pre-training on unlabeled speech and text. In the following, We will first introduce the overall architecture of SpeechT5 and the details of individual components (i.e., Section 2.1), and then present the pre-training method (i.e., Section 2.2) and the fine-tuning method (i.e., Section 2.3) for spoken language processing tasks.

Model Architecture
All spoken language processing tasks take speech or text as the input or output. Figure 2 shows the model architecture of the proposed SpeechT5 model, which consists of an encoder-decoder module and six modal-specific pre/post networks. The pre-nets convert the input speech/text to a unified space, and the shared encoder-decoder network models the sequence to sequence conversion. Then based on the output of the decoder, the post-nets generate the output in the speech/text modality.

Input/Output Representations
To train a single model on a diverse set of spoken language processing tasks, we cast all the tasks we consider into a speech/text to speech/text format, where the input/output is a speech sequence or text sequence. The 80-dimension log-Mel filterbank feature extracted from each frame with the librosa tool 1 is treated as a token. If the output is in the speech modality, we employ the Vocoder (Kong et al., 2020) to transform the log-Mel filterbank feature into a waveform. For text, we split the text into a sequence of tokens by using a unigram language model (Kudo, 2018).
Encoder-Decoder This model is similar to the Transformer (Vaswani et al., 2017). The encoder consists of a stack of blocks, each of which comprises two subcomponents: a self-attention layer, followed by a small feed-forward network. Layer normalization (Ba et al., 2016) and residual connection (He et al., 2016) are applied to the input of each subcomponent. The decoder has a similar architecture to the encoder except that it includes a cross-attention mechanism after each self-attention layer that attends to the output of the encoder. Besides, the self-attention mechanism in the decoder also uses a form of autoregressive or causal selfattention, which only allows the model to attend to past outputs. We use simple relative position embedding (Shaw et al., 2018) to enhance the model capabilities, in which we only add the relative posi-1 https://librosa.org/ tion embedding to the dot-product weights of the self-attention.
Speech Pre/Post Net There are some differences between the speech-encoder pre-net and speechdecoder pre-net. In the speech-encoder pre-net, we apply two convolutional layers (via convolution strides) to downsample them and process local relationships. In the speech-decoder pre-net, the log-Mel filterbank is fed into a neural network composed of three fully connected layers with the ReLU activation. To support multi-speaker TTS and VC, the speaker embedding, which is extracted by the public x-vector (Snyder et al., 2018), is concatenated with the output of the speech-decoder pre-net. Then it is processed by a linear layer with the ReLU activation. For the speech-decoder post network, we use two different linear projections to predict the log-Mel filterbank and the stop token, respectively, and use 5-layer 1-dimensional convolutional layers to produce a residual to refine the reconstruction of the log-Mel filterbank.
Text Pre/Post Net We use shared embeddings as the text encoder pre-net and decoder pre/post networks. The pre-net transforms the token index into an embedding vector, and the post-net transforms the hidden state into the probability of token distribution, which is normalized by the softmax function. There is a shift in the input text of the decoder for the auto-regressive generation. During the inference, the decoder uses its own past predictions to predict the next token.

Pre-training
With large scale collections of unlabeled speech and text corpus, we can pre-train the unified-modal model separately, and further align the textual and acoustic information into a unified semantic space by a joint pre-training method.
Speech Learning The goal of speech learning is to leverage unlabeled speech data to learn general speech representations for both speech understanding and generation tasks. To this end, the SpeechT5 model is trained as a unified encoderdecoder model with two types of spoken language modeling tasks: bidirectional masked prediction and sequence-to-sequence generation. Formally, the input to the speech module is a sequence of 80-dimensional log-Mel filterbank X = (x 1 , ..., x n ). The speech module, which consists of a speech-encoder pre-net and a Transformer encoder, produces hidden representations S = (s 1 , ..., s n ). Similar to the masked language modeling in BERT, we follow Hubert  to use the acoustic unit discovery model to provide frame-level targets 2 Z = (z 1 , ..., z n ) for the output of the Transformer encoder. Specifically, we use span mask strategies, where p% of timesteps are randomly selected as start indices, and spans of l steps are masked. The cross-entropy computed over masked timesteps is defined as whereX denotes the corrupted version of X, and M denotes the set of indices to be masked for the sequence X. Furthermore, we propose to reconstruct the original speech through the speech-decoder pre-net, Transformer decoder, and speech-decoder post-net. The decoder is autoregressivein that the output of the encoder S = (s 1 , ..., s n ) and the previously generated features y 1:t−1 are considered when decoding current output y t . Inspired by the success of modern seq2seq TTS models , we enforce the corresponding output y t to be close to the original frame x t by minimizing their L1distance as 2 The target labels are generated by clustering the MFCC feature with the k-means clustering method. Besides, we also use a binary cross-entropy (BCE) loss L s bce for the stop token. To address the imbalance problem between stop tokens and normal tokens, we impose a positive weight on the tail positive stop token when calculating the BCE loss.
Text Learning The language module aims to offer contextual understanding and generation. Overall, with unlabeled text data, SpeechT5 is trained by (1) corrupting text with an arbitrary noising function with a masked span, and (2) learning to produce a corresponding target that can reconstruct the original text or masked text. For unsupervised objectives, we can use BART-style (Devlin et al., 2019) or T5-style (Raffel et al., 2019) mask strategies and target sequences. In the BART-style, the model aims to reconstruct the original text from the noisy source text. However, in the T5-style, all selected fragments are removed from the text and concatenated as the target sequence, while the remaining parts are concatenated as the source sequence. Formally, this model is trained to generate the target sequence Y = (y 1 , ..., y m ) auto-regressively condition on the source sequence X = (x 1 , ..., x n ) with the maximum likelihood estimation as where the target sequence Y can be the original text or all masked text fragments.
Joint Pre-training The above pre-training methods can only leverage speech data or text data, individually, to model the acoustic information or language information. However, some spoken language tasks need to build a cross-modality mapping between speech and text, such as ASR and TTS. The alignment learning between the speech and text in pre-training will be beneficial to downstream tasks. With this motivation, we propose a unifiedmodal pre-training method to learn representations that capture modality invariant information with discrete vector quantization. Specifically, our goal is to utilize vector quantized embeddings as a bridge between speech and text as well as align speech representation and text representation through a shared codebook, as shown in Figure 3. Inspired by VQ-VAE (Oord et al., 2017) and SemFace (Ren et al., 2021), we first use the quantizer to turn these dense speech or text representations s i of the encoder output, into discrete representation c i from a fixed size codebook C K which contains K learnable embeddings. Formally, the nearest neighbor search is performed between the encoder output and the embedding of the latent code using the L2-distance metric as where c j is j-th quantized vector in the codebook. Note that we do the same operation for the encoder output of speech and text with a shared codebook. Then, we randomly replace a proportion of the contextual representations with quantized latent representations in the corresponding time steps, and calculate the cross-attention upon the mixed representations, which can explicitly guide the quantizer to utilize cross-modal information. To encourage sharing more codebook, the diversity loss is used by maximizing the entropy of the averaged softmax distribution as where p k is the averaged probability of choosing the k-th code in the codebook. The final pre-training loss with unlabeled speech and text data can be formulated as

Fine-tuning
After pre-training, we fine-tune the encoderdecoder model with the corresponding loss of downstream tasks. Our goal here is to measure general spoken learning abilities. As such, we study downstream performance on a diverse set of benchmarks, including ASR, TTS, VC, and SID. Four speech processing tasks that we consider can be done by concatenating the encoder-decoder model and corresponding pre-net and post-net. For example, the speech-encoder pre-net, encoderdecoder, text-decoder pre-net, and text-decoder post-net can constitute the ASR model, and the training loss is the maximum cross-entropy loss.

Dataset and Evaluation Metrics
For unsupervised speech pre-training, we use the full 960 hours of LibriSpeech audio (Panayotov et al., 2015), which is derived from the Lib-riVox project that contains English recordings of copyright-free audiobooks by volunteers from the Internet. For unsupervised text pre-training, we use the normalized language model training text of LibriSpeech as unlabeled data, which contains 400M sentences. 3 In supervised fine-tuning, we use the commonly adopted dataset and evaluation metric for each task. We train the ASR model with LibriSpeech training data, and measure the performance of ASR by the word error rate (WER) on the standard Librispech dev-other/clean and test-clean/other sets. A language model is trained by the same text data for pre-training, which is used for shallow fusion (Gulcehre et al., 2015) during ASR inference.
For TTS, we finetune the pretrained model on the 460-hours LibriTTS clean sets (Zen et al., 2019), which is a multispeaker English corpus of read English speech from the audiobooks of the LibriVox project. We trim the waveform as ESPnet recipe (Watanabe et al., 2018). We evaluate the WER using a open-source ASR model wav2vec 2.0 CTC 4 on all test set. Moreover, we randomly select 200 fixed examples with various lengths (no overlap with training set) from our internal dataset as the evaluation set to evaluate the mean option score (MOS).
For VC, we use CMU Arctic (Kominek and Black, 2004) corpus, which consists of speech recordings of four speakers, such as clb (female), bdl (male), slt (female), and rms (male), reading Model WER MCD BDL to SLT CLB to SLT BDL to SLT CLB to SLT VTN w/ ASR  11.1% 10.9% 6.50 6.11 VTN w/ TTS  7.6% 9.1% 6.33 6.02 Baseline 21.5% 10.8% 6.26 6.16 SpeechT5 10.1% 7.1% 6.06 5.95 Table 1: Results of VC (speech to speech). BDL, CLB, and SLT mean three different speakers. VTN  is the state-of-the-art voice Transformer network model, which is fine-tuned from the pretrained ASR or TTS model. the same 1,132 phonetically balanced English utterances. We consider a many-to-many setting and use all speakers for training and evaluation. Thus, there are twelve different combinations of source and target speakers. For each speaker, the first 932, the last 100, and the rest 100 sentences of the 1,132 sentences are used for training, test, and validation, respectively. The average of MCD (Mel-Cepstral Distortion) token along the DTW (dynamic time warping) path between the output and ground-truth mel-cepstra serves as the evaluation metric of VC. The smaller MCD indicates better performance. Besides, we also use WER to evaluate the quality of generated voice with a public ASR model Hubert Large 5 , since the WER of the test set with this ASR model is comparable to that of VTN . For SID, VoxCeleb1 (Nagrani et al., 2017) is adopted in our experiments, which contains over 100,000 speech records uttered by 1,251 celebrities extracted from videos uploaded to YouTube. We use the official split of VoxCeleb1 for the speaker identification task, where the test set contains 8,251 utterances from these 1,251 celebrities. The capability of identifying speakers is assessed by classifying an utterance into the ground-truth category. The top-1 speaker classification accuracy is used as the evaluation metric of SID.

Implementation Details
Pre-training All models in this paper are implemented in Fairseq 6 . The encoderdecoder model contains 12 Transformer encoder blocks and 6 Transformer decoder blocks, where the model dimension is 768, inner dimension (FFN) is 3,072 and the number of attention heads is 12. Speech-encoder pre-net is two 1-dimensional con-5 https://huggingface.co/espnet 6 https://github.com/pytorch/fairseq volutional layers with strides [2,2], kernel size [5,5] and channel size [1024,1536], where each layer is followed by a gated linear unit (Dauphin et al., 2017). For speech-decoder pre-net and post-net, we use the same setting as the pre-net and post-net in (Shen et al., 2018), except the channel size of post-net is 256. For text-encoder/decoder pre/postnet, a shared embedding table with dimension 768 is used. For the quantization module, we use G=2 codebooks with V=100 entries for the shared codebook module, resulting in a theoretical maximum of 10k codewords.
The speech feature is a sequence of 80-dim log-Mel filterbank with 64 millisecond (ms) window, and 16 ms frame shift. It is normalized with utterance-level mean and variance when used as input data. For the text data, we combine the text of LibriSpeech (Panayotov et al., 2015) and LibriTTS (Zen et al., 2019) to get 10k unigram vocabulary (Kudo, 2018) and segment text by using Senten-cePiece 7 . We optimize with Adam (Kingma and Ba, 2014), warming up the learning rate for the first 10% of updates. We pretrain our model on 32 GPUs with 32 GB memory and set the update frequency to 4 for 40k steps.
Fine-tuning and Inference After pre-training, we fine-tune the learned representations on labeled data of downstream tasks. The speech and text data are preprocessed in the same way as pre-training. All fine-tuning experiments are conducted on 8 GPUs.
For ASR, we add an extra linear layer to calculate the CTC loss at the top of the encoder . The loss weight is 0.3 for CTC, and 0.7 for cross-entropy. We train our models for 200k steps with a batch size of up to 60000 tokens Models JD LM dev-clean dev-other test-clean test-other Fairseq  w/o w/o 3.00 7.50 3.20 7.50 Espnet (Watanabe et al., 2018) Table 2: Results of ASR (speech to text). Fairseq  and Espnet (Watanabe et al., 2018) are two open-source Transformer based encoder-decoder ASR models. JD and LM mean joint decoding and language model, respectively.
per GPU and a learning rate of 0.001. We train a language model for ASR inference, which contains 12 blocks of transformer encoder and set the model dimension to 1024, inner dimension to 4096, and attention heads to 16. During inference, the beam size is set to 5 for all experiments and we can also apply the joint CTC and decoder inference  and language model (LM) to further improve the performance. For TTS, we add an additional attention loss (Tachibana et al., 2018) to speed up model convergence besides L1 loss and BCE loss. The model is updated for 300k steps with a learning rate of 0.0004, while each GPU processes up to 45000 tokens for a batch. We utilize HiFi-GAN (Kong et al., 2020) to produce the raw waveforms, which are capable of both efficient and high-fidelity TTS. We train it in a speaker-independent manner using the training data of LibriTTS.
For VC, we apply the loss function as used in the fine-tuning of TTS. The model is trained by the Adam optimizer with a batch size of 20000 tokens per GPU. We assign the learning rate based on inverse square root with the maximum learning rate of 10 −4 within 60k steps and apply 6k warm-up steps. For the waveform synthesis module, we use Parallel WaveGAN (Yamamoto et al., 2020), which is a non-autoregressive variant of the WaveNet vocoder. We train it in a speakerdependent manner by conditioning on our acoustic features using the same training split.
For SID, we use cross-entropy loss and fine-tune all models by the Adam optimizer with a batch size of 256 segments per GPU and the inputs of 150 frames. We assign the learning rate based on one cycle of a triangular cyclical schedule between  ASR The performance of ASR are reported in Table 2. We also list the results of Transformer based encoder-decoder model from Fairseq  and Espnet (Watanabe et al., 2018). As can be seen from the table, our baseline is much stronger than previous models. Furthermore, the proposed SpeechT5 without LM achieves 3.5%,

Framework
Model Top-1 ACC SUPERB  wav2vec 2.0 Base (Baevski et al., 2020) 75.18% HuBERT Base  81.42% HuBERT Large  90.33% SpeechNet (Chen et al., 2021) Single Task 86.00% Multi-Task with TTS 87.90% Ours Baseline 87.72% SpeechT5 90.97% Table 4: Results of SID (speech to class). SUPERB  is a leaderboard to benchmark the performance of a pre-trained model with minimal architecture changes and labeled data. SpeechNet (Chen et al., 2021) is a universal speech model with multi-task learning framework.
6.2%, 5.2%, and 4.2% relative WER reduction with respect to baseline with same setting on dev-clean, dev-other, test-clean, and test-other, respectively, which demonstrates the effectiveness of our pretraining methods.
TTS Table 3 shows the experimental results of TTS. Our proposed SpeechT5 achieves the performance of 1.49% WER and 3.65 MOS, getting a relative reduction of 13.37% in WER and an gain of 0.2 in MOS with respect to the baseline model, respectively. It suggest the proposed pre-training technique achieves significant improvement.
SID The results of SID are shown in Table 4. We also list the scores reported in SUPERB  and SpeechNet . In their leaderboard, wav2vec 2.0 (Baevski et al., 2020) and Hubert  are two state-of-the-art pre-trained models. Experimental results demonstrate that our Speech-T5 significantly outperforms strong baseline and previous work, and achieves the state-of-the-art performance of 90.97% accuracy in SID task.

Related Work
Large-scale pre-training has drawn much attention in both the communities of NLP and speech, due to its strong capability of generalization and efficient usage of large-scale data. Recent pretrained models in NLP, such as BERT (Devlin et al., 2019), RoBERTa , XLNet  and BART , have achieved the state-of-the-art performance on language understanding and generation tasks. In spoken language processing, pre-trained speech models have also been applied to ASR Baevski et al., 2020), TTS (Hayashi et al., 2019), speech translation (Li et al., 2020), VC , and so on.
However, the above-mentioned research effects gear towards single-modal learning, hence can only be used in either text or speech modeling. Although some speech-language pre-training work (Chung et al., 2021b;Kim et al., 2021; attempts to improve spoken language understanding tasks, e.g., intent detection, dialog act classification, and spoken sentiment analysis, these methods can not be used for spoken generation tasks such as TTS or text generation. We consider our work most related to T5 (Raffel et al., 2019). The core idea of the T5 model, a unified framework for a variety of text-based language problems, is to treat every text processing problem as a "text-to-text" problem, i.e., taking the text as input and producing new text as output. Unlike T5, SpeechT5 is a cross-modal encoder-decoder framework, whose input and output are speech or text through different pre/post networks. Besides, we propose a new joint speech-text pre-training method to leverage large-scale unlabeled text and speech dataset and align the textual and phonetic information.
SpeechT5 is also related to Speech Chain (Tjandra et al., 2020), which leverages ASR model and TTS model to build a closed-loop machine speech chain and allows us to train model on the concatenation of both labeled and unlabeled data, and SpeechNet (Chen et al., 2021), which designs a universal modularized model to perform multiple speech processing tasks with multi-task learning. SpeechNet shows that it can simultaneously learn several common and important speech processing tasks. However, there are two big differences between SpeechNet and our SpeechT5. First, Speech-Net has different encoder and decoder for different modalities (e.g., speech and text), but SpeechT5 only uses one shared encoder-decoder model for all tasks. Second, SpeechNet aims to verify the multi-task learning in several speech tasks, but our SpeechT5 attempts to pre-train and improve the universal model with large-scale unlabeled text and speech data.
Another related work is SUPERB , a benchmark to examine the capability of pre-trained models. SUPERB collects various tasks with limited labeled data from speech communities to align with common research interests. This paper focus on investigating a simple framework solving SUPERB tasks with a frozen, shared pretrained model, and lightweight prediction modules finetuned for each task. In contrast, the goal of SpeechT5 is to achieve all speech tasks by finetuning a unified-modal encoder-decoder model which is pre-trained on unlabeled speech and text corpus.

Conclusion and Future Work
In this paper, we have presented SpeechT5 as a pre-trained encoder-decoder model for various spoken language tasks. We convert all spoken language processing tasks into a speech/text to speech/text format, and propose a novel joint pretraining method to utilize cross-modal information by leveraging the unlabeled speech and text data. Our unified model can support both spoken language understanding and generation tasks, such as speaker identification and voice conversion. Experiments show that SpeechT5 significantly outperforms all baselines in several spoken language processing tasks.
For future work, we plan to investigate more efficient pre-training methods, such as waveform learning representation via masked prediction like Hubert , aligning text token and phoneme explicitly as unsupervised ASR (Baevski et al., 2021). Besides, we will pre-train the SpeechT5 with a larger model and more unlabeled data, and fine-tune it on more spoken language processing tasks. We are also interested in extending the proposed SpeechT5 framework to address the multilingual spoken language processing problem.