%0 Conference Proceedings
%T SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
%A Ao, Junyi
%A Wang, Rui
%A Zhou, Long
%A Wang, Chengyi
%A Ren, Shuo
%A Wu, Yu
%A Liu, Shujie
%A Ko, Tom
%A Li, Qing
%A Zhang, Yu
%A Wei, Zhihua
%A Qian, Yao
%A Li, Jinyu
%A Wei, Furu
%Y Muresan, Smaranda
%Y Nakov, Preslav
%Y Villavicencio, Aline
%S Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
%D 2022
%8 May
%I Association for Computational Linguistics
%C Dublin, Ireland
%F ao-etal-2022-speecht5
%X Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
%R 10.18653/v1/2022.acl-long.393
%U https://aclanthology.org/2022.acl-long.393
%U https://doi.org/10.18653/v1/2022.acl-long.393
%P 5723-5738