fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit

Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino


Abstract
This paper presents fairseq Sˆ2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis, a suite of automatic metrics is included. Apart from the features added specifically for this extension, fairseq Sˆ2 also benefits from the scalability offered by fairseq and can be easily integrated with other state-of-the-art systems provided in this framework. The code, documentation, and pre-trained models will be made available at https://github.com/pytorch/fairseq/tree/master/examples/speech_synthesis.
Anthology ID:
2021.emnlp-demo.17
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Heike Adel, Shuming Shi
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
143–152
Language:
URL:
https://aclanthology.org/2021.emnlp-demo.17
DOI:
10.18653/v1/2021.emnlp-demo.17
Bibkey:
Cite (ACL):
Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, and Juan Pino. 2021. fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 143–152, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit (Wang et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-demo.17.pdf
Code
 pytorch/fairseq
Data
LJSpeechVCTK