ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations

Neil Shah; Saiteja Kosgi; Vishal Tambrahalli; Neha S; Anil Nelakanti; Vineet Gandhi

ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations

Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha S, Anil Nelakanti, Vineet Gandhi

Abstract

We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.github.io/tts/

Anthology ID:: 2024.findings-eacl.6
Volume:: Findings of the Association for Computational Linguistics: EACL 2024
Month:: March
Year:: 2024
Address:: St. Julian’s, Malta
Editors:: Yvette Graham, Matthew Purver
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 79–91
Language:
URL:: https://aclanthology.org/2024.findings-eacl.6
DOI:
Bibkey:
Cite (ACL):: Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha S, Anil Nelakanti, and Vineet Gandhi. 2024. ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations. In Findings of the Association for Computational Linguistics: EACL 2024, pages 79–91, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):: ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations (Shah et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-eacl.6.pdf
Video:: https://aclanthology.org/2024.findings-eacl.6.mp4

PDF Cite Search Video