ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations

Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha S, Anil Nelakanti, Vineet Gandhi


Abstract
We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.github.io/tts/
Anthology ID:
2024.findings-eacl.6
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
79–91
Language:
URL:
https://aclanthology.org/2024.findings-eacl.6
DOI:
Bibkey:
Cite (ACL):
Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha S, Anil Nelakanti, and Vineet Gandhi. 2024. ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations. In Findings of the Association for Computational Linguistics: EACL 2024, pages 79–91, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations (Shah et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.6.pdf