Vishal Tambrahalli


2024

pdf bib
ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations
Neil Shah | Saiteja Kosgi | Vishal Tambrahalli | Neha S | Anil Nelakanti | Vineet Gandhi
Findings of the Association for Computational Linguistics: EACL 2024

We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.github.io/tts/