VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation

Wenrui Liu, Jionghao Bai, Xize Cheng, Jialong Zuo, Ziyue Jiang, Shengpeng Ji, Minghui Fang, Xiaoda Yang, Qian Yang, Zhou Zhao


Abstract
In recent years, speech generation fields have achieved significant advancements, primarily due to improvements in large TTS (text-to-speech) systems and scalable TTS datasets. However, there is still a lack of large-scale multilingual TTS datasets, which limits the development of cross-language and multilingual TTS systems. Hence, we refine Voxpopuli dataset and propose VoxpopuliTTS dataset. This dataset comprises 30,000 hours of high-quality speech data, across 3 languages with multiple speakers and styles, suitable for various speech tasks such as TTS and ASR. To enhance the quality of speech data from Voxpopuli, we improve the existing processing pipeline by: 1) filtering out low-quality speech-text pairs based on ASR confidence scores, and 2) concatenating short transcripts by checking semantic information completeness to generate the long transcript. Experimental results demonstrate the effectiveness of the VoxpopuliTTS dataset and the proposed processing pipeline.
Anthology ID:
2025.coling-main.685
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10293–10297
Language:
URL:
https://aclanthology.org/2025.coling-main.685/
DOI:
Bibkey:
Cite (ACL):
Wenrui Liu, Jionghao Bai, Xize Cheng, Jialong Zuo, Ziyue Jiang, Shengpeng Ji, Minghui Fang, Xiaoda Yang, Qian Yang, and Zhou Zhao. 2025. VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10293–10297, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation (Liu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.685.pdf