ProsodyFlow: High-fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models

Haoyu Wang, Sizhe Shan, Yinlin Guo, Yuehai Wang


Abstract
Text-to-speech (TTS) has seen significant advancements in high-quality, expressive speech synthesis. However, achieving diverse and natural prosody in synthesized speech remains challenging. In this paper, we propose ProsodyFlow, an end-to-end TTS model that integrates large self-supervised speech models and conditional flow matching to model prosodic features effectively. Our approach involves using a speech LLM to extract acoustic features, mapping these features into a prosody latent space, and then employing conditional flow matching to generate prosodic vectors conditioned on the input text. Experiments on the LJSpeech dataset show that ProsodyFlow improves synthesis quality and efficiency compared to existing models, achieving more prosodic and expressive speech synthesizing.
Anthology ID:
2025.coling-main.518
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7748–7753
Language:
URL:
https://aclanthology.org/2025.coling-main.518/
DOI:
Bibkey:
Cite (ACL):
Haoyu Wang, Sizhe Shan, Yinlin Guo, and Yuehai Wang. 2025. ProsodyFlow: High-fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 7748–7753, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
ProsodyFlow: High-fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models (Wang et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.518.pdf