Yinlin Guo


2025

pdf bib
ProsodyFlow: High-fidelity Text-to-Speech through Conditional Flow Matching and Prosody Modeling with Large Speech Language Models
Haoyu Wang | Sizhe Shan | Yinlin Guo | Yuehai Wang
Proceedings of the 31st International Conference on Computational Linguistics

Text-to-speech (TTS) has seen significant advancements in high-quality, expressive speech synthesis. However, achieving diverse and natural prosody in synthesized speech remains challenging. In this paper, we propose ProsodyFlow, an end-to-end TTS model that integrates large self-supervised speech models and conditional flow matching to model prosodic features effectively. Our approach involves using a speech LLM to extract acoustic features, mapping these features into a prosody latent space, and then employing conditional flow matching to generate prosodic vectors conditioned on the input text. Experiments on the LJSpeech dataset show that ProsodyFlow improves synthesis quality and efficiency compared to existing models, achieving more prosodic and expressive speech synthesizing.