Leveraging the Interplay between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation

Yejin Jeon, Yunsu Kim, Gary Geunbae Lee


Abstract
Contemporary neural speech synthesis models have indeed demonstrated remarkable proficiency in synthetic speech generation as they have attained a level of quality comparable to that of human-produced speech. Nevertheless, it is important to note that these achievements have predominantly been verified within the context of high-resource languages such as English. Furthermore, the Tacotron and FastSpeech variants show substantial pausing errors when applied to the Korean language, which affects speech perception and naturalness. In order to address the aforementioned issues, we propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns. Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips. Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance.
Anthology ID:
2024.lrec-main.910
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
10416–10421
Language:
URL:
https://aclanthology.org/2024.lrec-main.910
DOI:
Bibkey:
Cite (ACL):
Yejin Jeon, Yunsu Kim, and Gary Geunbae Lee. 2024. Leveraging the Interplay between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10416–10421, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Leveraging the Interplay between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation (Jeon et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.910.pdf