Incorporating speaker embedding and post-filter network for improving speaker similarity of personalized speech synthesis system

Sheng-Yao Wang, Yi-Chin Huang


Abstract
In recent years, speech synthesis system can generate speech with high speech quality. However, multi-speaker text-to-speech (TTS) system still require large amount of speech data for each target speaker. In this study, we would like to construct a multi-speaker TTS system by incorporating two sub modules into artificial neural network-based speech synthesis system to alleviate this problem. First module is to add speaker embedding into encoding module for generating speech while a large amount of the speech data from target speaker is not necessary. For speaker embedding method, in our study, two main speaker embedding methods, namely speaker verification embedding and voice conversion embedding, are compared to deciding which one is suitable for our personalized TTS system. Second, we substituted the conventional post-net module, which is adopted to enhance the output spectrum sequence, to further improving the speech quality of the generated speech utterance. Here, a post-filter network is used. Finally, experiment results showed that the speaker embedding is useful by adding it into encoding module and the resultant speech utterance indeed perceived as the target speaker. Also, the post-filter network not only improving the speech quality and also enhancing the speaker similarity of the generated speech utterances. The constructed TTS system can generate a speech utterance of the target speaker in fewer than 2 seconds. In the future, we would like to further investigate the controllability of the speaking rate or perceived emotion state of the generated speech.
Anthology ID:
2021.rocling-1.42
Volume:
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)
Month:
October
Year:
2021
Address:
Taoyuan, Taiwan
Editors:
Lung-Hao Lee, Chia-Hui Chang, Kuan-Yu Chen
Venue:
ROCLING
SIG:
Publisher:
The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
Note:
Pages:
326–332
Language:
URL:
https://aclanthology.org/2021.rocling-1.42
DOI:
Bibkey:
Cite (ACL):
Sheng-Yao Wang and Yi-Chin Huang. 2021. Incorporating speaker embedding and post-filter network for improving speaker similarity of personalized speech synthesis system. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021), pages 326–332, Taoyuan, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
Cite (Informal):
Incorporating speaker embedding and post-filter network for improving speaker similarity of personalized speech synthesis system (Wang & Huang, ROCLING 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.rocling-1.42.pdf