Sheng-Yao Wang
2021
整合語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度 (Incorporating Speaker Embedding and Post-Filter Network for Improving Speaker Similarity of Personalized Speech Synthesis System)
Sheng-Yao Wang
|
Yi-Chin Huang
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 2, December 2021
Incorporating speaker embedding and post-filter network for improving speaker similarity of personalized speech synthesis system
Sheng-Yao Wang
|
Yi-Chin Huang
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)
In recent years, speech synthesis system can generate speech with high speech quality. However, multi-speaker text-to-speech (TTS) system still require large amount of speech data for each target speaker. In this study, we would like to construct a multi-speaker TTS system by incorporating two sub modules into artificial neural network-based speech synthesis system to alleviate this problem. First module is to add speaker embedding into encoding module for generating speech while a large amount of the speech data from target speaker is not necessary. For speaker embedding method, in our study, two main speaker embedding methods, namely speaker verification embedding and voice conversion embedding, are compared to deciding which one is suitable for our personalized TTS system. Second, we substituted the conventional post-net module, which is adopted to enhance the output spectrum sequence, to further improving the speech quality of the generated speech utterance. Here, a post-filter network is used. Finally, experiment results showed that the speaker embedding is useful by adding it into encoding module and the resultant speech utterance indeed perceived as the target speaker. Also, the post-filter network not only improving the speech quality and also enhancing the speaker similarity of the generated speech utterances. The constructed TTS system can generate a speech utterance of the target speaker in fewer than 2 seconds. In the future, we would like to further investigate the controllability of the speaking rate or perceived emotion state of the generated speech.
Search