DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus

Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, Hiroshi Saruwatari


Abstract
In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis. DNN-based frameworks typically use linguistic information as input features called context instead of directly using text. In such frameworks, we can synthesize not only reading-style speech but also speech with paralinguistic and nonlinguistic features by adding such information to the context. However, it is not clear what kind of information is crucial for reproducing paralinguistic and nonlinguistic features. Therefore, we investigate the effectiveness of rich tags in DNN-based speech synthesis according to the Corpus of Spontaneous Japanese (CSJ), which has a large amount of annotations on paralinguistic features such as prosody, disfluency, and morphological features. Experimental evaluation results shows that the reproducibility of paralinguistic features of synthetic speech was enhanced by adding such information as context.
Anthology ID:
2020.lrec-1.792
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6438–6443
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.792
DOI:
Bibkey:
Cite (ACL):
Yuki Yamashita, Tomoki Koriyama, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Ryo Masumura, and Hiroshi Saruwatari. 2020. DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6438–6443, Marseille, France. European Language Resources Association.
Cite (Informal):
DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus (Yamashita et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.792.pdf