Transformer-based Nepali Text-to-Speech

Dongol Ishan, Bal Bal Krishna


Abstract
Research on Deep learning-based Text-toSpeech (TTS) systems has gained increasing popularity in low-resource languages as this approach is not only computationally robust but also has the capability to produce state-ofthe-art results. However, these approaches are yet to be significantly explored for the Nepali language, primarily because of the lack of adequate size datasets and secondarily because of the relatively sophisticated computing resources they demand. This paper explores the FastPitch acoustic model with HiFi-GAN vocoder for the Nepali language. We trained the acoustic model with two datasets, OpenSLR and a dataset prepared jointly by the Information and Language Processing Research Lab (ILPRL) and the Nepal Association of the Blind (NAB), to be further referred to as the ILPRLNAB dataset. We achieved a Mean Opinion Score (MOS) of 3.70 and 3.40 respectively for the same model with different datasets. The synthesized speech produced by the model was found to be quite natural and of good quality.
Anthology ID:
2023.icon-1.64
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
D. Pawar Jyoti, Lalitha Devi Sobha
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
651–656
Language:
URL:
https://aclanthology.org/2023.icon-1.64
DOI:
Bibkey:
Cite (ACL):
Dongol Ishan and Bal Bal Krishna. 2023. Transformer-based Nepali Text-to-Speech. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 651–656, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
Transformer-based Nepali Text-to-Speech (Ishan & Bal Krishna, ICON 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.icon-1.64.pdf