Impacts of Vocoder Selection on Tacotron-based Nepali Text-To-Speech Synthesis

Ganesh Dhakal Chhetri; Kiran Chandra Dahal; Prakash Poudyal

Impacts of Vocoder Selection on Tacotron-based Nepali Text-To-Speech Synthesis

Ganesh Dhakal Chhetri, Kiran Chandra Dahal, Prakash Poudyal

Abstract

Text-to-speech (TTS) technology enhances human-computer interaction and increases content accessibility. Tacotron and other deep learning models have enhanced the naturalness of text-to-speech systems. The vocoder, which transforms mel-spectrograms into audio waveforms, significantly influences voice quality. This study evaluates Tacotron2 vocoders for Nepali text-to speech synthesis. While English language vocoders have been thoroughly examined, Nepali language vocoders remain underexplored. The study utilizes the WaveNet and MelGAN vocoders to generate speech from mel-spectrograms produced by Tacotron2 for Nepali text. In order to assess the quality of voice synthesis, this paper study the mel-cepstral distortion (MCD) and Mean Opinion Score (MOS) for speech produced by both vocoders. The comparative investigation of the Tacotron2 + MelGAN and Tacotron2 + WaveNet models, utilizing the Nepali OpenSLR and News male voice datasets, consistently reveals the advantage of Tacotron2 + MelGAN in terms of naturalness and accuracy. The Tacotron2 + MelGAN model achieved an average MOS score of 4.245 on the Nepali OpenSLR dataset and 2.885 on the male voice dataset.

Anthology ID:: 2025.chipsal-1.18
Volume:: Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Kengatharaiyer Sarveswaran, Ashwini Vaidya, Bal Krishna Bal, Sana Shams, Surendrabikram Thapa
Venues:: CHiPSAL | WS
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 185–192
Language:
URL:: https://aclanthology.org/2025.chipsal-1.18/
DOI:
Bibkey:
Cite (ACL):: Ganesh Dhakal Chhetri, Kiran Chandra Dahal, and Prakash Poudyal. 2025. Impacts of Vocoder Selection on Tacotron-based Nepali Text-To-Speech Synthesis. In Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025), pages 185–192, Abu Dhabi, UAE. International Committee on Computational Linguistics.
Cite (Informal):: Impacts of Vocoder Selection on Tacotron-based Nepali Text-To-Speech Synthesis (Chhetri et al., CHiPSAL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.chipsal-1.18.pdf

PDF Cite Search Fix data