Improving Low-Resource Speech Recognition with Whisper-MoE and Synthetic Data Augmentation: A Case Study on Hakka

Yuan-Chi Hsu; Liang-Chun Fang; Hong-Jie Dai

Improving Low-Resource Speech Recognition with Whisper-MoE and Synthetic Data Augmentation: A Case Study on Hakka

Yuan-Chi Hsu, Liang-Chun Fang, Hong-Jie Dai

Abstract

The objective of this study is to improve speech recognition performance for low-resource Hakka, a language spoken by a specific ethnic group. Our team conducted experiments by fine-tuning different base versions of Whisper (e.g., the original model and the Mandarin-focused Belle model). We found that fine-tuning on different bases yielded distinct advantages and varying results in Hakka character and phonetic recognition tasks. To further enhance model accuracy, we experimented with replacing the q, k, and v linear layers in the attention blocks of the Whisper encoder with a mixture-of-experts model combined with RoLA. In addition, we augmented the training data with synthesized speech generated with diverse voice styles and varying speaking rates. The results showed a 0.73% reduction in character error rate for Task 1 and a 0.2% reduction in word error rate for Task 2. These findings confirm that both architectural adjustments to the model and the strategic use of limited synthetic speech data in low-resource dialect corpora can effectively improve recognition performance.

Anthology ID:: 2025.rocling-main.51
Volume:: Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Month:: November
Year:: 2025
Address:: National Taiwan University, Taipei City, Taiwan
Editors:: Kai-Wei Chang, Ke-Han Lu, Chih-Kai Yang, Zhi-Rui Tam, Wen-Yu Chang, Chung-Che Wang
Venue:: ROCLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 446–449
Language:
URL:: https://aclanthology.org/2025.rocling-main.51/
DOI:
Bibkey:
Cite (ACL):: Yuan-Chi Hsu, Liang-Chun Fang, and Hong-Jie Dai. 2025. Improving Low-Resource Speech Recognition with Whisper-MoE and Synthetic Data Augmentation: A Case Study on Hakka. In Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025), pages 446–449, National Taiwan University, Taipei City, Taiwan. Association for Computational Linguistics.
Cite (Informal):: Improving Low-Resource Speech Recognition with Whisper-MoE and Synthetic Data Augmentation: A Case Study on Hakka (Hsu et al., ROCLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.rocling-main.51.pdf

PDF Cite Search Fix data