Chih-Chung Kuo
2025
Taiwanese Hakka Across Taiwan Corpus and Formosa Speech Recognition Challenge 2025 – Dapu & Zhao’an Accents
Yuan-Fu Liao
|
Chih-Chung Kuo
|
Chao-Shih Huang
|
Yu-Siang Lan
|
Han-Chun Lai
|
Wen-Han Hsu
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
To revive the endangered Hakka language in Taiwan, the first large-scale Hakka speech corpus covering all aspects of Taiwanese Hakka across Taiwan (HAT) was created. This paper introduces the second part of the HAT corpus: the Dapu and Zhao’an accents. Furthermore, to promote this newly constructed corpus and evaluate the performance of the most advanced Hakka ASR system, the 2025 Formosa Speech Recognition Challenge, FSR-2025–Hakka ASR II, was held. Sixteen teams participated on two tracks: speech-to-Hakka-Hanzi and speech-to Hakka-Pinyin. The best results were: Hanzi character error rate (CER) 7.50%; Pinyin syllable error rate (SER) 14.81%.
The NPTU ASR System for FSR2025 Hakka Character/Pinyin Recognition: Whisper with mBART Post-Editing and RNNLM Rescoring
Yi-Chin Huang
|
Yu-Heng Chen
|
Jian-Hua Wang
|
Hsiu-Chi Wu
|
Chih-Chung Kuo
|
Chao-Shih Huang
|
Yuan-Fu Liao
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
This paper presents our system for the FSR-2025 Hakka Automatic Speech Recognition (ASR) Challenge, which consists of two sub-tasks: (i) Hakka Characters and (ii) Hakka Pinyin. We propose a unified architecture built upon Whisper [1], a large weakly supervised ASR model, as the acoustic backbone, with optional LoRA (Low-Rank Adaptation [2]) for parameter-efficient fine-tuning. Data augmentation techniques include the MUSAN [3] corpus (music/speech/noise) and tempo/speed perturbation [4]. For the character task, mBART-50 [5,6], a multilingual sequence-to-sequence model, is applied for text correction, while both tasks employ an RNNLM [7] for N-best rescoring. Under the final evaluation setting of the character task, mBART-driven 10-best text correction combined with RNNLM rescoring achieved a CER (Character Error Rate) of 6.26%, whereas the official leaderboard reported 22.5%. For the Pinyin task, the Medium model proved more suitable than the Large model given the dataset size and accent distribution. With 10-best RNNLM rescoring, it achieved a SER (Syllable Error Rate) of 4.65% on our internal warm-up test set, and the official final score (with tone information) was 14.81%. Additionally, we analyze the contribution of LID (Language Identification) for accent recognition across different recording and media sources.
Search
Fix author
Co-authors
- Chao-Shih Huang 2
- Yuan-Fu Liao 2
- Yu-Heng Chen 1
- Wen-Han Hsu 1
- Yi-Chin Huang 1
- show all...