Yi-Chin Huang

2025

The NPTU ASR System for FSR2025 Hakka Character/Pinyin Recognition: Whisper with mBART Post-Editing and RNNLM Rescoring
Yi-Chin Huang | Yu-Heng Chen | Jian-Hua Wang | Hsiu-Chi Wu | Chih-Chung Kuo | Chao-Shih Huang | Yuan-Fu Liao
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

This paper presents our system for the FSR-2025 Hakka Automatic Speech Recognition (ASR) Challenge, which consists of two sub-tasks: (i) Hakka Characters and (ii) Hakka Pinyin. We propose a unified architecture built upon Whisper [1], a large weakly supervised ASR model, as the acoustic backbone, with optional LoRA (Low-Rank Adaptation [2]) for parameter-efficient fine-tuning. Data augmentation techniques include the MUSAN [3] corpus (music/speech/noise) and tempo/speed perturbation [4]. For the character task, mBART-50 [5,6], a multilingual sequence-to-sequence model, is applied for text correction, while both tasks employ an RNNLM [7] for N-best rescoring. Under the final evaluation setting of the character task, mBART-driven 10-best text correction combined with RNNLM rescoring achieved a CER (Character Error Rate) of 6.26%, whereas the official leaderboard reported 22.5%. For the Pinyin task, the Medium model proved more suitable than the Large model given the dataset size and accent distribution. With 10-best RNNLM rescoring, it achieved a SER (Syllable Error Rate) of 4.65% on our internal warm-up test set, and the official final score (with tone information) was 14.81%. Additionally, we analyze the contribution of LID (Language Identification) for accent recognition across different recording and media sources.

2023

pdf bib

Whisper Model Adaptation for FSR-2023 Hakka Speech Recognition Challenge
Yi-Chin Huang | Ji-Qian Tsai
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

2022

pdf bib abs

A Preliminary Study on Mandarin-Hakka neural machine translation using small-sized data
Yi-Hsiang Hung | Yi-Chin Huang
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

In this study, we implemented a machine translation system using the Convolutional Neural Network with Attention mechanism for translating Mandarin to Sixan-accent Hakka. Specifically, to cope with the different idioms or terms used between Northern and Southern Sixan-accent, we analyzed the corpus differences and lexicon definition, and then separated the various word usages for training exclusive models for each accent. Besides, since the collected Hakka corpora are relatively limited, the unseen words frequently occurred during real-world translation. In our system, we selected suitable thresholds for each model based on the model verification to reject non-suitable translated words. Then, by applying the proposed algorithm, which adopted the forced Hakka idioms/terms segmentation and the common Mandarin word substitution, the resultant translation sentences become more intelligible. Therefore, the proposed system achieved promising results using small-sized data. This system could be used for Hakka language teaching and also the front-end of Mandarin and Hakka code-switching speech synthesis systems.

pdf bib

Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)
Yung-Chun Chang | Yi-Chin Huang
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

2021

pdf bib abs

Incorporating speaker embedding and post-filter network for improving speaker similarity of personalized speech synthesis system
Sheng-Yao Wang | Yi-Chin Huang
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

In recent years, speech synthesis system can generate speech with high speech quality. However, multi-speaker text-to-speech (TTS) system still require large amount of speech data for each target speaker. In this study, we would like to construct a multi-speaker TTS system by incorporating two sub modules into artificial neural network-based speech synthesis system to alleviate this problem. First module is to add speaker embedding into encoding module for generating speech while a large amount of the speech data from target speaker is not necessary. For speaker embedding method, in our study, two main speaker embedding methods, namely speaker verification embedding and voice conversion embedding, are compared to deciding which one is suitable for our personalized TTS system. Second, we substituted the conventional post-net module, which is adopted to enhance the output spectrum sequence, to further improving the speech quality of the generated speech utterance. Here, a post-filter network is used. Finally, experiment results showed that the speaker embedding is useful by adding it into encoding module and the resultant speech utterance indeed perceived as the target speaker. Also, the post-filter network not only improving the speech quality and also enhancing the speaker similarity of the generated speech utterances. The constructed TTS system can generate a speech utterance of the target speaker in fewer than 2 seconds. In the future, we would like to further investigate the controllability of the speaking rate or perceived emotion state of the generated speech.

pdf bib

整合語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度 (Incorporating Speaker Embedding and Post-Filter Network for Improving Speaker Similarity of Personalized Speech Synthesis System)
Sheng-Yao Wang | Yi-Chin Huang
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 2, December 2021