2025
pdf
bib
abs
Effective Speaker Diarization Leveraging Multi-task Logarithmic Loss Objectives
Jhih-Rong Guo
|
Tien-Hong Lo
|
Yu-Sheng Tsao
|
Pei-Ying Lee
|
Yung-Chang Hsu
|
Berlin Chen
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
End-to-End Neural Diarization (EEND) has undergone substantial development, particularly with powerset classification methods that enhance performance but can exacerbate speaker confusion. To address this, we propose a novel training strategy that complements the standard cross entropy loss with an auxiliary ordinal log loss, guided by a distance matrix of speaker combinations. Our experiments reveal that while this approach yields significant relative improvements of 15.8% in false alarm rate and 10.0% in confusion error rate, it also uncovers a critical trade-off with an increased missed error rate. The primary contribution of this work is the identification and analysis of this trade-off, which stems from the model adopting a more conservative prediction strategy. This insight is crucial for designing more balanced and effective loss functions in speaker diarization.
pdf
bib
abs
CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition
Hung-Yang Sung
|
Chien-Chun Wang
|
Kuan-Tang Huang
|
Tien-Hong Lo
|
Yu-Sheng Tsao
|
Yung-Chang Hsu
|
Berlin Chen
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien is difficult due to the scarcity of annotated data. However, direct fine-tuning on Han-character transcriptions often fails to capture detailed phonetic and tonal cues, while training only on romanization lacks lexical and syntactic coverage. In addition, prior studies have rarely explored staged strategies that integrate both annotation types. To address this gap, we present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The framework employs a two-stage process in which it first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions. This progressive adaptation enables effective alignment between speech sounds and orthographic structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88% relative reduction in character error rate (CER) compared with strong baselines. The results indicate that CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR and that it has potential to benefit other low-resource language scenarios.
pdf
bib
abs
Exploring Sentence Stress Detection using Whisper-based Speech Models
Ting-An Hung
|
Yu-Hsuan Hsieh
|
Tien-Hong Lo
|
Yung-Chang Hsu
|
Berlin Chen
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Sentence stress reflects the relative prominence of words within a sentence. It is fundamental to speech intelligibility and naturalness, and is particularly important in second language (L2) learning. Accurate stress production facilitates effective communication and reduces misinterpretation. In this work, we investigate sentence stress detection (SSD) using Whisper-based transformer speech models under diverse settings, including model scaling, backbone–decoder interactions, architectural and regularization enhancements, and embedding visualization for interpretability. Results show that smaller Whisper variants achieve stronger performance under limited data, while architectural and regularization enhancements improves stability and generalization. Embedding analysis reveal clear separation between stressed and unstressed words. These findings offer practical insights into model selection, architecture design, and interpretability for SSD applications, with implications for L2 learning support tools.
pdf
bib
abs
The EZ-AI System for Formosa Speech Recognition Challenge 2025
Yu-Sheng Tsao
|
Hung-Yang Sung
|
An-Ci Peng
|
Jhih-Rong Guo
|
Tien-Hong Lo
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
This study presents our system for Hakka Speech Recognition Challenge 2025. We designed and compared different systems for two low-resource dialects: Dapu and Zhaoan. On the Pinyin track, we gain boosts by leveraging cross-lingual transfer-learning from related languages and combining with self-supervised learning (SSL). For the Hanzi track, we employ pretrained Whisper with Low-Rank Adaptation (LoRA) fine-tuning. To alleviate the low-resource issue, two data augmentation methods are experimented with: simulating conversational speech to handle multi-speaker scenarios, and generating additional corpus via text-to-speech (TTS). Results from the pilot test showed that transfer learning significantly improved performance in the Pinyin track, achieving an average character error rate (CER) of 19.57%, ranking third among all teams. While in the Hanzi track, the Whisper + LoRA system achieved an average CER of 6.84%, earning first place among all. This study demonstrates that transfer learning and data augmentation can effectively improve recognition performance for low-resource languages. However, the domain mismatch seen in the media test set remains a challenge. We plan to explore in-context learning (ICL) and hotword modeling in the future to better address this issue.
2024
pdf
bib
abs
An Effective Pronunciation Assessment Approach Leveraging Hierarchical Transformers and Pre-training Strategies
Bi-Cheng Yan
|
Jiun-Ting Li
|
Yi-Cheng Wang
|
Hsin Wei Wang
|
Tien-Hong Lo
|
Yung-Chang Hsu
|
Wei-Cheng Chao
|
Berlin Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automatic pronunciation assessment (APA) manages to quantify a second language (L2) learner’s pronunciation proficiency in a target language by providing fine-grained feedback with multiple pronunciation aspect scores at various linguistic levels. Most existing efforts on APA typically parallelize the modeling process, namely predicting multiple aspect scores across various linguistic levels simultaneously. This inevitably makes both the hierarchy of linguistic units and the relatedness among the pronunciation aspects sidelined. Recognizing such a limitation, we in this paper first introduce HierTFR, a hierarchal APA method that jointly models the intrinsic structures of an utterance while considering the relatedness among the pronunciation aspects. We also propose a correlation-aware regularizer to strengthen the connection between the estimated scores and the human annotations. Furthermore, novel pre-training strategies tailored for different linguistic levels are put forward so as to facilitate better model initialization. An extensive set of empirical experiments conducted on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our approach in relation to several competitive baselines.
pdf
bib
abs
An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution
Tien-Hong Lo
|
Fu-An Chao
|
Tzu-I Wu
|
Yao-Ting Sung
|
Berlin Chen
Findings of the Association for Computational Linguistics: NAACL 2024
Automated speaking assessment (ASA) typically involves automatic speech recognition (ASR) and hand-crafted feature extraction from the ASR transcript of a learner’s speech. Recently, self-supervised learning (SSL) has shown stellar performance compared to traditional methods. However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss re-weighting, leveraging distinct SSL-based embedding features. Extensive experimental results on the ICNALE benchmark dataset suggest that our approach can outperform existing strong baselines by a sizable margin, achieving a significant improvement of more than 10% in CEFR prediction accuracy.
2023
pdf
bib
Leveraging Dialogue Discourse Parsing in a Two-Stage Framework for Meeting Summarization
Yi-Ping Huang
|
Tien-Hong Lo
|
Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)
pdf
bib
Addressing the issue of Data Imbalance in Multi-granularity Pronunciation Assessment
Meng-Shin Lin
|
Hsin-Wei Wang
|
Tien-Hong Lo
|
Berlin Chen
|
Wei-Cheng Chao
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)
pdf
bib
The NTNU ASR System for Formosa Speech Recognition Challenge 2023
Hao-Chien Lu
|
Chung-Chun Wang
|
Jhen-Ke Lin
|
Tien-Hong Lo
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)
2022
pdf
bib
abs
A Preliminary Study on Automated Speaking Assessment of English as a Second Language (ESL) Students
Tzu-I Wu
|
Tien-Hong Lo
|
Fu-An Chao
|
Yao-Ting Sung
|
Berlin Chen
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)
Due to the surge in global demand for English as a second language (ESL), developments of automated methods for grading speaking proficiency have gained considerable attention. This paper aims to present a computerized regime of grading the spontaneous spoken language for ESL learners. Based on the speech corpus of ESL learners recently collected in Taiwan, we first extract multi-view features (e.g., pronunciation, fluency, and prosody features) from either automatic speech recognition (ASR) transcription or audio signals. These extracted features are, in turn, fed into a tree-based classifier to produce a new set of indicative features as the input of the automated assessment system, viz. the grader. Finally, we use different machine learning models to predict ESL learners’ respective speaking proficiency and map the result into the corresponding CEFR level. The experimental results and analysis conducted on the speech corpus of ESL learners in Taiwan show that our approach holds great potential for use in automated speaking assessment, meanwhile offering more reliable predictive results than the human experts.
2021
pdf
bib
The NTNU Taiwanese ASR System for Formosa Speech Recognition Challenge 2020
Fu-An Chao
|
Tien-Hong Lo
|
Shi-Yan Weng
|
Shih-Hsuan Chiu
|
Yao-Ting Sung
|
Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 1, June 2021
pdf
bib
abs
A Preliminary Study on Environmental Sound Classification Leveraging Large-Scale Pretrained Model and Semi-Supervised Learning
You-Sheng Tsao
|
Tien-Hong Lo
|
Jiun-Ting Li
|
Shi-Yan Weng
|
Berlin Chen
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)
With the widespread commercialization of smart devices, research on environmental sound classification has gained more and more attention in recent years. In this paper, we set out to make effective use of large-scale audio pretrained model and semi-supervised model training paradigm for environmental sound classification. To this end, an environmental sound classification method is first put forward, whose component model is built on top a large-scale audio pretrained model. Further, to simulate a low-resource sound classification setting where only limited supervised examples are made available, we instantiate the notion of transfer learning with a recently proposed training algorithm (namely, FixMatch) and a data augmentation method (namely, SpecAugment) to achieve the goal of semi-supervised model training. Experiments conducted on bench-mark dataset UrbanSound8K reveal that our classification method can lead to an accuracy improvement of 2.4% in relation to a current baseline method.
2020
pdf
bib
Exploiting Text Prompts for the Development of an End-to-End Computer-Assisted Pronunciation Training System
Yu-Sen Cheng
|
Tien-Hong Lo
|
Berlin Chen
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)
2019
pdf
bib
探究端對端混合模型架構於華語語音辨識 (An Investigation of Hybrid CTC-Attention Modeling in Mandarin Speech Recognition)
Hsiu-Jui Chang
|
Wei-Cheng Chao
|
Tien-Hong Lo
|
Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 24, Number 1, June 2019
pdf
bib
使用生成對抗網路於強健式自動語音辨識的應用(Exploiting Generative Adversarial Network for Robustness Automatic Speech Recognition)
Ming-Jhang Yang
|
Fu-An Chao
|
Tien-Hong Lo
|
Berlin Chen
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019)
pdf
bib
探究端對端語音辨識於發音檢測與診斷(Investigating on Computer-Assisted Pronunciation Training Leveraging End-to-End Speech Recognition Techniques)
Hsiu-Jui Chang
|
Tien-Hong Lo
|
Tzu-En Liu
|
Berlin Chen
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019)
2018
pdf
bib
結合鑑別式訓練與模型合併於半監督式語音辨識之研究 (Leveraging Discriminative Training and Model Combination for Semi-supervised Speech Recognition)
Tien-Hong Lo
|
Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 23, Number 2, December 2018
pdf
bib
結合鑑別式訓練聲學模型之類神經網路架構及優化方法的改進 (Leveraging Discriminative Training and Improved Neural Network Architecture and Optimization Method)
Wei-Cheng Chao
|
Hsiu-Jui Chang
|
Tien-Hong Lo
|
Belin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 23, Number 2, December 2018
pdf
bib
會議語音辨識使用語者資訊之語言模型調適技術 (On the Use of Speaker-Aware Language Model Adaptation Techniques for Meeting Speech Recognition ) [In Chinese]
Ying-wen Chen
|
Tien-hong Lo
|
Hsiu-jui Chang
|
Wei-Cheng Chao
|
Berlin Chen
Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018)
pdf
bib
探討聲學模型的合併技術與半監督鑑別式訓練於會議語音辨識之研究 (Investigating acoustic model combination and semi-supervised discriminative training for meeting speech recognition) [In Chinese]
Tien-Hong Lo
|
Berlin Chen
Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018)
pdf
bib
探討鑑別式訓練聲學模型之類神經網路架構及優化方法的改進 (Discriminative Training of Acoustic Models Leveraging Improved Neural Network Architecture and Optimization Method) [In Chinese]
Wei-Cheng Chao
|
Hsiu-Jui Chang
|
Tien-Hong Lo
|
Berlin Chen
Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018)
2017
pdf
bib
使用查詢意向探索與類神經網路於語音文件檢索之研究 (Exploring Query Intent and Neural Network modeling Techniques for Spoken Document Retrieval) [In Chinese]
Tien-Hong Lo
|
Ying-Wen Chen
|
Berlin Chen
|
Kuan-Yu Chen
|
Hsin-Min Wang
Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017)
pdf
bib
語音文件檢索使用類神經網路技術 (On the Use of Neural Network Modeling Techniques for Spoken Document Retrieval) [In Chinese]
Tien-Hong Lo
|
Ying-Wen Chen
|
Kuan-Yu Chen
|
Hsin-Min Wang
|
Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 22, Number 2, December 2017-Special Issue on Selected Papers from ROCLING XXIX