Yung-Chang Hsu

2025

HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages
Bi-Cheng Yan | Hsin Wei Wang | Fu-An Chao | Tien-Hong Lo | Yung-Chang Hsu | Berlin Chen
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Automatic pronunciation assessment (APA) seeks to quantify a second language (L2) learner’s pronunciation proficiency in a target language by offering timely and fine-grained diagnostic feedback. Most existing efforts on APA have predominantly concentrated on highly constrained reading-aloud tasks (where learners are prompted to read a reference text aloud); however, assessing pronunciation quality in unscripted speech (or free-speaking scenarios) remains relatively underexplored. In light of this, we first propose HiPPO, a hierarchical pronunciation assessment model tailored for spoken languages, which evaluates an L2 learner’s oral proficiency at multiple linguistic levels based solely on the speech uttered by the learner. To improve the overall accuracy of assessment, a contrastive ordinal regularizer and a curriculum learning strategy are introduced for model training. The former aims to generate score-discriminative features by exploiting the ordinal nature of regression targets, while the latter gradually ramps up the training complexity to facilitate the assessment task that takes unscripted speech as input. Experiments conducted on the Speechocean762 benchmark dataset validates the feasibility and superiority of our method in relation to several cutting-edge baselines.

pdf bib abs

CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition
Hung-Yang Sung | Chien-Chun Wang | Kuan-Tang Huang | Tien-Hong Lo | Yu-Sheng Tsao | Yung-Chang Hsu | Berlin Chen
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

Automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien is difficult due to the scarcity of annotated data. However, direct fine-tuning on Han-character transcriptions often fails to capture detailed phonetic and tonal cues, while training only on romanization lacks lexical and syntactic coverage. In addition, prior studies have rarely explored staged strategies that integrate both annotation types. To address this gap, we present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The framework employs a two-stage process in which it first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions. This progressive adaptation enables effective alignment between speech sounds and orthographic structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88% relative reduction in character error rate (CER) compared with strong baselines. The results indicate that CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR and that it has potential to benefit other low-resource language scenarios.

pdf bib abs

Effective Speaker Diarization Leveraging Multi-task Logarithmic Loss Objectives
Jhih-Rong Guo | Tien-Hong Lo | Yu-Sheng Tsao | Pei-Ying Lee | Yung-Chang Hsu | Berlin Chen
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

End-to-End Neural Diarization (EEND) has undergone substantial development, particularly with powerset classification methods that enhance performance but can exacerbate speaker confusion. To address this, we propose a novel training strategy that complements the standard cross entropy loss with an auxiliary ordinal log loss, guided by a distance matrix of speaker combinations. Our experiments reveal that while this approach yields significant relative improvements of 15.8% in false alarm rate and 10.0% in confusion error rate, it also uncovers a critical trade-off with an increased missed error rate. The primary contribution of this work is the identification and analysis of this trade-off, which stems from the model adopting a more conservative prediction strategy. This insight is crucial for designing more balanced and effective loss functions in speaker diarization.

pdf bib abs

Exploring Sentence Stress Detection using Whisper-based Speech Models
Ting-An Hung | Yu-Hsuan Hsieh | Tien-Hong Lo | Yung-Chang Hsu | Berlin Chen
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)

Sentence stress reflects the relative prominence of words within a sentence. It is fundamental to speech intelligibility and naturalness, and is particularly important in second language (L2) learning. Accurate stress production facilitates effective communication and reduces misinterpretation. In this work, we investigate sentence stress detection (SSD) using Whisper-based transformer speech models under diverse settings, including model scaling, backbone–decoder interactions, architectural and regularization enhancements, and embedding visualization for interpretability. Results show that smaller Whisper variants achieve stronger performance under limited data, while architectural and regularization enhancements improves stability and generalization. Embedding analysis reveal clear separation between stressed and unstressed words. These findings offer practical insights into model selection, architecture design, and interpretability for SSD applications, with implications for L2 learning support tools.

2024

pdf bib abs

An Effective Pronunciation Assessment Approach Leveraging Hierarchical Transformers and Pre-training Strategies
Bi-Cheng Yan | Jiun-Ting Li | Yi-Cheng Wang | Hsin Wei Wang | Tien-Hong Lo | Yung-Chang Hsu | Wei-Cheng Chao | Berlin Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic pronunciation assessment (APA) manages to quantify a second language (L2) learner’s pronunciation proficiency in a target language by providing fine-grained feedback with multiple pronunciation aspect scores at various linguistic levels. Most existing efforts on APA typically parallelize the modeling process, namely predicting multiple aspect scores across various linguistic levels simultaneously. This inevitably makes both the hierarchy of linguistic units and the relatedness among the pronunciation aspects sidelined. Recognizing such a limitation, we in this paper first introduce HierTFR, a hierarchal APA method that jointly models the intrinsic structures of an utterance while considering the relatedness among the pronunciation aspects. We also propose a correlation-aware regularizer to strengthen the connection between the estimated scores and the human annotations. Furthermore, novel pre-training strategies tailored for different linguistic levels are put forward so as to facilitate better model initialization. An extensive set of empirical experiments conducted on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our approach in relation to several competitive baselines.

2022

pdf bib abs

Building an Enhanced Autoregressive Document Retriever Leveraging Supervised Contrastive Learning
Yi-Cheng Wang | Tzu-Ting Yang | Hsin-Wei Wang | Yung-Chang Hsu | Berlin Chen
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

The goal of an information retrieval system is to retrieve documents that are most relevant to a given user query from a huge collection of documents, which usually requires time-consuming multiple comparisons between the query and candidate documents so as to find the most relevant ones. Recently, a novel retrieval modeling approach, dubbed Differentiable Search Index (DSI), has been proposed. DSI dramatically simplifies the whole retrieval process by encoding all information about the document collection into the parameter space of a single Transformer model, on top of which DSI can in turn generate the relevant document identities (IDs) in an autoregressive manner in response to a user query. Although DSI addresses the shortcomings of traditional retrieval systems, previous studies have pointed out that DSI might fail to retrieve relevant documents because DSI uses the document IDs as the pivotal mechanism to establish the relationship between queries and documents, whereas not every document in the document collection has its corresponding relevant and irrelevant queries for the training purpose. In view of this, we put forward to leveraging supervised contrastive learning to better render the relationship between queries and documents in the latent semantic space. Furthermore, an approximate nearest neighbor search strategy is employed at retrieval time to further assist the Transformer model in generating document IDs relevant to a posed query more efficiently. A series of experiments conducted on the Nature Question benchmark dataset confirm the effectiveness and practical feasibility of our approach in relation to some strong baseline systems.

2021

pdf bib abs

Exploring the Integration of E2E ASR and Pronunciation Modeling for English Mispronunciation Detection
Hsin-Wei Wang | Bi-Cheng Yan | Yung-Chang Hsu | Berlin Chen
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

There has been increasing demand to develop effective computer-assisted language training (CAPT) systems, which can provide feedback on mispronunciations and facilitate second-language (L2) learners to improve their speaking proficiency through repeated practice. Due to the shortage of non-native speech for training the automatic speech recognition (ASR) module of a CAPT system, the corresponding mispronunciation detection performance is often affected by imperfect ASR. Recognizing this importance, we in this paper put forward a two-stage mispronunciation detection method. In the first stage, the speech uttered by an L2 learner is processed by an end-to-end ASR module to produce N-best phone sequence hypotheses. In the second stage, these hypotheses are fed into a pronunciation model which seeks to faithfully predict the phone sequence hypothesis that is most likely pronounced by the learner, so as to improve the performance of mispronunciation detection. Empirical experiments conducted a English benchmark dataset seem to confirm the utility of our method.

pdf bib abs

A Study on Contextualized Language Modeling for Machine Reading Comprehension
Chin-Ying Wu | Yung-Chang Hsu | Berlin Chen
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

With the recent breakthrough of deep learning technologies, research on machine reading comprehension (MRC) has attracted much attention and found its versatile applications in many use cases. MRC is an important natural language processing (NLP) task aiming to assess the ability of a machine to understand natural language expressions, which is typically operationalized by first asking questions based on a given text paragraph and then receiving machine-generated answers in accordance with the given context paragraph and questions. In this paper, we leverage two novel pretrained language models built on top of Bidirectional Encoder Representations from Transformers (BERT), namely BERT-wwm and MacBERT, to develop effective MRC methods. In addition, we also seek to investigate whether additional incorporation of the categorical information about a context paragraph can benefit MRC or not, which is achieved based on performing context paragraph clustering on the training dataset. On the other hand, an ensemble learning approach is proposed to harness the synergistic power of the aforementioned two BERT-based models so as to further promote MRC performance.

Yung-Chang Hsu

2025

2024

2022

2021

2020

Co-authors

Venues