2024
pdf
bib
abs
An Effective Pronunciation Assessment Approach Leveraging Hierarchical Transformers and Pre-training Strategies
Bi-Cheng Yan
|
Jiun-Ting Li
|
Yi-Cheng Wang
|
Hsin Wei Wang
|
Tien-Hong Lo
|
Yung-Chang Hsu
|
Wei-Cheng Chao
|
Berlin Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automatic pronunciation assessment (APA) manages to quantify a second language (L2) learner’s pronunciation proficiency in a target language by providing fine-grained feedback with multiple pronunciation aspect scores at various linguistic levels. Most existing efforts on APA typically parallelize the modeling process, namely predicting multiple aspect scores across various linguistic levels simultaneously. This inevitably makes both the hierarchy of linguistic units and the relatedness among the pronunciation aspects sidelined. Recognizing such a limitation, we in this paper first introduce HierTFR, a hierarchal APA method that jointly models the intrinsic structures of an utterance while considering the relatedness among the pronunciation aspects. We also propose a correlation-aware regularizer to strengthen the connection between the estimated scores and the human annotations. Furthermore, novel pre-training strategies tailored for different linguistic levels are put forward so as to facilitate better model initialization. An extensive set of empirical experiments conducted on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our approach in relation to several competitive baselines.
pdf
bib
abs
DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition
Yi-Cheng Wang
|
Hsin-Wei Wang
|
Bi-Cheng Yan
|
Chi-Han Lin
|
Berlin Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on pho-netic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic con-fusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities. The code is available at https://github.com/Amiannn/Dancer.
2023
pdf
bib
Enhancing Automated English Speaking Assessment for L2 Speakers with BERT and Wav2vec2.0 Fusion
Wen-Hsuan Peng
|
Hsin-Wei Wang
|
Sally Chen
|
Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)
pdf
bib
Addressing the issue of Data Imbalance in Multi-granularity Pronunciation Assessment
Meng-Shin Lin
|
Hsin-Wei Wang
|
Tien-Hong Lo
|
Berlin Chen
|
Wei-Cheng Chao
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)
pdf
bib
The NTNU Super Monster Team (SPMT) system for the Formosa Speech Recognition Challenge 2023 - Hakka ASR
Tzu-Ting Yang
|
Hsin-Wei Wang
|
Meng-Ting Tsai
|
Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)
2022
pdf
bib
abs
Building an Enhanced Autoregressive Document Retriever Leveraging Supervised Contrastive Learning
Yi-Cheng Wang
|
Tzu-Ting Yang
|
Hsin-Wei Wang
|
Yung-Chang Hsu
|
Berlin Chen
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)
The goal of an information retrieval system is to retrieve documents that are most relevant to a given user query from a huge collection of documents, which usually requires time-consuming multiple comparisons between the query and candidate documents so as to find the most relevant ones. Recently, a novel retrieval modeling approach, dubbed Differentiable Search Index (DSI), has been proposed. DSI dramatically simplifies the whole retrieval process by encoding all information about the document collection into the parameter space of a single Transformer model, on top of which DSI can in turn generate the relevant document identities (IDs) in an autoregressive manner in response to a user query. Although DSI addresses the shortcomings of traditional retrieval systems, previous studies have pointed out that DSI might fail to retrieve relevant documents because DSI uses the document IDs as the pivotal mechanism to establish the relationship between queries and documents, whereas not every document in the document collection has its corresponding relevant and irrelevant queries for the training purpose. In view of this, we put forward to leveraging supervised contrastive learning to better render the relationship between queries and documents in the latent semantic space. Furthermore, an approximate nearest neighbor search strategy is employed at retrieval time to further assist the Transformer model in generating document IDs relevant to a posed query more efficiently. A series of experiments conducted on the Nature Question benchmark dataset confirm the effectiveness and practical feasibility of our approach in relation to some strong baseline systems.
2021
pdf
bib
abs
Exploring the Integration of E2E ASR and Pronunciation Modeling for English Mispronunciation Detection
Hsin-Wei Wang
|
Bi-Cheng Yan
|
Yung-Chang Hsu
|
Berlin Chen
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)
There has been increasing demand to develop effective computer-assisted language training (CAPT) systems, which can provide feedback on mispronunciations and facilitate second-language (L2) learners to improve their speaking proficiency through repeated practice. Due to the shortage of non-native speech for training the automatic speech recognition (ASR) module of a CAPT system, the corresponding mispronunciation detection performance is often affected by imperfect ASR. Recognizing this importance, we in this paper put forward a two-stage mispronunciation detection method. In the first stage, the speech uttered by an L2 learner is processed by an end-to-end ASR module to produce N-best phone sequence hypotheses. In the second stage, these hypotheses are fed into a pronunciation model which seeks to faithfully predict the phone sequence hypothesis that is most likely pronounced by the learner, so as to improve the performance of mispronunciation detection. Empirical experiments conducted a English benchmark dataset seem to confirm the utility of our method.