A Multi-faceted Statistical Analysis for Logit-based Pronunciation Assessment

Chieh-Ren Liao; Berlin Chen

A Multi-faceted Statistical Analysis for Logit-based Pronunciation Assessment

Abstract

The Goodness of Pronunciation (GOP) score for pronunciation quality assessment is a key technology in computer-assisted language learning. Recent studies have shown that computing GOP scores directly from the acoustic model’s raw output logits outperforms traditional softmax-probability-based methods, because logits avoid probability saturation issues and retain richer discriminative information. However, existing logit-based methods mostly rely on basic statistics such as maxima, means, or variances, which neglect the more complex dynamic distributions and temporal characteristics of logit sequences over phoneme durations. To more comprehensively capture pronunciation details embedded in logit sequences, this study proposes a multi-faceted statistical analysis method. We explore five higher-order statistical indicators that describe different characteristics of logit sequences: (1) moment-generating functions to compute distribution skewness and kurtosis; (2) information theory, using entropy to quantify model uncertainty; (3) Gaussian mixture models (GMMs) to fit multimodal distributions of logits; (4) time-series analysis, computing autocorrelation coefficients to measure logit stability; and (5) extreme value theory, using top-k averaging to obtain more robust peak-confidence estimates. We conduct experiments on the public L2 English speech corpus SpeechOcean762, comparing these newly proposed statistical indicators with baseline methods from the literature (GOP_MaxLogit, GOP_margin). Preliminary results show that some higher-order statistical indicators—particularly those that describe logit-sequence stability and distribution shape—achieve higher accuracy on pronunciation-error detection classification tasks and exhibit stronger correlation with human expert ratings. This study demonstrates that deeper statistical modeling of logit sequences is an effective approach to improving the performance of automated pronunciation assessment systems.

Anthology ID:: 2025.rocling-main.35
Volume:: Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Month:: November
Year:: 2025
Address:: National Taiwan University, Taipei City, Taiwan
Editors:: Kai-Wei Chang, Ke-Han Lu, Chih-Kai Yang, Zhi-Rui Tam, Wen-Yu Chang, Chung-Che Wang
Venue:: ROCLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 326–333
Language:
URL:: https://aclanthology.org/2025.rocling-main.35/
DOI:
Bibkey:
Cite (ACL):: Chieh-Ren Liao and Berlin Chen. 2025. A Multi-faceted Statistical Analysis for Logit-based Pronunciation Assessment. In Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025), pages 326–333, National Taiwan University, Taipei City, Taiwan. Association for Computational Linguistics.
Cite (Informal):: A Multi-faceted Statistical Analysis for Logit-based Pronunciation Assessment (Liao & Chen, ROCLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.rocling-main.35.pdf

PDF Cite Search Fix data