Synergizing Semantic Anchors and Ordinal Smoothed Cross-Entropy for Speech Fluency Classification

Mulati Kahaer; Sirajahmat Ruzmamat; XuDong Pang; Subinuer Maimaitituerxun; Zaokere Kadeer; Abudurexiti Reheman; Wenwen Lu; Panpan Zheng; Aishan Wumaier

doi:10.18653/v1/2026.findings-acl.1551

Synergizing Semantic Anchors and Ordinal Smoothed Cross-Entropy for Speech Fluency Classification

Mulati Kahaer, Sirajahmat Ruzmamat, XuDong Pang, Subinuer Maimaitituerxun, Zaokere Kadeer, Abudurexiti Reheman, Wenwen Lu, Panpan Zheng, Aishan Wumaier

Abstract

Speech fluency is a core indicator of second language proficiency and a critical component of Computer-Assisted Pronunciation Training (CAPT) systems. Accurate assessment requires models to perceive both macroscopic speech flow trends and microscopic local anomalies. However, existing methods struggle to bridge the semantic gap between static expert priors and dynamic temporal representations, while often overlooking the inherent ordinal nature of fluency scores. To address these challenges, we first construct a set of expert features targeting fluency disruptions and rhythmic regularity to provide explicit linguistic priors. Building on this, we propose the Multimodal Multi-Stream Fusion Classification (MMSFC) network. It employs a Mutual Cross-Attention (MCA) mechanism that leverages these expert features as “semantic anchors” to actively guide Whisper’s temporal representations and integrate decoder contexts, achieving deep interaction between global priors and local dynamics. Furthermore, we propose the Ordinal Smoothed Cross-Entropy (OSCE) loss. By constructing distance-aware soft target distributions coupled with confidence-adaptive smoothing and boundary enhancement, OSCE explicitly models ordinal relationships to resolve boundary ambiguity. Experiments on SpeechOcean762 show MMSFC achieves 83.40% accuracy, significantly outperforming strong baselines. Notably, OSCE also demonstrates superior generalization potential in cross-domain CV and NLP tasks. Our code is available at https://github.com/speech26ai/MMSFCCode.

Anthology ID:: 2026.findings-acl.1551
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31018–31029
Language:
URL:: https://aclanthology.org/2026.findings-acl.1551/
DOI:: 10.18653/v1/2026.findings-acl.1551
Bibkey:
Cite (ACL):: Mulati Kahaer, Sirajahmat Ruzmamat, XuDong Pang, Subinuer Maimaitituerxun, Zaokere Kadeer, Abudurexiti Reheman, Wenwen Lu, Panpan Zheng, and Aishan Wumaier. 2026. Synergizing Semantic Anchors and Ordinal Smoothed Cross-Entropy for Speech Fluency Classification. In Findings of the Association for Computational Linguistics: ACL 2026, pages 31018–31029, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Synergizing Semantic Anchors and Ordinal Smoothed Cross-Entropy for Speech Fluency Classification (Kahaer et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1551.pdf
Checklist:: 2026.findings-acl.1551.checklist.pdf

PDF Cite Search Checklist Fix data