Ting-An Hung
2025
Exploring Sentence Stress Detection using Whisper-based Speech Models
Ting-An Hung
|
Yu-Hsuan Hsieh
|
Tien-Hong Lo
|
Yung-Chang Hsu
|
Berlin Chen
Proceedings of the 37th Conference on Computational Linguistics and Speech Processing (ROCLING 2025)
Sentence stress reflects the relative prominence of words within a sentence. It is fundamental to speech intelligibility and naturalness, and is particularly important in second language (L2) learning. Accurate stress production facilitates effective communication and reduces misinterpretation. In this work, we investigate sentence stress detection (SSD) using Whisper-based transformer speech models under diverse settings, including model scaling, backbone–decoder interactions, architectural and regularization enhancements, and embedding visualization for interpretability. Results show that smaller Whisper variants achieve stronger performance under limited data, while architectural and regularization enhancements improves stability and generalization. Embedding analysis reveal clear separation between stressed and unstressed words. These findings offer practical insights into model selection, architecture design, and interpretability for SSD applications, with implications for L2 learning support tools.