StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies

Pragya Khanna; Priyanka Kommagouni; Vamshi Raghu Simha Narasinga; Anil Vuppala

StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies

Pragya Khanna, Priyanka Kommagouni, Vamshi Raghu Simha Narasinga, Anil Vuppala

Abstract

Stuttering is a complex speech disorder that challenges both ASR systems and clinical assessment. We propose a multimodal stuttering detection and classification model that integrates acoustic and linguistic features through a two-stage fusion mechanism. Fine-tuned Wav2Vec 2.0 and HuBERT extract acoustic embeddings, which are early fused with MFCC features to capture fine-grained spectral and phonetic variations, while Llama-2 embeddings from Whisper ASR transcriptions provide linguistic context. To enhance robustness against out-of-distribution speech patterns, we incorporate Retrieval-Augmented Generation or adaptive classification. Our model achieves state-of-the-art performance on SEP-28k and FluencyBank, demonstrating significant improvements in detecting challenging stuttering events. Additionally, our analysis highlights the complementary nature of acoustic and linguistic modalities, reinforcing the need for multimodal approaches in speech disorder detection.

Anthology ID:: 2025.ijcnlp-long.39
Volume:: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venues:: IJCNLP | AACL
SIG:
Publisher:: The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:: 698–707
Language:
URL:: https://aclanthology.org/2025.ijcnlp-long.39/
DOI:
Bibkey:
Cite (ACL):: Pragya Khanna, Priyanka Kommagouni, Vamshi Raghu Simha Narasinga, and Anil Vuppala. 2025. StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 698–707, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):: StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies (Khanna et al., IJCNLP-AACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ijcnlp-long.39.pdf

PDF Cite Search Fix data