Cutting Through Overload: Efficient Token Dropping for Speech Emotion Recognition in Multimodal Large Language Models

Jaime Bellver-Soler; Mario Rodríguez-Cantelar; Ricardo Córdoba; Luis Fernando D’Haro

Cutting Through Overload: Efficient Token Dropping for Speech Emotion Recognition in Multimodal Large Language Models

Jaime Bellver-Soler, Mario Rodriguez-Cantelar, Ricardo Córdoba, Luis Fernando D’Haro

Abstract

Recent developments in Multimodal Large Language Models (MLLMs) have provided novel insights into Speech Emotion Recognition (SER). However, combining high-dimensional speech signals with textual tokens can lead to a rapid growth in input tokens, increasing computational costs and inference times. This “token overload” also risks shadowing essential textual cues, affecting the reasoning capabilities of the language model and diluting emotional information crucial to accurate SER. In this paper, we explore different token drop methods that mitigate excessive token counts while preserving both emotional nuances and the core linguistic capabilities of the model. Specifically, we compare various pooling approaches to produce a compact representation. Our preliminary findings suggest that these techniques can reduce computational costs without decreasing SER accuracy.

Anthology ID:: 2025.iwsds-1.30
Volume:: Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Month:: May
Year:: 2025
Address:: Bilbao, Spain
Editors:: Maria Ines Torres, Yuki Matsuda, Zoraida Callejas, Arantza del Pozo, Luis Fernando D'Haro
Venues:: IWSDS | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 284–289
Language:
URL:: https://aclanthology.org/2025.iwsds-1.30/
DOI:
Bibkey:
Cite (ACL):: Jaime Bellver-Soler, Mario Rodriguez-Cantelar, Ricardo Córdoba, and Luis Fernando D’Haro. 2025. Cutting Through Overload: Efficient Token Dropping for Speech Emotion Recognition in Multimodal Large Language Models. In Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, pages 284–289, Bilbao, Spain. Association for Computational Linguistics.
Cite (Informal):: Cutting Through Overload: Efficient Token Dropping for Speech Emotion Recognition in Multimodal Large Language Models (Bellver-Soler et al., IWSDS 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.iwsds-1.30.pdf

PDF Cite Search Fix data