Shuwan Yang

2025

Multimodal emotion recognition in conversation (MERC) aims to identify speakers’ emotional states by utilizing text, audio, and visual modalities. Although recent large language model (LLM)-based methods have demonstrated strong performance, they typically adopt static fusion strategies that integrate all available modalities uniformly. This overlooks the fact that the necessity of multimodal cues can vary significantly across utterances. In this work, we propose an adaptive modality selection framework for MERC. The core of our approach is a modality selection module based on Group Relative Policy Optimization (GRPO), which enables a LoRA-tuned LLM to reason about the necessity of multimodal input via chain-of-thought (CoT) generation. This process does not require manually labeled modality selection data and is trained in a fully unsupervised manner. The selected modality configuration is then provided as input to a downstream emotion classifier, which is also implemented using a LoRA-tuned LLM and trained to predict emotional states. Experimental results on benchmark multimodal dialogue datasets show that our method consistently outperforms strong baselines, demonstrating the effectiveness of adaptive modality selection in improving recognition accuracy. Our code is available at https://github.com/youflyaway/Modality-Selection-Enhanced-LoRA-Tuned-LLMs.

Co-authors

Venues

findings1

Fix author