Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

Weiqiao Shan; Yuang Li; Yuhao Zhang; Yingfeng Luo; Chen Xu; Xiaofeng Zhao; Long Meng; Yunfei Lu; Min Zhang; Hao Yang; Tong Xiao (肖桐); Jingbo Zhu

doi:10.18653/v1/2025.emnlp-main.974

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

Weiqiao Shan, Yuang Li, Yuhao Zhang, Yingfeng Luo, Chen Xu, Xiaofeng Zhao, Long Meng, Yunfei Lu, Min Zhang, Hao Yang, Tong Xiao, JingBo Zhu

Abstract

Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, speaker number verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging.

Anthology ID:: 2025.emnlp-main.974
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19305–19320
Language:
URL:: https://aclanthology.org/2025.emnlp-main.974/
DOI:: 10.18653/v1/2025.emnlp-main.974
Bibkey:
Cite (ACL):: Weiqiao Shan, Yuang Li, Yuhao Zhang, Yingfeng Luo, Chen Xu, Xiaofeng Zhao, Long Meng, Yunfei Lu, Min Zhang, Hao Yang, Tong Xiao, and JingBo Zhu. 2025. Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19305–19320, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders (Shan et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.974.pdf
Checklist:: 2025.emnlp-main.974.checklist.pdf

PDF Cite Search Checklist Fix data