DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Zhuoyuan Mao; Mengjie Zhao; Qiyu Wu; Hiromi Wakaki; Yuki Mitsufuji

doi:10.18653/v1/2025.emnlp-main.653

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

Abstract

Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model’s ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-LLM fusion Transformer to enhance modality fusion prior to input into text LLMs, tailoring for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We open-source the codes, models and datasets we constructed: https://github.com/sony/DeepResonance.

Anthology ID:: 2025.emnlp-main.653
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12926–12948
Language:
URL:: https://aclanthology.org/2025.emnlp-main.653/
DOI:: 10.18653/v1/2025.emnlp-main.653
Bibkey:
Cite (ACL):: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, and Yuki Mitsufuji. 2025. DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12926–12948, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning (Mao et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.653.pdf
Checklist:: 2025.emnlp-main.653.checklist.pdf

PDF Cite Search Checklist Fix data