NOTA: Multimodal Music Notation Understanding for Visual Large Language Model

Mingni Tang; Jiajia Li; Lu Yang; Zhiqiang Zhang; Jinhao Tian; Zuchao Li; Lefei Zhang; Ping Wang

doi:10.18653/v1/2025.findings-naacl.399

NOTA: Multimodal Music Notation Understanding for Visual Large Language Model

Mingni Tang, Jiajia Li, Lu Yang, Zhiqiang Zhang, Jinhao Tian, Zuchao Li, Lefei Zhang, Ping Wang

Abstract

Symbolic music is represented in two distinct forms: two-dimensional, visually intuitive score images, and one-dimensional, standardized text annotation sequences. While large language models have shown extraordinary potential in music, current research has primarily focused on unimodal symbol sequence text. Existing general-domain visual language models still lack the ability of music notation understanding. Recognizing this gap, we propose NOTA, the first large-scale comprehensive multimodal music notation dataset. It consists of 1,019,237 records, from 3 regions of the world, and contains 3 tasks. Based on the dataset, we trained NotaGPT, a music notation visual large language model. Specifically, we involve a pre-alignment training phase for cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation. Subsequent training phases focus on foundational music information extraction, followed by training on music score notation analysis. Experimental results demonstrate that our NotaGPT-7B achieves significant improvement on music understanding, showcasing the effectiveness of NOTA and the training pipeline.

Anthology ID:: 2025.findings-naacl.399
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7160–7173
Language:
URL:: https://aclanthology.org/2025.findings-naacl.399/
DOI:: 10.18653/v1/2025.findings-naacl.399
Bibkey:
Cite (ACL):: Mingni Tang, Jiajia Li, Lu Yang, Zhiqiang Zhang, Jinhao Tian, Zuchao Li, Lefei Zhang, and Ping Wang. 2025. NOTA: Multimodal Music Notation Understanding for Visual Large Language Model. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7160–7173, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: NOTA: Multimodal Music Notation Understanding for Visual Large Language Model (Tang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-naacl.399.pdf

PDF Cite Search Fix data