The Interpretation Gap in Text-to-Music Generation Models

Yongyi Zang, Yixiao Zhang


Abstract
Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.
Anthology ID:
2024.nlp4musa-1.18
Volume:
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)
Month:
November
Year:
2024
Address:
Oakland, USA
Editors:
Anna Kruspe, Sergio Oramas, Elena V. Epure, Mohamed Sordo, Benno Weck, SeungHeon Doh, Minz Won, Ilaria Manco, Gabriel Meseguer-Brocal
Venues:
NLP4MusA | WS
SIG:
Publisher:
Association for Computational Lingustics
Note:
Pages:
112–118
Language:
URL:
https://aclanthology.org/2024.nlp4musa-1.18/
DOI:
Bibkey:
Cite (ACL):
Yongyi Zang and Yixiao Zhang. 2024. The Interpretation Gap in Text-to-Music Generation Models. In Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA), pages 112–118, Oakland, USA. Association for Computational Lingustics.
Cite (Informal):
The Interpretation Gap in Text-to-Music Generation Models (Zang & Zhang, NLP4MusA 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4musa-1.18.pdf