A Retrieval Augmented Approach for Text-to-Music Generation

Robie Gonzales, Frank Rudzicz


Abstract
Generative text-to-music models such as MusicGen are capable of generating high fidelity music conditioned on a text prompt. However, expressing the essential features of music with text is a challenging task. In this paper, we present a retrieval-augmented approach for text-to-music generation. We first pre-compute a dataset of text-music embeddings obtained from a contrastive language-audio pretrained encoder (CLAP). Then, given an input text prompt, we retrieve the top k most similar musical aspects and augment the original prompt. This approach consistently generates music of higher audio quality as measured by the Frechét Audio Distance. We analyze the internal representations of MusicGen and find that augmented prompts lead to greater diversity in token distributions and display high text adherence. Our findings show the potential for increased control in text-to-music generation.
Anthology ID:
2024.nlp4musa-1.6
Volume:
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)
Month:
November
Year:
2024
Address:
Oakland, USA
Editors:
Anna Kruspe, Sergio Oramas, Elena V. Epure, Mohamed Sordo, Benno Weck, SeungHeon Doh, Minz Won, Ilaria Manco, Gabriel Meseguer-Brocal
Venues:
NLP4MusA | WS
SIG:
Publisher:
Association for Computational Lingustics
Note:
Pages:
31–36
Language:
URL:
https://aclanthology.org/2024.nlp4musa-1.6/
DOI:
Bibkey:
Cite (ACL):
Robie Gonzales and Frank Rudzicz. 2024. A Retrieval Augmented Approach for Text-to-Music Generation. In Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA), pages 31–36, Oakland, USA. Association for Computational Lingustics.
Cite (Informal):
A Retrieval Augmented Approach for Text-to-Music Generation (Gonzales & Rudzicz, NLP4MusA 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4musa-1.6.pdf