Rachel Bittner
2024
Evaluation of Pretrained Language Models on Music Understanding
Yannis Vasilakis
|
Rachel Bittner
|
Johan Pauwels
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)
Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications. Despite the reported success, there has been little effort in evaluating the musical knowledge of Large Language Models (LLM). We demonstrate that LLMs suffer from prompt sensitivity, inability to model negation and sensitivity towards specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged Audioset ontology to generate triplets consisting of anchor, positive and negative label for genre/instruments sub-tree and use six general-purpose Transformer-based models. Triplets required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.