Workshop on NLP for Music and Audio (2024)


up

pdf (full)
bib (full)
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)

pdf bib
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)
Anna Kruspe | Sergio Oramas | Elena V. Epure | Mohamed Sordo | Benno Weck | SeungHeon Doh | Minz Won | Ilaria Manco | Gabriel Meseguer-Brocal

pdf bib
Genre-Conformity in the Topics of Lyrics and Song Popularity
Anna Aljanaki

The genre of a song defines both musical (rhythmic, timbral, performative) aspects of a song, but also the themes of lyrics and the style of writing. The audience has certain expectations as to emotional and thematic content of the genre they listen to. In this paper we use Music4All database to investigate whether breaking these expectations influences song popularity. We use topic modeling to divide song lyrics into 36 clusters, and apply tag clustering to separate the songs into 15 musical genres. We observe that in some genres (metal, hip-hop) lyrics are mostly written in specific topics, whereas in other genres they are spread over most topics. In most genres, songs that have lyrics that are not representative of the genre, are more popular than songs with genre-conforming lyrics.

pdf bib
PIAST: A Multimodal Piano Dataset with Audio, Symbolic and Text
Hayeon Bang | Eunjin Choi | Megan Finch | Seungheon Doh | Seolhee Lee | Gyeong-Hoon Lee | Juhan Nam

While piano music has become a significant area of study in Music Information Retrieval (MIR), there is a notable lack of datasets for piano solo music with text labels. To address this gap, we present PIAST (PIano dataset with Audio, Symbolic, and Text), a piano music dataset. Utilizing a piano-specific taxonomy of semantic tags, we collected 9,673 tracks from YouTube and added human annotations for 2,023 tracks by music experts, resulting in two subsets: PIAST-YT and PIAST-AT. Both include audio, text, tag annotations, and transcribed MIDI utilizing state-of-the-art piano transcription and beat tracking models. Among many possible tasks with the multimodal dataset, we conduct music tagging and retrieval using both audio and MIDI data and report baseline performances to demonstrate its potential as a valuable resource for MIR research.

pdf bib
Lyrics Transcription in Western Classical Music with Whisper: A Case Study on Schubert’s Winterreise
Hans-Ulrich Berendes | Simon Schwär | Meinard Müller

Automatic Lyrics Transcription (ALT) aims to transcribe sung words from music recordings and is closely related to Automatic Speech Recognition (ASR). Although not specifically designed for lyrics transcription, the state-of-the-art ASR model Whisper has recently proven effective for ALT and various related tasks in music information retrieval (MIR). This paper investigates Whisper’s performance on Western classical music, using the “Schubert Winterreise Dataset.” In particular, we found that the average Word Error Rate (WER) with the unmodified Whisper model is 0.56 for this dataset, while the performance varies greatly across songs and versions. In contrast, spoken versions of the song lyrics, which we recorded, are transcribed with a WER of 0.14. Further systematic experiments with source separation and time-scale modification techniques indicate that Whisper’s accuracy in lyrics transcription is less affected by the musical accompaniment and more by the singing style.

pdf bib
Harnessing High-Level Song Descriptors towards Natural Language-Based Music Recommendation
Elena V. Epure | Gabriel Meseguer Brocal | Darius Afchar | Romain Hennequin

Recommender systems relying on Language Models (LMs) have gained popularity in assisting users to navigate large catalogs. LMs often exploit item high-level descriptors, i.e. categories or consumption contexts, from training data or user preferences. This has been proven effective in domains like movies or products. In music though, understanding how effectively LMs utilize song descriptors for natural language-based music recommendation is relatively limited. In this paper, we assess LMs effectiveness in recommending songs based on user natural language requests and items with descriptors like genres, moods, and listening contexts. We formulate the recommendation as a dense retrieval problem and assess LMs as they become increasingly familiar with data pertinent to the task and domain. Our findings reveal improved performance as LMs are fine-tuned for general language similarity, information retrieval, and mapping longer descriptions to shorter, high-level descriptors in music.

pdf bib
NLP Analysis of Environmental Themes in Phish Lyrics Across Concert Locations and Years
Anna Farzindar | Jason Jarvis

This study delves into the application of advanced AI and natural language processing techniques (NLP), to analyze the lyrics of Phish, a renowned American jam band. Focusing on environmental themes within their extensive repertoire, this paper aims to uncover latent topics pertaining to environmental discourse, by using the topic modeling and environmental classifier. Through meticulous preprocessing, modeling, and interpretation, our findings shed light on the multifaceted portrayal of environmental issues in Phish’s lyrics. In this study, our primary contribution lies in lyrical analysis, as well as visualization and interpretation of the topics their lyrics cover, over the forty plus years the band has existed. Our lyrical visualizations aim to facilitate an understanding of how Phish selects the timing and location for their live performances in relation to the themes present in their music.

pdf bib
A Retrieval Augmented Approach for Text-to-Music Generation
Robie Gonzales | Frank Rudzicz

Generative text-to-music models such as MusicGen are capable of generating high fidelity music conditioned on a text prompt. However, expressing the essential features of music with text is a challenging task. In this paper, we present a retrieval-augmented approach for text-to-music generation. We first pre-compute a dataset of text-music embeddings obtained from a contrastive language-audio pretrained encoder (CLAP). Then, given an input text prompt, we retrieve the top k most similar musical aspects and augment the original prompt. This approach consistently generates music of higher audio quality as measured by the Frechét Audio Distance. We analyze the internal representations of MusicGen and find that augmented prompts lead to greater diversity in token distributions and display high text adherence. Our findings show the potential for increased control in text-to-music generation.

pdf bib
Information Extraction of Music Entities in Conversational Music Queries
Simon Hachmeier | Robert Jäschke

The detection of music entities such as songs or performing artists in natural language queries is an important task when designing conversational music recommendation agents. Previous research has observed the applicability of named entity recognition approaches for this task based on pre-trained encoders like BERT. In recent years, large language models (LLMs) have surpassed these encoders in a variety of downstream tasks. In this paper, we validate the use of LLMs for information extraction of music entities in conversational queries by few-shot prompting. We test different numbers of examples and compare two sampling methods to obtain few-shot examples. Our results indicate that LLM performance can achieve state-of-the-art performance in the task.

pdf bib
Leveraging User-Generated Metadata of Online Videos for Cover Song Identification
Simon Hachmeier | Robert Jäschke

YouTube is a rich source of cover songs. Since the platform itself is organized in terms of videos rather than songs, the retrieval of covers is not trivial. The field of cover song identification addresses this problem and provides approaches that usually rely on audio content. However, including the user-generated video metadata available on YouTube promises improved identification results. In this paper, we propose a multi-modal approach for cover song identification on online video platforms. We combine the entity resolution models with audio-based approaches using a ranking model. Our findings implicate that leveraging user-generated metadata can stabilize cover song identification performance on YouTube.

pdf bib
Can Impressions of Music be Extracted from Thumbnail Images?
Takashi Harada | Takehiro Motomitsu | Katsuhiko Hayashi | Yusuke Sakai | Hidetaka Kamigaito

In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach through human evaluations.

pdf bib
Raga Space Visualization: Analyzing Melodic Structures in Carnatic and Hindustani Music
Soham Korade | Suswara Pochampally | Saroja TK

The concept of raga in Indian classical music serves as a complex, multifaceted melodic entity that can be approached through various perspectives. Compositions within a raga act as foundational structures, serving as the bedrock for improvisations. Analyzing their textual notations is easier and more objective in comparison with analyzing audio samples. A significant amount of musical insights can be derived from the discrete swara sequences alone. This paper aims to construct an intuitive visualization of raga space, using swara sequences from raga compositions. Notations from public sources are normalized, and their TF-IDF features are projected into a low-dimensional space. This approach allows for qualitative analysis of both Carnatic and Hindustani ragas, mapping them to known raga theory.

pdf bib
Musical Ethnocentrism in Large Language Models
Anna Kruspe

Large Language Models (LLMs) reflect the biases in their training data and, by extension, those of the people who created this training data. Detecting, analyzing, and mitigating such biases is becoming a focus of research. One type of bias that has been understudied so far are geocultural biases. Those can be caused by an imbalance in the representation of different geographic regions and cultures in the training data, but also by value judgments contained therein. In this paper, we make a first step towards analyzing musical biases in LLMs, particularly ChatGPT and Mixtral. We conduct two experiments. In the first, we prompt LLMs to provide lists of the “Top 100” musical contributors of various categories and analyze their countries of origin. In the second experiment, we ask the LLMs to numerically rate various aspects of the musical cultures of different countries. Our results indicate a strong preference of the LLMs for Western music cultures in both experiments.

pdf bib
Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation
Dinh-Viet-Toan Le | Louis Bigo | Mikaela Keller

Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text, particularly with polyphony, we investigate how BPE behaves with different types of musical content. This study provides a qualitative analysis of BPE’s behavior across various instrumentations and evaluates its impact on a musical phrase segmentation task for both monophonic and polyphonic music. Our findings show that the BPE training process is highly dependent on the instrumentation and that BPE “supertokens” succeed in capturing abstract musical content. In a musical phrase segmentation task, BPE notably improves performance in a polyphonic setting, but enhances performance in monophonic tunes only within a specific range of BPE merges.

pdf bib
Lyrics for Success: Embedding Features for Song Popularity Prediction
Giulio Prevedello | Ines Blin | Bernardo Monechi | Enrico Ubaldi

Accurate song success prediction is vital for the music industry, guiding promotion and label decisions. Early, accurate predictions are thus crucial for informed business actions. We investigated the predictive power of lyrics embedding features, alone and in combination with other stylometric features and various Spotify metadata (audio, platform, playlists, reactions). We compiled a dataset of 12,428 Spotify tracks and targeted popularity 15 days post-release. For the embeddings, we used a Large Language Model and compared different configurations. We found that integrating embeddings with other lyrics and audio features improved early-phase predictions, underscoring the importance of a comprehensive approach to success prediction.

pdf bib
The Role of Large Language Models in Musicology: Are We Ready to Trust the Machines?
Pedro Ramoneda | Emila Parada-Cabaleiro | Benno Weck | Xavier Serra

In this work, we explore the use and reliability of Large Language Models (LLMs) in musicology. From a discussion with experts and students, we assess the current acceptance and concerns regarding this, nowadays ubiquitous, technology. We aim to go one step further, proposing a semi-automatic method to create an initial benchmark using retrieval-augmented generation models and multiple-choice question generation, validated by human experts. Our evaluation on 400 human-validated questions shows that current vanilla LLMs are less reliable than retrieval augmented generation from music dictionaries. This paper suggests that the potential of LLMs in musicology requires musicology driven research that can specialized LLMs by including accurate and reliable domain knowledge.

pdf bib
“Does it Chug?” Towards a Data-Driven Understanding of Guitar Tone Description
Pratik Sutar | Jason Naradowsky | Yusuke Miyao

pdf bib
Evaluation of Pretrained Language Models on Music Understanding
Yannis Vasilakis | Rachel Bittner | Johan Pauwels

Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications. Despite the reported success, there has been little effort in evaluating the musical knowledge of Large Language Models (LLM). We demonstrate that LLMs suffer from prompt sensitivity, inability to model negation and sensitivity towards specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged Audioset ontology to generate triplets consisting of anchor, positive and negative label for genre/instruments sub-tree and use six general-purpose Transformer-based models. Triplets required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.

pdf bib
FUTGA: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
Junda Wu | Zachary Novack | Amit Namburi | Jiaheng Dai | Hao-Wen Dong | Zhouhang Xie | Carol Chen | Julian McAuley

We propose FUTGA, a model equipped with fined-grained music understanding capabilities through learning from generative augmentation with temporal compositions. We leverage existing music caption datasets and large language models (LLMs) to synthesize fine-grained music captions with structural descriptions and time boundaries for full-length songs. Augmented by the proposed synthetic dataset, FUTGA is enabled to identify the music’s temporal changes at key transition points and their musical functions, as well as generate detailed descriptions for each music segment. We further introduce a full-length music caption dataset generated by FUTGA, as the augmentation of the MusicCaps and the Song Describer datasets. The experiments demonstrate the better quality of the generated captions, which capture the time boundaries of long-form music.

pdf bib
The Interpretation Gap in Text-to-Music Generation Models
Yongyi Zang | Yixiao Zhang

Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.