ALCAP: Alignment-Augmented Music Captioner

Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the intricate interplay between the two. However, a comprehensive understanding of music necessitates the integration of both these elements. In this study, we delve into this overlooked realm by introducing a method to systematically learn multimodal alignment between audio and lyrics through contrastive learning. This not only recognizes and emphasizes the synergy between audio and lyrics but also paves the way for models to achieve deeper cross-modal coherence, thereby producing high-quality captions. We provide both theoretical and empirical results demonstrating the advantage of the proposed method, which achieves new state-of-the-art on two music captioning datasets.


Introduction
Learning to interpret music based on audio and lyrics has become an increasingly attractive research area for researchers in both music and natural language processing [16,31].The insights gained from this research into multimodal representation learning have a wide range of applications such as streaming media discovery [24] and music recommendation with detailed and human-like descriptions [2], making the dynamics of search and recommendation engines more explainable.However, captioning music is a challenging task, as the multimodal inputs contain ambiguous and repetitive lyrics, as well as complex audio signals mixed with various tracks of information.
Previous works on music captioning have primarily focused on improving individual components of the encoder-decoder architecture, such as developing a music encoder, implementing attention mechanisms, and using beam search.However, little effort has been directed towards leveraging the correspondence between audio and lyrics, which could potentially provide useful information for generating high-quality captions.Some works like [31] leverage the multimodal information from both lyrics and music through a cross-modal attention module, but the two modalities are not aligned before fusion.In reality, audio and lyrics are loosely aligned, making them imperfect sources of data for existing multimodal learning methods that do not have multimodal alignment mechanisms [17,30].For example, it is common for composers and lyricists to work separately in the music industry, resulting in different lyrics fitting the same melody.Additionally, the same words with different song patterns and styles can express diametrically opposite emotions.Therefore, it is believed that accurate and comprehensive music interpretation should leverage the subtle connections between music and lyrics.
In response to these challenges, we propose to improve music understanding by aligning audio and lyrics pairs with contrastive learning before modality fusion.The idea is that within the same input batch, the paired audio and lyrics should be brought close together in the latent space, while non-paired ones should be pulled apart.By adding a contrastive loss, the multimodal input pairs are forced to be more aligned, which in turn guides the model to achieve stronger cross-modal consistency for a more meaningful fused latent space.To this end, we propose Alignment Augmented Music Captioner (ALCAP), which is an extension of BART-fusion [31] with an alignment augmentation module.We provide a theoretical explanation of why the proposed alignment module results in improved generalization from an information bottleneck perspective.We conduct extensive experiments on the Song Interpretation Dataset [31] and the NetEase Cloud Music Review Dataset.On the Song Interpretation Dataset, ALCAP improves the state-of-the-art from 24.7 to 27.1 on ROUGE-L and from 22.6 to 27.7 on METEOR, and on the NetEase Cloud Music Review Dataset, it achieves a margin of 1.7 on ROUGE-L and 0.9 on METEOR over the baseline, substantially demonstrating the effectiveness of our approach.We also observe performance improvement in cross-modal text-music retrieval, which is a common application scenario in industry, providing an indirect perspective to evaluate caption quality.Lastly, we explore the effect of contrastive loss weights on the model performance via grid search and conclude our ablation study by showing that our proposed multimodal alignment module leads to more concentrated attention on language tokens through visualization analysis.
Our contributions are summarized as follows: • To the best of our knowledge, we are the first to propose an alignment augmentation module through cross-modal contrastive learning between music and lyrics for music captioning.By learning the interactions between the two modalities in an unsupervised manner, the model is guided to learn better cross-modal attention weights for meaningful fused latent space.
• We provide a theoretical justification for the improved generalization of the proposed multimodal alignment module from an information theory perspective.
• Extensive experiments on two music captioning datasets demonstrate the effectiveness of our proposed alignment augmentation module, and we set the new state-of-the-art on the Song Interpretation Dataset.We also conduct several ablation experiments to study the effect of different weights of contrastive learning on the model performance.
2 Related Work

Alignment Aware Representation Learning
Multimodal representation learning has been increasingly important as modern intelligent applications require a comprehensive understanding of vision, language and speech.To learn meaningful latent spaces, unsupervised alignment between different modality inputs has been proven effective as an additional layer of structural information about the data.In the work of pretraining for speech synthesis [3], aligning the acoustic and phoneme inputs makes the model more capable of learning cross-modal attention weights, thereby improving the quality of acoustic signal reconstruction.AL-BEF [15] proposes to align vision and language before the modality fusion, purifying the multimodal input pairs, thus resulting in a more grounded vision and language representation.This approach can be interpreted as maximizing mutual information among different views of the same vision and language pair.µ-VLA [32] introduces image-text level and region-phrase level alignment in vision and language pretraining so as to make the most of unpaired data.[8] propose a retrieval process operating on past experiences to provide the agent with contextual relevant information, improving sample efficiency and representation learning of the policy function.It proves the effectiveness of retrieval-augmented module in continuous decision making process which also applies to the sequence of words generation [22,9,29,12].
Comparasion between ALCAP and CLIP [20].While both models employ contrastive learning for multimodal alignment, their objectives and applications are distinct.CLIP is designed for imagetext understanding and generation, leveraging a large dataset of images with their corresponding textual captions.In contrast, ALCAP focuses specifically on music captioning by aligning audio and lyrics pairs using cross-modal contrastive learning before modality fusion.This novel alignment augmentation module in ALCAP is tailored to address the unique challenges posed by the ambiguous and repetitive nature of lyrics, as well as the complexity of audio signals in music.Furthermore, ALCAP extends the BART-fusion architecture with an alignment augmentation module, which has not been used in CLIP.This module enables ALCAP to learn better cross-modal attention weights for a more meaningful fused latent space, resulting in improved music captioning performance.We also provide a theoretical explanation for the improved generalization of the proposed multimodal alignment module from an information bottleneck perspective, which has not been addressed in CLIP's domain of image-text understanding.

Multimodal Music Captioning
Music captioning is a challenging task as it requires the model to not only comprehensively understand both music and corresponding lyrics but also to avoid overfitting on limited music-lyrics pairs due to copyright restrictions.MusCaps [16] firstly addresses the music captioning task from an audio captioning perspective, using a multimodal input encoder-decoder architecture based on LSTM [11].While MusCaps achieves a performance boost in caption generation, its predictive word sequence is limited to 20 tokens, which narrows down the approach's applicability, or at least not suitable for our long and human-like language composition scenario.One of the most relevant works to ours is BART-fusion [31] which is built on top of BART [14], adding a music encoder and modality fusion module.However, BART-fusion fails to fully mine the relationship between the music and lyrics input data.Inspired by works from retrieval augmented representation learning, we propose to improve the generalization ability of BART-fusion by introducing music and lyrics alignment before modality fusion.

Methodology
In this section, we introduce the architecture of ALCAP, which is based on BART-fusion [31].We first state the problem definition, then go through each module of the architecture.The overall framework of ALCAP is shown in Figure 1.

Problem Definition
Given a song represented as a music-lyrics pair x i , with a music track m i and its corresponding lyrics t i , we aim to generate the caption (or interpretation) ŷi of the song, consisting of a sequence of word tokens.In a typical setting of captioning, the attention-based encoder-decoder architecture is adopted to learn the mapping function from multimodal input to text output f θ : {m i , t i } → ŷi .The model parameters θ are optimized to generate the caption that is most consistent with the human annotated caption y i .

Multimodal Encoding
Music Encoder To obtain the representation of the music track, we use a pre-trained music encoder that includes a convolutional front-end and Transformer encoder layers, as described in [28].The model was originally trained to classify music audio into 50 tags under a multi-class setting using the Million Song Dataset [6].These tags cover various musical characteristics, such as the genre (e.g., Jazz and Blues), mode, and the presence of specific instruments (e.g., piano or guitar).To perform the classification, the mel-spectrogram of a music track m i is first passed through a series of CNN layers for local feature aggregation in the time and frequency axis.The intermediate features are then fed into two Transformer encoder layers to model the information along the time axis, taking into account that elements of music can appear at different moments within a music clip.In the original paper, the output embedding series from the Transformer layer is further pooled to perform the classification task.However, in this paper, the embedding series h m i ∈ R lm×dm is used directly, where l m is the length of the music sequence and d m is the hidden dimension.
Lyrics Encoder The representation of lyrics t i is obtained following standard BART encoder [14], and denoted as h t i ∈ R lt×dt , where l t is the length of the lyrics sequence and d t is the hidden dimension.The encoder consists of six multi-head self-attention layers.

Multimodal Representation Alignment
Music and lyrics are not inherently connected, as different lyrics can fit the same melody, and the same lyrics can convey different emotions when paired with dynamic, rhythmic music.To fully represent the interactions between music and lyrics, we propose using contrastive learning before modality fusion to explicitly align the two modalities.This is expected to result in improved performance due to increased interactions between the two modalities, as has been previously shown to be effective in the vision and language domain [15].
To be specific, given a batch of input music-lyrics pairs {(m 1 , t 1 ), (m 2 , t 2 ), ....,(m n , t n )}, we first obtain the music representations {h m 1 , h m 2 , ...., h m n } by the music encoder, and lyrics representations {h m 1 , h m 2 , ...., h m n } by the lyrics encoder respectively.As both music and lyrics are sequences, we denote h as the mean aggregation of h along the sequence length dimension.Through a linear transform on h, we obtain the latent code z and use the InfoNCE loss [18] as the contrastive learning objective in latent space, as where z m i and z t i are the latent code of music and lyrics respectively, and σ(•) is the exponential function.For simplicity, we ignore the symmetric version by switching z m i and z t i in Equation 1, which is also applicable for the purpose of modality alignment.Note that InfoNCE can be interpreted as an estimator of a lower bound of mutual information [5,18,7].We will incorporate this to prove the effectiveness of out proposed alignment module both theoretically and empirically, which is supposed to be non-trivial.We will revisit this in § 4 and § 6.

Multimodal Representation Fusion and Decoding
Before decoding, the aligned representations of music tracks h m i and lyrics h t i are further fused by a cross-attention module, where the lyrics representations are linearly projected as queries, and the music representations are projected as keys and values.The process can be described as where The fused representation contains semantics from both the music track and the lyrics, as the alignment by contrastive learning ensures sufficient interactions between them.While the multimodal encoder fused the text and music as a whole, the decoding process follows a teacher-forcing fashion to predict each caption words, i.e., the ground-truth word token of the ith sample y i,t are provided at every step t during training.We use the BART decoder [14] to generate the caption autoregressively and maximize the factorized conditional likelihood.The caption loss is defined as where y i,<t is the ground-truth word token before step t and P indicates the probability of the token at step t conditioning on previous tokens and fused multimodal representation.

Overall Learning Objective
To this end, we define the final loss to be the weighted sum of the caption loss and the contrastive learning loss as follows: where α is the weight of the contrastive learning loss, balancing the contribution of captioning and multimodal alignment.

An Information Theoretical Perspective
In this section, we explain the performance improvement of our alignment module based on contrastive learning from a mutual information perspective.
Given an input pair x i := {m i , t i }, information bottleneck (IB) [1] encourages the model to find minimal but sufficient information about the input x i with respect to the target caption words y i .In other words, the objective of the training process in IB can be formulated as where I(y; z) is the mutual information between the output and the latent code, I(x; z) is the mutual information between the input and the latent code, and p θ (z|x) is the conditional distribution of latent code parameterized by the encoder θ.To optimize the IB, an upper bound on I(x; z) is typically taken for generalization ability of a model [26,1].From the information perspective, we show the following lower bound on the mutual information of (x, z) in our setting.
Proposition 4.1.The mutual information of (x, z) in our setting is upper bounded by where R(z) E p((m,t)|z) log depends only on z and is independent of x.
In light of the fact that contrastive learning tends to maximize mutual information between (m, t) pairs, the above lower bound suggests that it can be considered as an approximate implementation of information bottleneck.Furthermore, if the music-lyrics pairs used in contrastive learning are not well aligned, one can actually prove that the learning will fail.
Proposition 4.2.If the music-lyrics pairing in the learning process is random such that the music and lyrics are sampled independently, then the mutual information between the input x and the representation z will be zero, and thus the encoder cannot learn anything useful.
The proof is provided in Appendix.To sum up, based on the InfoNCE loss [10], the proposed alignment module can be interpreted as maximizing the mutual information lower bound between the music m and corresponding text t, which translates to minimizing the mutual information between the input x and the latent code z, and consequently improving the generalization ability of the model.

Data
In this paper, we experiment on two datasets -the Song Interpretation Dataset [31] and the NetEase Cloud Music Review Dataset.

Song Interpretation Dataset
We use the Song Interpretation (SI) Dataset introduced by [31].The dataset contains audio excerpts from 27,834 songs from Music4All Dataset [25] and 490,000 user interpretations of the songs.Each song is in 30 seconds and recorded at 44.1 kHz.Based on user votes of the interpretations, [31] create three variants of the dataset, as 1) SI Full: the full dataset after some preprocessing; 2) SI w/voting ≥ 0: the subset with only interpretations that received non-negative votes; 3) SI w/voting > 0: the subset with only interpretations that received positive votes.The sizes of the training splits of the three datasets are 279,283, 265,360 and 49,736 respectively.All three datasets share the same test split consisting of 800 instances.Please refer to [31] for more details of the dataset.

NetEase Cloud Music Review Dataset
In addition to the Song Interpretation Dataset where the interpretations were mostly written by people who grew up under the influence of European and American culture, we curate another datasetthe NetEase Cloud Music (NCM) Review Dataset, where the reviews were written by people from China.NCM is a free music streaming service that is immensely popular in China.One of its most prominent features is that users can create their own playlists, write reviews and share the playlists with other users.
We collect user-created playlists from NCM and keep those consisting of only English songs.Because our model generates captions at an individual song level, for each playlist, we keep one song from it that has the highest popularity, i.e., the song that has been collected to most playlists1 .As a result, from each playlist, we have an instance of the song-review pair.For each song, we keep the middle 30 seconds excerpt and sample it at 22.05kHz.Since the BART [14] is pretrained in English, we translate the Chinese reviews into English using Google Translate.
We collect 22,210 playlists (songs) and their reviews.An example is shown in Figure 2. We randomly split the dataset into train/val/test splits, with sizes of 15,547, 3,331, and 3,332.6 Experiments

Experimental Setup
We resample each song at 16kHz and take a 15s excerpt.The maximum caption length is 512.
The model is implemented in PyTorch [19].We use the BART implementation facebook/bart-base from Huggingface [27].We use a batch size of 26 and a learning rate of 5e − 5.The weight of contrastive learning α loss is set to 0.02.For better computation efficiency we freeze the parameters in the music encoder and precompute the music representations.We train the model for 20 epochs and report the results on the test split using the checkpoint with the best evaluation performance.All hyperparameter tuning is based on grid search.All models are trained on a Tesla A100 GPU with 40GB memory.The training time for SI-Full, SI w/voting ≥ 0, SI w/voting > 0, and NCM Review are 28h, 28h, 5h, and 3h respectively.
We use ROUGE-1,2,L [23] and METEOR [4] as evaluation metrics.ROUGE measures the overlap of n-grams between the referenced text and the generated text.On top of ROUGE, METEOR complementarily measures the semantic similarity between the two pieces of text by taking into account synonyms through WordNet.For both metrics, we use the implementation with default parameters from Huggingface Datasets library.We use three random seeds and report the average performance on the test set.

Experiments I: Music Captioning
The results are presented in Table 1.BART is a model that utilizes only unimodal textual information from lyrics.The BART-fusion model, on the other hand, fuses representations from music and lyrics, but the two representations are not aligned prior to modality fusion.The results of these two baselines are reported in [31].We do not compare with [16], which focuses on short-length music descriptions with a maximum of 22 tokens.
We have found that ALCAP outperforms both BART-fusion and BART on all four datasets, in terms of all four metrics, thereby setting a new state-of-the-art.Specifically, the improvement on METEOR is more pronounced than on ROUGE metrics, which demonstrates that ALCAP is capable of capturing the semantics of the song for music captioning, not just memorizing the syntax.Furthermore, the results on the NCM Review for both models are overall worse than those on SI datasets.We believe this is due to the weaker correspondence between the music tracks and reviews in the NCM Review, as the reviews were originally created at the playlist level.Despite this, ALCAP is still able to capture such weak correspondence and achieve a significant improvement over the baseline.One of the most practical applications of music captioning is text-music retrieval, where given a piece of music description, the goal is to retrieve the most relevant music according to the text.In light of this, in this analysis, we test the retrieval capability of ALCAP and the baseline model.The setting of cross-modal retrieval in this experiment is different from previous works such as [29], where the retrieval is performed on the two modalities that are directly aligned through contrastive learning.
As proposed in [31], we randomly select one sentence from the human-generated interpretation or review, and use it as a query.The queries are used to retrieve the corresponding songs through their generated captions by our models.Specifically, we compute the representations of the queries and generated captions using Sentence-BERT [21].Thus, for each query, we obtain a ranked list of retrieved songs through the cosine similarities between the query representation and generated caption representations.We compare our proposed ALCAP model to the BART-fusion [31] model and use precision@k and recall@k as the evaluation metrics.The results are shown in Table 2.
We observe that ALCAP outperforms BART-fusion on most datasets and metrics, indicating the superiority of cross-modal alignment between music tracks and lyrics that makes the generated captions more semantically aligned with human-written texts.This is apart from several cases where ALCAP ties with BART-fusion.
Compared to SI datasets, the relatively low performance on NetEase Review of both models is due to 1) the weak correspondence between the song and the review as we mentioned in previous sections, and 2) the retrieval pool is much larger -3,332 vs. 800.Nevertheless, ALCAP still outperforms the baseline in such a challenging scenario.
Table 2: Results of text-music retrieval on four datasets using BART-fusion (baseline), and ALCAP (ours).The best results are highlighted in bold.
Dataset Method p@5 p@10 p@20 p@30 r@5 r@10 r@20 r@30 SI Full BART-fusion 3.2% To better understand the mechanism within the cross-attention module, we plot the attention weights of BART-fusion and ALCAP on five input examples from the training set in Figure 3.Both models are trained on the SI w/voting > 0 dataset.
The attention weights from ALCAP appear to be more focused on specific text tokens, in contrast to BART-Fusion, which has a more evenly distributed attention across all tokens.This phenomenon suggests that ALCAP, equipped with the cross-modal alignment module, is more effective at learning the interactions between the music audio and text domains.

Case Study II: Examples of Generated Caption
In this case study we show a representative example of generated captions from ALCAP and BARTfusion on Child In Time by Deep Purple, as in Figure 4.The song is from the test split of SI, and both models are trained on SI w/voting > 0.
From the lyrics and the reference interpretation, we can infer that the song is about war, which is captured by ALCAP.The generated caption contains "shot" and "sniper", which indicates that the model has correctly understood the theme of the song.However, BART-fusion fails to interpret the song correctly, instead interpreting it as a love song.We propose that this is due to the song's 70s Rock music style being too typical, and the lack of cross-modal alignment in BART-fusion.This allows the unimodal information from the sound track to dominate and confuse the model.As 70s Rock encompasses a wide range of topics, including love, it becomes harder to identify the correct topic of war.However, the alignment module in ALCAP manages to capture the semantics of the song and provide a more accurate interpretation.

Ablation Study: Effect of Contrastive Learning Weight
To further investigate the effect of multimodal alignment through contrastive learning, we show the performances of using different weights of contrastive learning α on SI w/voting > 0 on music captioning (Figure 5) and text-music retrieval (Figure 6).Recall(%) p@5 p@10 p@20 p@30 r@5 r@10 r@20 r@30 We observe that in both figures, the scores peak at α = 2e − 2, and decrease with higher weights or lower weights.When the weight is below 2e − 2, the model fails to learn sufficient alignment between the two modalities; on the other hand, when the weight is greater than 2e − 2, the model suffers because the overly large weight of contrastive learning loss negatively affects the optimizing of caption loss, which is the most prominent at α = 20.

Conclusions and Discussions
In this paper, we propose Alignment augmented music Captioner (ALCAP) that is a high quality music captioner leveraging an alignment augmentation module with cross-modal contrastive learning.We provide a theoretical analysis of the improved generalization of our model from an information bottleneck perspective.Experiments on two music captioning datasets demonstrate the effectiveness of ALCAP, and we achieve the new state-of-the-art on both of them.
For better computation efficiency, we fixed the parameters of the music encoder in ALCAP.In the future, we will allow the parameters to be trained for more flexible training.In addition, the Song Interpretation dataset, as the only public music captioning dataset, is still small in scale, leaving room for creating a large-scale dataset.Moreover, the user generated song interpretations and reviews are likely to be biased.As a result, how to mitigate such bias while training the model becomes a promising research direction.

IFigure 1 :
Figure 1: An overview of ALCAP.The encoded representations of music and lyrics are first aligned using contrastive learning, then the aligned representations are fused using cross-attention, and further decoded through the text decoder.The architecture is based on BART [14].

Figure 3 :
Figure 3: Illustration of the cross-modal weights for five samples (a)∼(e).The first row shows the cross-modal attention weights output by BART-fusion and the second row shows the weights by ALCAP.The y-axis and x-axis in each sub-graph indicates the text tokens and music segments respectively.

Figure 4 :
Figure 4: An example of generated captions from ALCAP and BART-fusion on Child In Time by Deep Purple.

Figure 5 :
Figure 5: Results of music captioning using different weights of contrastive learning α on SI w/voting.

Figure 6 :
Figure 6: Results of text-music retrieval using different weights of contrastive learning α on SI w/voting.

Table 1 :
Results of music captioning on four datasets using BART (baseline), BART-fusion (baseline), and ALCAP (ours).The best results are highlighted in bold.
Sweet child in time; You'll see the line; The line that's drawn between; Good and bad; See the blind man; Shooting at the world; Bullets flying; Oh taking toll; If you've been bad; Oh Lord I bet you have; And you've not been hit; Oh by flying lead; You'd better close your eyes; Oh bow your head; Wait for the ricochet; I wanna hear you sing; Sweet child in time; You'll see the line; The line that's drawn between; Good and bad; See the blind man; Shooting at the world ….Reference: I don't think this song is about the Vietnamese war, but rather war in general.The blind man is not just a person, or a countryit's mankind.Almost every nation has been in a conflict or war.And yes, some countries are more powerful than others, but in the endwhat has war resolved?Every country has sinned, some more than others.Nobody is innocent.Every action has a reaction -you shoot bullets blindly for unjustified reasons, and you are bound to have it ricochet back to you.The lyrics alone make this song great, but what really sends it to one of the all-time best works of genius art is everything else in the song -from the captivating organ intro, to Blackmore's unreal guitar skills, to Gillain's extraordinary vocals, and the passion that brought it all together.Probably the best song ever.BART-fusion: I think this song is about a guy who is in the midst of a breakup with his girlfriend.He is trying to figure out what to do with his life, but he ends up killing himself.He tries to convince himself that he can do it, and he can't do it.He wants to be able to show her that he still loves her, and is willing to do anything to help her, even if it means killing herself.He can't do it by himself, because he knows she will never be happy, and will never see him again.He will never give up on her, no matter how hard he tries.He knows that she will always love him, and it will be a long time before he does, but she is not willing to give up, and she is going to be with him forever.ALCAP: I think this song is about a man who was shot in the head by a sniper.He was shot by the sniper.The sniper was shot at the end of the night, and he was killed.He is still alive, but he is in a coma, and the sniper is still there.He has been shot.He's dying.It's a sad song, but it's a beautiful song.I think it's about a guy who is dying.I'm not sure what this song means, but I think this is a good song. Lyrics: