Unsupervised Melody-to-Lyrics Generation

Automatic melody-to-lyric generation is a task in which song lyrics are generated to go with a given melody. It is of significant practical interest and more challenging than unconstrained lyric generation as the music imposes additional constraints onto the lyrics. The training data is limited as most songs are copyrighted, resulting in models that underfit the complicated cross-modal relationship between melody and lyrics. In this work, we propose a method for generating high-quality lyrics without training on any aligned melody-lyric data. Specifically, we design a hierarchical lyric generation framework that first generates a song outline and second the complete lyrics. The framework enables disentanglement of training (based purely on text) from inference (melody-guided text generation) to circumvent the shortage of parallel data.We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints as guidance during inference. The two-step hierarchical design also enables content control via the lyric outline, a much-desired feature for democratizing collaborative song creation. Experimental results show that our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines, for example SongMASS, a SOTA model trained on a parallel dataset, with a 24% relative overall quality improvement based on human ratings. Our code is available at https://github.com/amazon-science/unsupervised-melody-to-lyrics-generation.


… Music Note
Human Baseline 1 Baseline 2 Lyra (Ours) Figure 1: An example of the melody and the corresponding lyrics, where 'L' denotes a music note with long duration and 'S' stands for short.Our model LYRA generates more coherently than the baselines.Besides, the rhythms of lyrics (i.e., accents and relaxations when spoken) generated by human and LYRA align well with the flows of the melody.On the other hand, existing methods output lyrics that have low singability by either aligning multiple words with one single note (baseline 1) or vice versa (baseline 2) as highlighted in red.

Introduction
Music is ubiquitous and an indispensable part of humanity (Edensor, 2020).Self-serve songwriting has thus become an emerging task and has received interest by the AI community (Sheng et al., 2021;Tan and Li, 2021;Zhang et al., 2022;Guo et al., 2022).However, the task of melody-to-lyric (M2L) generation, in which lyrics are generated based on a given melody, is underdeveloped due to two major challenges.First, there is a limited amount of melody-lyric aligned data.The process of collecting and annotating paired data is not only labor-intensive but also requires strong domain expertise and careful consideration of copyrighted source material.In previous work, either a small amount (usually a thousand) of melody-lyrics pairs is manually collected (Watanabe et al., 2018;Lee et al., 2019), or Sheng et al. (2021) use the recently publicized data (Yu et al., 2021) in which the lyrics are pre-tokenized at the syllable level leading to less sensical subwords in the outputs.
Another challenge lies in melody-to-lyric modeling.Compared to unimodal sequence-to-sequence tasks such as machine translation, the latent correlation between lyrics and melody is difficult to learn.For example, Watanabe et al. (2018); Lee et al. (2019); Chen and Lerch (2020); Sheng et al. (2021) apply RNNs, LSTMs, SeqGANs, or Transformers with melody embeddings and cross attention (Vaswani et al., 2017), hoping to capture the melody-lyrics mapping.However, as shown in Figure 1, these methods may generate less singable lyrics when they violate too often a superficial yet crucial alignment: one word in a lyric tends to match one music note in the melody (Nichols et al., 2009).In addition, their outputs are not fluent enough because they are neural models trained from scratch without leveraging large pre-trained language models (PTLMs).
In this paper, we propose LYRA, an unsupervised, hierarchical melody-conditioned LYRics generAtor that can generate high-quality lyrics with content control without training on melody-lyric data.To circumvent the shortage of aligned data, LYRA leverages PTLMs and disentangles training (pure text-based lyric generation) from inference (melody-guided lyric generation).This is motivated by the fact that plain text lyrics under open licenses are much more accessible (Tsaptsinos, 2017;Bejan, 2020;Edmonds and Sedoc, 2021), and prior music theories pointed out that the knowledge about music notes can be compiled into constraints to guide lyric generation.Specifically, Dzhambazov et al. (2017) argue that it is the durations of music notes, not the pitch values, that plays a significant role in melody-lyric correlation.
As shown in Figure 1, the segmentation of lyrics should match the segmentation of music phrases for breathability.Oliveira et al. (2007); Nichols et al. (2009) also find that long (short) note durations tend to associate with (un)stressed syllables.However, existing lyric generators, even when equipped with state-of-the-art neural architectures and trained on melody-lyrics aligned data, still fail to capture these simple yet fundamental rules.In contrast, we show that through an inference-time decoding algorithm that considers two melody constraints (segment and rhythm) without training on melody-lyrics aligned data, LYRA achieves better singability than the best data-driven baseline.Without losing flexibility, we also introduce a factor to control the strength of the constraints.
In addition, LYRA adopts the hierarchical text generation framework (i.e., plan-and-write (Fan et al., 2019;Yao et al., 2019)) that both helps with the coherence of the generation and improves the controllability of the model to accommodate userspecified topics or keywords.During training, the input-to-plan model learns to generate a plan of lyrics based on the input title and salient words, then the plan-to-lyrics model generates the complete lyrics.To fit in the characteristics of lyrics and melody, we also equip the plan-to-lyrics model with the ability to generate sentences with a predefined count of syllables through multi-task learning.
Our contributions are summarized as follows:  Task Formulation We aim to generate lyrics that comply with both the provided topic and melody.The input topic is further decomposed into an intended title T and a few salient words S to be included in the generated lyrics (see Figure 2 for an example input).Following the settings of previous work (Chen and Lerch, 2020;Sheng et al., 2021), we assume that the input melody M is predefined and consists of M music phrases (M = {p 1 , p 2 , ...p M }), and each music phrase contains N i music notes (p i = {n i1 , n i2 , ...n iN i }).

Inputto-Plan
The output is a piece of lyrics L that aligns with the music notes: L = {w 11 , w 12 , ..., w M N }.Here, for j ∈ {1, 2, ..., N i }, w ij is a word or a syllable of a word that aligns with the music note n ij .

Lyric Generation Model
We draw inspirations from recent generation models with content planning.These models are shown to achieve increased coherence and relevance over end-to-end generation frameworks in tasks such as story generation (Fan et al., 2018;Yao et al., 2019;Yang et al., 2022).Our lyrics generation model is similarly hierarchical as is shown in Figure 2. Specifically, we finetune two modules in our purely text-based pipeline: 1) an input-to-plan generator that generates a keyword-based intermediate plan, and 2) a plan-to-lyrics generator which is aware of word phonetics and syllable counts.

Input-to-Plan
In real-world scenarios, users will likely have an intended topic (e.g., a title and a few keywords) Model Output: Generated Lyric Naïve Cause the Christmas gift was for.Chen and Lerch (2020) Hey now that's what you ever.Sheng et al. (2021) Believe you like taught me to.Ours, Multi-task Night and day my dreams come true.
Table 1: Examples of lyrics generated by different models with seven syllable counts as a constraint.Our model with multi-task auxiliary learning is the only system that successfully generates a complete line of lyrics with the desired number of syllables.On the other hand, the supervised models (Chen and Lerch, 2020;Sheng et al., 2021) trained with melody-lyrics paired data still generate dangling or cropped lyrics.
to write about.We similarly extract a few salient words from the training lyric using the YAKE algorithm (Campos et al., 2020), and feed them to our input-to-plan module to improve topic relevance.
The input contains the song title, the music genre, and three salient words extracted from ground truth lyrics.Note that we chose 3 as a reasonable number for practical use cases, but our approach works for any arbitrary number of salient keywords.
Our input-to-plan model is then trained to generate a line-by-line keyword plan of the song.Considering that at inference time we may need different numbers of keywords for different expected output lengths, the number of planned keywords is flexible.Specifically, we follow the settings in Tian and Peng (2022) to include a placeholder (the <MASK> token) in the input for every keyword to be generated in the plan.In this way, we have control over how many keywords we would like per line.We finetune BART-large (Lewis et al., 2020) as our input-to-plan generator with format control.

Plan-to-Lyrics
Our plan-to-lyrics module takes in the planned keywords as input and generates the lyrics.This module encounters an added challenge: to match the music notes of a given melody at inference time, Task Sample Data (Input → Output)

T4
Moon → MUWN; river → RIH_VER; wider → WAY_DER; ... it should be capable of generating lyrics with a desired syllable count that aligns with the melody.
If we naïvely force the generation to stop once it reaches the desired number of syllables, the outputs are usually cropped abruptly or dangling.For example, if the desired number of syllables is 7, a system unaware of this constraint might generate 'Cause the Christmas gift was for' which is cropped and incomplete.Moreover, two recent lyric generators which are already trained on melody-to-lyrics aligned data also face the same issue (Table 1).
We hence propose to study an under-explored task of syllable planning: generating a line of lyrics that 1) is a self-contained phrase and 2) has the desired number of syllables.To this end, we include both the intermediate plan and the desired syllable count as input.Additionally, we propose to equip the plan-to-lyrics module with the word phonetics information and the ability to count syllables.We then adopt multi-task auxiliary learning to incorporate the aforementioned external knowledge during training, as Liebel and Körner (2018); Guo et al. (2019); Poth et al. (2021); Kung et al. (2021) have shown that related auxiliary tasks help to boost the system performance on the target task.Specifically, we study the collective effect of the following related tasks which could potentially benefit the model to learn the target task: • T1: Plan to lyrics generation with syllable constraints (the target task) • T2: Syllable counting: given a sentence, count the number of syllables • T3: Plan to lyrics generation with granular syllable counting: in the output lyric of T1, append the syllable counts immediately after each word • T4: Word to phoneme translation We list the sample data for each task in Table 2.We aggregate training samples from the above tasks, and finetune GPT-2 large (Radford et al., 2019) on different combinations of the four tasks.We show our model's success rate on the target task in Table 3 in Section 6.1.

Melody-Guided Inference
In this section, we discuss the procedure to compile a given melody into constraints to guide the decoding at inference time.We start with the most straightforward constraints introduced before: 1) segmentation alignment and 2) rhythm alignment.Note that both melody constraints can be updated without needing to retrain the model.

Segment Alignment Constraints
The segmentation of music phrases should align with the segmentation of lyrics (Watanabe et al., 2018).Given a melody, we first parse the melody into music phrases, then compute the number of music notes within each music phrase.For example, the first music phrase in Figure 2 consists of 13 music notes, which should be equal to the number of syllables in the corresponding lyric chunk.Without losing generality, we also add variations to this constraint where multiple notes can correspond to one single syllable when we observe such variations in the gold lyrics.

Rhythm Alignment Constraints
According to Nichols et al., the stress-duration alignment rule hypothesizes that music rhythm should align with lyrics meter.Namely, shorter note durations are more likely to be associated with unstressed syllables.At inference time, we 'translate' a music note to a stressed syllable (denoted by 1) or an unstressed syllable (denoted by 0) by comparing its duration to the average note duration.For example, based on the note durations, the first music phrase in Figure 2 is translated into alternating 1s and 0s, which will be used to guide the inference decoding.

Phoneme-Constrained Decoding
At each decoding step, we ask the plan-to-lyrics model to generate candidate complete words, instead of subwords, which is the default word piece unit for GPT-2 models.This enables us to retrieve the word phonemes from the CMU pronunciation dictionary (Weide et al., 1998) and identify the resulting syllable stresses.For example, since the phoneme of the word 'Spanish' is 'S PAE1 NIH0 SH', we can derive that it consists of 2 syllables that are stressed and unstressed.
Next, we check if the candidate words satisfy the stress-duration alignment rule.Given a candidate word w i and the original logit p(w i ) predicted by the plan-to-lyrics model, we introduce a factor α to control the strength: if wi satisfies rhythm alignment, αp(w i ), otherwise.
(1) We can either impose a hard constraint, where we reject all those candidates that do not satisfy the rhythm rules (α = 0), or impose a soft constraint, where we would reduce their sampling probabilities (0 < α < 1).Finally, we apply diverse beam search (Vijayakumar et al., 2016) to promote the diversity of the generated sequences.

Experimental Setup
In this section, we describe the train and test data, baseline models, and evaluation setup.The evaluation results are reported in Section 6.

Dataset
Train data.Our training data consists lyrics of 38,000 English songs and their corresponding genres such as Pop, Jazz, and Rock, which we processed from the Genre Classification dataset (Bejan, 2020).The phonetic information needed to construct the auxiliary tasks to facilitate the syllable count control is retrieved from the CMU pronunciation dictionary (Weide et al., 1998).
Automatic test data.The testing setup is the complete diagram shown in Figure 2. Our input contains both the melody (represented in music notes and phases) and the title, topical, and genre information.Our test melodies come from from the lyric-melody aligned dataset (Yu et al., 2021).In total, we gathered 120 songs that do not appear in the training data.Because the provided lyrics are pre-tokenized at the syllable level (e.g."a lit tle span ish town" instead of "a little spanish town"), we manually reconstructed them back into natural words when necessary.
Two sets of human test data.To facilitate human evaluation, we leverage an online singing voice synthesizer (Hono et al., 2021) to generate the sung audio clips.This synthesizer however requires files in the musicXML format that none of the existing datasets provide (including our automatic test data).Therefore, we manually collected 6 copyrighted popular songs and 14 non-copyrighted public songs from the musescore platform that supports the musicXML format.
The first set of pilot eval data are these 20 pieces of melodies that come with ground truth lyrics.In addition, we composed a second, larger set of 80 test data by pairing each existing melody with various other user inputs (titles and salient words).This second eval set, which does not come with ground truth lyrics, is aimed at comparison among all the models.

Baseline Models for Lyrics Generation
We compare the following models.1. SongMASS (Sheng et al., 2021) is a state-of-the-art (SOTA) song writing system which leverages masked sequence to sequence pre-training and attention based alignment for M2L generation.It requires melodylyrics aligned training data while our model does not.2. GPT-2 finetuned on lyrics is a uni-modal, melody-unaware GPT-2 large model that is finetuned end-to-end (i.e., title-to-lyrics).In the automatic evaluation setting, we also compare an extra variation, content-to-lyrics, in which the input contains the title, salient words, and genre.These serve as ablations of the next model LYRA w/o rhythm to test the efficacy of our plan-and-write pipeline without inference-time constraints.3. LYRA w/o rhythm is our base model consisting of the inputto-plan and plan-to-lyrics modules with segmentation control, but without the rhythm alignment.4. LYRA w/ soft/hard rhythm is our multi-modal model with music segmentation and soft or hard rhythm constraints.For the soft constraints setting, the strength controlling hyperparameter α = 0.01.All models except SongMASS are finetuned on the same lyrics training data described in Section 5.1.

Automatic Evaluation Setup
We automatically assess the generated lyrics on two aspects: the quality of text and music alignment.For text quality, we divide it into 3 subaspects: 1) Topic Relevance, measured by input salient word coverage ratio, and sentence-or corpus-level BLEU (Papineni et al., 2002); 2) Diversity, measured by distinct unigrams and bigrams (Li et al., 2016); 3) Fluency, measured by the perplexity computed using Huggingface's pretrained GPT-2.We also compute the ratio of cropped sentences among all sentences to assess how well they fit music phrase segments.For music alignment, we compute the percentage where the stress-duration rule holds.

Human Evaluation Setup
Turker Qualification We used qualification tasks to recruit 120 qualified annotators who 1) have enough knowledge in song and lyric annotation, and 2) pay sufficient attention on the Mechanical Turk platform.The qualification consisted of two parts accordingly.First, to test the Turkers' domain knowledge, we created an annotation task consisting of the first verse from 5 different songs with gold labels.The 5 songs are carefully selected to avoid ambiguous cases, so that the quality can be clearly identified.We selected those whose scores have a high correlation with gold labels.Second, we adopted attention questions to rule out irresponsible workers.As is shown in the example questionnaire in Appendix A, we provided music sheets for each song in the middle of the questions.We asked all annotators the same question: "Do you think the current location where you click to see the music sheet is ideal?".Responsible answers include "Yes" or "No", and suggesting more ideal locations such as "immediately below the audio clip and above all questions".We ruled out irresponsible Turkers who filled in geographical locations (such as country names) in the provided blank.
Annotation Task Our annotation is relative, meaning that annotators assess a group of songs generated from different systems with the same melody and title at once.We evaluated all baseline models except for GPT-2 finetuned (content-tolyrics), as the two GPT-2 variations showed similar performance in automatic evaluation.We thus only included one due to resource constraints of the human study.Each piece of music was annotated by at least three workers, who were asked to evaluate the quality of the lyrics using a 1-5 Likert scale on six dimensions across musicality and text quality.For musicality, we asked them to rate singability (whether the melody's rhythm aligned well with the lyric's rhythm) and intelligibility (whether the lyric content was easy to understand when listened to without looking at the lyrics). 3For the lyric quality, we asked them to rate coherence, creativeness, and in rhyme.Finally, we asked annotators to rate how much they liked the song overall.A  6 Results

Generating a Sequence of Lyrics with the Desired Number of Syllables
Recall that in Section 3.2, we trained the plan-tolyrics generator on multiple auxiliary tasks in order to equip it with the ability to generate a sentence with a pre-defined number of syllables.A sample output (boldfaced) can be found below: Line 1 (8 syllables): Last Christmas I gave you my gift; Line 2 (13 syllables): It was some toys and some clothes that I said goodbye to; Line 3 (11 syllables): But someday the tree is grown with other memories; Line 4 (7 syllables): Santa can hear us singing;... To test this feature, we compute the average success rate on a held-out set from the training data that contains 168 songs with 672 lines of lyrics.For each test sample, we compute its success as a binary indicator where 1 indicates the output sequence contains exactly the same number of syllables as desired, and 0 for all other cases.We experimented with both greedy decoding and sampling, and found that BART (Lewis et al., 2020) could not learn these multi-tasks as well as the GPT-2 family under the same settings.We hence report the best result of finetuning GPT-2 large (Radford et al., 2019) in Table 3.
The first row in Table 3 shows that the model success rate is around 20% without multi-task learning, which is far from ideal.By gradually training with auxiliary tasks such as syllable counts, the success rate increases, reaching over 90% (rows 2, 3, 4).This shows the efficacy of multi-task auxiliary learning.We also notice that the phoneme translation task is not helpful for our goal (row 4), so we disregard the last task and only keep the remaining    three tasks in our final implementation (row 3).

Automatic Evaluation Results
We report the automatic evaluation results in Table 4.Our LYRA models significantly outperform the baselines and generate the most on-topic and fluent lyrics.In addition, adding rhythm constraints to the base LYRA noticeably increases the music alignment quality without sacrificing too much text quality.It is also noteworthy that humans do not consistently follow stress-duration alignment, meaning that higher is not necessarily better for music alignment percentage.The comparisons between GPT-2 content-to-lyrics and LYRA w/o rhythm support the hypothesis of the better topic control provided by our hierarchical architecture.Since the baseline model SongMASS has no control over the content, it has lowest topic relevance scores.Moreover, although the SongMASS baseline seems to achieve the best diversity, it tends to produce non-sensical sentences that consist of a few gibberish words (e.g., 'for hanwn to stay with him when, he got to faney he alone'), partially because its training data are pre-tokenized at the syllable level.Such degeneration is also reflected by the extremely high perplexity and cropped sentence ratio (CSR).Meanwhile, CSR is not applicable to both GPT-2 finetuned models because they are melodyunaware and generate lyrics freely without being forced to end at the end of each music segment.

Human Evaluation Results
The results on both evaluation sets are shown in Figures 3a and 3b.Clearly, human-written lyrics greatly outperform all models.For both evaluation sets, we notice the relative rankings of the models remain the same across all metrics except creativeness.This observation is mirrored by paired t-tests where we find that the best machine model differentiates from the second best machine model with statistical significance (p-value < 0.05) for all aspects except creativeness.Both indicate the reliability of our collected results in singability, intelligibility, coherence, rhyme, and overall quality.In a little Spanish town, In a little Spanish town, We'll make you feel good, We'll make you dance, In a little Spanish town, In a little Spanish town . . .It takes me back to a time and a place in a town, But if you do love me you will give me some kisses, Where my heart and soul are not the same but now, Your heart is in my heart and not in a soul i know.e -LYRA w/ Soft Rhythm f -LYRA w/ Hard Rhythm Take me back to older time in little Spanish town, And all the love and all the kisses that you gave me, I need your heart and your soul and your love too, I need your heart and soul and I need you back again.
Spanish people in the city and all in the town, Love and all the tender kisses between two of us, Is it my heart or my soul in you and me, In the heart and in the soul and in the mind and then.Table 5: An example of the generated lyrics with the same input -Title: In a little Spanish Town; Genre: Jazz; Salient words: town, kisses, heart.We highlight the generated keywords in italics.
LYRA with hard or soft rhythm constraint are the best models in terms of singability, intelligibility, rhyme, and overall quality, which demonstrates the efficacy of our plan-and-write pipeline with melody alignment.We regard LYRA with soft rhythm as our best model since it has highest overall quality.The addition of soft rhythm alignment leads to further improvements in musicality and overall quality, with only a little sacrifice in coherence compared to GPT-2 (title-to lyrics).On the other hand, imposing hard rhythm constraints sacrifices the coherence and intelligibility of lyrics.Surprisingly, SongMASS performs even worse than the finetuned GPT-2 baseline in terms of musicality.Upon further inspection, we posit that Song-MASS too often deviates from common singing habits: it either assigns two or more syllables to one music note, or matches one syllable with three or more consecutive music notes.

Qualitative Analysis
We conduct a case study on an example set of generated lyrics to better understand the advantages of our model over the baselines.In this example, all models generate lyrics given the same title, genre, and salient words, as well as the melody of the original song.We show the music sheet of the first generated segment in Figure 4 and the complete generated lyrics in Table 5.We also provide the song clips with synthesized singing voices and more examples in this demo website.
Musicality.The melody-lyric alignment in Figure 4 is representative in depicting the pros and cons of the compared models.Although Song-MASS is supervised on parallel data, it still often assigns too many music notes to one single syllable, which reduces singability and intelligibility.The GPT-2 title-to-lyrics model is not aware of the melody and thus fails to match the segmentation of music phrase with the generated lyrics.
LYRA w/o rhythm successfully matches the segments, yet stressed and long vowels such as in the words 'takes' and 'place' are wrongly mapped to short notes.Humans, as well as our models with both soft and hard rhythm alignment, produce singable lyrics.
Text quality.As shown in Table 5, SongMASS tends to generate simple and incoherent lyrics because it is trained from scratch.The GPT-2 title-tolyrics model generates coherently and fluently, but is sometimes prone to repetition.All three variations of LYRA benefit from the hierarchical planning stage and generate coherent and more informative lyrics.However, there is always a trade-off between musicality and text quality.Imposing hard rhythm constraints could sometimes sacrifice coherence and creativity and thus hurt the overall quality of lyrics.
7 Related Work

Melody Constrained Lyrics Generation
End-to-End Models.Most existing works on M2L generation are purely data-driven and suffer from a lack of aligned data.For example, Watanabe et al. (2018); Lee et al. (2019); Chen and Lerch (2020) naively apply SeqGAN (Yu et al., 2017) or RNNs to sentence-level M2L generation.The data collection process is hard to automate and leads to manual collection of only small amounts of samples.Recently, Sheng et al. (2021) propose Song-MASS by training two separate transformer-based models for lyric or melody with cross attention.To the best of our knowledge, our model LYRA is the first M2L generator that does not require any paired cross-modal data, and is trained on a readily available uni-modal lyrics dataset.2021) use syllable alignments as reward for the lyric generator.However, it only estimates the expected number of syllables from the melody.We not only provide a more efficient solution to syllable planning, but also go one step further to incorporate the melody's rhythm patterns by following music theories (Nichols et al., 2009;Dzhambazov et al., 2017).Concurrently, Xue et al. (2021); Guo et al. (2022) partially share similar ideas with ours and leverage the sound to generate Chinese raps or translate lyrics via alignment constraints.Nevertheless, the phonetics of Chinese characters are very different from English words, and rap generation or translation is unlike M2L generation.

NLG with Hierarchical Planning
Hierarchical generation frameworks are shown to improve consistency over sequence-to-sequence frameworks in other creative writing tasks such as story generation (Fan et al., 2018;Yao et al., 2019).Recently, a similar planning-based scheme is adopted to poetry generation (Tian and Peng, 2022) to circumvent the lack of poetry data.We similarly equip LYRA with the ability to comply with a provided topic via such content planning.

Studies on Melody-Lyrics Correlation
Music information researchers have found that it is the duration of music notes, not the pitch values that a play significant role in melody-lyric alignment (Nichols et al., 2009;Dzhambazov et al., 2017).Most intuitively, one music note should not align with two or more syllables, and the segmentation of lyrics should match the segmentation of music phrases for singability and breathability (Watanabe et al., 2018).In addition, Nichols et al. (2009) find out that there is a correlation between syllable stresses and note durations for better singing rhythm.Despite the intuitiveness of the aforementioned alignments, our experiments show that existing lyric generators which are already trained on melody-lyrics aligned data still tend to ignore these fundamental rules and generate songs with less singability.

Conclusion and Future Work
Our work explores the potential of lyrics generation without training on lyrics-melody aligned data.To this end, we design a hierarchical plan-and-write framework that disentangles training from inference.At inference time, we compile the given melody into music phrase segments and rhythm constraints.Evaluation results show that our model can generate high-quality lyrics that significantly outperform the baselines.Future directions include investigating more ways to compile melody into constraints such as the beat, tone or pitch variations, and generating longer sequences of lyrics with song structures such as verse, chorus, and bridge.Future works may also take into account different factors in relation to the melody such as mood and theme.

Limitations
We discuss the limitations of our work.First of all, our model LYRA is build upon pre-trained language models (PTLM) including Bart (Lewis et al., 2020) and GPT-2 (Radford et al., 2019).Although our method is much more data friendly than previous methods in that it does not require training on melody-lyric aligned data, our pipeline may not apply to low-resource languages which do not have PTLMs.Second, our current adoption of melody constraints is still simple and based on a strong assumption of syllable stress and note duration.We encourage future investigation about other alignments such as the tone or pitch variations.Lastly, although we already have the music genre as an input feature, it remains an open question how to analyze or evaluate the generated lyrics with respect to a specific music genre.

Ethics Statement
It is known that the generated results by PTLMs could capture the bias reflected in the training data et al., 2019;Wallace et al., 2019).Our models may potentially generate offensive content for certain groups or individuals.We suggest to carefully examine the potential biases before deploying the models to real-world applications.

A Survey Form Used In Human Evaluation
We show the original survey with the evaluation instructions and the annotation task in Figures 5 through 9. Figure 5, Figure 6, and Figure 7 provide task instructions, including the definition of each metric (Intelligibility, Singability, Coherence, Creativeness, and Rhyme), and examples of good and bad lyrics in each criterion.Figures 8 and 9 showcase the actual annotation task.
In the the actual annotation tasks, we noticed that annotators tended to adjust their rating to Intelligibility (whether the content of the lyrics was easy to understand without looking at the lyrics) after they were prompted to see the lyrics texts.We hence explicitly asked them to rate Intelligibility twice, both before and after they saw the generated lyrics and music scores.Annotators must not modify their ratings to the first question after they saw the lyric texts, but could still use the second question to adjust their scores if needed.Such a mechanism helped us reduce the noise introduced by the presentation of lyric texts and music sheets.Namely, we asked the same questions twice, but only took into account the first intelligibility ratings when we computed the results.

Figure 2 :
Figure 2: An overview of our approach that disentangles training from inference.Blue represents components used during both training and inference, while brown means inference only.During training, our input-to-plan model learns to predict sentence-level plans (i.e., keywords) given the title, genre, and salient words as input.Then, the planto-lyrics model generates the lyrics while being aware of word phonetic information and syllable counts.At inference time, we compile the given melody into 1) music phrase segments and 2) rhythm constraints to guide the generation.cross-lingual translation by training on monolingual data only.In our case, we achieve "unsupervised" melody-to-lyrics generation by training on text data only and do not require any parallel melody-lyrics aligned data for training.
(a) Human evaluation results on the pilot test set with human as ground truth lyrics.(b) Human evaluation results on the larger test set without ground truth lyrics.

Figure 3 :
Figure 3: Average human Likert scores for two lyrics evaluation datasets on singability, intelligibility, coherence, creativity, rhyme, and overall quality.For each pair of systems in either study, we conduct paired t-test and observe statistical significance across all dimensions except creativeness (denoted by *).

Figure 4 :
Figure 4: Music sheets showing the lyric generated by different systems given the same piece of melody.LYRA with soft and hard rhythm control are the only two models that can generate highly singable lyrics.The singing voices of the complete song can be found in this demo page.a -Human b -SongMASS (Sheng et al., 2021) Many skies have turned to gray because we're far apart, Many moons have passed away and still she's in my heart, We made a promise and sealed it with a kiss, In a little Spanish town twas on a night like this.Someone got to go here, Forget that rest of my life, Everybody loves somebody who i, In the middle of the night when the. c -GPT-2 Finetuned on Title-to-Lyrics d -LYRA w/o Rhythm External Knowledge.Oliveira et al. (2007); Oliveira (2015) apply rule-based text generation methods with predefined templates and databases for Portuguese.Ma et al. (

Figure 9 :
Figure9: Annotation Task Page 2. We explicitly asked the annotators to rate Intelligibility twice, before and after they saw the generated lyrics and provided musicality scores.

Plan-to-Lyrics (aware of phonetics)
• Both automatic and human evaluations show that our unsupervised model LYRA outperforms fully supervised baselines in terms of both text quality and musicality by a significant margin. 2Background and Problem SetupRepresentation of Melody Melody is a succession of pitches in rhythm consisting of a sequence of music phrases, which can be further decomposed into timed music notes.Each music note is defined by two independent pivots: pitch values and durations.Pitch represents the highness/lowness of a musical tone; duration is the note's length of time.Namely, melody M can be denoted by M = {p 1 , p 2 , ...p M }, where each p i (i ∈ 1, 2, ..., M ) is a music phrase.The music phrase can be further decomposed into timed music notes (p i = {n i1 , n i2 , ...n iN i }), where each music note n ij (j ∈ {1, 2, ..., N i }) comes

Table 2 :
Sample data of the four proposed tasks to facilitate lyric generation with syllable planning.

Table 4 :
Automatic evaluation results.Human (ground truth) performance is highlighted in a grey background.Among all models, we highlight the best scores in boldface and underline the second best.