TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method

Lyric-to-melody generation is an important task in automatic songwriting. Previous lyric-to-melody generation systems usually adopt end-to-end models that directly generate melodies from lyrics, which suffer from several issues: 1) lack of paired lyric-melody training data; 2) lack of control on generated melodies. In this paper, we develop TeleMelody, a two-stage lyric-to-melody generation system with music template (e.g., tonality, chord progression, rhythm pattern, and cadence) to bridge the gap between lyrics and melodies (i.e., the system consists of a lyric-to-template module and a template-to-melody module). TeleMelody has two advantages. First, it is data efficient. The template-to-melody module is trained in a self-supervised way (i.e., the source template is extracted from the target melody) that does not need any lyric-melody paired data. The lyric-to-template module is made up of some rules and a lyric-to-rhythm model, which is trained with paired lyric-rhythm data that is easier to obtain than paired lyric-melody data. Second, it is controllable. The design of the template ensures that the generated melodies can be controlled by adjusting the musical elements in the template. Both subjective and objective experimental evaluations demonstrate that TeleMelody generates melodies with higher quality, better controllability, and less requirement on paired lyric-melody data than previous generation systems.


Introduction
With the rapid development of artificial intelligence, automatic songwriting has drawn much attention from both academia and industry. Automatic songwriting covers many tasks, such as lyric generation (Malmi et al. 2016;Xue et al. 2021), melody generation (Wu et al. 2020;Choi, Fazekas, and Sandler 2016;Zhu et al. 2018), lyric-to-melody generation (Yu, Srivastava, and Canales 2021;Sheng et al. 2020;Bao et al. 2019;Lee, Fang, and Ma 2019), and melody-tolyric generation (Sheng et al. 2020;Watanabe et al. 2018;Li et al. 2020). In this paper, we focus on lyric-to-melody generation, since it is one of the most important and common tasks in songwriting, and is still under-explored.
Recent years, deep learning techniques have been widely used to develop end-to-end lyric-to-melody systems (Yu, Srivastava, and Canales 2021;Sheng et al. 2020;Bao et al. 2019;Lee, Fang, and Ma 2019). However, these systems suffer from the following issues: 1) They require large amount of paired lyric-melody data to learn the correlation between syllables in lyrics and notes in melodies (Sheng et al. 2020). However, collecting lots of paired data is quite difficult and costs much. Sheng et al. (2020) have attempted to alleviate the low-resource challenge by unsupervised pre-training on lyric-to-lyric and melody-to-melody model respectively. However, the utilization of the unpaired data helps on the understanding and generation of lyrics and melodies while has little effect on the correlation learning between lyrics and melodies. 2) They generate melodies directly from lyrics, which hinders end users to control musical elements (e.g. tonality and chord progression) over the generation process. Without controllability, requirements from users may be ignored and the application scenarios are limited.
In this paper, we propose TeleMelody 1 , a two-stage lyricto-melody generation system with a carefully designed template as a bridge to connect lyrics and melodies. The template contains tonality, chord progression, rhythm pattern, and cadence. This designed template is effective because: 1) It is convenient to be extracted from melodies and predicted from lyrics, and can successfully catch their characteristics; 2) It is easy to be understood and manipulated from users. Accordingly, we break down the lyric-to-melody task into a lyric-to-template module and a template-to-melody module. This can reduce the task difficulty, improve data efficiency and enable controllability on generated melodies. The details are described as follows: This two-stage framework can help reduce the difficulty of learning the correlation between lyrics and melodies in end-to-end models. In the template-to-melody module, we train a template-to-melody model with templates extracted from melodies by rules. Generating melodies from templates is much easier than from lyrics, since the correlation between templates and melodies is much stronger than that between lyrics and melodies, and the paired template-melody data can be easily got from a self-supervised way. In the lyric-to-template module, rhythm pattern in template is ob-1 Tele is from TEmpLatE. tained by a lyric-to-rhythm model, which is trained with paired lyric-rhythm data. This paired data is obtained by extracting the rhythm pattern from crawled lyric-audio data through audio processing tools, which is much easier to get than paired lyric-melody data. Cadence is inferred based on punctuation mappings. Chord progression and tonality in the template can be acquired with predefined musical knowledge. In this way, the two modules either rely on selfsupervised learning or data mining on external lyric-audio data, which do not require any paired lyric-melody data and are more data-efficient than end-to-end models.
Moreover, benefiting from the template based framework, end users can control the generated melodies by changing the musical elements in templates. Besides, in the sequenceto-sequence based template-to-melody model, we use musical knowledge to guide the learning of attention alignments between the template tokens and the corresponding melody tokens, which can lead to better controllability.
The main contributions of this work are as follows: • We propose TeleMelody, a two-stage lyric-to-melody system with a carefully designed template as the bridge. It decomposes lyric-to-melody task into a lyric-totemplate module and a template-to-melody module. This framework can help reduce the task difficulty and improve data efficiency. • Chord progression, tonality, rhythm pattern and cadence designed in templates can help control basic musical elements and high-level music structures. We introduce alignment regularization based on musical knowledge to ensure better controllability for generated melodies. • Experimental results demonstrate that TeleMelody significantly outperforms previous end-to-end lyric-tomelody generation models in terms of both objective and subjective evaluations on generation quality, and is capable of better controlling the generated melodies.

Background
In this section, we introduce the background of lyric-tomelody generation task and then briefly introduce each musical element in template for better understanding.

Lyric-to-Melody Generation
Considerable development on lyric-to-melody generation has been seen in recent years, from rule or statistical based methods to deep learning based methods. Traditional rule or statistical based methods usually need lots of manual designs with abundant domain knowledge in music, and also hinder end users to control musical elements. Monteith, Martinez, and Ventura (2012) generate rhythm pattern with rule-based modules, and construct an n-gram model to predict note pitch 2 . Long, Wong, and Sze (2013) and Rabin (1963) learn the lyric-note correlation by performing statistical based method with limited paired data. In these works, the melody is generated either without the controlling of any musical element or just considering one specific 2 https://en.wikipedia.org/wiki/Pitch (music) Twin-kle twin-kle lit -tle star, how I won-der what you are.
The melody, lyric, and chord progression.  Figure 1: The song "Twinkle Twinkle Little Star" in "C major" tonality. musical element. Fukayama et al. (2010) obtain the optimal pitch sequence by maximizing the conditional probability given chords, tonality, accompaniment bass, rhythm and pitch accent information, but the generated melodies may suffer from bad musical structure without repetition patterns. Meanwhile the algorithm cannot be directly applied to lyrics written in "stress accent" languages like English.
Recently, developing lyric-to-melody systems based on machine learning methods attracts lots of attentions. Scirea et al. (2015) allocate the same number of notes as the syllables in lyrics using Markov chain. Ackerman and Loker (2017) leverage random forest model to construct a rhythm model and a melody model separately on paired lyric-audio data. Bao et al. (2019) and Yu, Srivastava, and Canales (2021) use sequence-to-sequence models to generate melody from lyrics. However, deep learning methods usually require large amount of paired lyric-melody data for learning the correlation between lyrics and melodies. Sheng et al. (2020) attempt to address low-resource challenge by performing pre-training for lyric-to-lyric and melody-tomelody modules, and incorporating supervised learning into the pre-training to learn a shared latent space between lyrics and melodies. This is pioneering, but the unpaired data has not been sufficiently utilized on correlation learning between lyrics and melodies. Moreover, these works do not consider to control specific musical elements of generated melodies. Inspired by template-based methods in language generation (Fabbri et al. 2020;Yang et al. 2020;Li et al. 2020), we propose TeleMelody, a two-stage template-based system, which consists of a lyric-to-template module and a template-tomelody module. The two-stage framework can help address the issues of limited paired data, and the designed template together with the alignment regularization in this framework is able to ensure better controllability over generated melodies.

Music Background Knowledge
In this subsection, we use the song "Twinkle Twinkle Little Star" in Figure 1(a) as an example to introduce musical elements of the template.
• Tonality 3 is composed of a scale 4 and a root note. For a given tonality, the notes will be distributed within specific order and pitch intervals. For example, the tonality of the melody in Figure 1(a) is "C major", since notes are ordered within pitches in major scale 5 , and ending in root pitch "C". • Chord progression 6 is an ordered sequence of chords, in which a chord 7 is a set of more than two notes with harmonious pitches. As in Figure 1(a), the chord progression is "C-F-G-F-G-C'. Chord progression, interacting with melody, should create a sense of harmony when composing music. • Rhythm 8 generally refers to the pattern of occurrence of notes and rests. In Figure 1(a), each note is aligned with a syllable and notes in green boxes are in the same rhythm patterns. • Cadence 9 occurs at the end of phrase and gives a sense of ending in melody, and is often aligned with punctuation marks in lyrics. In Figure 1(a), we assign the half cadence 9 and the authentic cadence 9 to the comma and the period respectively.

Methodology
The system architecture is shown in Figure 2(a). We introduce a template as a bridge between lyrics and melodies, and decompose lyric-to-melody generation into a lyric-totemplate module and a template-to-melody module. In this section, we describe each component (i.e, template, lyricto-template module, and template-to-melody module) of the system in detail.

Template
In this subsection, we introduce template from several aspects: definition, design principles, contained musical elements and their representation, and the connection with the generated melodies. The template is a well-designed sequence of musical elements that can capture the characteristics of both lyrics and melodies. With the connection of this template, we decompose lyric-to-melody generation into a lyric-to-template module and a template-to-melody module as shown in Fig Several high-level principles of template design are as follows: 1) Templates can be extracted directly from melodies so that a template-to-melody model can be trained in a selfsupervised way. 2) Templates obtained from lyrics should be in accordance with those extracted from melodies. 3) Compared with hidden representation in lyric-to-melody end-toend model, templates should be more easily manipulated on demand.
Based on these principles, our designed template consists of tonality, chord progression, rhythm pattern, and cadence. The template representation is shown in Figure 2(c): We use 5 https://en.wikipedia.org/wiki/Major scale 6 https://en.wikipedia.org/wiki/Chord progression 7 https://en.wikipedia.org/wiki/Chord (music) 8 https://en.wikipedia.org/wiki/Rhythm#Composite rhythm 9 https://en.wikipedia.org/wiki/Cadence  one token at the start of sequence to represent the tonality of the melody and three consecutive tokens (chord, rhythm pattern, cadence) to represent musical elements of each note. For better understanding, we provide an example of template in Figure 1(b), which corresponds to the melody in Figure 1(a) .
The influence of musical elements in templates on generated melodies are described as follows (shown in Figure 2(c)): 1) Tonality can control pitch distribution. 2) Chord can influence the harmony of generated melodies. 3) Rhythm pattern can constrain note position and control highlevel musical structure with repetitive patterns. 4) Cadence can guarantee the accordance between punctuation in lyrics and rest notes in melodies.

Template-to-Melody Module
The template-to-melody module is to generate melodies from given templates. An encoder-attention-decoder Transformer model is adopted in this module, which can learn the alignment between source and target sequences by attention mechanism. In this subsection, we first introduce how to extract musical elements in templates from melodies for model training and then introduce alignment regularization, which can leverage musical knowledge to improve model controllability.
Template Extraction Method We introduce the method for extracting templates (i.e., tonality, chord progression, rhythm pattern, cadence) from melodies for model training: • Tonality can be inferred according to the note pitch distribution of a whole melody, following Liang et al. (2020). • Chord progression can be inferred based on note pitch distribution based on a viterbi algorithm proposed by Magenta 10 .
• Rhythm pattern can be inferred based on position information of notes in melodies. • Cadence can be inferred based on note pitch, onset and duration 11 . Specifically, 1) "no cadence" is assigned to the note when the note duration is short (e.g., less than 1 beat 12 ) and the onset interval between the current note with the next note is small (e.g., less than 1.5 beats). 2) "authentic cadence" is assigned to the note when the note is the root note 13 of tonic 14 chord, or the inferred chord of this note is tonic chord. There is also a probability of p (e.g., 0.3) for labeling this note with "authentic cadence" when the note is other notes in tonic chord rather than root note, and the onset interval is large (e.g., more than 2 beats). 3) "half cadence" is assigned to notes outside the above two situations.
Alignment Regularization With the introduction of templates, we are able to control the melody generation by adjusting musical elements in templates. To further increase model controllability, we introduce musical knowledge to the template-to-melody model through well-designed alignment regularization during training. Each alignment is designed based on musical knowledge and is imposed on the encoder-decoder attention.
Following Garg et al. (2019), alignment regularization is described as follows: We denote m k as the k th note information token in melody sequence, and t j as the j th musical element token in template sequence.ŵ denotes a 0-1 matrix such thatŵ k,j = 1 if m k is aligned with t j . We simply normalize the rows of matrixŵ to get a matrix w. We expect encoder-decoder attention weight A k,j between m k and t j to be closer to w k,j defined as: where T is the number of tokens in the template sequence that m k is aligned to. The alignment regularization term is defined as follows: where J and K are the number of tokens in the source and target sentence respectively. Finally, we train our model to minimize L attn in conjunction with the negative log likelihood loss L nll . The overall loss is defined as follows: where λ attn is a hyperparameter. As shown in Figure 2(c), we give an example to illustrate the alignment between the consecutive i th and i+1 th note in the melody, and the corresponding musical elements in the template. The note information is consisted of bar index, position in a bar, pitch and duration. The alignments mentioned in Equation (1) are designed as follows: 11 https://en.wikipedia.org/wiki/Duration (music) 12 https://en.wikipedia.org/wiki/Beat (music) 13 https://en.wikipedia.org/wiki/Root (chord) 14 https://en.wikipedia.org/wiki/Tonic (music) • We add an alignment between tonality (at the beginning of the template) and every note pitch in the melody, since tonality controls the pitch distribution of the entire melody. • We add an alignment between the chord of i th note in the template and pitch of i th note in the melody, since chord influences the note pitch. • We add an alignment between the rhythm pattern of i th note in the template and the position of i th note in the melody, since rhythm pattern determines the note position. • We add an alignment between the cadence of i th note in the template and the duration of i th note in the melody, since cadence controls the note duration. Besides, since onset interval between i th note and i+1 th note is also closely related to the pause, we add alignments between cadence of i th note with the bar index and position of i+1 th note. We also add an alignment between cadence of i th note and the pitch of i th note, since pitch is an important factor in distinguishing "half cadence" and "authentic cadence".

Lyric-to-Template Module
In this subsection, we describe the lyric-to-template module, which generates musical elements (i.e., tonality, chord progression, rhythm pattern, and cadence) in templates from lyrics. Tonality and chord progression are weakly correlated with lyrics and can be manipulated on demand. Therefore, we focus on generating rhythm pattern and cadence in the following subsections.
Lyric-to-Rhythm Model Previous works (Monteith, Martinez, and Ventura 2012;Fukayama et al. 2010) usually generate a fixed rhythm pattern based on syllable count in a bar, which need lots of manual design and constrain the diversity of rhythm pattern. Therefore, we introduce a lyric-to-rhythm model to predict rhythm pattern in template with given lyrics. We utilize an encoder-attention-decoder Transformer model to generate rhythm pattern in an auto-regressive way, which requires large amount of paired lyric-rhythm data for training. To collect adequate training data, we extracted the paired data from crawled lyric-singing audio data with audio processing tools. The data pipeline is similar to Xue et al. (2021) and is described in Appendix A.
Punctuation-Cadence Mapping Since punctuation marks in lyrics are closely related to cadences in melodies, we design musical knowledge based mapping as follows: 1) We align "authentic" cadence and "half cadence" with period and comma in lyrics respectively for indicating the end of the sentence. 2) We align the "no cadence" label with each syllable in lyrics since it is a voiced part.

Experimental Settings
In this section, we introduce the experimental settings, including datasets, model configurations, and evaluation metrics. Our code is available via this link 15 .

Dataset
We conduct experiments on both English (EN) and Chinese (ZH) lyric-to-melody generation tasks to verify the effectiveness of TeleMelody. For lyric-to-template generation, we construct two lyric-rhythm datasets in English and Chinese respectively for training lyric-to-rhythm model. For template-to-melody generation, we only require one template-melody dataset since templates are language independent. The processing details about melody and template are provided in Appendix B.
Lyric-Rhythm Dataset Following data mining pipeline mentioned in Appendix A, we can collect adequate paired lyric-rhythm training data. Finally, we obtain 9, 761 samples in English and 74, 328 samples in Chinese. Statistics of these two datasets are shown in Table 4 of Appendix C.
Template-Melody Dataset Since templates are language independent, we only require one template-melody model for lyrics in two languages. We use LMD-matched MIDI dataset 16 (Raffel 2016), which contains 45, 129 MIDI data. We perform pre-processing to get high-quality melody data as follows: First, we extract the melody from the track 17 with at least 50 notes that has the highest average pitch among all the tracks, and then delete polyphonic notes. Second, we normalize the tonality to "C major" or "A minor", and normalize the pitches to fit the range of vocals. Finally, we filter the empty bars in each sample. After these steps, we can obtain the processed target melody dataset.
To construct paired template-melody dataset, we utilize our proposed knowledge-based rules to extract the corresponding template from the target melody. Statistics of templatemelody dataset are shown in Table 5 of Appendix C.

System Configurations
We choose encoder-attention-decoder Transformer in our system. The configurations of the two models in TeleMelody are as follows: 1) Lyric-to-rhythm Transformer model consists of 4 encoder layers and 4 decoder layers. The hidden size and the number of attention heads are set as 256 and 4. During training, we apply dropout with the rate of 0.2. We use Adam optimizer (Kingma and Ba 2014) with a learning rate of 0.0005. 2) Template-to-melody Transformer model has a 4-layer encoder and 4-layer decoder with 4 attention heads in each layer. We set hidden size as 256 and dropout as 0.0005. The alignment regularization weight λ attn is set as 0.05 and the learning rate of Adam optimizer is set as 0.0005. 3) To ensure the diversity of generated melodies, we use stochastic sampling inference following Huang et al. (2018). The temperature and top-k parameters are set as 0.5 and 2 in lyric-to-rhythm generation, and as 0.5 and as 10 in template-to-melody generation.

Evaluations Metrics
We conduct both objective and subjective evaluations to validate the performance of our proposed system. 16 https://colinraffel.com/projects/lmd/ 17 https://en.wikipedia.org/wiki/Multitrack recording Objective Evaluation Following previous work (Sheng et al. 2020), we consider two metrics to measure the similarity of the generated and the ground-truth melodies: 1) Similarity of pitch and duration distribution (PD and DD).
2) Melody distance (MD). Besides, we also use accuracy of tonality, chord, rhythm pattern, and cadence (TA, CA, RA, AA) to measure the consistency between the generated melody and musical elements in the template. The more consistent the generated melody is with the template, the more controllable the model is. TA, RA, AA of ground-truth melodies are 100%, while the CA score of ground-truth melodies is less than 100% since introducing some notes outside chord are encouraged to avoid monotony. Therefore, CA scores is better if it is closer to that of the ground-truth melodies. Accordingly, we consider that the closer TA, CA, RA, and AA are to the ground-truth melodies, the more controllable the model is. The definitions of these metrics are described in Appendix F.

Subjective Evaluation
We invite 10 participants with music knowledge as human annotators to evaluate 10 songs in each language. We require each participant to rate properties of the melodies in a five-point scale, from 1 (Poor) to 5 (Perfect). The whole evaluation is conducted in a blind-review mode. Following previous works (Sheng et al. 2020;Zhu et al. 2018;Watanabe et al. 2018), we use following subjective metrics to evaluate the generated melodies: 1) Harmony: Is the melody itself harmonious? 2) Rhythm : Is the rhythm of the melody sounds natural and suitable for lyrics? 3) Structure: Does the melody consist of repetitive and impressive segments? 4) Quality: What is the overall quality of the melody?

Experimental Results
In this section, we first compare TeleMelody with baselines to demonstrate its effectiveness. Then, we show the analysis of TeleMelody in the aspects of controllability and data efficiency. Audio samples of the generated melodies are available via this link 18 .

Main Results
In this subsection, we evaluate the performance of TeleMelody. Two baselines are considered: 1) Song-MASS (Sheng et al. 2020), an end-to-end Transformer model that also deals with low-resource scenario by unsupervised lyric-to-lyric and melody-to-melody pre-training; 2) Transformer baseline, a Transformer model directly trained with paired lyric-melody data, which is similar to SongMASS without pre-training and and an alignment constraint. Our TeleMelody (including lyric-to-rhythm and template-to-melody models) has similar number of model parameters with that of SongMASS and Transformer baseline for fair comparison. For English, the two baselines use 8, 000 paired lyric-melody data (Yu, Srivastava, and Canales 2021), and SongMASS additionally uses 362, 237 unpaired lyrics and 176, 581 unpaired melodies for pre-training. For Chinese, the two baselines use 18, 000 paired lyric-melody data (Bao et al. 2019), and SongMASS additionally uses 228, 000 unpaired lyrics and 283, 000 unpaired melodies crawled from the Web. As shown in Table 1, TeleMelody significantly outperforms Transformer baseline on all the objective metrics (Improvement on EN: 19.45% in PD, 5.02% in DD, and 0.29 in MD; Improvement on ZH: 38.36% in PD, 13.45% in DD and 3.95 in MD). Compared with SongMASS, TeleMelody also performs better on all the objective metrics (Improvement on EN: 9.54% in PD, 4.18% in DD, and 0.21 in MD; Improvement on ZH: 23.40% in PD, 1.39% in DD and 0.49 in MD). Meanwhile, for all the subjective metrics, TeleMelody is better than both the Transformer baseline and SongMASS. Specifically for the quality, TeleMelody outperforms Transformer baseline by 1.49 in English and 1.7 in Chinese, and outperforms SongMASS by 1.10 in English and 1.43 in Chinese.
The results show that an end-to-end Transformer model has poor performance, since the available lyric-melody paired data is limited. In SongMASS, the lyric-to-lyric and melody-to-melody unsupervised pre-training can effectively improve the performance of end-to-end models, but it is still insufficient since the unlabeled data is not effectively utilized to improve the correlation learning between lyrics and melodies. TeleMelody performs the best, since it successfully reduces the difficulty and effectively utilizes the unpaired melodies in the two-stage framework. Besides, the improvements in Table 1 are also consistent with the intuition in designing the template: 1) tonality and chord progression in the template can control note pitch and thus help improve PD and harmony; 2) rhythm pattern and cadence in the template can control both onset and duration of notes, and thus help improve DD and rhythm; 3) the repetitive patterns in template can help improve structure.

Method Analyses
In this subsection, we analyze the proposed method from two aspects: 1) verifying the controllability of TeleMelody, as well as the effect of the designed template and the alignment regularization on controllability; 2) verifying data efficiency by testing TeleMelody with varying paired lyricrhythm training data scale. Besides, in Appendix D, we also evaluate the effectiveness of the rule to extract cadence from melodies.
Controllability Study Previous works in lyric-to-melody generation based on deep learning techniques hinders end users to control musical elements in the generated melodies. As shown in Table 2, TeleMelody with template is quite high in RA and AA, and close to the ground truth in CA, which indicates that the melody generated by TeleMelody is highly consistent with the rhythm, chord, and cadence in the template. Meanwhile, TeleMelody can also control the tonality with a good TA accuracy with the template. Moreover, the proposed alignment regularization (AR) can further improve the controllability (with closer TA, CA, RA, and AA to the ground truth). To show the effect of the proposed alignment regularization intuitively, we further visualize the average encoderdecoder attention weights of all heads in the last layer. As shown in Figure 3, after adding alignment regularization (template + AR), the related elements in the template and the generated melody are clearly aligned, according to the alignment rules introduced in Section 3.2.
We also conduct a case study to illustrate the controllability of TeleMelody and how the elements in the template affect the generated melody. The baseline melody is shown in Figure 4(a), with a "C major" tonality and a "Am-F-C-G" chord progression. We evaluate the control performance from the following aspects: • Tonality determines all the pitches in the melody. As shown in Figure 4(b), when we change the tonality in the template from "C major" to "A minor", most of the pitches change, and the pitch of the ending note changes from tonic pitch of "C major" to tonic pitch of "A minor".
• For each note, a chord is provided in the template, which affects the pitch of the note. As shown in Figure 4(c), when we change the chords of the notes in the third  Figure 3: Alignment visualization. We denote x-coordinate "i-feature" as the feature of the i th note in the melody sequence, and y-coordinate "i-element" as the musical element of i th note in the template sequence.
days like this I want to drive a -way, days like this I want to drive a -way.
(a) Basic melody. Tonality is "C major" and chord progression is "Am-F-C-G".
days like this I want to drive a -way, days like this I want to drive a -way.
Adjusting tonality to "A minor". days like this I want todrive a -way, days like this I want to drive a -way. bar from "C" to "Am", the pitches in the third bar are changed correspondingly. • Rhythm affects the onset positions of the notes. For example, in Figure 4, when we use the same rhythm for the first and third bars as labeled by green or blue boxes in each melody, the onset positions of the notes in the two bars are the same. • Cadence affects the pauses in the generated melody.
Meanwhile, for a note with "authentic cadence" in the template, the pitch should be tonic pitch. As shown in Figure 4, a pause exists in the melody for each punctuation mark in the end of the second and fourth bars as labeled by pink boxes, since a "half cadence" or an "authentic cadence" is added for each punctuation mark. Moreover, the pitch of the ending note of the entire melody is the tonic pitch as labeled by orange boxes, since an "authentic cadence" is added for each period.
Data Efficiency Study Our proposed TeleMelody is dataefficient: 1) In template-to-melody module, we train a model with extracted templates from melodies in a selfsupervised way. 2) In lyric-to-template module, only the lyric-to-rhythm model requires paired training data, which is much easier to obtain than paired lyric-melody data. We further study the performance with varying paired lyricrhythm data on the lyric-to-rhythm model, to demonstrate this model is also data-efficient. As shown in Table 3, when reducing the paired lyric-rhythm data to 50%, the performance of TeleMelody only declines slightly, while still outperforms SongMASS on all the metrics. We further reduce the paired lyric-rhythm data to zero, to demonstrate that TeleMelody can be applied in the scenario without any paired data. We replace the lyric-to-rhythm model with hand-craft rules in TeleMelody. The details of the handcraft rules are described in the Appendix E. Table 3 shows that when replacing the lyric-to-rhythm model with handcraft rules, the performance of TeleMelody significantly degrades, which demonstrates the advantage of the lyric-torhythm model in TeleMelody. Moreover, it is also shown in Table 3, comparing to the performance on SongMASS based on the large amount of paired data and large scale pretraining, TeleMelody achieves comparable performance on all the objective metrics in English without any paired data. This promising result illustrates that TeleMelody is also effective in zero resource scenario.

Conclusion
In this paper, we proposed TeleMelody, a two-stage lyricto-melody generation system with music template (e.g., tonality, chord progression, rhythm pattern, and cadence) to bridge the gap between lyrics and melodies. TeleMelody is data efficient and can be controlled by end users by adjusting the musical elements. Both subjective and objective experimental evaluations demonstrate that TeleMelody can generate melodies with higher quality than previous lyric-to-melody generation systems. Moreover, experimental studies also verify the data efficiency and controllability of TeleMelody. In future work, we will extend our proposed two-stage framework to other music generation tasks (e.g., melody-to-lyric generation and melody-to-accompaniment generation). each lyric. To extract tempo information, we perform a direct estimation from the accompaniment audio with an audio information retrieval tool, librosa 22 . Finally, with tempo information and timestamp of each lyric, we can infer beatlevel onset, that is, the corresponding rhythm of each lyric.

B.1 Melody
In this paper, we only consider the melody with a constant tempo and a time signature of 4/4 23 . Each note is represented by four consecutive tokens (bar, position, pitch and duration). We use 256 tokens to represent different bars and 16 tokens to represent different positions in a bar with granularity of 1/16 note. We use 128 tokens to represent pitch values following the MIDI format. We use 16 tokens to represent duration values ranging from a 1/16 note to a whole note.

B.2 Template
In this paper, template contains musical elements including tonality, chord progression, rhythm pattern, and cadence. We only consider "C major" and "A minor" tonalities for simplicity, since other tonalities can be transposed to these two tonalities based on their scales. Chord consists of a root note and a chord quality. We consider 12 chord roots (C, C , D, D , E, F, F , G, G , A, A , B) and 7 chord qualities (major, minor, diminished, augmented, major7, minor7, half diminished), resulting in 84 possible chords in total. We use 4 tokens ranging from 0 to 3 to represent rhythm patterns, that is, beat-level onset position in a bar. For cadence, we consider "half cadence", "authentic cadence" and "no cadence", which is aligned with comma, period and other syllables in lyrics respectively.   In lyric-to-template module, we directly obtain cadence from lyrics in inference stage through punctuation-tocadence mapping. In template-to-melody module, we extract cadence from melodies in training stage through cadence extraction rule. Therefore, a question may arise: is there a gap in cadence between training and inference? To answer this question, we explore the statistics of cadences in template-melody dataset. The results are shown in Table 6: • Notes labeled with "no cadence", which are aligned with syllables in lyrics, have short duration and onset interval. Notes labeled with "half cadence", which are aligned with commas in lyrics, have 4.91× longer duration and 4.22× longer onset interval than those labeled with "no cadence". Notes labeled with "authentic cadence", which are aligned with periods in lyrics, have 5.94× loner duration and 5.11× longer onset interval than those labeled with "no cadence". This is in consistent with musical knowledge, since punctuation marks are usually aligned with pauses. • The average number of "no cadence" is 5.86× greater than the average number of "half cadence' and "authentic cadence" combined. This ratio is similar to the ratio of syllables to punctuation marks in lyrics, as shown in Table 4.

E Lyric-to-Rhythm Hand-Craft Rules
We design several lyric-to-rhythm rules to demonstrate that TeleMelody can be applied in the scenario without any paired data. Specifically, for English lyrics, a note is corresponding to a syllable, and we generate the rhythm patterns syllable by syllable, where the onset interval between a note and its previous note is 2 beats if its corresponding syllable is the start of a sentence, is 1 beat if its corresponding syllable is the start of a word but not the start of a sentence, and is 0.5 beat otherwise. For Chinese lyrics, a note is corresponding to a character, and we generate the rhythm patterns character by character, where the onset interval between a note and its previous note is 2 beats if its corresponding character is the start of a sentence, and is 1 beat otherwise.
F Definitions of TA, RA, CA, and AA We evaluate model controllability with accuracy of tonality, rhythm pattern, chord, and cadence (TA, RA, CA, AA), that is, the proportion of notes in consistent with given template. Specifically, a melody is in consistent with tonality if the inferred tonality is the same as the template tonality; a note is in consistent with rhythm pattern if its position is in accord with given rhythm pattern information; a note is in consistent with chord if its pitch is within the chord; a note is in consistent with cadence if both its duration and its onset interval between this note and the next note comply with the extraction rules described in Section 3.2.