SongRewriter: A Chinese Song Rewriting System with Controllable Content and Rhyme Scheme

Although lyrics generation has achieved significant progress in recent years, it has limited practical applications because the generated lyrics cannot be performed without composing compatible melodies. In this work, we bridge this practical gap by proposing a song rewriting system which rewrites the lyrics of an existing song such that the generated lyrics are compatible with the rhythm of the existing melody and thus singable. In particular, we propose SongRewriter,a controllable Chinese lyrics generation and editing system which assists users without prior knowledge of melody composition. The system is trained by a randomized multi-level masking strategy which produces a unified model for generating entirely new lyrics or editing a few fragments. To improve the controllabiliy of the generation process, we further incorporate a keyword prompt to control the lexical choices of the content and propose novel decoding constraints and a vowel modeling task to enable flexible end and internal rhyme schemes. While prior rhyming metrics are mainly for rap lyrics, we propose three novel rhyming evaluation metrics for song lyrics. Both automatic and human evaluations show that the proposed model performs better than the state-of-the-art models in both contents and rhyming quality.

Current methods of generating singable lyrics is typically conditioned on a given melody.Accordingly, they treat the generation as a sequenceto-sequence translation task.However, there are two main challenges: 1) the parallel lyrics-melody dataset for training the model is limited, which mainly consists of 7,998 songs proposed by Yu et al. (2020); 2) melody notes and lyric syllables are loosely correlated and thus the alignments between them are hard to learn from the limited data.Therefore, prior works simplify the task by assuming a one-to-one mapping between the melody notes and lyric syllables (Yu et al., 2020;Ma et al., 2021).However, such an assumption and restrictions may lead to a sub-optimal performance, as the mapping relationship is usually many-to-one in real life.
Another important research question on lyrics generation is how to control the generated content and rhyme.In prior works, the controllable content is usually achieved by conditioning the generation on given hint words or sentences (Shen et al., 2019;Zhang et al., 2022).However, they usually ignore the requirements of generating lyrics from a draft where the model needs to edit some sentences or words.In addition, prior works on rhyme control mainly focus on rhymes at the end of sentences (end rhyme) (Potash et al., 2015;Nikolov et al., 2020;Xue et al., 2021;Liu et al., 2022).To the best of our knowledge, no work has been done on both internal rhyme and end rhyme schemes.
In this work, we develop a user-assist AI system SongRewriter which can generate singable lyrics by rewriting parts of or the whole lyrics of a given song (i.e., partial rewriting and full rewriting, respectively) with controllable contents and rhyme schemes, as illustrated in Figure 1.To address the difficulty of learning the correlation between the melody and lyrics from a limited amount of parallel dataset, we propose to generate lyrics which have the same number of syllables as the original lyrics of the song.Therefore, the generated lyrics can well align with the melody.In addition, this method can directly learn the generation from text data so as to bypass the demand of parallel datasets.Specifically, we adopt a transformer-based sequence-to-sequence auto-regressive model (Vaswani et al., 2017) as our model backbone.The model is trained by masking a few random fragments from the encoder's inputs and predicting the masked fragments by the decoder.The generation process can be controlled in terms of three aspects.1) To enable rewriting arbitrary parts of the lyrics, we train the model by randomly performing one of the three masking (i.e., token-level, sentence-level and song-level) strategies, corresponding to different levels of rewriting tasks.2) To control the lexical choice of contents, we allow the generation to condition on given keyword prompts.This is achieved by training on extracted keywords from masked positions as additional encoder inputs.
3) To enable lyrics generation with arbitrary pre-defined rhyme schemes with vowel specified, we introduce additional vowel inputs and apply a vowel mask strategy during training.This equips the model with the ability of generating tokens with required vowels at arbitrary positions.Since end rhyme is the most frequently used rhyme type, we specifically incorporate reverse language modelling and propose decoding constraints during inference to improve the vowel consistency and rhyming word diversity.
We evaluate our model on both generation controllability and quality in terms of keyword recall, vowel accuracy, lexical diversity, coherence, perplexity and rhyme quality.Since prior evaluation metrics on rhyme quality are mainly for rap lyrics and ignore the problem of identical rhyming words, we propose three new rhyme metrics to measure the local rhyme between adjacent sentences, global rhyme and diversity of rhyming words, respectively.Experimental results show that our model outperforms baseline models and state-of-the-art models on both full and partial rewriting tasks.
Our contributions are summarized as follows: • We propose SongRewriter which generates melody-aligned lyrics by rewriting the lyrics of songs.It bypasses the difficulties of modelling the melody-lyrics correlation from limited parallel datasets.
• We propose a multi-level randomized masking scheme for training SongRewriter, which allows the model to rewrite arbitrary parts of the inputs according to the bidirectional context.
• We introduce a partial vowel masking strategy into training to enable lyrics generation on any rhyme schemes, and we design a novel decoding strategy to improve the end rhyme consistency and rhyming word diversity.
• We propose novel metrics for rhyme evaluation.We collect data from the internet for training and testing.Experiments show the effectiveness of our proposed model.
Rhyming is an essential element for lyrics and poetry.To model rhyme, current approaches can be mainly divided into three types.The first type is to encourage the model to learn the rhyme structure implicitly during the training by adding additional rhyme signals.Potash et al. (2015) append ⟨endLine⟩ token to each sentence.Li et al. (2020) employ an additional format embedding as inputs to emphasize the rhyming tokens.Most prior works focus on the rhyme at the end of the sentences (Xue et al., 2021;Liu et al., 2022).In this work, we extend the rhyme control to arbitrary rhyme schemes for both internal rhyme and end rhyme.To the best of our knowledge, this is the first work enabling arbitrary rhyme schemes.

Method
The proposed model is a transformer-based autoregressive sequence-to-sequence model (Vaswani et al., 2017).As shown in Figure 2, given an input of masked lyrics and keyword prompt to the encoder, the decoder generates output tokens corresponding to the masked tokens of the input, which contain the keywords in the prompt and satisfy the vowel constraints.Such generation also improves decoding efficiency and forces the model to rely more on the source input.
In a basic setting, given the lyrics of a song, the tokens of the input lyrics are of the following form, [B], x 00 , ..., [M ], ..., [S], ..., x ij , ..., [S], [E] where x ij denotes the jth tokens in the ith sentence of the lyrics, [S] is the inter-sentence delimiter, [B] is placed at the beginning of the lyrics with [E] at the end, and the tokens to be rewritten are replaced by [M ] which will be predicted by the decoder.
To enable the controllability of the generation process, we further incorporate the keyword prompt into the model to control the lexical choices, propose the multi-level masking strategy to enable rewriting arbitrary parts of the input in a single model and the rhyme control strategy to inject predefined rhyme schemes and improve rhyming word diversity.

Keyword Prompt
Using keyword prompts to control the text generation has been explored in other tasks, such as poetry generation (Zhipeng et al., 2019).In the task of lyrics generation, prior works mainly focus on theme control (Shen et al., 2019).In this work we introduce the technique of keyword prompts into lyrics generation to force the model generating lyrics containing these keywords.
Specifically, a keyword prompt is prepended to the input lyrics.The keyword prompt is a concatenation of a set of keywords in the following format, where k ij is the jth token in the ith keyword , [W ] is the inter-keyword delimiter, and [K] is the start token of the prompt.
During training, we first use jieba2 to segment the masked fragments into words and obtain their Parts-of-Speech tags.Then, we use the nouns and verbs to form a keyword database.Last, we sample a random number of keywords (ranging from 0 to 5) from the keyword database to form a keyword prompt.During inference, the keyword prompt is optional and can be provided by users.

Randomized Multi-Level Masking Scheme
As song rewriting requires the number of syllables between the original lyrics and the generated lyrics to be identical, we adopt the framework of MASS (Song et al., 2019), a sequence-to-sequence model pre-training method, which uses the decoder to predict the masked tokens in the encoder.However, MASS trains the model on inputs of singlesentence examples by masking a fragment of continuous tokens (around 50% of the input).Therefore, it is not an optimal strategy for full lyrics generation and editing arbitrary parts of the input lyrics.
Accordingly, we propose a novel masking strategy which masks the input from three levels (tokenlevel, sentence-level and song-level) to simulate the partial rewriting and full rewriting tasks.During training, for each input lyric, we randomly apply one of the following masking strategies: • Token-Level Masking: To simulate the task of rewriting phrases in a sentence, for each sentence in the input lyric, we mask a few fragments of the sentence with a random ratio and train the model to reconstruct the masked portions of the sentences.
• Sentence-Level Masking: To enable sentence rewriting, we mask a random ratio of sentences (entire sentences) from the input and train the model to reconstruct the masked sentences.
• Song-Level Masking: We masks all the input tokens to simulate the full rewriting task.
We denote the above three masking schemes as {TOKEN, SENT, ALL}, respectively.During training, for SENT, we sample a masking ratio from a uniform distribution U(0, 1) and randomly select the corresponding ratio of sentences.For TOKEN, we sample a masking ratio from a uniform distribution U(0, 1) for each sentence and then randomly select the ratio of tokens to mask.

Rhyme Modeling and Control
Our method for rhyme modelling and control is divided into two parts, rhyming modelling and control for the final syllables of the lines (end rhyme), and rhyme control for an arbitrary rhyme scheme which defines the required vowels at specific positions in the generated lyrics (internal rhyme).

End Rhyme Modeling and Constraint
End rhyme is the most frequently used rhyme type for lyrics and poems.It occurs when the last words of the sentences rhyme.Inspired by the rhyme modelling for rap lyrics (Xue et al., 2021), we adopt reverse language modelling with two additional position embeddings, sentence position embedding and local position embedding, to facilitate the modelling of rhyme features in the lyrics.Specifically, we reverse the order of the characters in each sentence for both inputs and target outputs (while keeping the sentence order unchanged).Therefore, the reverse sentence starts with the potential rhyming character, i.e., the end character in the original sentence.Accordingly, the model can easily learn to identify the rhyming characters from the inputs with the local position l 0 .
However, since rhyming with identical words is considered inferior, we incorporate two control factors at inference time to encourage rhyme consistency and rhyming word diversity.Given the end character set e <t = {e 0 , ..., e t−1 } and their corresponding vowel set v <t = {v 0 , ..., v t−1 } from the previous t sentences, we define an adjusted probability of the end character of the (t + 1)th sentence being x i as, where p i t is the predicted probability of the token x i in the vocabulary, and v x i is the vowel of the token x i .The two factors: • Λ(•) returns λ if the vowel of the token x i appears in the end vowel set v <t of previous sentences; otherwise, returns 1.
• Γ(•) returns γ if x i appears in the end character set e <t of previous sentences, otherwise 1.
λ and γ are the hyper-parameters to control the rhyming effect.While a larger λ increases the probability of the model sampling a word with the same vowel in the previous end vowel set (improving rhyme consistency), a smaller γ reduces the chance of a generated end word being chosen again (increasing rhyming word diversity).

Internal Rhyme by Vowel Modeling
While rhyme generally refers to end rhyme, where the last words of the lines rhyme with each other, internal rhyme has also been widely used, which usually includes multiple rhyming words either within the same line (usually one in the middle and the other one at the end) or in the middle of multiple lines.As shown in Figure 3, the highlighted characters (with the corresponding pinyin in parentheses) share the same vowel and rhyme.While end rhyme has been widely investigated for both poetry generation (Lau et al., 2018;Li et al., 2020) and rap lyrics generation (Xue et al., 2021;Nikolov et al., 2020), internal rhyme is still yet to be explored.
To model internal rhyme as well as other rhyme schemes with specified vowels, we propose a restricted vowel loss to enable direct control of vowels at arbitrary positions.Specifically, as shown in Figure 2, during the training stage, for the masked fragments in the input, we only replace 80% of the vowel inputs with the [M ] token.For the remaining 20% masked input tokens with ground truth vowel inputs, we introduce an additional vowel prediction task.For the tth output token with a predicted distribution p t ∈ R N ×1 , if the ground truth vowel v t is provided in the input, the additional training objective is where the probability of the vowel v t at the time step t is calculated by, where p i t is the predicted probability for each token i in the vocabulary at the decoding time step t, and V (x i ) is a mapping function which returns the vowel of the token x i .The function 1(V (x i ) = v t ) returns 1 if the vowel of the token x i is identical to the ground truth vowel v t , otherwise returns 0. Thus, p(v t |•) is basically summing up the predicted probabilities of all the tokens with the same vowel v t .
During the inference stage, the internal rhyme scheme can be created for the generated outputs by providing rhyming vowel inputs to specific positions in masked fragments of the inputs.3 4 Experiment

Datasets and Baselines
Our model is trained on three different datasets.We first crawl a large-scale text corpus from Baidu Encyclopedia , which is the largest Chinese online Encyclopedia.We use this dataset to pretrain our model.Then, we crawl a lyrics dataset from two Chinese lyrics websites. 4Since the amount of the lyrics data is limited, we further crawl proses as a supplementary dataset from a Chinese prose website. 5The pretrained model is first fine-tuned on the prose dataset and then on the the lyrics dataset to produce our final system. 6ince Chinese is a monosyllabic language, each character consists of one syllable.To control the number of syllables in the generated output, We use the BasicTokenizer from the transformers library (Wolf et al., 2020) for tokenization, which splits text into characters for Chinese and into words for other languages (mainly English).We keep the words and characters with frequency larger than 3,000 to build a vocabulary of size 6,572.For the vowels, we employ python-pinyin7 to extract the vowels of the Chinese characters.There are in total 21 distinct vowels.
We evaluate the proposed model, SongRewriter, for both full and partial rewriting tasks.For the full rewriting task, we compare SongRewriter with a Chinese GPT2 (Radford et al., 2019) and SongNet (Li et al., 2020).For the partial rewriting task, we compare the proposed model with ILM (Donahue et al., 2020).We incorporate the keyword prompt function to the ILM model resulting in ILM-Keyword which is used as a baseline model for keyword-conditioned rewriting tasks.All the baseline models are fine-tuned on our lyrics dataset for a fair comparison.8

Evaluation Metrics
We evaluate the performance in terms of two aspects, generation controllability and generation quality.The controllability metrics include Keyword Recall and Vowel Accuracy.The quality metrics include Diversity,9 Coherence,10 Perplexity-Test (PPL-Test) and Perplexity-Gen (PPL-Gen).11Following prior works, we assume the optimal approach should generate lyrics with quality closest to the human-written lyrics (Holtzman et al., 2020).Therefore, we report the absolute difference scores on some metrics: ∆Diversity, ∆Coherence and ∆PPL-Gen.
To measure rhyme quality, we propose three novel metrics: • Local Rhyme (Rhyme-L): While a lyric may contain multiple rhyming vowels, sentences sharing the same rhyme are usually grouped together.Therefore, we propose local-rhymen to evaluate this localised characteristics.Specifically, We define a sentence being locally n-rhymed if, among the n sentences before and after the current sentence, there are at least one sentence sharing the same rhyming vowel with the current sentence.Thus, localrhyme-n is defined as the number of locally n-rhymed sentences divided by total number of sentences.We report Rhyme-L, which is the average of the local-rhyme-n with n ∈ [1, 4].
• Global Rhyme (Rhyme-G): Apart from evaluating rhyming effect from a local perspective, we also measure the rhyming effect of a lyrics as a whole.Since the more sentences sharing the same rhyming vowels, the better the rhyming effect.We evaluate the global rhyming performance by calculating the portion of duplicated rhyming vowels: 1 − number of unique rhyming vowels total number of sentences .
• Diversity of Rhyming Words (Dist-RW): Since rhyming with identical words is considered inferior, we evaluate the diversity of the rhyming words by calculating the ratio of the number of unique end words to the total number of end words.

Full Song Rewriting
We evaluate the performance of SongRewriter on the task of full song rewriting by masking all the input tokens.This task is similar to lyrics generation with a fixed format, i.e., the number of sentences and lengths of each sentence are pre-defined.
As shown in Table 1, SongRewriter significantly outperforms other models in terms of both rhyme quality and content quality.For the content, the generated outputs of SongRewriter are closer to the human-written lyrics in the aspects of lexical diversity, coherence and fluency (PPL-Gen), and PPL-Test further verifies the languge modeling ability of SongRewriter.
By inspecting the generated outputs,12 we observe that: 1) The generated lyrics are fluent and coherent in general.2) Most lines share the same vowel and rhyme with their neighbouring Table 2: Quality evaluation on the generated outputs on the task of partial song rewriting under masking schemes, {SENT, TOKEN}.We report the averaged scores over three masking ratios, {0.25, 0.5, 0.75}.The best scores are in bold.
lines with diverse rhyming words.3) Similar to human-written lyrics, automatically generated lyrics contain duplicated blocks, which indicates that SongRewriter can learn the structural information of the lyrics.

Partial Song Rewriting
We test the performance of SongRewriter on the partial song rewriting task by masking a portion of the input lyrics.We compare the proposed model with ILM under two masking schemes, {SENT, TO-KEN}.We average the scores under three masking ratios, {0.25, 0.5, 0.75}. 13s shown in Table 2, the proposed model performs better than ILM in general at both tokenlevel and sentence-level masking.Specifically, for ILM, it is observed that there is a large discrepancy on content quality (Diversity, Coherence and PPL-Gen) between the TOKEN and SENT masking schemes.On the contrary, SongRewriter achieves consistent performance, indicating that the proposed method is more suitable for both tasks.Besides, we also find that the performance of ILM decreases significantly as the masking ratio increases, while SongRewriter consistently achieves high performance across various masking ratios.This suggests that SongRewriter is able to tackle arbitrary portion of the rewriting consistently.By inspecting examples,14 we observe that, during the partial song rewriting, SongRewriter considers not only the bidirectional context but also the rhyming effects with the input sentences.Table 3: Evaluation results on the task of keywordconditioned partial song rewriting under masking schemes, {SENT, TOKEN, ALL}.We report the averaged scores over three masking ratios, {0.25, 0.5, 0.75}.The best scores are in bold.

Keyword Control
We test the performance of controlled lyrics rewriting under masking schemes {SENT, TOKEN}.We report the averaged scores over three masking ratios, {0.25, 0.5, 0.75}.
To evaluate the ability of generating content with keywords, we first build a keyword database by using jieba15 to extract keywords from the training set.We evaluate the controllability by sampling keywords from the database. 16As shown in Table 3, SongRewriter performs significantly better than the baseline model, ILM-Keyword, by a large margin.

Rhyme Scheme Control
To evaluate the rhyme scheme controllability, we evaluate the ability of the model to generate tokens with pre-defined vowels at arbitrary positions.We randomly mask 80% of the vowel inputs from the  8 in Appendix H. masked fragments.For the remaining 20% masked tokens with vowel inputs, we calculate the ratio of output tokens with the same vowel as the input (vowel accuracy).We found that the proposed model is able to consistently generate tokens of pre-defined vowels around 98% of the time under various masking schemes and masking ratios, 17 implying that the model can generate user-defined rhyme schemes by providing rhyming vowel to the target positions in the input most of the times.

Ablation Study
As shown in the Figure 4, by removing the inference constraints, the performance on Rhyme-L and Rhyme-R increases while the others decrease.By inspecting the output samples, it is found that the model without inference constraints is more likely to generate repetitive sentences and rhyming words, thus leading to an increase in Coherence (more similar content), Rhyme-L and Rhyme-R, but a decrease in content diversity and rhyming word diversity.
To verify the effectiveness of the multi-level masking scheme, we train a model with the masking scheme proposed in MASS (Song et al., 2019), which is to mask 50% of the tokens consecutively from the inputs.As shown in Figure 4, the language fluency and the rhyming performance drop significantly in terms of ∆ PPL-Gen, Rhyme-L and Rhyme-G.Although the diversity of the rhyming words increases, those ending words do not share the same vowel, and thus, not rhyme with each other.The performance drop is expected as by only masking 50% of tokens from the inputs, there is a 17 Detailed results are in Table 4: Human evaluation results on the generated outputs.For the partial song rewriting, the masking ratio is set to 0.5.
task discrepancy between the training task (rewriting half of the tokens) and inference task (rewriting content with ratios ranging from 0 to 1).
Regarding the vowel modeling, by removing the vowel loss from the training objective, the vowel accuracy for the full song rewriting drops from 98.5% to 92.5%, showing that incorporating vowel loss can help the model generate tokens with correct vowels at the specific positions.

Human Evaluation
Apart from automatic metrics, we also conduct human evaluation following the previous works (Lee et al., 2019;Xue et al., 2021).We sample 200 examples from the test set as inputs and generate 200 outputs from each model.Then, we recruit 3 annotators with musical knowledge background to score the generated lyrics from 1 (Poor) to 5 (Perfect) on three criteria: language fluency, content coherence and rhyme quality.
As shown in Table 4, the human evaluation results show that SongRewriter outperforms other models on all three tasks (full song rewriting, sentence rewriting and partial sentence rewriting).In particular, SongRewriter performs significantly better in terms of rhyming.

Conclusion
In this work, we propose to overcome the difficulties of modelling the melody-lyrics correlation from limited parallel datasets by directly rewriting the lyrics of songs.We propose a unified model for full and partial song rewriting by training with a multi-level randomized masking scheme.The proposed model allows rewriting arbitrary parts of the inputs according to the bidirectional context.Besides, we introduce a partial vowel masking strategy into training to enable lyrics generation on any rhyme schemes.A novel decoding strategy is designed to improve the end rhyme consistency and rhyming word diversity.Novel metrics are pro-posed for rhyme evaluation.Both automatic and human evaluation shows our proposed model outperforms baseline and state-of-the-art models.

Limitation
Since each Chinese character contains 1 syllable, our proposed model can control the number of syllables in the generation by the number of generated tokens.However, this method does not apply to languages with multisyllabic words (such as English).To rewrite lyrics with multisyllabic words while maintaining the same number of syllables, a special technique such as syllable-level subword tokenization may be needed.This line of work will be left to be investigated in the future.

Ethics Statement
Rewriting the lyrics of a song may cause potential copyright infringement.Besides, the copyrights of the lyrics in the dataset belong to the song writers.To protect the copyrights, our model and the released dataset will be protected by the license, Creative Commons Attribution-NonCommercial (CC-BY-NC), and prohibited from commercial use.

B Baselines
We evaluate the proposed model, SongRewriter, for both full and partial rewriting tasks.For the full rewriting task, we compare SongRewriter with GPT2 (Radford et al., 2019) and SongNet (Li et al., 2020) which are fine-tuned on our lyrics datasets.
For the partial rewriting task, we compare the proposed model with ILM (Donahue et al., 2020).The details of the baselines are as follows, • GPT2: GPT2 is an auto-regressive language model based on transformer decoder.Initializing with GPT2-Chinese,18 we fine-tune the model on the lyrics dataset.Note that the lyrics generated by GPT2 is in free form, it does not follow any format constraints.
• SongNet: SongNet is a rigid format controlled text generation model which forces the generated output to follow the exact sentence lengths and sentence numbers of the input.We fine-tune their released pre-trained checkpoint19 on the lyrics dataset for full song rewriting.
• ILM: ILM is a GPT2-based model specialised on the text infilling task.It randomly replaces part of the text by "[blank]" tokens, and appends the masked segments (which are concatenated by "[answer]" tokens) to the end of the masked input text.For example, given a source text "She ate leftover pasta for lunch", an ILM example will be "She ate [blank]

C Training and Inference Settings
The proposed model is a transformer encoderdecoder model (Vaswani et al., 2017).There are 12 layers for the encoder and decoder, respectively, with 12 heads for each layer.The hidden dimension is 768, and the dropout (Srivastava et al., 2014) is 0.1.We employ the AdamW (Loshchilov and Hutter, 2019) optimizer with a weight decay rate to be 10 −4 .For the pre-training stage, we use 8,000 warm-up steps with the default learning rate schedule in Vaswani et al. (2017) and train for 10, 000 iterations.On the fine-tuning stage, we use a fixed learning rate of 10 −5 and train until models converge.We first fine-tune the model on the prose dataset.Afterwards, we fine-tune the model on the lyrics dataset.During inference, we apply Top-K sampling (Fan et al., 2018) with k to be 32.The rhyme factors, γ and λ, are set to 0.3 and 1.4, respectively.

D Definition of Evaluation Metrics
• Keyword Recall: To evaluate the content control ability by keyword prompt, we calculate the keyword recall rate, which is the percentage of the keywords appearing in the generated outputs.
• Vowel Accuracy: To measure the performance of the vowel control, we calculate the percentage of output tokens with correct vowels as the input.
• Diversity: We evaluate the diversity of the generated lyrics by distinct-n (Li et al., 2016), which is defined as the number of unique ngrams divided by total number of n-grams.We report the average of the distinct-n with n from 1 to 4: • Coherence: To evaluate the semantic consistency among all the sentences of the generated lyrics, we measure the semantic textual similarity (STS) between all sentence pairs.Specifically, we employ pre-trained SimCSE (Gao et al., 2021) to extract sentence embeddings and calculate the cosine similarity:20 Besides, we also calculate the STS between adjacent sentences for the consistency from a local aspect: We define the coherence to be the average: • PPL-Test: We evaluate the quality of the model by calculating the model perplexity on the test set.
• PPL-Gen: We evaluate the quality of the generated lyrics with the perplexity from a language model, which is a pre-trained Chinese GPT2 model21 (Radford et al., 2019) finetuned on the lyrics dataset.

E Full Song Rewriting
Table 6 shows evaluation results on the content quality of the generated lyrics in full song rewriting task.For each model, we apply Top-K sampling (Fan et al., 2018) for five times and report the mean and standard deviation (subscript).Figures 5 and 6 show two examples of the generated lyrics.

F Partial Song Rewriting
Table 7 shows evaluation results on the generated outputs on the task of partial song rewriting under masking schemes, {SENT, TOKEN}, and masking ratios, {0.25, 0.5, 0.75}.Figure 8 shows an example of token-level rewriting.Figure 9 shows an example of sentence-level rewriting.

G Rhyme Scheme Control
Table 9 shows the evaluation results on the task of vowel-conditioned partial song rewriting under the masking schemes, {SENT, TOKEN}, and masking ratios, {0.25, 0.5, 0.75}.Figure 7 shows an example of applying rhyme scheme control.We first extract the syllable template and rhyme scheme from the original lyrics.As shown in the middle column of Figure 7, the original lyrics control both end rhyme and internal rhyme.The end rhyme scheme is AAAA, where all four lines of a verse share the same ending vowel and rhyme with each other.As shown in Figure 7, by inputting the desired vowel tokens in the target positions, the generated lyrics have a consistent rhyme scheme with the original lyrics.

I Sequence Order and Local Position Order
To improve end rhyme modeling, we incorporate reverse language modeling with local position embeddings.We assume that by reversing the language order, the model is easier to locate the rhyming words (which corresponds to " ⟨l 0 ⟩", the first token of the reverse sentence according to the local position embedding) and generate rhyming sentences by generating the end words before the rest of the sentences.In this section, we verify the effectiveness of "labeling rhyming words by ⟨l 0 ⟩" and "generating rhyming word of the sentence first" by comparing four model variants: reverse token order with sequential local position (the proposed method), reverse token order with reverse local position (Rev TK & Rev LP), sequential token order with sequential local position (Seq TK & Seq LP) and sequential token order with reverse local position (Seq TK & Rev LP).
As shown in Table 10, the rhyming performance decreases drastically for models with sequential token order in both Rhyme-L and Rhyme-G.With the same token order, the models with the ⟨l 0 ⟩ of the local position aligned with the rhyming word perform slightly better than those not aligned.The results show that generating rhyming words before the other words in the sentences can significantly improve the rhyming performance of the generated outputs.However, the role of local position embeddings to help identify the rhyming words is less important.We hypothesize that, apart from the local position, the model can also identify the rhyming words from the global position (the tokens before or after the sentence delimiter tokens).

Figure 1 :
Figure 1: An overview of the proposed Chinese song rewriting system.Given an existing song, the users first mask the part(s) of the lyrics that they want to rewrite.Then, SongRewriter generates new lyrics corresponding to the masked fragments.Last, the rewritten lyrics are combined with the original melody to form a new song.During the generation, user can require the content to include specific keywords or control the rhyme scheme by setting the vowels of the output characters at specific positions.

Figure 2 :
Figure 2: The architecture of the proposed SongRewriter model.The inputs of the encoder consist of a keyword prompt and partially masked lyrics.The keywords in the prompt are extracted from the masked fragments during training.The decoder uses [G] as a start token and generates the masked tokens autoregressively.The text order is reversed for end rhyme modeling.The original lyrics is from Later by Rene Liu.
Lau et al. (2018) incorporate an extra model to encode the ending tokens of the sentences.While Zhang et al. (2020) and Xue et al. (2021) both generate the rhyming word before the rest of the sentence, Zhang et al. (2020) move the last word of the sentence to the front, and Xue et al. (2021) generate sentence from right to left by reversing the word order.The second type is to apply explicit rhyme constraints during the training to force the model to generate rhyming sentences.Jhamtani et al. (2019) apply a discriminator on the sentence-ending words to learn the rhyme pattern adversarially.The last type is through post-editing, where the model first generates the lyrics, then another model edits the ending words to fulfil the rhyming constraints (Nikolov et al., 2020).

Figure 3 :
Figure 3: An example of lyrics with internal rhyme.The lyrics are from Blue and White Porcelain, a Chinese song of Jay Chou, with the English translation underneath.The rhyming characters are highlighted in red with their pinyin.

Figure 4 :
Figure 4: Ablation results on the generated outputs on the task of full song rewriting.To facilitate comparison, all metrics are normalized to 0-1 range by the respective maximum values.The exact scores are presented in Table8in Appendix H.

Figure 5 :
Figure 5: An example generated by SongRewriter.End characters sharing the same vowel (the pinyin of the characters are in the adjacent bracket) are highlighted in the same color.Lyrics is splitted into blocks for clear illustration.

Figure 6 : 12877 Figure 7 :
Figure 6: An example generated by SongRewriter.End characters sharing the same vowel (the pinyin of the characters are in the adjacent bracket) are highlighted in the same color.Lyrics is splitted into blocks for clear illustration.

Figure 8 :
Figure 8: An example of partial sentence rewriting by SongRewriter.The original lyrics is from Later by Rene Liu.The inputs are the original lyrics with red tokens masked.The model generates tokens at the corresponding masked positions (highlighted in red on the right column).

Figure 9 :
Figure 9: An example of sentences rewriting by SongRewriter.The original lyrics is from Actor by Zhiqian Xue.The inputs are the original lyrics with sentences in red masked.The outputs are the red sentences on the right.

Table 1 :
Quality evaluation on the generated outputs in the full song rewriting task.The best scores are in bold.
Table 9 in Appendix G.

Table 5 :
Statistics on the number of documents, proses and lyrics.The dev and test sets are randomly sampled.

Table 6 :
(Fan et al., 2018)content quality of the generated lyrics in full song rewriting task.For each model, we apply Top-K sampling(Fan et al., 2018)for five times and report the mean and standard deviation (subscript).

Table 7 :
Evaluation results on the content quality of the generated outputs on the task of partial song rewriting under masking schemes, {SENT, TOKEN}, and masking ratios, {0.25, 0.5, 0.75}.

Table 10 :
Table8shows ablation results on the generated lyrics in full song rewriting task.Evaluation results on the rhyming performance of the generated lyrics on the full song rewriting task under different language order and local position embedding order.