DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics, but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap datasets with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement, and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms.


Introduction
Rap is a musical form originating from America in 1970s, and has quickly developed as one of the mainstream music genres in the world (Keyes, 2004). With the rapid development of artificial intelligence, automatic rap lyrics generation has drawn attention from academia (Potash et al., 2015;Malmi et al., 2016;Liang et al., 2018;Nikolov et al., 2020). Generally speaking, rap lyrics need to be semantically meaningful and fashionable to convey interesting stories or express feelings. Different from natural language or other artistic genres (e.g., * * Corresponding author: Xu Tan, xuta@microsoft.com 1 https://github.com/microsoft/muzic/tree/main/deeprapper lyrics or poetry), rap has distinctive characteristics: 1) it usually contains complex rhyme patterns among several consecutive sentences, which are the key to form a good flow; 2) it needs to align with the singing beat since rap lyrics are usually rapped according to some rhythmic accompaniments. Therefore, how to generate rap lyrics with good rhymes and rhythms is a troublesome problem. Previous works (Potash et al., 2015;Malmi et al., 2016;Liang et al., 2018;Nikolov et al., 2020) for rap generation mainly focused on lyric generation and some of them developed strategies for rhyme modeling. Potash et al. (2015) directly added a "<endLine>" token at the end of verse lines and expected to learn rhyme patterns implicitly. Nikolov et al. (2020) applied a two-step strategy, which first generates rap lyrics and then adds rhyme tokens to the end of generated lyrics. However, these methods cannot guarantee the rhyme patterns for every lyric line and only care the rhyme on the last token. Although many works have studied rhyming modeling in other artistic genres (e.g., poetry) (Li et al., 2020;Van de Cruys, 2020;, they are not suitable for rap generation due to the complex rhyme structure in rap. For example, poetry needs to rhyme with only the last word in each sentence, while rap rhymes with multiple consecutive tokens at the end of each sentence. No previous works have studied rhythm modeling (i.e., beats in rap), to our knowledge. One of the main reasons is the lack of rap datasets with beat-lyric alignment. Consequently, the generation of lyrics without rhythmic beats cannot be regarded as a full rap generation.
In this paper, we develop DeepRapper, a Transformer (Vaswani et al., 2017) based rap generation system which can model both rhymes and rhythms. To build the system, since there is no available rap datasets with aligned rhythmic beats, we design a data mining pipeline and collect a large-scale rap dataset for rhythm modeling. Specifically, we first crawl many rap songs, each song with both rap lyrics and audios, from the Web. For each crawled rap song, we perform a series of data preprocessing steps to extract rhythmic beats as well as beat-lyric alignment. To better model rhyme, we generate the words in a rap sentence from right to left in an autoregressive manner. Doing so we can easily identify the last few words of a sentence (now become the first words of the reverse sentence) to rhyme with. Additionally, we incorporate several rhymerelated representations into our language model to further improve the rhyming quality, and encourage the N -gram rhyme in generated rap lyrics through rhyme constraint during inference. We use a special token [BEAT] to represent the rhythmic beat and insert it into lyrics right before the corresponding word. In this way, we can model the beat in the lyric sequence both in training and generation.
Inspired by the success of pre-trained language models (Devlin et al., 2019;Radford et al., 2018;Song et al., 2019;, we incorporate pre-training into our system. To obtain large-scale data for pre-training, we also use our data mining pipeline to collect another two datasets: 1) non-rap songs with aligned beats, which can be larger than rap dataset since non-rap songs are more general than rap songs; 2) pure lyrics, which can be even larger than non-rap songs. In the pre-training stage, we pre-train our Deep-Rapper model based on the above two datasets. Then we fine-tune our pre-trained model on the rap songs with aligned beats. The fine-tuned model is used for final rap generation. Both objective and subjective evaluations verify the advantages of DeepRapper in generating rap lyrics with rhymes and rhythms.
Our main contributions can be summarized as follows: • To model rhythms in rap generation, we develop a data mining pipeline to create rap datasets with aligned rhythmic beats.
• To better model rhymes, we design an autoregressive language model to generate rap lyrics from right to left with rhyme constraint. As far as we know, DeepRapper is the first to explicitly model N -gram rhymes.
• We elaborately insert the beat token inside lyrics to model the rhythmic beats. To our knowledge, DeepRapper is the first system that models rhythms for rap generation.

Background
Since DeepRapper generates rap lyrics with both rhyme and rhythm modeling, in this section, we briefly introduce the related background: lyric generation, rhyme modeling and rhythm modeling.
Lyric Generation Broadly speaking, lyric generation can cover rap lyric generation (Potash et al., 2015;Nikolov et al., 2020;Liang et al., 2018), song lyric generation (Watanabe et al., 2018;Chen and Lerch, 2020;Sheng et al., 2020), general poetry generation (Zhang and Lapata, 2014;Lau et al., 2018;Li et al., 2020; and etc. Different from previous works that leverage language model to generate lyrics similar to natural language, in this paper, we introduce a novel language model for rap generation, with well-designed rhyme and rhythm modeling to fit the characteristics of rap lyrics. Additionally, inspired by the successes of pre-trained language models (Devlin et al., 2019;Radford et al., 2019;Song et al., 2019) in NLP applications, we also incorporate pre-training into our model to further improve the quality of rap generation.
Rhyme Modeling Rhyme modeling plays an important role in rap generation, which requires the last few tokens in consecutive sentences to have the same rhyme pattern. Existing rap generation systems either directly add a special token "<endLine>" at the end of rap lyric to encourage the model to learn rhyme structure (Potash et al., 2015), or introduce a two-step strategy for rhyme modeling that first generates rap lyrics and then adds rhyme tokens after the generated lyrics (Nikolov et al., 2020). However, these works only focused on unigram rhyme while rap appreciates more for n-gram rhyme. Although a lot of works have explored rhyme modeling in other genres, most of them cannot be directly used for rap generation. For example, poetry generation (Lau et al., 2018;Zhipeng et al., 2019;Liao et al., 2019;Li et al., 2020) usually used pre-defined format to control the rhyme pattern since poetry usually has fixed number of words and only cares the rhyme pattern for the last word. However, rap lyrics have diverse rhyme structures across multiple consecutive sentences and most importantly multiple con- secutive words. Therefore, we introduce N -gram rhyme modeling in DeepRapper to handle the distinctive rhyme patterns in rap. Besides, we also train our language model in a reverse order (i.e., right to left), similar to previous works (Van de Cruys, 2020), to better model rhymes since they always occur at the end of sentence.
Rhythm Modeling Rhythm modeling is usually used in music generation (Zhu et al., 2018;Huang and Yang, 2020;Ren et al., 2020) which generates the duration of notes along with the note pitch to form rhythmic beats in melody and accompaniment generation. Different from music generation, rap cares more about rhythmic beats instead of note pitches (i.e. melody). In this way, the generated rap lyrics need to align with the corresponding rhythmic beats in order to be rapped, otherwise it cannot be regarded as a complete rap. However, to the best of our knowledge, none of previous works have studied the rhythm modeling in rap generation.
In this paper, we introduce a novel beat modeling strategy in DeepRapper for rhythm generation.

Rap Dataset Mining
Previous works (Potash et al., 2015;Liang et al., 2018;Nikolov et al., 2020) for rap generation usually used rap datasets with only lyrics, without considering the rhythmic beat information. To model rhythm in rap generation, the rap dataset should contain lyrics with aligned rhythmic beats. However, beat alignments are quite difficult to obtain, since their annotations require musicians with professional knowledge to identify stressing syllable in rap songs. To handle this problem, we design a data mining pipeline to automatically extract beatlyric alignments. In this section, we introduce the details of the data mining pipeline and our mined dataset based on this pipeline.

Data Mining Pipeline
Figure 1 overviews our data mining pipeline, which consists of 5 steps: data crawling, vocal and accompaniment separation, vocal and lyric alignment, beat detection, and lyric and beat alignment.

Data Crawling
To mine a large-scale rap dataset, we first crawl a large amount of rap songs with both lyrics and singing audios from the Web. To ensure the lyric and audio can be aligned in the sentence level which is beneficial for our later word-level beat alignment, we also crawl the start and end time of each lyric sentence corresponding to the audio.
Vocal and Accompaniment Separation For each rap song, we utilize Spleeter (Hennequin et al., 2020) 2 , a public music separation tool, to separate the vocal (containing rap singing) and accompaniment (containing rhythmic beats) from the crawled rap audio.

Vocal and Lyric Alignment
We split the separated vocals into the sentence level according to the crawled start and end time of each lyric sentence, and thus we can get the vocal-lyric alignments in the sentence level. We convert lyrics into phonemes via Phonemizer 3 and utilize Montreal Forced Aligner 4 to obtain vocal-lyric alignments in the phoneme level. Based on these phoneme-level vocal-lyric alignments, we obtain the corresponding timestamp of each word in the singing audio.
Beat Detection To obtain the alignments between lyrics and beats, we need to know the timestamp of each beat. Therefore, we use a beat track detection tool, Librosa (McFee et al., 2020) 5 , to track the timestamp of each beat from the separated accompaniment that obtained from the second step.
Lyric and Beat Alignment After we obtain the timestamp of each word and each beat, we can align them together according to their timestamps. However, since a rapper may not sing a word exactly following the beat, directly using the timestamp to exactly match the word and beat is inappropriate. Therefore, we propose an approximate method to align them. Denote the word sequence of a lyric sentence as W = {w 1 , w 2 , · · · , w |W| }, and its beat sequence as B = {b 1 , b 2 , · · · , b |B| }, where w i and b j represent i-th word and j-th beat. We use T w i and T b j to represent the timestamps of w i and b j respectively. For each beat b j , we first filter out a word setW = {w : where r represents the average duration of each word in the song (i.e., the total duration divides the number of words). Next, word w i is aligned with beat b j if it satisfies the following condition:

Mined Datasets
Using the above data mining pipeline, we obtain a rap lyric dataset with aligned beats (named as D-RAP, where D represents "dataset"), which satisfies the requirements of building a rap generation system with both rhyme and rhythm modeling. We split the D-RAP dataset into the training and validation set with a ratio of 4:1. Since rap is only one of music genres and the number of rap songs is usually smaller compared with more general songs, we also mine another two datasets to pre-train our DeepRapper model with the same mining pipeline: 1) non-rap songs with aligned beats (named as D-SONG); 2) pure lyrics without aligned beats (named as D-LYRIC). We summarize the statistics of the three datasets in Table 1 and show a rap song with aligned beats from D-Rap in Figure 2.

Rap Generation Model
In this section, we introduce the architecture of our rap generation model, and the details of its rhyme modeling and rhythm modeling. Figure 3 illustrates the detailed architecture of our rap generation model. We use Transformer (Vaswani et al., 2017) to build an autoregressive language model (Radford et al., 2018(Radford et al., , 2019   for rap generation, and introduce several new designs: 1) To better model rhymes, our model generates a sentence from right to left, since rhyming words are always at the end of the sentence; 2) As aforementioned, rhythms are critical for rap performance, so we insert a special token [BEAT] for explicit beat modeling; 3) Unlike original Transformer with only word embedding and positional embedding, we add multiple additional embeddings to better model rhymes and rhythms. Next, we introduce our rhyme modeling in subsection 4.2 and rhythm modeling in subsection 4.3.

Rhyme Modeling
Rhymes are the key to form a good rap flow. In DeepRapper, we model rhymes with three components: 1) reverse-order language model; 2) rhyme representation; and 3) rhyme constraint.

Reverse-Order Language Model
Rhyming words usually occur at the end of each lyric sentence. If using a standard autoregressive language model and generating tokens from left to right, we need to identify whether the current generation step is the end of a sentence, which decides whether to generate rhyming words to be consistent with that in previous sentences. Therefore, to better model rhymes, we use a reverse-order language model to generate sentences from right to left, as shown in Figure 3. Doing so we can easily identify the last few words of a sentence (now become the first few words of the reverse sentence) to control their rhymes. Note that we only reverse Token Embeddings S0 S1 S1 S1 S1 S1 S1 S1 S1    words inside a sentence, and still generate different sentences in the original order. Figure 4 compares the sentences in left-to-right order and right-to-left order, from which we can see that rhyming words of each sentence share the same relative positions (offset to the first token) in the reverse order, and are easy to model and control.

Rhyme Representation
Rhyming words have two important features: 1) its vowel that used for rhyming and 2) its relative position in a sentence to decide the correspondence between the rhyming words in consecutive sentences (e.g., in the reverse order setting, the first/second word of the current sentence should be rhymed with the first/second word in the previous sentence).
We use the vowel in the Pinyin 6 of Chinese characters to represent their rhymes. To this end, we 6 Pinyin is the standard phoneme for Chinese. build a vowel dictionary F(·) to identify the vowel of each word. As shown in Figure 3, we add an additional vowel embedding F and an intra-sentence relative positional embedding R to enhance rhyme representation for each token. Besides, to better identify different sentences, we introduce a sentence embedding S to differentiate different sentences.

Rhyme Constraint
In addition to reverse-order language model and rhyme representation, we also introduce rhyme constraint to improve the quality of rhyme generation in inference. As shown in Figure 4, sentences in rap lyrics not only rhyme with the last token, but also with multiple consecutive tokens at the end. We call this phenomenon as N -gram rhymes, which mean the current sentence and the previous sentence keep the same rhyme for the last N consecutive tokens. To our knowledge, no previous work has investigated N -gram rhymes (N > 1), although it is important to improve rap quality. Our proposed rhyme constraint enables our model to adjust the probability of next predicted token to further encourage N -gram rhyme generation. The constraint is introduced as follows.
To generate the i-th word w i in the standard inference procedure, we usually choose the predicted token with the maximum probability, i.e., w i = arg max p(w|w <i ; θ), where w <i denotes the words before position i in the reverse sentence and θ is the model. When the words before posi-tion i of the current and previous sentence have the same rhyme pattern, we will use an adjusted probability distributionp(w|w <i ; θ) to encourage the i-th generated word to be rhymed according to the i-th word in the previous sentence, so as to form N -gram rhymes. The adjusted probability distributionp(w|w <i ; θ) is: (2) where π(w) is a vowel check function and α is a hyper-parameter to balance the two terms. Here, π(w) is 1 if the predicted w has the same vowel with the i-th token in the previous sentence, otherwise 0. In other words, when predicting i-th token (i ≤ N ), we encourage our model to pay more attention for these words with same vowel with the i-th token in the previous sentence. In this way, the model tends to generate N -gram rhymes with large N .

Rhythm Modeling
Generating lyrics with aligned beats is necessary since rap lyrics need to be rapped with rhythmic beats. Therefore, we model and generate rhythmic beats along with the lyrics with a specific symbol: we regard beat as a special token [BEAT] and insert it into lyric sequences for model training. As shown in Figure 3 Rap usually contains different beat frequencies, i.e., the ratios between the total number of words and the total number of beats in a rap song. To explicitly model and generate rap with different beat frequencies, we use three tokens [S], [M], and [F] to represent the slow, medium and fast beat frequencies and add the corresponding tokens at the start of a rap song for training and inference. In our D-RAP dataset, the distribution of beat frequency is displayed in Figure 5. According to the distribution, we assign [S], [M], and [F] to songs with beat frequency less than 3, equal to 3, and greater than 3 respectively.

Model, Data, and Training Configuration
Our DeepRapper model is built on the autoregressive Transformer decoder (Vaswani et al., 2017;Radford et al., 2018Radford et al., , 2019, where the hidden size, the number of attention heads and the number of Transformer layers are set as 768, 12, 12. The dimension of all different kinds of embedding in DeepRapper is set as 768. Considering there is no existing pre-trained language model in reverse order, we do not utilize any pre-trained language models for initialization. Instead, we first pre-train our model on D-LYRIC and D-SONG for 2 millions steps, and then fine-tune our model on D-RAP with 3K steps as the size of D-RAP is smaller than our pre-training corpus. We convert each song to a sequence with a length of 1024 tokens by cutting longer sequence or padding shorter sequence. Our model is trained with a batch size of 8 songs on 4 NVIDIA TITAN V GPUs. We use Adam optimizer with a learning rate of 0.00015, β 1 = 0.9, β 2 = 0.999, and = 10 −6 . We set the maximum value of N -gram rhyme as 3 and the hyperparameter α in Equation 2 as 0.95. Samples are generated conditioned on a given sentence in reference.

Evaluation Metrics
In this subsection, we introduce the objective and subjective metrics to evaluate the quality of the generated raps.
Objective Evaluation We evaluate the generated raps in terms of the quality of language, rhyme and rhythm. We choose five metrics to evaluate our model: 1) Perplexity (PPL), a standard metric to evaluate the quality of a language model; 2) Rhyme Accuracy (RA), the ratio of sentences that have correctly predicted rhymes; 3) Rhyme Density (RD), the longest rhyme of a song, averaged over all songs, which is introduced by Malmi et al. (2016) to measure the quality of rhyming fluency; 4) Combo-N, the maximum number of consecutive sentences with the same N -gram rhyme in a rap song, averaged over all songs, where we study N = 1, 2, 3; 5) Beat Accuracy (BA), the accuracy of our model in beat prediction, under the teacherforcing mode. Table 2: Results of objective and subjective evaluations. "+PT" means using pre-training. Since the two baselines do not include beat information, we only compare in perplexity (PPL), rhyme accuracy (RA) and rhyme density (RD) for objective evaluation. For subjective evaluation, we report the average annotation score of theme, fluency, rhyme quality, and rhyme diversity.

Model
Objective  (Zhang and Lapata, 2014;Nikolov et al., 2020) in artistic creation, we also use human evaluation to accurately evaluate the quality of the generated raps. We invite 10 participants with professional knowledge in music as human annotators to evaluate 100 sampled raps. Each annotator is required to score from 1 (Poor) to 5 (Perfect) on the following perspectives: 1) the clearness of the theme of the rap lyrics; 2) the fluency of the rap lyrics; 3) the quality of the rhyme; 4) the diversity of the rhyme. The averaged score of all annotators on all sampled raps is used as the evaluation score for each perspective.

Experimental Results
Results Table 2 shows the objective and subjective results of DeepRapper compared with two baselines: 1) Baseline: a standard autoregressive language model with the same model configuration with DeepRapper but without our proposed rhyme and rhythm modeling; 2) Baseline + PT, using pretraining on Baseline. We have several observations from Table 2: 1) DeepRapper achieves better perplexity, rhyme accuracy and rhyme density than the two baselines, which demonstrates the advantages of our method in generating high-quality rap lyrics with accurate and diverse rhymes. 2) DeepRapper achieves better scores in all subjective metrics, demonstrating that DeepRapper can generate highquality and rhyming raps that accord with human taste. 3) Pre-training improves the performance of baseline in both objective and subjective metrics, which indicates the importance of pre-training. However, its performance is still worse than Deep-Rapper.
Ablation Studies To further validate the necessity of each component in DeepRapper, we conduct a series of ablation studies, including remov- Table 3: The ablation studies on each component in DeepRapper. "-" means removing the corresponding component. "Rhyme", "Rhythm" and "PT" represent rhyme modeling, rhythm modeling and pretraining. "RO", "VE", "IPE" and "SE" mean reverseorder, vowel embedding, intra-sentence position embedding and sentence embedding. ing rhyme modeling, rhythm modeling and pretraining, respectively. The results are reported in Table 3. We have several observations: 1) Removing rhyme modeling affects rhyme quality a lot as it results in a dramatic drop in rhyme accuracy and rhyme density; 2) Removing each specific design in rhyme modeling (i.e., RO: reverse order language model, VE: vowel embedding, IPE: intrasentence position embedding, SE: sentence embedding) causes worse rhyme accuracy and rhyme density. Specifically, while removing RO leads to a better PPL since left-to-right order can be more easily modeled than right-to-left order according to the analysis in Wu et al. (2018), it causes large accuracy drop in rhyme quality. 3) Apparently, DeepRapper without rhythm modeling cannot produce any beat information; 4) DeepRapper without pre-training affects the perplexity and rhyme accuracy a lot, however, obtains a higher rhyme density. The reason is that without pre-training, DeepRapper tends to copy previous rhyme tokens due to the lack of generalization (larger PPL). To verify this, we count the repetitive rate of rhyming words and found that the rate of DeepRapper is 23.8% while without pre-training is 42.5%, which is higher than using pre-training. The above results verify the effectiveness of each component in DeepRapper.

Model
N -gram Rhyme To highlight the advantage of DeepRapper in modeling N-gram rhyme, we use Combo-N to measure the ability of each design in DeepRapper to model N-gram rhyme. The results are reported in Table 4. We can find that 1) The model without rhyme modeling can hardly generate good rhyme, regardless of the value of N in N-gram; 2) Removing rhyme constraint also weakens the capacity of generating N-gram rhyme. These results further demonstrate the importance of our rhyme modeling and rhyme constraint in generating multiple consecutive rhymes. Beat Frequency To better measure the beat quality, we randomly generate about 5,000 samples by DeepRapper and DeepRapper with beat frequency control. We propose the First Order Distribution (FOD) and the Second Order Distribution (SOD) and measure the distance (via Wasserstein Distance (Vallender, 1974)) of these distributions between the generated samples and our DRAP dataset. We define the interval of the current [BEAT] as the number of words between the current [BEAT] and the next [BEAT]. Therefore, the FOD is defined as the distribution of the interval of the current [BEAT]. Similarly, the SOD is defined the distribution of the difference between the interval of the current [BEAT] and the next [BEAT]. The results of the distance are normalized into [0, 1] and are reported in Table 5. It can be seen that DeepRapper with beat frequency control achieves better performance in beat modeling, which indicates the importance of beat frequency control in beat modeling. Case Analyses on Generated Raps We list a sample case from our generated raps in Figure 6 to demonstrate the good quality of the raps generated by DeepRapper. The sample is generated by feeding the first sentence of the example in Fig of childhood and the beautiful visions for the futures. We also provide a group of samples generated with beat frequency control. To save space, we put them and the translation of all the samples to Appendix. More samples are provided in https://deeprapper.github.io.

Conclusions
In this paper, we develop DeepRapper, a novel Transformer-based rap generation system, which leverages rhyme modeling, rhythm modeling and pre-training for rap generation. Considering there is no available rap dataset with aligned rhythmic beats for rhythm modeling, we propose a data mining pipeline to mine a rap dataset with beat-lyric alignments. We leverage right-to-left generation, rhyme representation and rhyme constraint to better model rhyme and encourage N-gram rhyme, and explicitly model beat information by insert beat token beside the corresponding word in the lyric sequence. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates high-quality raps with good rhymes and rhythms.
Thanks to the design of DeepRapper, we can further build another rap singing system to sing out the raps according to the rhymes and rhythms, which we leave as future work. We also leave Multilingual DeepRapper as future work.

Ethical Considerations
The proposed framework can be considered a novel language model for rap generation in automatic artistic creation. Specifically, the proposed framework has been configured with novel rhyme modeling as rhyme is quite important in music genres. Therefore, our proposed framework is also beneficial for generating other music genres. On the other hand, although we collect large-scale lyric data for pre-training, it still cannot fully utilize the potential of pre-training. In the future, we expect to employ more large-scale data in the open domain plus the music domain for pre-training to improve the capacity of the language model. In addition, our training datasets may have biases, which may bring some potential risks of model bias. Hence, we encourage future works to study how to apply other techniques in mitigating similar problems in our framework.

A Comparison with GhostWriter
We provide a comparison between DeepRapper and GhosterWriter (Potash et al., 2015) in Table 6. The results show that both DeepRapper and baselines outperform GhosterWriter in terms of PPL, rhyme accuracy, and rhyme density on rap generation tasks.

B Samples with Beat Frequency Control
Fast Figure 7 provides a rap generated by Deep-Rapper with fast beat frequency, which the frequency is 4.3. The rap express ones beat wished to his/her lover. The following is the translation of texts in Figure 7. Medium Figure 8 provides a rap generated by DeepRapper with medium beat frequency, which the frequency is 2.6. The rap praises the times we live in. The following is the translation of texts in Figure 8.   Vowels in red color represents that the word rhymes with previous sentence. Bold word means a beat is aligned with the word.

我长大的地方像一个简朴的寨
Slow Figure 9 provides a rap generated by Deep-Rapper with slow beat frequency, where the frequency is 2.1. The rap express ones relief from life. The following is the translation of texts in Figure 9.  你 就 这 样 子 ⼀ 笑 而 去 Figure 9: Rap generated of slow beat frequency. Vowels in red color represents that the word rhymes with previous sentence. Bold word means a beat is aligned with the word.

C Translation of Chinese Examples in the Paper
Words in red are rhymes.
Translation of Chinese in Figure 2 我长大的地方像一个简朴的寨 Translation of Chinese in Figure 3 我抬头仰望。天空的苍茫。 I looked up. The sky is vast. Figure 4 是这座城市的气象 It is the weather of this city 让你感受生命的力量 makes you feel the power of living Translation of Chinese in Figure 6 我长大的地方像一个简朴的寨