ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models

State-of-the-art poetry generation systems are often complex. They either consist of task-specific model pipelines, incorporate prior knowledge in the form of manually created constraints, or both. In contrast, end-to-end models would not suffer from the overhead of having to model prior knowledge and could learn the nuances of poetry from data alone, reducing the degree of human supervision required. In this work, we investigate end-to-end poetry generation conditioned on styles such as rhyme, meter, and alliteration. We identify and address lack of training data and mismatching tokenization algorithms as possible limitations of past attempts. In particular, we successfully pre-train ByGPT5, a new token-free decoder-only language model, and fine-tune it on a large custom corpus of English and German quatrains annotated with our styles. We show that ByGPT5 outperforms other models such as mT5, ByT5, GPT-2 and ChatGPT, while also being more parameter efficient and performing favorably compared to humans. In addition, we analyze its runtime performance and demonstrate that it is not prone to memorization. We make our code, models, and datasets publicly available.


Introduction
End-to-end fine-tuning of pre-trained language models like GPT-2 (Radford et al., 2019) or T5 (Raffel et al., 2020a) on downstream tasks has been an immensely popular training paradigm for textgeneration in the last few years . Endto-end models learn to complete a task by directly learning all steps, without intermediary algorithms such as hand-crafted rules or post-processing. This approach has proven to be highly effective on a wide range of problems such as dialog generation (Sun et al., 2022;, summarization (Zhu et al., 2021;Zhong et al., 2021;Huang et al., 2021),

github.com/potamides/uniformers
The sweet wild strain, the sudden start, A Which shakes the perfumed altar's flame, B To make its shrine a sacred name, B And sing its praise in every heart. A -B GPT5 Figure 1: Generated quatrain with ABBA rhyme scheme, high amount of alliterations (green), and iambic meter, i.e., unstressed syllable ( ) follows stressed syllable ( ). and machine translation (Farinha et al., 2022;Tran et al., 2020). Nevertheless, all these applications have in common that they only concern themselves with the generation of prosaic texts. Generating formal verse poetry on the other hand, with strict constraints on aesthetic style such as rhyme scheme, meter and alliteration, remains a difficult problem. Attempts to employ end-to-end solutions in this context have so far been unsuccessful (Wöckener et al., 2021), with some authors even concluding that language models cannot pick up such constraints from data alone (Popescu-Belis et al., 2022). As a consequence, state-of-the-art poetry generation systems rely on human guidance by (i) injecting prior knowledge in the form of hard-coded constraints during inference to filter model outputs or modifying probability distributions or (ii) breaking the whole process down into sophisticated task-specific model pipelines. Tian and Peng (2022), for example, propose a sonnet generation framework with four distinct pipeline steps: content planning, rhyme pairs generation, polishing for aesthetics, and finally sketch-to-sonnet generation. Further, they incorporate prior knowledge such as pronunciation dictionaries, knowledge bases, and lexically constrained decoding. Hopkins and Kiela (2017) use Weighted Finite State Trans-We define incorporating prior knowledge as "Any form of influence on model decisions not learned by the model itself." ducers to monitor whether their poetry generation system meets metric constraints and roll back its state in case of a violation. Jhamtani et al. (2019) and Ormazabal et al. (2022) generate a large number of samples poems before filtering them with pronunciation dictionaries.
Such forms of human supervision lead to ramifications that an end-to-end solution would not face. Pipelines are susceptible to errors in early modules that propagate and are amplified in subsequent modules; an effect known as cascading of errors (Castro Ferreira et al., 2019). Similarly, incorporating prior knowledge at inference time depends on the cleverness and intent of the modeler and generally becomes more difficult when heterogeneous constraints are involved or the number of constraints increases (Garbacea and Mei, 2022). Furthermore, standard text-generation architectures do not lend themselves well for manually applying constraints to their output. Due to the autoregressive generation of tokens from left to right, constraints at arbitrary positions cannot be implemented easily or only with additional tradeoffs (Garbacea and Mei, 2022). For example, end rhymes, which come at the end of a verse, cannot be constrained in isolation due dependencies on previously generated tokens. A commonly applied work-around for this problem is to generate each verse in reverse (Lau et al., 2018;Jhamtani et al., 2019;Van de Cruys, 2020;Xue et al., 2021a).
In this work, we thus aim to reduce the amount of human supervision in poetry generation and explore viable end-to-end solutions. We hypothesize that failing to do so so far has the following root causes: (i) lack of available training data. Poetry corpora labeled with aesthetic styles are few and far between and we speculate that they do not suffice to train a generalized model. (ii) Unfavorable tokenization algorithms. Aesthetic styles of poetry such as rhyme, meter, and alliteration are often expressed at the character-level while most available off-the-shelf pre-trained models operate at the subword-level (Kudo and Richardson, 2018). Xue et al. (2022) showed that character-level models (also known as token-free models) excel at other character-level tasks so we assume that they would perform similarly well at poetry generation. Our contributions are as follows: (i) We pre-train B GPT5, to our knowledge the first decoder-only transformer for characterlevel language modeling.
(ii) We create Q T , a large machine-labeled poetry corpus of quatrains in German and English.
(iii) By fine-tuning B GPT5 on Q T , we show that it learns character-level styles better than subword-based systems, such as GPT-2 and T5, as well as other token-free models like B T5, while being more parameter efficient and also faring well compared to humans.
(iv) We further demonstrate that B GPT5 exhibits few memorization problems and, via tokenlevel attributions, we introspect what information it uses when predicting a next token, finding that it better learns to understand poetic styles than the competitors we compare to. In addition, we compare its performance to C GPT and also find that it performs well on tasks that do not operate at the character-level.

Background
In formal verse poetry, poems have to adhere to strict patterns and rules of language to which we refer as styles. Such styles evoke additional meaning compared to prosaic texts or spoken language and often lead to the use of distinguished linguistic expressions. Our goal is to train an end-to-end poetry generation system whose generated poems can be constrained to adhere to specified styles. We refer to this as style-conditioned poetry generation. In our work, we focus on generating quatrains and conditioning on the following defining styles of formal verse poetry (cf. Figure 1): Rhyme Rhyme is the repetition of the same or similar sounds in the final accented syllables of two or more words, which must be preceded by differing consonants (Harmon et al., 2000). If all conditions are met, we speak of perfect rhymes, and if some of them are violated, for example, because the final sounds are different or the words are identical, we speak of imperfect rhymes. Rhymes typically appear at the end of lines in poetry in which case they are also called end rhymes. The pattern in which such end rhymes appear in a stanza is called rhyme scheme and is usually encoded as letters, e.g., in a quatrain with ABAB rhyme scheme, the first and third lines rhyme, as do the second and fourth lines. For a quatrain, there exist 15 theoretically possible rhyme schemes.
Meter Meter refers to the rhythmic pattern within a verse. In modern poetry, this rhythm is usually accented-syllabic, that is, the succession of stressed ( ) and unstressed syllables ( ) occurs at regular intervals (Harmon et al., 2000). The rhythmic unit is also known as a foot and the meter of a verse can thus be described as a sequence of feet. In English poetry, common feet are iambic ( ), trochaic ( ), anapestic ( ), and dactylic ( ). For conditioning on meter, we consider all metric feet appearing in our datasets (cf. Appendix A).
Alliteration Harmon et al. (2000) define alliteration as the repetition of the same consonant sounds or any vowel sounds at the beginning of words or syllables that are close together in a verse. In formal verse, alliteration is secondary to rhyme and meter, follows less strict constraints, and is therefore not as easily classified. In this work, we thus consider the level of alliteration instead, which we classify as either low, medium, or high (cf. Section 4).

Methods
We induce our end-to-end poetry generation systems for English and German by fine-tuning pretrained transformer models (Vaswani et al., 2017). For conditioning on style, we consider two architectural variants-encoder-decoder transformers (Xue et al., 2021b(Xue et al., , 2022 and decoder-only transformers (Radford et al., 2019;Brown et al., 2020). As explained in Section 1, we focus on token-free models, but also consider subword-level models for comparison. We do not experiment with models with more than 400 million parameters since they exceed the capacity of our available GPU resources.

Encoder-Decoder
With encoder-decoder models, we initialize the encoder with a tuple of desired styles and generate a quatrain with the decoder. We represent each style by a special token which we add to the model vocabulary. We use B T5 (Xue et al., 2022), a token-free pre-trained encoder-decoder transformer, as a baseline model. For comparison with subword-level approaches, we fine-tune T5 (Xue et al., 2021b).

Decoder-only
As the input for encoder-decoder models is a relatively short sequence of styles, this could lead to an underutilization of the encoder. We thus hypothesize that a decoder-only model, with styles supplied as a prompt string, would be better suited for our task. On the subword-level, multiple models, such as GPT-2 (Radford et al.,  2019), are readily available. However, to our best knowledge, no such model exists at the characterlevel yet, which is why we train our own. Since our new model shares some similarities with the GPT family of models, but has its origin in B T5 (see Section 3.1), we refer to it as B GPT5. An overview of all models we use can be seen in Table 1.

B GPT5
For pre-training our own token-free decoder-only model, we start by modifying the architecture of B T5 and discard its encoder component entirely. We then initialize the weights with the decoder of B T5 to warm-start the training process (Rothe et al., 2020;Tang et al., 2022). We repeat this for the three small smallest variants of B T5. Because B T5 has an asymmetrical architecture, the resulting models retain only 25% of its parameters. We refer to their model sizes as small, base, and medium.
As training data, we use O W T 2 (Gao et al., 2021) for English and a subset of 100 (Conneau et al., 2020) for German. For hyperparameters, we follow Radford et al. (2019) and Brown et al. (2020) and use Adam with a weight decay of 0.1, a batch size of 512, varying learning rates depending on model size, and train on a causal language modeling objective for 50k steps following (Lester et al., 2021). The loss curves can be seen in Figure 2.

Datasets
We collect a range of labeled and unlabeled datasets of English and German poetry (cf. Table 2). As shown, we were able to procure labeled data for rhyme and meter, but the corpora are far too small to train a poetry generation system. Instead, we aim   to use the bigger unannotated corpora, EPG and DLK, as training data by labeling them automatically. To facilitate the labeling process, we chunk these datasets into pseudo-quatrains (any consecutive sequence of four lines), eventually amounting to over 660k quatrains for English and 1.4m for German. We refer to this new dataset as Q -T (cf. Table 2), further statistics can be found in Appendix A. In the following, we explain the labeling process for each style.
For rhyme and meter classification, we leverage the available labeled data (cf. Table 2) and train classifiers. Meter classification is a multiclass classification problem with a single verse as input, while rhyme classification is a binary classification problem with two verses separated by a special token as input. For each style, we perform a 90/5/5 train-valid-test split and fine-tune a range of encoder-only transformers with classification heads jointly on both languages, as this improves performance (  Kuhn, 2018). We test subword-level BERT and XLM-R, as well as character-level C -C (Clark et al., 2022). The performance of these models can be seen in Table 3. Since character-level C -C outperforms both BERT and XLM-R on both classification tasks, we use it as our final classifier. We classify the meter of a quatrain by choosing the dominant meter among the verses , and the rhyme scheme by determining which verses rhyme and which do not.
As no readily available poetry datasets include labels for alliteration, we approach the problem in a different way. The quantification of the level of alliteration in a document is a long known research problem (Skinner, 1939;Leavitt, 1976;Blain, 1987;Benner, 2014). Let be the atomic units of sound in verse , Blain (1987) quantify alliteration as where f (·) is a similarity function of two sounds; the default simply testing for equality. Intuitively, allit(·) counts alliterative sounds in a verse, applies a distance penalty, and normalizes the score to [0, 1]. To get a score for quatrains, we average the alliteration level of all verses. We consider initial phonemes of words, as well as all further stressed phonemes as atomic sound units , and to determine phonemes and stress, we employ a grapheme-to-phoneme conversion model based on B T5 (Zhu et al., 2022). Further, we conducted a study to empirically determine several intensity thresholds based on a sample of quatrains. We classify the alliteration level of a quatrain as low if the score is below 0.05, medium if the score is below 0.1, and high if it is above that. separate models for each language in Q T and train for 10 epochs. We conduct both automatic (Section 5.1) and human evaluation (Section 5.2). Examples of generated quatrains can be found in Appendix C.

Automatic Evaluation
For automatic evaluation, we condition our models on a range of different style combinations. We select four common rhyme schemes (AABB, ABAB, ABBA, and ABCB), the most popular meters per language (iambus, trochee, anapest, and dactyl for English, and iambus, trochee, and alexandrine for German), and all levels of alliteration, and create 75 poems per model for each possible combination. To find out if these styles are properly reflected in the generated quatrains we reuse the classifiers from Section 4, i.e., we use them to classify the generated poems and see if the styles match. We define the following metrics: Rhyme Score computes the recall of verses that should rhyme in a quatrain, as well as the recall of verses that should not and takes their arithmetic average.
Alliteration Score is 1 if a quatrain has the correct alliteration level, else 0.
Meter Score is the fraction of verses with correctly classified meters.
Coherence uses BERT for next sentence prediction (Devlin et al., 2019) as a means to assess discourse relations of verses (Duari and Bhatnagar, 2021;Shi and Demberg, 2019). The score is the fraction of consecutive verse pairs that are correctly classified to come after one another.
We provide the scores for each model averaged over all generated quatrains (2700 for German and 3600 for English) in Figure 3. We can see that all models manage to learn to follow aesthetic styles to some degree. Nevertheless, there are noticeable score differences between individual models. In terms of rhyme, all B GPT5 variants collectively outperform all GPT-2 models on both languages. Similarly, B T5 consistently outperforms T5. This supports our hypothesis that tokenfree models are better suited for character-level styles. Further, B GPT5 (small) performs better than B T5 (small) which means we can discard the encoder (75% of parameters) while still improving performance. Surprisingly though, base B GPT5 and GPT-2 achieve higher scores than their medium variants. While this may initially suggest that larger capacity decoders prioritize (meaningful) content, while smaller decoders focus on style, the high coherence seen across all models weakens this hypothesis. Instead, we speculate that this may be an overfitting problem. In particular, smaller models, up to base size, may be better suited for generating shorter texts such as quatrains. Another surprising finding is that B T5 (small) performs worse than GPT-2 (base) on English. We investigate this further in Section 5.2. On meter, all models perform very similar to one another. Whereas B GPT5 (small) performs best on English by a small margin, it is outperformed by T5 (small) on German. This result is not surprising. Since meter is a syllable-level style, subword-level language models also manage to pick it up reasonably well (Lau et al., 2018). Interestingly though, on English the scores are much lower overall than on German. A reason for this may be that the occurrence of different meters is much more evenly distributed in German Q T (cf. Table 6 in the appendix). While in German only about 60% of all meters are iambs, in English it is over 80% making it difficult for models to learn other meters. We identify further reasons in Section 5.2.
Alliteration is the style that all models are the worst at. Our formulation of alliteration levels may make it very difficult for models to pick up its semantics. Still, B GPT5 (base) performs the best on English, and B GPT5 (small) and B T5 (small) perform the best on German, suggesting that token-free models have an advantage on this style.

Human Evaluation
To further validate the effectiveness of our models, we conduct a human evaluation campaign using best-worst scaling (BWS) as a means of annotation (Louviere et al., 2015). BWS is a variant of comparative annotation that has been shown to produce high-quality results while keeping the number of required annotations low Mohammad, 2016, 2017). Annotators are presented with tuples of items and asked to identify the best and worst item based on a specified property. By subtracting the fraction of times an item is chosen as the best from the fraction of times it is chosen as the worst, real-valued scores ranging from -1 (very bad) to 1 (very good) can be obtained (Orme, 2009). In our annotations, we consider three properties: rhyme, meter, and human likeness, i.e., the likelihood of a poem being written by a human.
As we only have limited resources, we take a few measures to reduce the workload on our annotators. First, we exclusively evaluate our English models and only consider the top-performing model within each model class based on the results of our automatic evaluation. The models in question thus are B GPT5 (base), GPT-2 (base), B T5 (small), and T5 (small). Furthermore, we only choose from three rhyme schemes (AABB, ABAB and ABBA), two meters (iambus and trochee), and one level of alliteration (medium), and create four poems per system for each possible combination.
In addition to machine-generated poems, we also randomly sample human quatrains from our datasets that match the constraints and create 120 4-tuples from the combined set of quatrains. We ensure that each quatrain appears in 4 distinct tuples. Four distinct annotators then annotate the rhyme and human likeness of each tuple, while meter is evaluated by a single expert annotator only. Since we have multiple annotators working on rhyme and human likeness we use the split-half reliability (SHR) measure (Kiritchenko and Mohammad, 2017) to assess their consistency. SHR is calculated by splitting the annotations into two sets, computing scores for each set, and then computing their Spearman rank correlation coefficient.
In Figure 4, we provide a kernel density estimate for each annotated property. On rhymes, we obtain an SHR of = 0.77 which demonstrates a high agreement between annotators. Human rhymes are ranked the highest overall, whereas B GPT5 comes in as a close second, followed by B T5. This shows that, with respect to rhymes, human annotators consistently prefer token-free language models over subword-level models. This is a bit different from our findings during automatic evaluation where GPT-2 (base) was ranked higher than B T5 (small) on English. An analysis of GPT-2 generated quatrains revealed a predominance of imperfect rhymes as a likely cause. As our rhyme classifier is trained on binary labels it is unable to detect this, but human annotators perceive this kind of rhyme as worse.
With = 0.54, the SRH of human likeness is noticeably lower than for rhyme. This suggests that all models succeeded in generating human-like quatrains, and it was therefore more difficult for annotators to rank them consistently. Indeed, the distributions in Figure 4 are much closer overall, and although humans rank higher than B GPT5, which in turn ranks higher than GPT-2, they still perform similarly well. Still, we can see that B T5, and especially T5 are ranked a bit lower than the other contenders. Both models were pre-trained on a span corruption objective and have thus never seen truly natural text during pre-training (Zhu et al., 2022;Raffel et al., 2020b;Lester et al., 2021) which we believe could be a possible cause.
The distributions for meter in Figure 4 have very large variances for all models, and also for humans. This is surprising, as it suggests that our annotator does not think that humans are better at adhering to metric constraints than our models, even though the results of automatic evaluation on English were not on par with those on German. We hypothesize that even among real English poets, there is a significant amount of poetry that does not strictly adhere to metric constraints. Subsequently, language models only learn to follow them freely as well, which could explain our results of automatic evaluation.

Analysis
We continue with a deeper analysis of our models and look into low-resource training ( §6.1), quantify memorization ( §6.2), evaluate the performance of token-free models on non-character-based, highlevel tasks ( §6.3), use token-level attributions for explainability ( §6.4), and compare B GPT5 with C GPT ( §6.5).

Low-resource Training
We hypothesized that a large training corpus is an important factor in successfully training an end-toend poetry generation system. Thus, we examine this hypothesis by selecting a subset of English Q -T and re-training our models in a low-resource setting. In particular, we select 1/20th of all poems, amounting to 33k quatrains of training data and use the same hyperparameters as in Section 5. Figure 5 shows how well these new low-resource models adhere to style constraints, similar to the automatic evaluation of full training in Figure 3. In contrast to full training, all low-resource models perform noticeably worse at generating rhymes, independent of their architectures and tokenization algorithms. Results on meter and alliteration show similar trends, although not as severe. Whereas results on rhyme are 15%-40% worse, for meter it is only 5%-20%, and for alliteration 5%-10%. Nevertheless, the results show, especially for rhymes, that even character-level tokenization algorithms do not help much with picking up style when the amount of training data is low. On rhyme, all models perform almost equally bad, suggesting that a large training corpus is important.

Extractive Memorization
A common problem of language models, known as extractive memorization, is generating verbatim copies from the training data during inference (Carlini et al., 2022;Raunak and Menezes, 2022;Meehan et al., 2020). We now quantify whether, and if, to which degree, our models are affected by memorization. According to Carlini et al. (2022)   extractive memorization occurs when a language model's continuation of a string is part of the data it was trained on. Since the inputs to our language models are strings of style, this formulation lends itself well to our use case: we simply have to check if generated poems appear in Q T . In Table 4, we thus compute the extractive memorization rates of the quatrains generated in Section 5.1. To account for negligible variations in the generated sequences, we do not compare raw strings, but calculate the Ratcliff-Obershelp similarity (Ratcliff and Metzener, 1988), and assume that two strings are equivalent if their similarity exceeds 0.7.
As can be observed, GPT-2 suffers from memorization the most. On English, over 3% of all quatrains generated by GPT-2 (medium) are copied. While B GPT5 also copies quatrains to an extent, it is much less affected in comparison. On English, B GPT5 (medium) does not copy anything and on German only 0.81% of all generated poems. As a general trend, we can see that bigger model variants tend to copy more data than smaller ones-a finding shared by others (Carlini et al., 2022). Interestingly, encoder-decoder models do not seem to be affected by this problem at all, most likely because styles are not used as a prompt, but are fed into the encoder separately.

Higher-level Styles
We also explore how well token-free models perform on higher-level styles which are not characteror subword-level phenomena. In particular, we focus on emotion as a higher-level style phenomenon, using P -E (Haider et al., 2020), a dataset which distinguishes between eight different aesthetic emotions invoked by readers of poetry (cf. Appendix A). We condition our models on these emotions, and assess their ability to understand and accurately  As in Section 4, we leverage automatic labeling. To that matter we train an emotion classifier on the German subset of P -E as in Haider et al. (2020) and reproduce the results. We then classify the emotions of all poems in German Q T and retrain our language models by conditioning them on all emotions appearing in a quatrain. For evaluation, we generate 100 poems for each possible 2-tuple of emotions (so 2800 poems overall) and compute the recall of correctly classified emotions in a quatrain. The results in Table 4 show that encoder-decoder models score the highest, with T5 achieving the overall best performance. Still, token-free models are not far behind. Especially B GPT5 fares very well against GPT-2, suggesting that token-free models are still competitive on higher-level tasks.

Token-level Attributions
In this section, to gain an insight into the decisionmaking processes of our models, we visualize token-level attributions when generating a quatrain. Token-level attributions explain to which degree each token in the input is involved in determining the next output token of a model allowing us to reason about what a model has learned. To this end, Ferrando et al. (2022) decompose the attention blocks of transformers into a sum of vectors and define a new measure for visualizing token-to-token interactions based on the distance of each vector to the output (Kobayashi et al., 2021). We apply this measure on generative language models and visualize token-level attributions for B GPT5 and GPT-2 when generating the last syllable in a quatrain in Figure 6.
We can see that B GPT5 puts a big emphasis on the current verse, as well as the styles it was conditioned on. Further, possibly in response to the ABBA rhyme scheme, it also heavily stresses the ending of the first verse. Since the model also places a moderate amount of attention on the last consonants in verse 3, it also seems to be aware of which sounds it should not generate in order maintain the rhyme scheme. Interestingly, it heavily emphasizes the letter v in the last two verses. We assume that this corresponds to what B GPT5 understands by alliteration, in which case it would not have understood well at which position in a word the same sounds must occur.
Unlike B GPT5, GPT-2 does not put any visible emphasis on intput style tokens, which suggests that it does not understand how to handle them very well. Nevertheless it stresses similar aspects to B GPT5, although, due to the subword vocabulary, at a different level of granularity. We provide additional examples in German in Appendix B.

Comparison with C GPT
C GPT (OpenAI, 2022) is a conversational large language model which specializes in dialogue. It has attracted attention for its detailed and expressive answers, raising the question of how well it performs in generating poetry. In a small-scale study, we thus ask C GPT to generate quatrains with various rhyme schemes (AABB, ABAB, ABBA, and ABCB) using its web interface, and similarly generate poems using B GPT5. We then construct random pairs of quatrains generated by each model and want to find out which poem adheres better to rhyme constraints. Since we know the quatrains of C GPT beforehand, we use our rhyme scorer of Section 5.1 for unbiased scoring. Only in 15% of cases does our scorer prefer poems of C GPT over B GPT5. Manual investigation showed that C GPT tends to generate rhymes at arbitrary positions, rather than adhering to specified rhyme schemes, even when giving examples in the prompt. Our verdict is that C GPT is a viable approach for poetry generation but not style-conditioned poetry generation.

Related Work
As indicated in Section 1, other competing poetry generation systems usually consist of model pipelines and/or incorporate prior knowledge in the generation process. Lau et al. (2018), for example, propose D -, a poetry generation system for modeling quatrains in English consisting of three components: one language model generates a set of sample verses in reverse order, another model re-initiates sampling as long as the rhyme constraints are not met, and a final model ranks the samples according to how well they adhere to iambic pentameter. Van de Cruys (2020) also induce prior knowledge into a generic language model, but they do it by modifying output probability distributions directly. D R (Xue et al., 2021a) does the same for generating rhymes in rap music. In addition, rhythmic constraints of rap are encoded as special tokens interleaved with actual lyrics. Jhamtani et al. (2019) put their focus on actually learning rhyme and train a sonnet and limerick generator through adversarial training. The model is hierarchical, i.e., it first generates a sequence of line endings which are subsequently completed in reverse. While the model manages to learn the meaning of rhyme to an extent, the authors still filter outputs using pronunciation dictionaries. More in line with our research, Hopkins and Kiela (2017) train a model on the phonetic representation of poetry using the International Phonetic Alphabet (IPA) as a character-level vocabulary. During inference, a second model translates sounds back to human readable text. Although promising, the model did not generalize well, and an additional model enforces rhythmic constraints in their final approach. Ormazabal et al. (2022) also identify lack of poetic training data as a shortcoming, but unlike us, address the issue by using clever pre-training on prosaic texts. They prefix each text with a content plan specifying line endings and syllable counts for each contained line. During inference they generate poetry by specifying content plans with rhyming line endings and metric syllable counts. While this approach does not require actual poems, it has the limitation of requiring the explicit specification of line endings and does not take into account syllable stress, which is a crucial aspect of meter.
Apart from the aforementioned systems which do not require human input during runtime, a parallel field of research investigates interactive poetry generation, with a focus on assisting poets in their creative process. Boggia et al. (2022), Popescu-Belis et al. (2022), and Tian and Peng (2022) all propose the use of complex model pipelines for this purpose, while Chakrabarty et al. (2022) fine-tune T5 to enable it to answer predefined questions about poetry.
Finally, several studies have focused on the task of style transfer, specifically the ability to replicate the writing style of a specific author. Zugarini et al. (2019) and Lewis et al. (2021) both train models on the works of an Italian and English poet, respectively, and implement hand-crafted rules to filter generated samples.
In this work, we implement end-to-end styleconditioned poetry generation systems for generating quatrains in English and German. In particular, we present B GPT5, a novel token-free decoderonly language model, and show that fine-tuning it on a custom poetry corpus outperforms other models, such as GPT-2, T5, and B T5, on average, while also performing favorably against human poems. Our key findings are that (i) tokenization algorithms matter, i.e., token-free language models generally perform better at generating characterlevel styles than subword-level transformers, and (ii) large datasets are crucial for successful model training. We further show that bigger models do not necessarily perform better and that decoderonly architectures work best, i.e., we can discard the encoder of B T5 (75% of parameters) while still improving downstream performance. We also demonstrate that token-free transformers perform competitively on tasks not tied to character-level styles, and are less susceptible to memorization of the training dataset. Additionally, we visualize token-level attributions to gain insights into the decision-making processes of models when generating a quatrain.
In future work, we want to to extend our system to other poetic forms such as sonnets, limericks, or villanelles.

Limitations
A well-known shortcoming of transformers is the computational complexity in self-attention layers (Vaswani et al., 2017). Since the number of required calculations grows quadratically with the length of the input, transformers become prohibitively slow on very long sequences. An unfortunate side effect of processing inputs at the characterlevel is that internal sequences become much longer, so token-free transformers run into these efficiency problems much earlier than subword-based models. Figure 7 illustrates this problem by contrasting the runtime of all poetry generation systems when generating a single quatrain. Even B GPT5 (small), the smallest model in terms of number of parameters (cf. Table 1) and the fastest token-free transformer, is only marginally faster than GPT-2 (medium), which is almost five times larger. Tay et al. (2022) propose a solution to this problem for transformer encoder blocks by applying a neural pooling operation over input embeddings before feeding them into the model, which could be extended to decoder blocks in future work. Alternatively, Libovický et al. (2022) propose a two-stage decoding architecture in which the transformer decoder operates on character blocks that an additional LSTM model (Hochreiter and Schmidhuber, 1997) decodes into individual characters. Another shortcoming is that our poetry generation systems can only generate a single poetic form, i.e., quatrains. In general, poetry is a very diverse form of language and stanzas can be of arbitrary length, so this is a serious limitation. In future work, we thus plan to extend our implementation of style-conditioning for variable length poems. In particular, one could encode a rhyme scheme not as a single special token, but as an arbitrary series of letters indicating which verses rhyme with each other. Alternatively, our current systems could be used to generate longer stanzas through a sliding window approach, i.e., generating one verse at a time with the last three verse as context.
Further, Q T is limited in that it consists of pseudo-quatrains, which are not real quatrains and often have missing contexts. Nonetheless, as can be seen in Appendix C, models trained on Q T are still able to generate meaningful poetry. In future work, we plan to improve the quality of our dataset by obtaining real quatrains from additional sources such as the Eighteenth-Century Poetry Archive (Huber, 2022

A Additional poetry corpus statistics
The poetry corpora we collect are English Project Gutenberg (EPG) and Deutsches Lyrik Korpus (DLK) (Haider, 2021) for unlabeled poetry, and P , Chicago Rhyme Corpus (C ) , For-better-for-verse (FORB) (Tucker, 2011), German Rhyme Corpus (GRC) (Haider and Kuhn, 2018), as well as EPG64 and A -K (Haider, 2021) for labeled poetry. We map meters which appear less than 25 times in our labeled corpora to the special label other. The final list of meters we consider can be found in Table 5.
Additional statistics of our custom Q T corpus can be found in Table 6. During automatic labeling, when the rhyme scheme cannot be clearly determined (e.g., according to the classifier the first verse rhymes with the second, the second with the third but the first and the third do not rhyme) or no dominant meter exists, we discard the quatrain.
The eight emotions in P -E we train our classifier on are beauty / joy, sadness, uneasiness, vitality, awe / sublime, suspense, humor, and annoyance. Since an additional emotion, nostalgia, almost never occurs, we follow Haider et al. (2020) and omit it from our experiments.

B German Token-level Attributions
German token-level attribution scores in Figure 8 largely follow the trends observed in Section 6.4 on English. GPT-2 puts less attention on style and the emphasized parts of text are less granular.

C Example Quatrains
In Table 7 we list additional example quatrains in German and English, generated with B GPT5 (base).   Table 7: Additional example poems generated with B GPT5 (base) in German and English.