A Recipe for Arbitrary Text Style Transfer with Large Language Models

In this paper, we leverage large language models (LLMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as ‘make this melodramatic’ or ‘insert a metaphor.’


Introduction
Text style transfer is the task of rewriting text to incorporate additional or alternative stylistic elements while preserving the overall semantics and structure. Although style transfer has garnered increased interest due to the success of deep learning, these approaches usually require a substantial amount of labeled training examples, either as parallel text data (Zhu et al., 2010;Rao and Tetreault, 2018) or non-parallel text data of a single style. (Li et al., 2018;Jin et al., 2019;Liu et al., 2020;Krishna et al., 2020). Even bleeding-edge approaches that tackle the challenging problem of label-free style transfer are limited in that they require at least several exemplar sentences that dictate a given target style Riley et al., 2021). Hence, recent survey papers have identified a need for new methods that both reduce the training data requirements and expand the scope of styles supported (Jin et al., 2020;Hu et al., 2020).
In this work, we present augmented zero-shot learning, a prompting method that allows large language models to perform text style transfer to arbitrary styles, without any exemplars in the target style. Our method builds on prior work showing * Equal contribution Here is some text: {That is an ugly dress}. Here is a rewrite of the text, which is more positive: { Here is some text: {I was really sad about the loss}. Here is a rewrite of the text, which is more positive: {I was able to accept and work through the loss to move on.} Here is some text: {The eggnog was tasteless}. Here is a rewrite of the text, which is more positive: {The eggnog had a great, festive taste to it.} … Here is some text: {That is an ugly dress}. Here is a rewrite of the text, which is more positive: { Here is some text: {When the doctor asked Linda to take the medicine, he smiled and gave her a lollipop}. Here is a rewrite of the text, which is more scary: {When the doctor told Linda to take the medicine, there had been a malicious gleam in her eye that Linda didn't like at all} Here is some text: {They asked loudly, over the sound of the train}. Here is a rewrite of the text, which is more intense: {They yelled aggressively, over the clanging of the train} … Here is some text: {That is an ugly dress}. Here is a rewrite of the text, which is more positive: {

Zero-shot learning prompt
Few-shot learning prompt Augmented zero-shot learning prompt (ours) more melodramatic includes a metaphor include the word "balloon" (a) (b) (c) Figure 1: Zero-shot, few-shot, and augmented zeroshot prompts for style transfer. The boldface text is the zero-shot prompt, and the plain text is the additional priming sequence. The full prompts used in this paper are shown in Table 7. We encourage readers to examine the outputs of our model at https: //bit.ly/3fLDuci. that sufficiently large LMs such as GPT-3 can perform various tasks ranging from classification to translation, simply by choosing a clever prompt to prepend to the input text for which the model is asked to continue (Brown et al., 2020;Branwen, 2020). Using a single prompt that provides several demonstrations of sentences being "rewritten" to meet a desired condition, language models can extrapolate and rewrite text in unseen styles. We are thus able to perform style transfer to arbitrary styles such as "make this sentence more comic" or "include the word balloon." Augmented zero-shot learning is simple and facilitates the application of style transfer to a wider range of styles than existing work. Our contributions are the following. 1. We propose a recipe for style transfer using large LMs that is label-free, training-free, and intuitively controllable. 2. Via human evaluation, we find that our method achieves strong performance on both standard and non-standard style transfer tasks. We also compare our approach for sentiment transfer with prior methods using automatic evaluation. 3. We explore real-world desired style transfers generated from users of a text editing UI that implements our method.
2 Augmented zero-shot prompting Although large LMs are trained only for continuation, recent work has shown that they can perform a variety of NLP tasks by expressing the task as a prompt that encourages the model to output the desired answer as the continuation (Puri and Catanzaro, 2019; Weller et al., 2020;Brown et al., 2020;Schick and Schütze, 2021, inter alia). The simplest approach, zero-shot prompting, directly uses natural language to ask the large LM to perform a task, as shown in Figure 1a. Zero-shot learning, however, can be prone to failure modes such as not returning well-formatted or logical outputs (see Appendix §A). Few-shot prompting, as shown in Figure 1b, has been shown to achieve higher performance, but requires exemplars for the exact task that we want the model to perform. Such few-shot examples can be easily obtained if the desired style transformation is known ahead of time, but this ultimately limits style transfer to a set of pre-specified styles.
To remove the need for these labeled exemplars in few-shot prompting, we propose augmented zeroshot learning. Instead of using exemplars of the exact style transfer task we wish to perform, we prompt the model via related style transfer tasks in the same format, as shown in Figure 1c. This intuition is inspired by Reynolds and McDonell (2021)'s observation that successful prompts constrain the behavior of the large LM away from failure modes-in our case, we aim to preserve the flexibility of a zero shot prompt while encouraging the model to produce outputs of a specific template. We keep the exemplars constant and insert any desired transformation in the final continuation, e.g., "more melodramatic," "insert a metaphor," or "include the word balloon." This augmented zero-shot formulation works for a range of arbitrary styles.

Original
There, in the middle of the street, stood an old man in a weatherbeaten brown coat.

More descriptive
The man's brown trenchcoat was coated with snow, and as he leaned against the telephone booth, he cast a long shadow across the road in the late afternoon sun.

Include a metaphor
There, in the middle of the street, stood an old man whose face resembled the weatherworn bark of a gnarled tree.

Include the word "park"
There, in the middle of Central Park, stood an old man in a weatherbeaten brown coat.

Include the word "balloon"
There, in the middle of the street, stood an old man with several colourful balloons tied to the straps of his coat.

More comic
There, in the middle of the street, stood an old man in a raggedy brown coat that was much too big for him.

More melodramatic
There, in the middle of the road, stood a grizzled old man, the light of life faded from his sunken eyes.

Experimental Setup
Style transfer tasks. We consider six style transfer tasks that we deem non-standard, listed in Table  1. These styles were chosen to be representative of style adjustments made by users of an AI-assisted text editor that employs our method (discussed further in §5). As source sentences, we use 50 sentences randomly drawn from the Reddit Writing Prompts validation set (Fan et al., 2018), excluding those that already clearly exhibited one of the styles or were ungrammatical/incoherent. We use human evaluation for these styles, since not all styles have readily available classifiers.
We also evaluate our method on two standard style transfer tasks: sentiment and formality. We use the Yelp polarity dataset (Zhang et al., 2015) for sentiment and Grammarly's Yahoo Answers Formality Corpus (GYAFC) dataset for formality (Rao and Tetreault, 2018). 1 These datasets allow us to evaluate performance of augmented zero-shot learning in the context of prior supervised methods which have been used on these tasks.
Model. Augmented zero-shot learning requires a large language model. We use two dense left-toright decoder-only transformer language models (Vaswani et al., 2017), each with a non-embedding parameter count of 137B. The first model, which we refer to as LLM, was trained on a corpus comprising public web documents, including forum and dialog data and Wikipedia. The dataset was tokenized into 2.81T BPE tokens with a Sentence-Piece vocabulary size of 32K (Kudo and Richardson, 2018). The second model, which we refer to as LLM-Dialog, was the result of finetuning LLM on a curated, high-quality subset of data identified to be in a conversational format.
To show that the success of augmented zero-shot learning is not restricted to these two large LMs, we also perform an experiment using GPT-3 models (see Table 8). For LLM and GPT-3, we use the prompts shown in Figure 1 (see 7a for the unabbreviated prompts). For LLM-Dialog, the prompt is formulated as a conversation between one agent who is requesting rewrites and another who is performing the rewrites (see Table 7b in the appendix.)

Non-Standard Styles
For our six non-standard styles, we asked six professional raters to assess a total of 7,200 <input sentence, target style, output sentence> tuples (for each style we have outputs from our method plus the three baselines).
Each output was scored by three raters on the following three axes which are standard to textual style transfer (Mir et al., 2019): (1) transfer strength (the amount that the output actually matches the target style), (2) semantic presentation (whether the underlying meaning of the output text, aside from style, matches that of the input), and (3) fluency (whether the text is coherent and could have been written by a proficient English speaker). Following Sakaguchi and Van Durme (2018), transfer strength and semantic preservation were rated on a scale from 1-100. A screenshot of the evaluation UI is shown in Figure 5 in the Appendix. We use dialog-LLM, and compare it with three other methods: (1) zero-shot (a baseline), (2) paraphrase (our normal augmented zero shot prompt, but with the target style of "paraphrased", as a control) and (3) human (ground-truth transformations written by the authors). Figure 2 shows these results. We found that the outputs of our method were rated almost as highly as the human-written ground truth for all three evaluations. The zero-shot baseline performed the worst in all categories: 25.4% of the time, it did not return a valid response at all (see Appendix §A), compared with 0.6% for augmented zero shot. The strong performance of the paraphrase baseline at fluency and semantic similarity shows that large LMs are capable of generating high quality text that remains true to the input sentence's meaning. For a subset of the tasks, some automatic evaluation was also possible. We found that the "balloon" and "park" transformations successfully inserted the target word 85% of the time. For "more descriptive" and "include a metaphor" the transformed text was, as expected, longer than the original (by 252% and 146% respectively, compared with 165% and 146% for human baselines).

Standard Styles
To better contextualize the performance of our method with prior methods, we also generated outputs for two standard style transfer tasks: sentiment and formality. , we perform automatic evaluation for sentiment style transfer since there are classifiers available for these styles. We note that although automatic evaluations can diverge from human ratings, they can still be a good proxy as we could not perform human evaluation against every prior method due to time and resource constraints. We automatically evaluate (1) transfer strength using a sentiment classifier from Hug-gingFace Transformers (Wolf et al., 2020), (2) se-  Table 2 shows these automatic evaluations, with four main takeaways. First, augmented zero-shot prompting achieves high accuracy and low perplexity compared with baselines. The BLEU scores, however, are low, which we believe is because it tends to add additional information to generated sentences (see Appendix C for a deeper analysis). Second, we apply augmented zero-shot learning to GPT-3 175B; these results indicate that augmented zero-shot learning generalizes to another large language model. Third, we vary model size for GPT-3 models, finding that larger size greatly improves style transfer. Fourth, for LLM and LLM-dialog, we find that augmented zero-shot learning substantially outperforms vanilla zero-shot learning and almost reaches the accuracy of five-shot learning.

Potential of Arbitrary Styles
One promising application of augmented zero-shot learning is an AI-powered writing assistant that can allow writers to transform their text in arbitrary ways that the writer defines and controls. As a qualitative case study to explore what arbitrary re-write styles may be requested, we built an AI-assisted story-writing editor with a "rewrite as" feature that uses our augmented few-shot method. Our editor has a freeform text box for users to specify how they would like a selection of their story to be rewritten (see Figure 6 in the Appendix). We asked  to be a little less angsty • to be about mining • to be better written • to be less diabolical • to be more absurd • to be more adventurous • to be more Dickensian • to be more emotional • to be more magical • to be more melodramatic • to be more philosophical • to be more revolutionary • to be more surprising • to be more suspenseful • to be more technical • to be more whimsical • to be warmer • to fit better grammatically with the rest of the story • to make more sense Table 3: Requests in the form of "Rewrite this..." made by real users to a large LM-powered text editor. For the full set of unique requests, see Table 5 in the Appendix.
30 people from a creative writing group to use our our UI to write a 100-300 word story, collecting 333 rewrite requests in total. Table 3 shows a subset of these, which were as diverse as asking for the text "to be about mining" or "to be less diabolical."

Conclusions
We introduced augmented zero-shot learning, which we find shows shows strikingly promising performance considering its simplicity. This prompting paradigm moves the needle in text style transfer by expanding the range of possible styles beyond the currently limited set of styles for which annotated data exists. More broadly, we also hope that the strategy of prompting a large LM with nontask specific examples can inspire new inferenceonly methods for other NLP tasks. Hallucinations Large LMs are known to hallucinate text content; we saw this happen frequently for style transfer. While this is an advantage in some contexts like creative writing, it is undesirable for applications like summarization.
Inherent style trends We also noticed that even our "paraphrase" baseline was rated highly for style strength for a few styles ("more formal" and "more melodramatic"). This implies that the method outputs generally trend toward these style. A direction for future work would be to see what styles and qualities of text our method (and large LMs in general) are inherently more likely to produce.
Large LM safety concerns Large LMs themselves come with their own host of difficulties, barriers to entry, and potential safety concerns as discussed by Bender et al. (2021), which are also valid for this style transfer method. However, we also think that this method can be a useful tool in exploring and exposing the safety and boundaries of these models themselves: what happens if we try to force the large LM to make a text "more racist", "more sexist", or "more incendiary"? It is important to keep pushing these models to their boundaries to see where they fail and where problems arise, and specific use cases that show a broader range of the model's capabilities also show a broader range of its failure modes.

B Prompt Selection
A promising new area of prompt engineering has arisen to address the failure modes discussed above, specifically the invalid or unparseable answers. Reynolds and McDonell (2021) find that prompting a model for a task is more akin to locating an already-learned task than truly learning a new one. Moreover, they emphasize that that prompt engineering is mostly about avoiding various failure cases such as those described above. In this work, we use delimiters ("{" and "}") to help avoid these types of errors, giving scores of zero when there was no valid responses with such delimiters. There are other delimiters that could be used (e.g., quotes, "(" and ")", "<" and ">", newlines with a colon (as used by GPT-3), etc. We chose curly braces as they were 1) likely to occur in the training data as delimiters in other contexts and 2) not frequently part of the input sentence itself. We also use a second person prompt template for the dialog, which yielded better results as it was more similar to the training data. Exploring these options more quantitatively would be an interesting direction for future work. Because the performance of prompting can vary depending on the exact language of the prompt (Reynolds and McDonell, 2021), we compare four variations of prompts for sentiment: "more positive/negative," "happier/sadder," "more optimistic/pessimistic," and "more cheerful/miserable." As shown in Table 4 in the Appendix, performance differed across the four prompts, but we found them comparable.  into paragraphs • to be a bit clearer • to be a little less angsty • to be a word for a song • to be about mining • to be about vegetables • to be better written • to be less descriptive • to be less diabolical • to be more absurd • to be more adventurous • to be more angry • to be more cheerful • to be more descriptive • to be more Dickensian • to be more emotional • to be more fancy • to be more flowery • to be more interesting • to be more joyful • to be more magical • to be more melodramatic • to be more philosophical • to be more revolutionary • to be more scary • to be more subtle • to be more surprising • to be more suspenseful • to be more technical • to be more violent • to be more whimsical • to be warmer • to fit better grammatically with the rest of the story • to make more sense • to use a more interesting word • with a few words

C Low BLEU for LLM Outputs
As we saw in 2, the outputs of our model had low BLEU scores with respect to human generated outputs, while simultaneously having high semantic similarity in human evaluations. Based on qualitative examination of outputs, we believe that this is because model outputs often, despite having high semantic similarity with the source sentence, used different language from human annotations. For instance, for transferring the sentiment of "ever since joes has changed hands it's just gotten worse and worse" to positive sentiment, our zero-shot augmented learning model outputted "the establishment has continued to provide excellent service, improving steadily since its change of ownership." This will have low BLEU with the ground truth with respect to human references, which is simply "ever since joes has changed hands it's just gotten better and better." Though we do not see this as an inherent problem, increasing the BLEU for the purposes of comparison can be done in an easy way via candidate selection, as our model returns sixteen possible continuations. In some application for which we prefer model outputs to have high lexical similarity to the source sentence, we could select the candidate of the sixteen with the highest BLEU score compared with the original source sentence. We find that this candidate selection step can substantially improve the BLEU score with the ground truth target sentences, as we show in Table 8.

D Further Related Work
Style transfer has gained increasing attention in the NLP landscape, for which neural models have been trained to perform style transfer for styles including sentiment, formality, politeness, gender, and political slant (Prabhumoye et al., 2018;Madaan et al., 2020;Liu et al., 2021). We will briefly summarize the primary approaches to style transfer here, and refer the involved reader to either (Jin et al., 2020) or (Hu et al., 2020) for a survey.
Most text style transfer approaches fall in two categories. Early approaches tend to require parallel text data (Zhu et al., 2010;Rao and Tetreault, 2018), where every input in the source style has a corresponding output in the target style. Though this formulation elegantly fits the standard encoderdecoder paradigm, the availability of a parallel text corpus is a stringent requirement. Hence, recent text style transfer approaches have instead used non-parallel monostyle data (no one-to-onemapping between instances in the source and target styles). Such methods include latent representation manipulation (Liu et al., 2020), prototype-based text editing (Li et al., 2018), and pseudo-parallel corpus construction (Jin et al., 2019). However, even non-parallel monostyle data can be hard to collect for arbitrary styles. As such, surveys have called for more research on approaches that expand the scope of supported styles and reduce the training data requirements for style transfer systems (Jin et al., 2020;Hu et al., 2020).
Several new methods tackle the challenging problem of label-free style transfer, which does not require a full corpus of labeled data, but rather just a few exemplars that define a style.  use variational autoencoders for unsupervised learning of controllable representations for text. Riley et al. (2021) extract a style vector from a set of target texts and use this vector to condition the decoder to perform style transfer to a target style. These approaches have a similar goal to ours in terms of expanding the scope of possible style transfers. However, they are different in two main ways. First, they require a fully specialized model, where our method can be applied out-of-the-box with something like GPT-3. This can either be a strength or weakness, depending on the availability of such a model. Second, they require exemplars to define a style rather than a plain text description. Table 7: The exact augmented-zero shot prompts used in our experiments. For LLM-Dialog, we replaced "Here is a rewrite of the text, which is" with "Rewrite it to be", and fed each line of the input to the model as individual dialog turns. The blue text is an example of a templated input text and style that would produce the final model output. Note that we can achieve high accuracy even though the prompt formulation resulted in some minor grammitical errors for some styles (e.g., "rewrite it to be include the word 'snow"') Acc