STEER: Unified Style Transfer with Expert Reinforcement

While text style transfer has many applications across natural language processing, the core premise of transferring from a single source style is unrealistic in a real-world setting. In this work, we focus on arbitrary style transfer: rewriting a text from an arbitrary, unknown style to a target style. We propose STEER: Unified Style Transfer with Expert Reinforcement, a unified frame-work developed to overcome the challenge of limited parallel data for style transfer. STEER involves automatically generating a corpus of style-transfer pairs using a product of experts during decoding. The generated offline data is then used to pre-train an initial policy before switching to online, off-policy reinforcement learning for further improvements via fine-grained reward signals. STEER is unified and can transfer to multiple target styles from an arbitrary, unknown source style, making it particularly flexible and efficient. Experimental results on a challenging dataset with text from a diverse set of styles demonstrate state-of-the-art results compared to competitive baselines. Remarkably, STEER outperforms the 175B parameter instruction-tuned GPT-3 on overall style transfer quality, despite being 226 times smaller in size. We also show STEER is robust, maintaining its style transfer capabilities on out-of-domain data, and surpassing nearly all baselines across various styles. The success of our method highlights the potential of RL algorithms when augmented with controllable decoding to overcome the challenge of limited data supervision.


Introduction
Style transfer has been widely explored in the NLP field due to its practical applications, such as making text more formal (Rao and Tetreault, 2018), Figure 1: An overview of unified style transfer.In standard style transfer, models can only transfer from a single source style to a specified target style, struggling to transfer from out-of-domain texts.In contrast, unified style transfer models can transfer from an arbitrary source style to multiple target styles.
increasing politeness (Madaan et al., 2020;Mukherjee et al., 2023), or anonymizing authorship (Shetty et al., 2017;Patel et al., 2022).Previous work has mostly focused on one-to-one style transfer which involves rewriting text from one specific style to another while preserving meaning and fluency (Li et al., 2018;Sudhakar et al., 2019;Shen et al., 2017a).However, this approach may be less practical in real-world scenarios, where there are multiple and often unknown source styles a user wishes to transfer from.
We focus on arbitrary style transfer, a many-toone style transfer task, where the goal is to transfer text from an arbitrary, unknown style to a target style using a single model (Reif et al., 2021;Krishna et al., 2020).This is a challenging task mainly due to the lack of large-scale, human-curated corpora for training.Furthermore, we design a framework for training a unified, many-to-many style transfer model, which can do arbitrary style transfer to multiple target styles, as shown in Figure 1.To circumvent the lack of supervised data, recent approaches (Suzgun et al., 2022;Patel et al., 2022) heavily rely on large language models like GPT-Neo (Black et al., 2022) and GPT-3 (Brown et al., 2020) in zero or few-shot settings.Though promising and convenient, these approaches are limited by the high cost of API calls (OpenAI, 2023) and lack of reproducibility due to over-reliance on LLMs (Dean, 2023).Our method enhances the effectiveness of smaller, more accessible models for style transfer, broadening their adaptability and utility for the wider community.
In this work, we present Unified Style Transfer with Expert Reinforcement ( STEER), a novel, unified framework for many-to-one style transfer without supervision.Starting with a non-parallel corpus of text with various styles and a general paraphraser model, STEER first creates a diverse, pseudo-parallel dataset of style transfer pairs using product-of-experts decoding (Hinton, 2002;Liu et al., 2021).This makes our framework efficient by eliminating the need for costly human-curated datasets.Next, STEER uses offline reinforcement learning (RL) with this data before switching to online, off-policy RL for further improvement.To reflect the varied properties of style transfer, we adapt the QUARK algorithm (Lu et al., 2022), incorporating multiple reward models associated with different aspects such as style strength, fluency, and meaning similarity.Our framework is both practical and flexible, enabling a single model to transfer arbitrary source styles to multiple target styles.
We apply STEER to a diverse dataset of 11 styles (Krishna et al., 2020), developing a unified style model capable of transferring text from any of the 11 styles to any other style in the corpus.Our final model is effective at transferring style while preserving fluency and semantic similarity for all source and target styles, beating strong baselines across a suite of automatic metrics for style transfer.In particular, across all styles our 775M parameter model beats all baselines in overall style transfer quality, including the instruction-tuned 175B parameter GPT-3 model (Ouyang et al., 2022).Finally, we showcase the robustness of our model through evaluation on two out-of-domain source styles that are unseen during training, where STEER consistently outperforms almost all baselines for every target style.The success of STEER demonstrates the effectiveness of reinforcement learning abetted by a high-quality, offline dataset in lieu of a good initial policy.

Task: Unified Style Transfer
Conventionally, the goal of style transfer is to take an input text in a known source style x s i and rewrite it into some known target style x s j while preserving meaning and fluency.However, this setting is unrealistic and may not cover real-world use cases where there are multiple and often unknown source styles.The goal of arbitrary style transfer is to instead transfer text from an arbitrary, unknown style to a text in the target style with meaning and fluency preservation.Formally, given S as the set of all possible style choices, this amounts to finding a function f : X × S → X , which takes an input text x and a desired target style s j , and outputs a modified text in the target style x s j .

Unified Style Transfer with Expert Reinforcement
We introduce STEER, a novel two-stage framework for unsupervised unified style transfer.Our framework is illustrated in Figure 2 and is composed of 1) expert-guided data generation to circumvent the challenge of obtaining supervised datasets at scale, and 2) offline reinforcement learning followed by online reinforcement learning to effectively align an initial policy with multiple reward functions related to the style transfer task.
In expert-guided data generation ( §3.1), the goal is to automatically collect a diverse high-quality dataset D f of style transfer pairs using only a general paraphraser M p and a corpus of diverse styles C. To this end, we follow an overgenerate-andfilter approach: we first generate a large pool of candidate pairs from the paraphraser guided by style expert models in a product-of-expert fashion (Hinton, 2002), then leave only pairs that qualify for the style transfer task (i.e., accurately transferred style and semantically similar pairs).In online off-policy reinforcement learning ( §3.2), we first update the paraphraser M p as an initial policy using supervised learning on the collected dataset and then switch to online, off-policy learning for further data exploration and model improvements (Ramamurthy et al., 2022;Lu et al., 2022).(2) Reinforcement Learning Step 0: Offline RL Step k: Online RL Score the new data based on: Policy θ

Exploration
Figure 2: An overview of STEER.We first use expert-guided data generation to automatically generate candidate style-transfer pairs x si → xst , mapping from an input of arbitrary style x si to a rewrite x st in a target style, by decoding with a product of experts using a paraphraser M P and style-expert LMs.After filtering by quality metrics, we have a diverse, high-quality dataset D f .We then train a unified, many-to-many style transfer model, using D f for offline RL before switching to online, off-policy RL to further optimize style transfer quality.

Expert-guided Data Generation
We first leverage expert LMs to generate a highquality, pseudo-parallel style transfer corpus.
Generation For each target style s t ∈ S, we first massively generate a diverse set of candidate style transfer pairs x s i → xst for all s i ∈ S − {s t }, such that we collect pairs of transfers from each possible source style to the target style.To do so, we firstpass text x s i from a candidate source style through a general (style-agnostic) paraphraser M P , typically resulting in a normalized text x = M P (x s i ) with little or no stylistic features (Krishna et al., 2020).To ensure that the x belongs to the desired target style, we steer the paraphraser M P generation towards the target style and away from the source style during decoding.Intuitively, we exploit the inherent capability of the paraphraser to faithfully rewrite input texts, while injecting stylistic control through guided-decoding. 2o do this, we leverage DEXPERTS decoding (Liu et al., 2021), a controllable text generation paradigm that enables steering towards and away from distinct attributes.DEXPERTS combines the distribution of a base autoregressive model P b with those of an "expert" P e and/or "anti-expert" P a model in a product of experts, which are trained on desirable and undesirable attributes respectively.Given a prompt x <t , the next token probability is obtained by a product-of-experts: where α is hyperparameter controlling the strength of control over the base model P b .Within our problem setting, we consider the general paraphraser M p as the base model, and two language models finetuned on texts belonging to target style s t and source style s i as our expert and anti-expert models respectively.Given a text in a candidate source style x s i , we generate text in the target style x st via sampling from the probability distribution obtained in Eq. 1.We repeat this expert-guided decoding for all the source and target styles, resulting in a dataset D init .In practice, we over-generate data by repeating the generation procedure above with a vast sweep of hyperparameters, such as multiple sampling temperatures and decoding algorithms, so we can eventually filter and attain as many high-quality rewrites possible.
Filtering Not all of the expert-guided generations in D init are high-quality.We thus filter D init and retain the pairs that best represent the task of style transfer.We assess the quality of each candidate style transfer pair in D init with three standard style transfer metrics: 1. Target Style Strength (TSS) of the generation x st is measured by the probability of the target class s t with a RoBERTa-large classifier (Liu et al., 2019) trained on text from all the styles in the corpus C. Both style strength and style accuracy have been used in previous work (Reif et al., 2021;Krishna et al., 2020); we opt for style strength, as it it is more finegrained than a binary measurement of accuracy.Accordingly, we train our classifier in a multi-label setup, such that the prediction probability of each target style can be independently evaluated.
2. Fluency (F) of the generation x st is measured by the probability of being grammatically acceptable via a binary RoBERTa-large classifier trained on the CoLA dataset (Warstadt et al., 2018).
3. Meaning Similarity (MS) between the input x s i and rewritten text x st is measured via SentenceTransformers embedding distance (Reimers and Gurevych, 2019).
Following previous work (Krishna et al., 2020), for each candidate style transfer pair, we aggregate the three style metrics above into a joint metric V that captures the overall quality: All three individual metrics are scalar values in the interval [0, 1] 3 , which ensures also that V ∈ [0, 1].
Next, we filter our data to create a high-quality pool of training data D f for subsequent model training.For each target style in D init , we sort the styletransfer pairs by their combined score V, then take the top-k examples.This sampling method ensures that the examples in the resulting dataset are the highest quality possible, but may also lead to lower diversity, as it excludes lower-scoring generations.
In practice, with multiple target styles in the initial pool of pairs D init , filtering is done for each style separately, and the filtered data from each target style is combined to form D f .

Reinforcement Learning
Next, we train a unified style transfer model by leveraging the generated corpus D f .Concretely, our goal is to attain a rewriting model M θ which accepts an input with arbitrary style x s i along with a target style s t and produces a high-quality rewrite x st , as evaluated by the joint metric V, formally: 3 SentenceTransformers occasionally outputs negative scores; we set these to 0 to ensure a score in [0,1] Recently, online policy-based RL algorithms (Lu et al., 2022;Schulman et al., 2017;Ramamurthy et al., 2022) have been shown effective in optimizing language models towards a given objective function.In the RL framework, we refer to the model M θ as the policy and the objective function V as the reward.Generally, online RL algorithms conduct policy optimization with model-generated outputs while assuming a reasonable degree of alignment between the output distribution of the initial policy and the optimal reward distribution.This alignment is necessary to produce generations with meaningful signals for RL training.
Due to the absence of supervision, the closest initial policy for our unified style transfer task would be the style-agnostic paraphraser M P .However, this initial policy is still far away from the optimal reward distribution as the style transfer task falls beyond the capabilities of the paraphraser M P , making it unable to produce useful generations for RL optimization.To overcome this challenge, we propose first conducting offline RL training and then progressing to online RL training.Specifically, prior to optimizing M P with its own generations, we first perform RL optimization on the style transfer data D f generated through expert-guidance ( §3.1).Intuitively, the offline stage equips the initial policy with a certain degree of style transfer capability before online stage further optimizes it towards generating rewrites with better quality.
In practice, we employ and adapt the RL algorithm QUARK (Lu et al., 2022) to accomplish the two-stage RL training.QUARK is an online, offpolicy RL algorithm that has proven effective in various text generation tasks.Notably, the off-policy nature4 makes it possible to be adapted for the offline RL stage.QUARK optimizes a reward function through reward conditioning.Concretely, the algorithm alternates between 1) collecting samples with the current language model, 2) sorting them into quantiles based on their reward, with each quantile identified by a reward token prepended to the language model's input, and 3) using standard language modeling loss on samples from each quantile conditioned on their reward token.
When adapting QUARK to offline RL, we start by initializing the data pool with the style transfer corpus D f generated through expert-guidance rather than gathering generations from the initial  (Krishna et al., 2020) and P-A-R (Suzgun et al., 2022), using GPT-2 Large (774M), and GPT-3 (175B).Bold and underline denote the highest and the second-highest score respectively in each row.
policy.Afterward, we carry out the quantization and learning steps in the same manner as the original QUARK.After completing the offline RL stage, we proceed with the online QUARK training by alternating between data generation with the updated policy, quantization and learning.In both stages, our training objective can be written as: where r V(•) denotes the quantized reward token corresponding to the reward score V(•) of the generated rewrite.In online RL, D is expanded with samples from the improved policy at each iteration.Additionally, we also explore integrating a vectorized reward function v(x s i , x st , s t ) into the QUARK algorithm, rather than using the joint multiplied scalar score V as the reward function.In this case, instead of conditioning on one reward token that corresponds to a quantized scalar score, we condition on a reward vector composed of three reward tokens.These reward tokens represent quantized scores from the style, fluency and similarity metrics respectively.As we will show in the experiment section, we observe a noticeable performance boost brought by vectorized QUARK in terms of reward optimization.We believe this is likely because the vectorized reward provides additional fine-grained signals for optimization, which reflect the quality of each generated output with respect to individual evaluation metrics.

Datasets
We use the following datasets in our experiments: 1) the Corpus of Diverse Styles (CDS; Krishna et al., 2020) is a non-parallel, diverse text corpus with 11 distinct styles such as Shakespeare and the Bible, 2) Grammarly's Yahoo Answers Formality Corpus (GYAFC; Rao and Tetreault, 2018) is a parallel corpus of formal and informal responses collected from the Yahoo Answers forum, and 3) the Yelp Review Dataset (Yelp; Shen et al., 2017a) is a non-parallel corpus of user-reviews on various businesses and services from the Yelp with binary sentiment ratings of positive or negative.For more details on the datasets see Appendix B.

Baselines
We use three competitive style-transfer baselines.Method-specific details are located in Appendix C: Style Transfer via Paraphrasing (STRAP; Krishna et al., 2020) is an unsupervised approach for arbitrary style transfer, which uses GPT-2 Large (Radford et al., 2019) inverse paraphrasers.
Prompt-and-Rerank (P-A-R; Suzgun et al., 2022) prompts some language model to generate k candidate style transfer texts, ranks them based on quality, and returns the best one.We use P-A-R with GPT-2 Large.GPT-3 (Brown et al., 2020;Ouyang et al., 2022) is a highly-capable class of decoder-only models, GPT2-Large GPT-3 (text-davincii-003) particularly showing strong zero-and few-shot performance.We utilize GPT-3 as baseline both in a zero-shot and few-shot (k = 1, 5, 10) setting.Specifically, we use the instruction-tuned, 175B parameter engine text-davinci-003.5

Evaluation Metrics
To evaluate the quality of each style transfer pair, we use the same metrics introduced in §3.1: target style strength (TSS), fluency (F), meaning similarity (MS), and the aggregate metric V.For a set of style transfer pairs (i.e., over an entire data corpus), we report the average V. 6 To ensure that the improvement from STEER is meaningful (i.e., to make sure our model is not reward hacking), we also report evaluation using alternative metrics unseen during training in Appendix F; these results corroborate our main findings in §4.5.

Experimental Details
For all non-GPT-3 baselines, we use GPT-2 large as the base language model.Specifically, for STEER, we use GPT-2 large for the paraphraser and for the expert models.Our main STEER results are with the vectorized QUARK variant (i.e., using finegrained reward).More details are in Appendix A.3.GPT-3 has its best relative performance on the Twitter and Shakespeare styles, but struggles otherwise.This shows the limitations of relying on largescale general-purpose LLMs: in this case, GPT-3 excels transferring to styles most likely to be highly prevalent in it's internet text corpus (Brown et al., 2020) However, it is unlikely to generalize to more obscure styles unseen during training, even with few-shot examples.The poor performance of the GPT-2-based P-A-R reinforces this, showing the unreliability of prompting general-domain, pretrained LMs for style transfer, especially at smaller scales.

Style Transfer on CDS
We also conduct an out-of-domain evaluation to assess the robustness of each method to unseen 7551 inputs.Specifically, we use text from the two styles in GYAFC as inputs at testing time,7 employing the previously trained CDS model without further finetuning, and transfer to each of the 11 styles in CDS.
Our results are shown in Table 2: overall, STEER is the most robust method, outperforming all others in total score, V, for almost all target styles.STEER loses only to GPT-3 on the Shakespeare style; this may be due to the inherent knowledge of Shakespeare stored in GPT-3.

Human Evaluation
We also conduct a human evaluation to verify the quality of the generations.We use a 3-point Likert Scale to evaluate style transfers (Iyyer et al., 2018)  Figure 3 shows our human evaluation results.In terms of individual metrics, STEER has better fluency than STRAP and maintains competitive fluency to GPT-3, which is known to excel at generating human-like text (Brown et al., 2020).STEER also performs slightly better in meaning similarity than STRAP, but GPT-3 outperforms both of them significantly.However, the TSS of STEER makes up for this and dwarves both the baselines.We think this is a reasonable trade-off: STEER sacrifices much less fluency and meaning preservation for much more style transfer strength.
Previous work has also demonstrated this trade- Transfer: lyrics → bible STEER And he will not dare to face me: for fear of me is in his eyes.

GPT-3
And his fear was great, so that he could not stand before me.STRAP For he that is afraid of me is of me; but he that is of me is of him.

P-A-R
In fear he came and hid himself, because God was near to him off between style transfer accuracy and meaning preservation, both through empirical results (Suzgun et al., 2022;Malmi et al., 2020;Wu et al., 2019;Li et al., 2018) and explicit mentions in discussions (Li et al., 2018;Xu et al., 2019;Wu et al., 2019;Hallinan et al., 2023).Intuitively, when transferring from one style to another, some amount of semantic changes is unavoidable; as a simple example, meaning similarity will be maximized when the input is naively copied.
Overall, human evaluation validates our main findings: STEER still beats both baselines in overall score V.These results show that GPT-3 is excellent at paraphrasing -creating fluent and semantically similar rewrites, but not at transferring to multiple diverse styles, as it often struggles to convert to the target style.On the other hand, STEER is more versatile, maintaining moderate-to-strong performance on all individual metrics, making it the strongest overall method.
Finally, we show qualitative examples of generations from different models in Table 3 optimize for only one or two.

Ablations
We perform two ablation studies to analyze the effect of dataset size and reward design in STEER.
All models are compared after 15K training steps: Dataset Size We investigate the effect of different dataset sizes on the performance of STEER.
Using the top-k sampling strategy, we vary k with k = 100K, 200K, and 400K and compare style transfer on CDS. Figure 4 shows the average results transferring to 11 target styles in CDS from all other styles.Interestingly, we do not observe direct scaling of style transfer performance with increasing dataset size; as the top-k value increases, the aggregate score V, target style strength TSS, and fluency F all follow a reverse U-shape curve.
These results may indicate a trade-off between diversity and quality in the dataset D f used to train STEER: as the k value increases with top-k sampling, D f becomes more diverse, but also includes samples with lower-quality, which may hurt model performance downstream.On the other hand, when k is too small, though the average quality of each example in D f is higher, fewer diverse examples may hurt generalization.The optimal dataset has examples with sufficient variety and quality, enabling the model to learn a high-quality policy while staying resilient to various inputs.
Coarse vs Fine-grained Reward We also directly compare the use of coarse or fine-grained reward tokens in the RL stages of STEER.As mentioned in §3.2, rather than using a product of the style metrics and a single reward token, we can use a vectorized reward function that outputs each of the three style metrics individually and correspondingly condition on each of these specific metrics.
Results are shown in Table 4. Incorporating a fine-grained reward improves performance across all dimensions, including V.This shows that conditioning on fine-grained rewards can lead to more control across each desired attribute, resulting in much better style transfers overall.

Analysis of D f
We analyze D f , the dataset resulting from the expert-guided dataset generation.First, we compare the lexical diversity of D f against existing style transfer corpora.Following Gehrmann et al.
(2021), we gauge the mean segmented token-type ratio over segemented length of N = 10 (MSTTR) and the 1/2/3-gram entropy of the training split of each corpus.We also assess the quality of styletransferred outputs in each corpus by assessing fluency (F) and meaning similarity (MS).
Table 5 shows comparisons of these metrics.The automatically-created D f is comparable to existing human-created datasets in diversity and in fluency.The average meaning similarity is also promising, as it is within 85% of the value of GYAFC.This shows the potential of machine-generated data when aided with creative decoding algorithms.

Related Work
Style Transfer Due to the absence of large-scale parallel corpora for text style transfer (TST), prior work has focused on unsupervised methods designed for non-parallel datasets (Dai et al., 2019;Luo et al., 2019).Most of these efforts focus on disentangling the representation of content and the style of a given text, either through an auxiliary discriminator to classify text attributes (Hu et al., 2018;Shen et al., 2017b), or by training with a policy gradient (Xu et al., 2018;Gong et al., 2019).
Recent work has leveraged the generation capabilities of LMs for TST: Krishna et al. (2020), create a pseudo-parallel corpus by paraphrasing text from a style, then training an inverse paraphraser to convert text to that style.Other work automatically align pairs of sentences in different styles, either in the representation-level (Prabhumoye et al., 2018) or corpus-level (Liu et al., 2022b).
Others have attempted TST by prompting LMs (Reif et al., 2021;Suzgun et al., 2022).How-ever, these approaches often rely on a strong initial model, either already fine-tuned on TST-related tasks (e.g.paraphrasing), or a large LM capable of few-shot generalization.In contrast, our framework does not assume strong capabilities of the initial model, making it applicable in a realistic setting.
RL for NLP Recent work has shown the potential of RL to align with arbitrary natural language objective functions across areas such as summarization (Paulus et al., 2017), open-ended text generation (Lu et al., 2022), dialogue (Li et al., 2016;Zhou et al., 2017), question-answering (Liu et al., 2022a), machine translation (Nguyen et al., 2017;Wu et al., 2016), and dataset generation (Pyatkin et al., 2022;Kim et al., 2023).For unified style transfer, a setting where the desired output can be directly correlated to automatic metrics, RL is a promising avenue.
Data Generation with LMs LM-generated data have been increasingly used across a wide range of tasks, such as commonsense reasoning (West et al., 2022;Zelikman et al., 2022), NLI (Ye et al., 2022) and dialogue generation (Kim et al., 2023).While previous approaches rely on the task-solving capability of LLMs, recent work show that small LMs can also generate high-quality datasets without supervision (Jung et al., 2023;Brahman et al., 2023).Building on top of these, our work pushes further on machine-generated data by incorporating 1) inference-time decoding algorithms and 2) targeted filtering, yielding an effective pseudo-parallel corpus to initialize offline reinforcement learning.

Conclusion
We propose STEER, a unified framework to overcome the challenge of limited parallel data in style transfer, by leveraging expert-guided decoding and two-stage reinforcement learning.We focus on a more realistic use case: rewriting text from an arbitrary, unknown style to a desired target style.Through extensive experiments, we demonstrate the effectiveness and robustness of STEER on both in-and out-of-domain style transfer, outperforming competitive baselines.The success of STEER underscores the potential of RL algorithms when combined with controllable decoding and encourages future algorithmic innovation that fully unleash the power of RL for real-world NLP applications.

Limitations, Ethical Considerations, and Broader Impacts
While STEER demonstrates promising results for arbitrary-to-many style transfer, there are several limitations.Firstly, in our experiments, we rely heavily on the availability of a corpus containing text from diverse styles to act as source styles for the expert-guided creation of D f ; however, not every corpus will have as diverse a set of styles to create a D f from.Instead, in data-limited settings, it may be required to gather source text from other locations, like other corpora, in order to create candidate style-transfer pairs.Secondly, while we tested the generalization of STEER to out-ofdomain source style, adaption to new target styles through continual learning requires further investigation and experimentation.Additionally, like many other natural language systems, STEER could unintentionally introduce harmful stereotypes or engage in malicious content generation.Specially the use of fine-grained reward signals during online training may be used to reinforce undesired behaviors potentially leading to the generation of biased or unethical outputs.Furthermore, bad actors may try to intentionally utilize style transfer systems like STEER to create harm or to harass marginalized communities by using toxic output styles.This is a common misuse case in generation (McGuffie and Newhouse, 2020), and an application which we strongly condemn.
On the positive side, STEER allows for training memory and cost-efficient training of unified style transfer models using existing corpora.Our method is thus beneficial for somewhat reducing the carbon footprint by reducing the reliance on training large language models (LLMs) to achieve desired results (Strubell et al., 2019).We test out both prompts with a small subset of data, and find that contrastive prompting works much better, so we use this going forward.We also try generating k = 3 samples and k = 5 samples per input, and find that k = 3 works the best.Following the original paper, we use nucleus sampling (Holtzman et al., 2019) with p = 0.9 and a temperature of 1.0.Finally, we use GPT-2 Large for fair comparison with STEER.

C.3 GPT-3
We prompt GPT-3 using nucleus sampling (Holtzman et al., 2019) with p = 0.9 and a temperature of 1.0.We include further details on zero-shot and few-shot prompting below.

C.3.1 Zero-shot
We use the following prompt setup for zero-shot style transfer: Rewite

D Human Evaluation
Since automatic metrics alone have been shown insufficient for evaluating text generations (Novikova et al., 2017), we conduct human evaluation.Annotators rate meaning similarity of a style transfer pair, and the fluency of the style-transferred text.
For fluency, annotators choose between: 0 for not fluent, 1 for somewhat fluent, and 2 for fluent.For meaning similarity, annotators choose between: 0 for not similar, 1 for somewhat similar, and 2 for similar.We discard annotations where all three annotators disagree on a label for either fluency or similarity, resulting in a final human evaluation labeled size of 310 (from an initial size of 330).
To reduce labor cost, we only run our human evaluations on the top three methods from Table 1, meaning we exclude P-A-R.In addition, following previous work, we do not run human evaluation on target style strength.Further details are explained in Appendix D.1.

D.1 Style Identification Task Difficulty
The target styles in the CDS dataset are extremely complex.Previous work from Krishna et al., 2020 mention that this is too challenging of a task, even for experienced annotators.
We verify the difficulty of the text style identification task reported in Krishna et al., 2020 by performing an additional human evaluation.From the CDS test set, we randomly sample 10 examples from each of the 11 styles (110 total examples with ground truth styles).Next, we use the same three annotators from our previous human evaluation (NLP experts), and provide them with a natural language description of each of the 11 styles and 20 random examples from the train set of each to familiarize them with text from different styles.We ask them to assign a style label to each of the 110 examples, given their knowledge of the styles, and calculate their accuracy and agreement.On average, the annotators only have a 40.0%classification accuracy with an inter-annotator agreement of 0.39 (Fleiss' kappa).In contrast, on the same samples (unseen by the classifier), our classifier obtains a 84.5% classification accuracy.These results validate the difficulty of the task and suggest that an automatic classifier is more suited for this task.

E The Cold Start Problem in RL
Reinforcement learning often involves optimizing a policy model towards an optimal distribution that maximizes some expected reward.This paradigm works well out-of-the-box for a variety of tasks in NLP, such as model detoxification and sentiment control (Lu et al., 2022) where the output distribution of the initial policy already aligns, to a reasonable degree, with the optimal reward distribution However, in a cold-start, reinforcement-learning setting, the initial policy output distribution is drastically different than the optimal reward distribution; this may be the case when the reward is linked to a specific task outside the capabilities of the original policy.
Adjusting to cold-start has been mostly explored in the context of recommender systems, where it is difficult to determine user-preferences without any initial data (Ding and Soricut, 2017;Ji et al., 2021;Du et al., 2022), but has been sparsely pursued in reinforcement learning for NLP.Ding and Soricut (2017) introduce softmax policy gradients for cold-start reinforcement-learning, but the approach is limited to only one class of reinforcement learning algorithms (policy-gradient approaches) and includes mathematical assumptions not widely applicable to various NLP applications.

F Alternative Evaluation Metrics
To ensure that the model improvement from STEER is meaningful (i.e., to make sure our model is not reward hacking), we use a set of alternative metrics for target style strength, meaning similarity, and fluency, and re-run evaluation on all results from Tables 1 and 2. These are metrics unused during training time for STEER.
For the fluency model, we use a different binary CoLA classifier (https://huggingface. co/textattack/roberta-base-CoLA), and again use the raw probability score of the linguistically acceptable class.To assess meaning similarity, we use the embedding-based SIM model of Wieting et al., 2019as used in Krishna et al., 2020.Finally, for the style classifier model, given limited data quantity, we train another RoBERTa-Large classifier with the same CDS data but with a different seed.As before, we compute the aggregate metric V by taking the product of the three automatic metrics for each style transfer pair in the corpus, and report the average V value.
Our results using the alternative metrics to rerun evaluation on both in-domain style transfer and out-of-domain style transfer using the CDS-trained STEER mode are shown in Table 10 and Table 11 respectively.Overall, this corroborates our main findings by showing that our relative results are largely unchanged: on the in-domain styles, STEER beats all baselines, including impressive gains on target style strength as well as improved fluency and meaning similarity.On the out-of-domain task, STEER continues to excel, once again beating all other baselines other than GPT-3 on Shakespeare.

G Full Experimental Results
We detail the full experimental results for the main experiments in this section, including all style evaluation metrics.

G.1 Main Experiments
We include the full results for the main experiment from Table 1, testing out style transfer on the CDS dataset to each target style from all other source styles.Table 12 has the results for STEER, Table 13 has the results for STRAP, Table 14 has the results for P-A-R, and Tables 15-18 have the results for GPT-3 0-shot, 1-shot, 5-shot, and 10-shot.

Figure 3 :
Figure 3: Style transfer quality V ∼H on CDS, averaged across all 11 styles, with fluency and meaning similarity human evaluation.TSS is automatically computed. 10

Figure 4 :
Figure 4: Plots of the style transfer quality on CDS averaged across all 11 styles with varying k, the hyperparameter used in the Top−k sampling strategy.

Table 1 :
Comparison of 11-way style transfer on the CDS dataset measured by aggregate score V with different methods, including STRAP

Table 2 :
Comparison of style transfer to each of the 11 styles in the CDS dataset measured by aggregate score V from two out-of-domain styles from the GYAFC corpus.For. and Inf.denote the formal and informal styles respectively.Bold and underline denote the highest and the second-highest score respectively in each row.

Table 3 :
Examples of style transfer pairs generated by STEER and other methods.GPT-3 is run with 10-shot.

Table 4 :
. In the examples, STEER produces style transfers that optimize across all dimensions, while other methods Style transfer quality on CDS, averaged across 11 target styles using STEER with a coarse vs a finegrained reward.The highest values are denoted in bold.

Table 5 :
Data metrics on D f (STEER) and other datasets.

Table 9 :
Krishna et al., 2020izes in CDSWe retrieve the diverse paraphraser and the inverse style transfer models from the repository in the original paper; please seeKrishna et al., 2020for more details.At inference, we use greedy decoding as this led to the best results in the original paper.