Thank you BART! Rewarding Pre-Trained Models Improves Formality Style Transfer

Scarcity of parallel data causes formality style transfer models to have scarce success in preserving content. We show that fine-tuning pre-trained language (GPT-2) and sequence-to-sequence (BART) models boosts content preservation, and that this is possible even with limited amounts of parallel data. Augmenting these models with rewards that target style and content –the two core aspects of the task– we achieve a new state-of-the-art.


Introduction and Background
Style transfer is the task of automatically converting a text of one style into another, such as turning the formal "I viewed it and I believe it is a quality program." into the informal "I've watched it and it is AWESOME!!!!". This task, which can be used for, e.g., personalised response generation, translation of ancient text into modern text, and text simplification, is particularly challenging since style must be changed while ensuring that content is preserved. Accordingly, the performance of style transfer systems is commonly assessed on both style strength and content preservation.
Due to the general scarcity of parallel data, unsupervised approaches are popular. These include disentangling style and content by learning a distinct representation for each (Shen et al., 2017;Fu et al., 2018;John et al., 2019), and back translation (Zhang et al., 2018;Lample et al., 2019;Luo et al., 2019;Prabhumoye et al., 2018). A common strategy to enhance style accuracy is to introduce a reward in the form of a style classifier (Lample et al., 2019;Gong et al., 2019;Luo et al., 2019;Wu et al., 2019;Sancheti et al., 2020). As a result, unsupervised models achieve good accuracy in style strength. Content preservation is however usually unsuccessful (Rao and Tetreault, 2018).
Parallel data can help to preserve content, but is limited. Niu et al. (2018) combine the train sets of two different domains and incorporate machine translation to train their models with a multi-task learning schema, plus model ensembles. Sancheti et al. (2020) use it to train a supervised sequence-tosequence model, and in addition to the commonly used style strength reward, they include a reward based on BLEU (Papineni et al., 2002) to enhance content preservation. Shang et al. (2019) propose a semi-supervised model combining parallel data with large amounts of non-parallel data.
Pre-trained models, successful in a variety of NLP tasks, have recently been used in formality style transfer. Zhang et al. (2020) propose several data augmentation methods for pre-training a transformer-based (Vaswani et al., 2017) model and then used gold data for fine-tuning. Using GPT-2 (Radford et al., 2019), Wang et al. (2019) and Wang et al. (2020) propose a harness-rulebased preprocessing method, and joint training of bi-directional transfer and auto-encoding with two auxiliary losses. Contemporary work by Chawla and Yang (2020) develops a semi-supervised model based on BART large (Lewis et al., 2020).
Contributions Focusing specifically on formality transfer, for which parallel data is available, (i) we take the contribution of pre-trained models a step further by augmenting them with reward strategies that target content and style, thereby achieving new state-of-the-art results. (ii) We analyse separately the contribution of pre-trained models on content and style, showing that they take care of preserving content (the hardest part of style transfer to date), while ensuring style strength. (iii) Moreover, experimenting with training size, we show that while parallel data contributes to content preservation, fine-tuning pre-trained models with 10% of parallel data is more successful than training on 100% of data from scratch. Reducing the need for parallel data opens up the applicability of

Method
We propose a framework to control the style of output text for style transfer atop pre-trained models. Given a source sentence x = {x 1 , · · · , x n } of length n with style s 1 and a target style sentence y = {y 1 , · · · , y m } of length m with style s 2 , our model aims to learn two conditional distributions, altering the style of a sentence while preserving its original content. Our framework consists of (i) fine-tuning pre-trained models on a formality transfer parallel corpus; (ii) incorporating rewards to enhance style change and content preservation.

Models
GPT-2 This model (Radford et al., 2019) is a transformer-based network (Vaswani et al., 2017). Given a sentence of tokens x = {x 1 , · · · , x l }, the standard language modeling objective is to minimize the following negative log likelihood: where k is the size of the context window.
To make GPT-2 rephrase a text in the target style, the input pair Source Sentence, Target Sentence is represented as a single sequence with three special tokens to mark beginning [BOS] and end [EOS] of every sequence, and to separate source and target sentences [SEP] ( Fig. 1(a)). During inference, we feed to GPT-2 the source sentence with [BOS] and [SEP] to infer the target sentence.
BART This is a denoising autoencoder for pretraining sequence-to-sequence models (Lewis et al., 2020). Given a source sentence x and a target sentence y, the loss function is the cross-entropy between the decoder's output and the target sentence: (2)

Rewards
Atop the models, we implement two rewards, used in isolation and together, to enhance style strength (Style Classification Reward) and content preservation (BLEU Score Reward).
Style Classification Reward As often done in previous work (see Section 1), we use a classification confidence reward to encourage larger change in the confidence of a style classifier (SC). We pre-train the binary style classifier TextCNN (Kim, 2014) and use it to evaluate how well the transferred sentence y matches the target style. SC's confidence is formulated as where i = {1,2}, and represent source and target style respectively. θ are the parameters of the style classifier, fixed during fine-tuning. The reward is where y is the generated target sentence sampled from the model's distribution at each time step in decoding. For the GPT-2 based model, we also add a classification confidence reward to the source sentence, similar to Eq. 4, since the model generates sentence x with the original style while generating the target sentence:  BLEU Score Reward Following Sancheti et al.
(2020), we introduce a BLEU-based reward to foster content preservation as in Eq. 6, where y is the target style text obtained by greedily maximizing the distribution of model outputs at each time step, and y s is sampled from the distribution.
Gradients and Objectives The rewards are used for policy learning. The policy gradient 2 is where R is the SC reward and/or the BLEU reward, y s is sampled from the distribution of model outputs at each decoding time step, and φ are the parameters of the model. Similarly, we add the policy gradient regarding the source sentence for the SC reward (only for the GPT-2-based model).
The overall objectives for φ are the loss of the base model (Eq. 1 or Eq. 2) and the policy gradient of the different rewards (Eq. 7).

Experiments
Dataset Grammarly's Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault, 2018) is a formality style transfer dataset with parallel formal and informal sentences from two domains: Entertainment & Music (E&M) and Family & Relationships (F&R). Table 1 shows the number of sentences in train, validation, and test. Four human references exist for every valid/test sentence.
Setup All experiments are implemented atop Huggingface's transformers (Wolf et al., 2020). Our base models are the GPT-2-based model (117M parameters) and BART-based model (base with 139M parameters and large with 406M). We fine-tune them with the Adam optimiser (Kingma and Ba, 2015) with batch size 32; the initial learning rates are 5e −5 (GPT-2) and 3e −5 (BART). The final values for λ are set to 1 for SC and 0.2 for BLEU based on validation results. We use early 2 Additional details are provided in the Appendix.  Table 2: Comparison of our models to previous work. The best score for each metric in each block is boldfaced. Notes: (i) if the output of previous work is available, we re-calculate the scores using our evaluation metrics. Otherwise, scores are from the paper and we mark this with ( Results Figure 2 shows the HM score of x%sized training sets on the E&M and the F&R domains. Increasing train set size from 10% to 50% has a greater boost on GPT-2-based models than BART's. However, BART-based models obtain the highest results. Table 2 reports a selection of our models 4 and previous state-of-the-art work. Zooming in on the single measures, we see in Table 2 how varying training size reveals the impact of parallel data on content preservation: OpenNMT's BLEU score on E&M increases from 0.231 with 10% of the data to 0.403 with 100%. Style accuracy appears instead easier to achieve even with limited supervision. Increasing training size for fine-tuning either pre-trained model does not however yield dramatic improvements in content preservation (e.g. from 0.547 to 0.577 BLEU for BART base on E&M). In fact, fine-tuning a pre-trained model (either GPT-2 or BART) with just 10% of parallel data, leads to better content preservation (0.547 BLEU with BART on E&M) than Open-NMT with 100% (0.403). This suggests that content preservation is largely taken care of by the pre-trained models, already, and can explain why the BLEU-based reward does not help too much in isolation (see Fig. 2). Conversely, the SC reward consistently boosts style accuracy in both BART and GPT-2. Nevertheless, combining rewards can be beneficial. Overall, BART-based models perform better on content preservation while results on style strength are mixed. Given the experimental setup of some previous work, we ran additional comparisons (blocks (B), (C), and (D) of Table 2). In all cases, our results are higher than the previous state-of-the-art. For example, in F&R (D) our model with 10% parallel data outperforms Shang et al. (2019)'s semi-supervised model, which uses about 9.5% parallel data and large amounts of non-parallel data (BLEU 0.571 vs 0.379). Fine-tuning BART on both domains (C) 5 leads to the best results to date on both datasets (E&M: 0.719; F&R: 0.728).
With respect to the two evaluation metrics used for content preservation (BLEU and BLEURT), we can observe in Table 2 that they follow a similar trend. In fact, they correlate very highly (Pearson's r = .951, p<.001, n = 14 for E&M, and r = .951,  p<.001, n = 13 for F&R). Table 3 shows example outputs and their evaluation according to the metrics we use; the outputs are produced by existing systems we compare to, and our own models. 6 In the "Informal to Formal" example, we can see that text generated by most systems is assessed with a high confidence in style conversion, except for PBMT-Combined (Rao and Tetreault, 2018) and Transformer (Zhang et al., 2020) (the name "omarionhe" should be "Omarion", and the word "he" at the beginning of the sentence should be "He"). However, the sentences generated by previous systems are not so fluent, and some of them fail in preserving content (Transformer (Zhang et al., 2020) ("omarionhe") and Chawla's (Chawla and Yang, 2020) ("Marion")). For our models, the Bi-LSTM based model fails in content preservation while the systems based on pre-trained models are much better at this task. Our model based on BART Large generates this specific sentence accurately in terms of content preservation, style strength, and fluency.

Finer-grained Analysis
When looking at the "Formal to Informal" example in Table 3, we observe that the two previously existing systems replace very little (one comma by the Bi-directional FT (Niu et al., 2018)) or nothing at all (PBMT-Combined (Rao and Tetreault, 2018)). Conversely, our systems make substantial modifications, resulting in output sentences that are noticeably more informal than the input sen- 6 More examples are in Appendix.
tence. OpenNMT and the GPT-2-based models lose part of the content (the suggestion to avoid hot dogs) while the two BART-based systems manage to preserve the whole message.

Conclusions
Fine-tuning pre-trained models proves a successful strategy for formality style transfer, especially towards content preservation, thereby reducing the need for parallel data. A sequence-to-sequence pre-trained model (BART) outperforms a language model (GPT-2) in content preservation, and overall, and with the addition of rewards achieves new stateof-the-art results. The fact that GPT-2 is instead often better at style strength could be (partly) due to how the style reward is implemented in the two models (Eq. 4 and 5), and will need further investigation. For a better understanding of the different behaviour of BART and GPT-2 for this task, the next natural step is to include human evaluation.

Acknowledgments
This work was partly funded by the China Scholarship Council (CSC). The anonymous ACL reviewers provided us with useful comments which contributed to improving this paper and its presentation, so we're grateful to them. We would also like to thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Peregrine high performance computing cluster.

Impact Statement
All work that automatically generates and/or alters natural text could unfortunately be used maliciously. While we cannot fully prevent such uses once our models are made public, we do hope that writing about risks explicitly and also raising awareness of this possibility in the general public are ways to contain the effects of potential harmful uses. We are open to any discussion and suggestions to minimise such risks.

References
Kunal Chawla and Diyi Yang. 2020. Semi-supervised formality style transfer using language model discriminator and mutual information maximization.

A Appendices
This Appendices include: 1) detailed results for all experiments (A.1); 2) more details on policy gradient (A.2); 3) some example outputs of various models and their sentence-level scores, to give an idea of what the generated sentences look like when style transfer is applied. We specifically focus on the 100% parallel data settings for our models (A.3).

A.1 Detailed Results of Models
We report here the full set of results for all our models and previous work.

A.2 Policy Gradient
Reinforcement learning (RL) is a sub-field of machine learning that is concerned with how intelligent agents ought to take actions in an environment in order to maximize the cumulative reward. Here, we employ the policy gradient algorithm (Williams, 1992) to maximize the expected reward (style strength and/or content preservation) of the generated sequence y s , whose gradient with respect to the parameters φ of the neural network model is estimated by sampling as: where J(·) is the objective function, ∇ φ J(·) is the gradient of J(·) with respect to φ, R i is the reward of the i th sequence y s that is sampled from the distribution of model outputs at each decoding time step, φ are the parameters of the model, N is the sample size, and E(·) is the expectation.