Audience-Centric Natural Language Generation via Style Infusion

Adopting contextually appropriate, audience-tailored linguistic styles is critical to the success of user-centric language generation systems (e.g., chatbots, computer-aided writing, dialog systems). While existing approaches demonstrate textual style transfer with large volumes of parallel or non-parallel data, we argue that grounding style on audience-independent external factors is innately limiting for two reasons. First, it is difficult to collect large volumes of audience-specific stylistic data. Second, some stylistic objectives (e.g., persuasiveness, memorability, empathy) are hard to define without audience feedback. In this paper, we propose the novel task of style infusion - infusing the stylistic preferences of audiences in pretrained language generation models. Since humans are better at pairwise comparisons than direct scoring - i.e., is Sample-A more persuasive/polite/empathic than Sample-B - we leverage limited pairwise human judgments to bootstrap a style analysis model and augment our seed set of judgments. We then infuse the learned textual style in a GPT-2 based text generator while balancing fluency and style adoption. With quantitative and qualitative assessments, we show that our infusion approach can generate compelling stylized examples with generic text prompts. The code and data are accessible at https://github.com/CrowdDynamicsLab/StyleInfusion.


Introduction
In this paper, we develop a novel approach to infuse audience-centric styles into pretrained language generation (NLG) models.Learning to synthesize subjective styles is crucial to various applications.For instance, persuasion and memorability in computational advertising and marketing (van Noort et al., 2020).User-centric applications of language generation, such as writing aids, chatbots, and dialog systems, often require these stylistic adjust-ments depending on both the audience and the task.Prior work often defines textual style with large static sentence collections.However, stylistic objectives such as persuasiveness, memorability, and empathy are hard to define without a target audience (Bell, 1984) due to non-uniform stylistic expectations across diverse user groups.Thus, we suggest that subjective text styles and traits must be defined by the target audience instead of audienceindependent data.Our work focuses on two resulting challenges -first, how to collect target audience feedback, and second, how to leverage the limited feedback efficiently for style infusion.
Textual styles -i.e., different linguistic presentations of the same conceptual content -play an integral role in persuasive/memorable communication.For instance, an informal style is less persuasive in formal settings (Kim et al., 2019).The style problem extends across diverse domains, from empathic styling in mental health (Cameron et al., 2018) to fact-driven, simplistic styling in tech support (Okuda and Shoda, 2018).Existing work in textual style transfer (TST) takes two general approaches.The strictly supervised approaches leverage fixed parallel corpora, analogous to machine translation (Hu et al., 2017b), while semisupervised and unsupervised techniques leverage non-parallel collections of stylized sentences (Shen et al., 2017).Predefined metrics, heuristics, external oracles, and hybrid approaches have also been considered (Jain et al., 2019;Jin et al., 2019a).
Constructing audience-centric or time-evolving / adaptive methods for style transfer remains an open challenge.Existing approaches are guided by rigid modeling considerations and the distributions of fixed style-specific corpora.This is innately limiting for stylistic objectives such as persuasiveness, a trait with a widely disputed definition in existing literature, and multiple external confounds such as preexisting biases and independent features of the persuader (e.g., how many followers they have) (Al Khatib et al., 2020;Moran et al., 2016;Lowrey, 1998;Berger and Milkman, 2012;Murphy, 2001).Furthermore, it is infeasible to collect extensive annotated collections of text for each audience, style, and application (Pennebaker and King, 1999).
Unlike prior work, we define and incorporate style grounded on our target audience.To address dynamic settings requiring audience-centric linguistic styles, we propose the novel style-infusion task.Since human reviewers are better at pairwise style comparisons than direct scoring (Shah et al., 2014), we formulate style infusion as follows: how do we infuse the stylistic preferences of our audience, via pairwise sentence comparisons, in a generative language model (LM)?Unlike conventional style transfer, our task leverages domain and audience-specific feedback instead of parallel non-parallel sentence collections rendered in any specific style.Further, we adopt an incremental training approach rather than retraining models from scratch.
We bootstrap an initial style analysis model to discriminate the positive and negative samples from audience feedback.Our model then selects additional samples from a generic topical sentence collection to expand the seed set of audience judgments.By separating style analysis and text generation models, we create an adversarial setup to infuse the audience's stylistic feedback in any generative LM.We weight the noisy reward from the style analysis model (discriminator) with a reconstruction loss to balance style adoption and fluency.
In summary, our contributions are as follows: 1. Audience-centric Style Infusion: To our knowledge, we are the first to formulate the task of style infusion to tether the definition of style to the target audience.In contrast, prior work defines style in a purely data-driven manner (Shen et al., 2017;Yang et al., 2018).External data limits the definition of style to the context in which it was collected.We propose a more human-centric approach to text styling through explicit audience feedback via pairwise comparisons.

Decoupling Style:
We decouple the style analysis and language generation models for versatility and simplicity.Prior work often unifies these tasks in a single training setup, thus sacrificing incremental learning and infusion of new stylistic preferences of audiences (Jain et al., 2019;Jin et al., 2019a).We introduce an automatically weighted loss, combining an independent reconstruction loss for generation and discriminator-based loss for style, producing a more robust representation of style than in fused settings.

Automatic Style Evaluation:
To the best of our knowledge, we are the first to automatically evaluate the transfer of memorability/persuasiveness.Existing literature has relied on costly manual evaluation as these two traits are hard-to-define stylistic objectives lacking generative work (Li et al., 2020;Tan et al., 2016;Danescu-Niculescu-Mizil et al., 2012).We introduce a new audiencecentric correlation metric using a hierarchical Bayesian model to compute the correlations of linguistic features with audience feedback.We then evaluate our model's generations based on their agreements with these audience correlations.

Related Work
Prior work has explored "style transfer" in diverse settings ranging from "clickbait" headlines to formalizing text (Jin et al., 2020;Chawla and Yang, 2020;Xu et al., 2019).While strictly supervised approaches show high fidelity to input samples (Hu et al., 2017b;Jhamtani et al., 2017), unsupervised and minimally supervised learning are widely applicable since parallel samples are unavailable (Shen et al., 2017;Yang et al., 2018).Disentanglement, prototype editing, and pseudoparallel corpus creation are popular approaches.Prototype editing applies stylistic markers to predefined sentence templates (Guu et al., 2018;Li et al., 2018), disentanglement extracts style independent of the content (Shen et al., 2017;Hu et al., 2017a).Audience-centric feedback may not conform to these rigid hypotheses.First, unconstrained generation allows for freedom in sentence and paragraphlevel constructs to define the style (Li et al., 2020).Second, the separability of content and style is harder in specialized domains reliant on domainspecific jargon (Woodward-Kron, 2008) and expressions.Our bootstrapping approach shares some commonalities with pseudo-parallel corpus creation (e.g.aligning sentences from two mono-style corpora) (Jin et al., 2019b;Zhang et al., 2018), but only utilizes a generic topical corpus to expand the audience-generated "seed set" of pairwise judg-ments.Adversarial training has also been used to quantify style (Yang et al., 2022).Our approach explicitly decouples the style discrimination and generation tasks for modularity and incremental training purposes.
We pick two stylistic objectives that are highly audience-dependent and hard to define objectively -memorability and persuasiveness -to evaluate our approach.Prior work in these styles has been limited to analysis but not generation.Tan et al. (2016) and Li et al. (2020) find linguistic patterns, interaction dynamics, and discourse structure are strong identifiers of persuasive arguments, while convincingness (Habernal and Gurevych, 2016), memorability (Danescu-Niculescu-Mizil et al., 2012) have been better explained by linguistic feature correlation.However, there is a lack of work on unconstrained generation of persuasive and memorable text (Dürr and Gloor, 2021;van Noort et al., 2020).Our approach enables us to bridge some of these specific gaps while maintaining a generalized overall formulation.

Discriminative Language Model
In this section, we train a BERT-based style discriminator to provide feedback to our generator.

Model Architecture and Training
Our style discriminator (style analysis module) adds a fully connected (FC) layer with dropout to pre-trained BERT (Devlin et al., 2019).We use the 'bert-base-uncased' model (Wolf et al., 2019) (12-layers, 768 dimension).We concatenate with a '<SEP>' token and jointly tokenize the compared pair of sentences.The FC layer generates a single output (R 768 → R).We threshold the sigmoid of the output at 0.5 to decide the preferred sentence.We train all layers (including BERT) on the pairwise audience feedback (batch size 32, 5 epochs, η = 0.0001, dropout = 0.2).We also train a Siamese BERT architecture (Reimers and Gurevych, 2019) with the same settings but find it to underperform BERT (results in Appendix A).

Pairwise Feedback Datasets
We select one pairwise feedback dataset for both the persuasiveness and memorability tasks to evaluate our approach.The UKPConvArg1 corpus (Habernal and Gurevych, 2016) presents pairs of arguments where human annotators select the more persuasive argument.The authors gener-ate 16,000 argument pairs over 16 distinct, nonoverlapping topics.Both arguments in a pair belong to the same topic and argue for the same stance (i.e., parallel pairwise feedback).For memorability, we leverage the Cornell Movie-Quotes Corpus (Danescu-Niculescu-Mizil et al., 2012), containing 2,200 paired movie quotes with crowdsourced memorability annotations.

Observations and Validation
Our discriminator achieves 89% accuracy over 5fold cross-validation for the persuasiveness task.We further validate for overfitting by holding out two topics from the test set and training on the remaining topics, ensuring the discriminator has no exposure to these held-out topics during training.
After training from scratch, the discriminator still achieves 87% accuracy on the held-out topics.On the Cornell Movie-Quotes corpus, the discriminator achieves 80% accuracy.We repeat the held-out topic test to validate the classification performance for the memorability task.
In summary, these tests validate the ability of our style discriminator to learn audience style preferences with small volumes of pairwise feedback.In Section 4, we describe our approach to infuse the style discriminator feedback into a generative language model.

Style-Aware Language Generation
In this section, we infuse the stylistic preferences learned by our style discriminator in Section 3 into a GPT-2 model (Radford et al., 2019) pretrained on the causal language modeling (CLM) objective1 .The model takes in a textual prompt and generates text, y, that we want to infuse with the audience preferred style.For the persuasiveness task, the UKPConvArg1 dataset provides prompts for each argument pair.For the memorability task, we use the previous sentence as the prompt.
During training, we feed the prompt and feedback pair to GPT-2 -the more preferred (styled) sample, y * s , and the corresponding less-preferred (non-styled) sample, y * ns .We use an adversarial training paradigm to enable the generator to learn from the discriminator, illustrated in Figure 1.

Training
We utilize two losses during training: a reconstruction loss, L R , and a discriminator loss, L D .The The reconstruction loss teaches the model to mimic the gold-standard samples.The discriminator loss is meant to maximize the score of the discriminator, D, and is formulated as: where y (i) is the i-th token of the generated sentence y, and Ri is a baseline reward meant to reduce the noise from the discriminator.We elaborate on the baseline reward in Appendix B.
We find that too strong of a discriminator loss negatively impacts fluency.Thus, we introduce a regularization constant, β, to ensure that the discriminator loss remains only a fraction of the loss.The two losses are weighted together to create the final loss as follows: where C = β(1 − α S ) and α S = D(y * s , y * ns ).Note that y * ns is the non-styled training argument.Instead of making the weighted ratio between the two losses constant, we make them sample dependent.The intuition is that when α S is high (e.g. the sample is persuasive), we can just use the reconstruction loss to replicate the gold standard which will directly reflect the style.However, when α S is low (i.e.we have a weak sample), we instead switch to learning the trends from the discriminator.This loss is referred to as the sample-dependent discriminator (SD) loss.We also compare the discriminator loss with a simpler supervised loss defined as

Dataset Augmentation Approach
The UKPConv1 and Cornell Movie Quotes corpora we presented in Section 3.2 provide approximately 16,000 and 2,200 unique pairs for stylistic feedback; not nearly enough to train a large language model.To increase our model's breadth of knowledge, we generate additional pairwise feedback with the CNN/Daily Mail dataset (See et al., 2017), containing over 300,000 unique news articles.First, we generate the Universal Sentence Embeddings (Cer et al., 2018) of all unique sentences in our style corpora (UKPConv1, Cornell) and external corpora (CNN/Daily Mail).For each candidate sentence in the external dataset, s i , we find the top-k similar sentences (y 1 ...y k ) in the style corpus to be augmented.We then perform pairwise comparisons ∨ j D(s i , y j ) > 0.5, j ∈ {1, ..., k} where if the discriminator prefers the candidate external sentence (s i ) over any one of the similar sentences (y 1 ...y k ) from the style corpus, we include the pair.Through this bootstrapped augmentation method, we ensure we have sentences that are relatively more "styled", as defined by our discriminator, and similar to those in our existing corpus.

Style-Aware Generation with GPT-2
The OpenAI GPT-2 (Radford et al., 2019) model is a large transformer-based language model pretrained on nearly 8 million web pages, allowing generalization to many domains and tasks.This is closer to the unconstrained scenarios that we wish to target with our style-infusion task.Alternate generators such as pointer-generators rely on copying (Xu et al., 2019), thus introducing more limitations in the extent of style infusion.The ability to extensively pretrain transformer-based models makes them more widely applicable for syle infusion (Gururangan et al., 2020).
In this section, we introduced the adversarial training mechanism for the style-aware language generator and the bootstrapped data augmentation method used to produce robust generations.Next, we will introduce the baselines, evaluation metrics, and training settings.

Experimental Settings
We compare our architecture against a few strong baselines: Pretrained GPT-2 (Radford et al., 2019) We use this pre-trained model as a representation of average text, allowing us to show shifts in style that occur due to training.
Fine-tuning We fine-tune the pre-trained GPT-2 model using the reconstruction objective on the style-specific corpus only (e.g.UKPConv1).
Fine-tuning + Data Augmentation We fine-tune the pre-trained GPT-2 model using the reconstruction objective on the augmented data.
TitleStylist (Jin et al., 2020) We adapt the stylistic headline generation framework to generate stylistic text based on a prompt.Jin et al. (2020) utilize a Denoising Autoencoder with parameter sharing to disentangle style from content to control the style with a set of parameters.
Training Settings For all GPT-2 based models, we use a base GPT-2 model from Huggingface (Wolf et al., 2019) (1024 dimensions, Adam optimizer, η = 5e − 5).Because of the length of our text and size of our models, we utilize DeepSpeed (Rasley et al., 2020) to distribute training over two 32GB V100s, and we train with FP16 mixed precision.We experiment with the loss parameters of C and β and discuss our findings in section 8.
Evaluation Metrics We take a deeper look into the annotator labels in the UKPConvArg1 dataset and we find that some linguistic features play a significant role in the persuasiveness of text.
We create a hierarchical Bayesian model to find the correlation between a set of collected linguistic features and the desired style.We first take the unique sentences from a dataset and compute a set of linguistic features over them.A full list of features can be found in Appendix C.
For each linguistic feature-topic pair, we infer the correlation between the feature and the text that demonstrates the style by running a Markov Chain Monte-Carlo (MCMC) process using the No-U-Turn Sampler (NUTS).We elaborate on the calculations in Appendix C.2.Note that the results we show are in the logit scale, meaning even a change of ∓1 has a big effect on the probability (about a 23% difference in odds of winning).
The models then generate text based on the prompts in a held-out test set and we calculate the features of the generations.We run a t-test to determine if the difference in features between a pretrained GPT-2 and one of our models is statistically significant.This evaluation shows how our trained model learns to use these linguistic features to construct more stylized arguments.
In addition, we use pyrouge library to collect the ROUGE (Lin, 2004) score, a commonly used metric that measures the N-gram overlap between the training and generated arguments.While these scores will not tell us how persuasive our generations are, they will ensure that the generations remain on topic.
Lastly, we compute the BERTScore (Zhang et al., 2019), another automatic evaluation metric that computes token similarity using contextual embeddings.The BERTScore represents the semantic similarity of the generations to the test set which will ensure generations are relevant, but not necessarily persuasive.

Results on Persuasiveness
In this section, we analyze our results by showing a significant usage of linguistic features that resemble persuasive text, showing generated text, and with standard metrics.

Linguistic Feature Correlations
Figure 2 shows the correlations between linguistic features and convincingness in the UKPConvArg1 corpus.The model details are in Appendix C.
We find a strong positive correlation between readability and winning arguments.This is reflected by both readability scores (e.g., SMOG, Flesch-Kincaid, etc.) and correlation with smaller words, fewer total dependencies, and a smaller overall total dependency distance.We notice a positive correlation with speed and volume.Toubia  (Toubia et al., 2021).We also find a negative correlation with passive voice and a positive correlation with misspelled words (not shown for brevity).
We run a significance test to see how well our models learned the style (see Appendix C.3).Most models consistently learn pronounced trends (i.e., Brunet index, length, speed, and volume).The augmented data likely led to this change because fine-tuning on the augmented set displays the same trends.In cases like total dependencies (TD) and the ratio of present tense verbs, models trained with the sample-dependent discriminator (SD) loss are significantly better at learning the trend, despite the data not actively showing the trend (or completely opposing it).In the case of Flesch score, models trained with SD loss can nullify the trend which occurs in the incorrect direction.This displays that models trained with the SD loss are substantially better at learning from the dataset than the baselines and models trained with the sample-dependent su-pervised (SS) loss.One example of failure is the ratio of jargon, likely because the model could not generate out of vocabulary words, but this is a limitation of how we define jargon.

Sample Generations
Table 1: The baseline models and our model (GPT2 with the SD loss and β = 0.5 on the augmented data).The models are fed the input "Plastic bottles are bad because" and "Gaming is good for child development because", generating up to 100 tokens using beam search.
MODEL & GENERATED TEXT EXAMPLE 1: GPT2: Plastic water bottles are bad because they are not recyclable," he said."They are not good for the environment, they are not recyclable."GPT2-Aug: Plastic water bottles are bad because it causes pollution.Also water bottles also cost more than regular water.TStylist: Plastic water bottles are bad because plastic water bottles are bad for our environment.Ours (SD-0.5):Plastic water bottles are bad because they are not recycled and end up in landfills which lead to pollution of our environment.Plastic requires up to 47 million gallons of oil per year to produce.EXAMPLE 2: GPT2: Gaming is good for child development because it teaches them how to use a computer.It's good for the economy, it's good for the environment, it's good for the children, it's ... GPT2-Aug: Gaming is good for child development because it allows the child to grow and develop.I believe eSports (LoL) should be a spectator sport and not a major part of the sports calendar.TStylist: Gaming is good for child development because gaming is good for your child.Ours (SD-0.5):Gaming is good for child development because it allows children to grow up in a world where they are exposed to a wide variety of ideas and experiences.
Table 1 shows the generations of three baselines and our best-performing method.We find that for both prompts, the generations of models trained with the sample-dependent discriminator (SD) loss generally have the highest values of speed, volume, and lexical diversity.For the second prompt, the speed and volume of our generation are larger than that of GPT2 and TStylist, but slightly smaller than that of GPT2 fine-tuned on the augmented data.Intuitively, this makes sense because the GPT-Aug generation covers much more information in the same time frame; however, this information isn't relevant to the argument, making our generation much more sensible.The baselines often suffer from neural degeneration, but the model trained with the SD loss does not face this issue.Since length had a strong negative correlation with per-suasiveness, the model likely implicitly learned from the discriminator to handle this kind of neural degeneration.However, it is still an issue in some cases with out-of-domain samples.
Table 2: ROUGE-{1,2, L} scores and BERT scores (F1) for all models.Baseline models: GPT2, GPT-2 finetuned on UKPConvArg1, GPT-2 with augmented data, TitleStylist (Jin et al., 2020).Our models are trained on augmented data and a sample-dependent discriminator (SD) or sample-dependent supervised (SS) loss with parameter β.The baseline ROUGE score increases due to data augmentation; the relevance of our models' generations is largely insensitive to loss type and parameter value.

Automatic Metrics
We compare the ROUGE scores of our experimental models in Table 2, ensuring that the topics in the test set are not discussed anywhere in the UKP-ConvArg1 or augmented datasets.The data augmentation leads to a sharp increase in the ROUGE scores of the generations, showing that it is essential for robust and relevant generations.The results are relatively insensitive to variation in β parameter that controls the tradeoff between reconstruction loss (L R ) and discriminator loss (L R ).These scores show that our models generate relevant, but not necessarily persuasive, text.We find similar insights from the BERTScore; although the augmentation has a slight negative impact on the score, the difference is negligible.

Results on Memorability
In this section, we focus on memorability and show that our model can generate more robust, relevant, and memorable text than the baselines.

Linguistic Feature Correlations
We train a Bayesian hierarchical model for the Cornell Movie Quotes corpus, which produces the correlations shown in Figure 3.We find a strong negative correlation with long, winding text, shown by the trends in total dependencies, total dependency distance, length, and circuitousness.A higher circuitousness implies that a less direct route was taken to convey information (Toubia et al., 2021).
Circuitousness is detrimental to memorability as winding text tends to be harder to remember.The negative correlation with the punctuation rate and positive correlation with the average dependencies show that more memorable text tends to have a few sentences, independent of the length of sentences.Lastly, there is a strong emphasis on uncommon vocabulary with a negative correlation with the Brunet index and a positive correlation with token type ratio.This is supported by findings from (Danescu-Niculescu-Mizil et al., 2012) who find that memorable quotes are built upon less common word choices.Notice that the correlations show a negative correlation with long and winding text (i.e., circuitousness (Toubia et al., 2021)).
We run the significance test on a held-out test set to see how well our models learned to generate memorable text.In some cases in Table 8, models trained with the sample-dependent discriminator (SD) loss have similar performance as the fine-tuned models, indicating that some relevant features are learned solely from fine-tuning.However, many other incorrect trends are corrected with training using the SD loss.The only feature that does not improve is the pronoun rate, likely because of shorter sentences with more emphasis on uncommon word choices.3.There is a visible decrease in the number of sentences, overall length, and circuitousness in both the GPT2-Aug and SD-0.5 models.This reflects the trends shown in Table 8, but we can see that the model trained with the SD loss generates sentences that are more sophisticated than the fine-tuned GPT2 model's.TitleStylist generates sentences that are identical to our generation feature-wise but are not as sophisticated as our generations.

We look at a few examples of generations to see how training influenced the model's generations in Table
MODEL & GENERATED TEXT EXAMPLE 1: GPT2: The more you know about it, the more likely you are to believe that it is true.If you don't believe that it is true, you're not alone.If you don't believe that it is true, you're not alone... GPT2-Aug: The more you smoke, the more you will smoke TStylist: The more you need to know.Ours (SD-0.5):The more you learn, the more you see.
EXAMPLE 2: GPT2: When solving problems, dig a hole in the ground and dig a hole in the ground and dig a hole in the ground and dig a hole in the ground and ... GPT2-Aug: When solving problems, dig your heels in and try to find a solution, even if you don't have the answers, and even if you don't know the answers.TStylist: When solving problems, dig better and better.Ours (SD-0.5):When solving problems, dig deeper than a grave.
Table 3: Generations of GPT2, GPT2 fine-tuned on augmented data, and GPT2 with the SD loss (β = 0.5) on the augmented data.The models are fed in the inputs "The more you" and "When solving problems, dig".

Automatic Metrics
Similar to persuasiveness, Table 4 shows that the ROUGE scores increase mainly due to data augmentation.Once again, these results demonstrate that the augmented data leads to more relevant generations, increasing the breadth of knowledge transferred to the model.The same trends generally hold for the BERTScore, which shows that the generations remain semantically relevant.
We show that our model generates more robust, relevant, and memorable text than the baselines.Next, we discuss how tuning the loss parameters affects generations.
Table 4: ROUGE-{1,2, L} scores and BERT scores (F1) for all models.Baseline models: GPT2, GPT-2 fine-tuned on UKPConvArg1, GPT-2 with augmented data, TitleStylist (Jin et al., 2020).Our models are trained on augmented data, and a sample-dependent discriminator (SD) or sample-dependent supervised (SS) loss with parameter β.The baseline ROUGE score increases due to data augmentation; again, the relevance of generations is largely independent of loss type and parameter value.

Empirical Observations
We analyze how the value of β affects generations, finding that generations from β = 0.1 suffer the same degeneration as fine-tuning while higher values avoid these issues.Because α S is not always 1, the constant in front of the discriminator loss is less than β.Consequently, the discriminator is not given enough weight, and the generator cannot learn as effectively from the discriminator.It is difficult to distinguish differences between β = 0.5 and β = 1.0, but aside from β = 0.5, β = 1.0 outperforms every other value of β.
We also experiment with hard-coding the coefficients for the discriminator and reconstruction loss in Table 6.Putting too much weight on the discriminator loss, L D , (i.e., 0.9) leads to poor quality arguments having some of the strongest linguistic feature changes (e.g., shorter length).Conversely, limiting L D to 0.1 leads to much stronger genera-Table 6: Generations of our model trained with a mixed reconstruction and discriminator loss objective with hard-coded weights (as opposed to sample-dependent).
MODEL & GENERATED TEXT EXAMPLE 1: 0.9 Supervised + 0.1 MLE: Schools should teach physical education because it's a good thing.0.1 Supervised + 0.9 MLE: Schools should teach physical education because PE helps children develop good habits later on in life.Plus, there's the benefit of working together as a team that doesn't always happen in other classes.EXAMPLE 2: 0.9 Supervised + 0.1 MLE: Plastic water bottles are bad because they are not recyclable.0.1 Supervised + 0.9 MLE: Plastic water bottles are bad because they are bad for the environment and they are bad for the economy.Some people think that bottled water is bad for consumers and should only be used in situations such as disasters when no other clean water is available.EXAMPLE 3: 0.9 Supervised + 0.1 MLE: Gaming is good for child development because you can play with other kids.0.1 Supervised + 0.9 MLE: Gaming is good for child development because it teaches them how to think and solve problems.It also teaches them how to communicate with each other.
tions.We introduced the β parameter to cap L D at β.Because of the β parameter, the previous experiments show similar but less obvious trends.

Conclusion
In this paper, we introduced style infusion to motivate infusing audience-centric, stylistic preferences into unconstrained natural language generation models.We present a bootstrapped data augmentation method for limited pair-wise audience feedback and an adversarial training framework with a decoupling loss to train a style-infused GPT-2.Through an automatic evaluation method for the transfer of audience-specific styles, we show that our approach generates compelling stylized examples with generic text prompts better than the baselines.
Synthesizing text with subjective styles, such as persuasion and memorability, remains a significant challenge in domains like computational advertising.Our work takes the first few steps to address this problem.We plan to continue improving our work in many directions, such as incorporating long-document attention mechanisms (Beltagy et al., 2020) to capture document-level style features and altering the discourse structure to convey information in a more interpretable manner.

Limitations
As with other unconstrained natural language generation applications, our system is prone to issues like degeneration from beam search and neural hallucinations.To combat the former, we post-process generations, but future work will hopefully provide better methods to prevent this issue.For the latter, we increase our dataset with samples from the CNN/DM dataset, partially mitigating the problem, but out-of-domain topics still suffer.Increasing the dataset size will only work for so long due to diminishing marginal returns.
Due to the limited amount of data available, we considered iteratively training the discriminator with the augmented data while we trained the generator.Ultimately, we felt that the weak labels would dilute the learned trends in the discriminator, but it may be interesting to see how it affects the performance of the framework.Currently, collecting pairwise datasets to use with this framework can be viewed as a limitation.With increasing interest in the computational synthesis of persuasive text and imagery, we expect to see more relevant curated datasets in the near future.Generating pairwise data through human subject experiments is expensive, which is why the data augmentation methods introduced in this paper are crucial for future work.
We also note that our framework is limited by the computational resources available to us.Thus, we were unable to effectively support long text generation while preserving the quality of the generated text.During training, we decrease the batch size and utilize the DeepSpeed framework (Rasley et al., 2020), but it is still insufficient to handle long text.Furthermore, traditional left-to-right generation struggles with long text as the topics tend to diverge.Because many styles, like persuasiveness, are dependent on paragraph-level features in addition to sentence-level ones, it is beneficial for our application to support longer texts.
Lastly, one of the biggest limitations of this paper is in showing the effectiveness of the architecture we choose.Because most baselines are in style transfer and fundamentally differ from our task, we find it difficult to make a fair comparison with prior work.Regardless, style infusion is a critical step for unconstrained NLG systems such as dialogue systems and chatbots, especially in the context of human-centric stylistic objectives, which are already difficult enough to define.

Ethics Statement and Broader Impact
Our objective for developing a stylistic generative language model that leverages domain and audience-specific feedback is to enable unconstrained generation applications to appeal to more human users.For example, generating more persuasive real news might help combat misinformation by propagating the truth faster than falsehoods.In advertising and communication, persuasiveness and memorability are critical traits and having an unconstrained generation model that could replicate these features would have a multitude of positive applications, especially in targeted interventions.Previous research has mostly focused predicting audience characteristics and targeting, but not on synthesizing matching messages.
We acknowledge the dual-use concerns of the misuse of such a generation framework to, for example, spread misinformation.For this reason, we do not release the model or the pretrained generator checkpoint used in this work.

A Siamese BERT Discriminator
To validate our results, we tried another architecture, similar to Siamese BERT (Reimers and Gurevych, 2019), where we tokenized the texts individually and passed them through their own BERT layers, producing two embeddings, e 1 and e 2 .We concatenated the two outputs along with the distance between the two as follows: [e 1 ; e 2 ; |e 2 − e 1 |] We passed this new vector of R 3×h , where h is the hidden dimension of BERT, through a fully connected classification layer.
While the original discriminator achieves approximately 89% accuracy on the random test set, the Siamese BERT model achieves a smaller, but still significant, 83% accuracy.On the Cornell Movie-Quotes corpus, with the same hyperparameters, the original discriminator achieves 80% accuracy and a 77% accuracy with the Siamese BERT architecture.We choose to use the simpler discriminator architecture because it seems to capture style better than the Siamese BERT architecture.

B Baseline Reward
The baseline reward is meant to reduce the noise from the reward given by our discriminator (Ranzato et al., 2015).The baseline reward, Ri , is calculated using a linear layer, with the input being the hidden states of our generator at timestep i.The intuition is that the linear layer approximates the value of the reward for a certain timestep and in practice, reduces the variance from the reward.We train the linear layer with the following loss: where D(y * s , y) is the output of the discriminator when fed a gold argument and the generated argument (i.e. the reward).

C Linguistic Feature Correlation C.1 Collected Linguistic Features
We collected the following linguistic features: length, verb tenses (e.g.future, past, etc.), punctuation rates, readability scores (Flesch score, Flesch-Kincaid score, Gunning Fog score, SMOG score, Dale-Chall score), part of speech rates (noun rate, Table 9: Percentage agreement with the linguistic feature correlations calculated using the hierarchical Bayesian model.Baseline models: GPT2, GPT-2 fine-tuned on UKPConvArg1 or the Cornell Movie Quotes corpus, GPT-2 with augmented data, and TitleStylist (Jin et al., 2020).Our models are trained on augmented data and a sampledependent discriminator (SD) or sample-dependent supervised (SS) loss with parameter β.We show that our models are significantly better at learning stylistic features compared to our baselines.them with A id and B id , respectively.Similarly, we construct γ for each topic.A f t and B f t are the features of text A and B, respectively.α, β, and γ are all constructed similarly.Let's take α as an example: where ᾱ ∼ N (0, 0.25) and α σ is drawn from an exponential distribution with λ = 1.We construct a separate α v for each unique A id where each α v ∼ N (0, 1.0).These values are chosen because they help the MCMC sampling converge.β and γ follow the same construction except with different shapes for β v and γ v .During the training of the hierarchical Bayesian model, we use 1000 warmup steps and generate an additional 1000 samples.

C.3 Feature Agreement
To demonstrate feature agreements, we calculate a weighted average to quantify the results of Table 7 in the context of the full feature set, using the correlations obtained from the Bayesian model as weights.The results are shown in Table 9.

Figure 1 :
Figure 1: Training diagram that shows how the loss is calculated as a weighted sum of the discriminator (L D ) and reconstruction (L R ) loss.α S is decided by the discriminator as a form of contrastive learning.

Figure 2 :
Figure2: The correlations between linguistic features and convincingness in the UKPConvArg1 corpus.The lower x-axis is in the logit scale, and the percentage difference in odds of winning is on the upper x-axis.The figure is read as: if the feature for argument A is one standard deviation greater than the feature for argument B, the odds of A winning shift by the respective percent value.Notice that the correlations show a strong positive correlation with readability (e.g., the Flesh score is positive while length and average syllables are negative).

Figure 3 :
Figure 3: The correlations between linguistic features and persuasiveness in the Cornell Movie-Quotes Corpus.The lower x-axis is in the logit scale, and the percentage difference in odds of winning is on the upper x-axis.Notice that the correlations show a negative correlation with long and winding text (i.e., circuitousness(Toubia et al., 2021)).

Table 5 :
Generations of GPT2 trained with the sampledependent discriminator loss objective with different values of β.The generations for SD-0.5 and SD-1.0 tend to be much better than for SD-0.1