Style is NOT a single variable: Case Studies for Cross-Stylistic Language Understanding

Every natural text is written in some style. Style is formed by a complex combination of different stylistic factors, including formality markers, emotions, metaphors, etc. One cannot form a complete understanding of a text without considering these factors. The factors combine and co-vary in complex ways to form styles. Studying the nature of the covarying combinations sheds light on stylistic language in general, sometimes called cross-style language understanding. This paper provides the benchmark corpus (XSLUE) that combines existing datasets and collects a new one for sentence-level cross-style language understanding and evaluation. The benchmark contains text in 15 different styles under the proposed four theoretical groupings: figurative, personal, affective, and interpersonal groups. For valid evaluation, we collect an additional diagnostic set by annotating all 15 styles on the same text. Using XSLUE, we propose three interesting cross-style applications in classification, correlation, and generation. First, our proposed cross-style classifier trained with multiple styles together helps improve overall classification performance against individually-trained style classifiers. Second, our study shows that some styles are highly dependent on each other in human-written text. Finally, we find that combinations of some contradictive styles likely generate stylistically less appropriate text. We believe our benchmark and case studies help explore interesting future directions for cross-style research. The preprocessed datasets and code are publicly available.


Introduction
People often use style as a strategic choice for their personal or social goals in communication (Hovy, 1987;Silverstein, 2003;Jaffe et al., 2009;Kang, 2020). Some stylistic choices implicitly reflect the author's characteristics, like personality, demographic traits (Kang et al., 2019), and emotions (Buechel and Hahn, 2017), whereas others are explicitly controlled by the author's choices for their social goals like using polite language, for better relationship with the elder (Danescu et al., 2013). In this work, we broadly call each individual linguistic phenomena as one specific type of style.
Style is not a single variable, but multiple variables have their own degrees of freedom and they co-vary together. Imagine an orchestra, as a metaphor of style. What we hear from the orchestra is the harmonized sound of complex combinations of individual instruments played. A conductor, on top of it, controls their combinatory choices among them, such as tempo or score. Some instruments under the same category, such as violin and cello for bowed string type, make a similar pattern of sound. Similarly, text reflects complex combination of multiple styles. Each has its own lexical and syntactic features and some are dependent on each other. Consistent combination of them by the author will produce stylistically appropriate text.
Unlike the prior works, this work proposes the first comprehensive understanding of cross-stylistic language variation, particularly focusing on how different styles co-vary together in written text, which styles are dependent on each other, and how they are systematically composed to generate text.
Our work has following contributions: • Aggregate 15 different styles and 23 sentencelevel classification tasks ( §3). Based on their social goals, the styles are categorized into four groups (Table 1): figurative, affective, personal and interpersonal. • Collect a cross-style set by annotating 15 styles on the same text for valid evaluation of crossstylistic variation ( §3.3). • Study cross-style variations in classification ( §4), correlation ( §5), and generation ( §6): our jointly trained classifier on multiple styles shows better performance than individuallytrained classifiers. our correlation study finds statistically significant style inter-dependencies (e.g., impoliteness and offense) in written text.

Related Work
Building on Silverstein's notion of indexical order, Eckert (2008) built the notion that linguistic variables index a social group, which leads to the indexing of certain traits stereotypically associated with members of that group. Eckert (2000Eckert ( , 2019 argued that style change creates a new persona, impacting a social landscape and presented the expression of social meaning as a continuum of decreasing reference and increasing performativity.
Despite the extensive theories, very little is known on extra-dependency across multiple styles. In this work, we empirically show evidence of extra-linguistic variations of styles, like a formal-  ity, politeness, etc, but limited to styles only if we can obtain publicly available resources for computing. We call the individual phenomena a specific type of "style" in this work. We admit that there are many other kinds of styles not covered in this work, such as inter-linguistic variables in grammars and phonology, or high-level style variations like individual's writing style or genres.
Cross-style analysis. Some recent works have provided empirical evidence of style interdependencies but in a very limited range: Warriner et al. (2013) analyzed emotional norms and their correlation in lexical features of text. Chhaya et al. (2018) studied a correlation of formality, frustration, and politeness but on small samples (i.e., 960 emails). Nguyen et al. (2014) focused on correlation across demographic information (e.g., gender, age) and with some other factors such as emotions (Preoţiuc-Pietro and Ungar, 2018). Dankers et al. (2019); Mohammad et al. (2016) studied the interplay of metaphor and emotion in text. Liu et al. (2010) studied sarcasm detection using sentiment as a sub-problem. Brooke and Hirst (2013) conducted a topical analysis of six styles: literary, abstract, objective, colloquial, concrete, and subjective, on different genres of text. Passonneau et al. (2014) conducted a detailed analysis of Biber's genres and relationship between genres.

Style selection and groupings
In order to conduct a comprehensive style research, one needs to collect a collection of different style datasets. We survey recent papers related to style research published in ACL venues and choose 15 widely-used styles that have publicly available annotated resources and feasible size of training dataset (Table 1). We plan to gradually increase the coverage of style kinds and make the benchmark more comprehensive in the future.  (Moudgil, 2017) 463k random 2 humor (50%), non-humor (50%) Y web Y clsf.

Emotion
EmoBank valence (Buechel and Hahn, 2017) 10k random 1 negative, positive -misc. Y rgrs. EmoBank arousal (Buechel and Hahn, 2017) 10k random 1 calm, excited -misc. Y rgrs. EmoBank dominance (Buechel and Hahn, 2017)   We follow the theoretical style grouping criteria based on their social goals in Kang (2020) that categorizes styles into four groups (Table 1): PER-SONAL, INTERPERSONAL, FIGURATIVE, and AF-FECTIVE group, where each group has its own social goals in communication. This grouping will be used in our case studies as a basic framework to detect their dependencies.

Individual style dataset
For each style in the group, we pre-process existing style datasets or collect our own if there is no publicly available one (i.e., ShortRomance). We do not include datasets with small samples (e.g., ≤ 1K) due to its infeasibility of training a large model. We also limit our dataset to classify a single sentence, although there exists other types of datasets (e.g., document-level style classifications, classifying a sentence with respect to context given) which are out of scope of this work.
If a dataset has its own data split, we follow that. Otherwise, we randomly split it by 0.9/0.05/0.05 ra-tios for the train, valid, and test set, respectively. If a dataset has only positive samples (ShortHumor, ShortJoke, ShortRomance), we do negative sampling from literal text as in Khodak et al. (2017). We include the detailed pre-processing steps in Appendix §A.

Cross-style diagnostic set
The individual datasets, however, have variations in domains (e.g., web, dialogue, tweets), label distributions, and data sizes (See domain, label, and #S columns in Table 2). Evaluating a system with these individual datasets' test set is not an appropriate way to validate how multiple styles are used together in a mixed way, because samples from individual datasets are annotated only when a single style is considered.
To help researchers evaluate their systems in the cross-style setting, we collect an additional diagnostic set, called cross-set by annotating labels of 15 styles together on the same text from crowd workers. We collect total 500 sample texts from  two different sources: the first half is randomly chosen from test sets among the 15 style datasets in balance, and the second half is chosen from random tweets that have high variations across style prediction scores using our pre-trained style classifiers. Each sample text is annotated by five annotators, and the final label for each style is decided via majority voting over the five annotations. In case they are tied or all different from each other for multiple labels, we don't include them. We also include Don't Know option for personal styles and Neutral option for two opposing binary styles (e.g., sentiment, formality). The detailed annotation schemes are in Appendix §B. Table 3 shows annotator's agreement on the cross-set. We find that annotator's agreement varies a lot depending on style: sentiment and politeness with good agreement, and formality, emotion, and romance with moderate agreement. However, personal styles (e.g., age, education level, and political view), metaphor, and emotions (e.g., arousal and dominance), show fair agreements, indicating how difficult and subjective styles they are.

Contribution
Most datasets in XSLUE except for Romance are collected from others' work. Following the data statement (Bender and Friedman, 2018), we cite and introduce individual datasets with their data statistics in Table 2. Our main contribution is to make every dataset to have the same pre-processed format, and distribute them with accompanying code for better reproducibility and accessibility. Besides this engineering effort, XSLUE's main goal is to invite NLP researchers to the field of crossstyle understanding and provide them a valid set of evaluation for further exploration. As the first step, using XSLUE, we study cross-style language variation in various applications such as classification ( §4), correlation ( §5), and generation ( §6).

Case #1: Cross-Style Classification
We study how modeling multiple styles together, instead of modeling them individually, can be effective in style classification task. Particularly, the annotated cross-set in XSLUE will be used as a part of evaluation for cross-style classification.
Models. We compare two types of models: single and cross model. The single model is trained on individual style dataset separately, whereas the cross model is trained on shuffled set of every dataset together. For single model, we use various baseline models, such as majority classifier by choosing the majority label in training data, Bidirectional LSTM (biLSTM) (Hochreiter and Schmidhuber, 1997) with GloVe embeddings (Pennington et al., 2014), and variants of fine-tuned transformers; Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019), robustly optimized BERT (RoBERTa) , and text-to-text transformer (T5) (Raffel et al., 2019). 2 For cross model, we propose an encoder-decoder based model that learns cross-style patterns with the shared internal representation across styles (Figure 1). It encodes different styles of input as text (e.g., "STYLE: formality TEXT: would you please..") and decodes output label as text (e.g., "formal"). We use the pretrained encoder-decoder model from T5 (Raffel et al., 2019), and finetune it using the combined, shuffled datasets in XSLUE. Due to the nature of encoder-decoder model, we can take any training instances for classification tasks into the same text-to-text format. We also trained the single model (e.g., RoBERTa) on the combined datasets via a multi-task setup (i.e., 15 different heads), but showing less significant result.  Table 4: Individual style (left) and cross style (right) classification in XSLUE. Every score is averaged over ten runs of experiments with different random seeds. For cross-style classification, we choose a single dataset per style, which has larger training data than the others. Otherwise, we leave it as a blank (-).
The detailed hyper-parameters used in our model training are in Appendix §C.
Tasks. Our evaluation has two tasks: individualset evaluation for evaluating a classifier on individual dataset's test set (left columns in Table 4) and cross-set evaluation for evaluating a classifier on the annotated cross-set collected in §3.3 (right columns in Table 4). Due to the label imbalance of datasets, we measure f-score (F1) for classification tasks and Pearson-Spearman correlation for regression tasks (i.e., EmoBank). For multi-labels, all scores are macro-averaged on each label.
Results. In the individual-set evaluation, compared to the biLSTM classifier, the fine-tuned transformers show significant improvements (+8% points F1) on average, although the different transformer models have similar F1 scores. Our proposed cross model, significantly outperforms the single model, by +1.7 percentage points overall F1 score, showing the benefit of learning multiple styles together. Particularly, the cross model sig-nificantly improves F1 scores on personal styles such as gender, age, and education level, possibly because the personal styles may be beneficial from detecting other styles. Among the styles, all personal styles, figurative styles (e.g., sarcasm and metaphor), and emotions are the most difficult styles to predict, which is similarly observed in the annotator's agreement in Table 3.
In cross-set evaluation, the overall performance significantly drops against the individual set evaluation, like from 65.9% to 40.7%, showing why it is important to have these annotated diagnostic set for valid evaluation of cross-style variation. Again, the cross-style model achieves +1.2% gain than the single models. Figure 2 shows F1 improvement by the cross model against the single model BERT. Most styles obtain performance gain from the cross-style modeling, whereas not in the two metaphor style datasets (VUA, TroFi) and ethnicity style. This is possibly because metaphor tasks prepend the target metaphor verb to the input text, which is different from other task setups. Thus, learning them  together may harm the performance, although it is not significant.

Case #2: Style Dependencies
In addition to the theoretical style grouping in §3.1, we empirically find how two styles are correlated in human-written text using silver predictions from the classifiers.
Setup. We sample 1,000,000 tweets crawled using Twitter's Gardenhose API. We choose tweets as the target domain, because of their stylistic diversity compared to other domains, such as news articles. Using the fine-tuned cross-style classifier in §4, we predict probability of 53 style attributes 3 over the 1M tweets. We split a tweet into sentences and then average their prediction scores. We then produce a correlation matrix across the style attributes using Pearson correlation coefficients with Euclidean distance and finally output a 53 × 53 correlation matrix. We only show correlations that are statistical significant with p-value < 0.05 and cross out the rest.

Reliability.
One may doubt about the classifier's low performance on some styles, leading to unreliable interpretation of our analysis. Although we only show correlation on the predicted style values, 3 Attribute means labels of each style: positive and negative labels for sentiment style.  we also performed the same analysis on the humanannotated cross-set, showing similar correlation tendencies to the predicted ones. However, due to the small number of annotations, its statistical significance is not high enough. Instead, we decide to show the prediction-based correlation, possibly including noisy correlations but with statistical significance.
Results. Figure 3 shows the full correlation matrix we found. From the matrix, we summarize some of the highly correlated style pairs in Table 5 For each pair of correlation, two annotators evaluate its validity of stylistic dependency using a Likert scale. Our prediction-based correlation shows 4.18 agreement on average, showing reasonable accuracy of correlations. We also provide an empirical grouping of styles using Ward hierarchical clustering (Ward Jr, 1963) on the correlation matrix. Figure 4 shows some interpretable style clusters detected from text, like Asian ethnicities (SouthAsian, EastAsian), middle ages (35-44, 45-54, 55-74), positiveness (happiness, dominance, positive, polite), and bad emotions (anger, disgust, sadness, fear).

Case #3: Cross-Style Generation
We study the effect of combination of some styles in the context of generation. We first describe our Figure 3: Cross-style correlation. Correlations with p < 0.05 (confidence interval: 0.95) are only considered as statistically significant. The degree of correlation gradually increases from red (negative) to blue (positive), where the color intensity is proportional to the correlation coefficients. We partition the correlation matrix into three pieces: across interpersonal, figurative and affective styles (upper left), between persona and a group of interpersonal, figurative, and affective styles (upper right), and across persona styles (lower right). IMPORTANT NOTE: please be VERY CAREFUL not to make any unethical or misleading interpretations from these modelpredicted artificial correlations. Best viewed in color.
style-conditioned generators that combine the style classifiers in §4 with pre-trained generators ( §6.1), and then validate two hypothetical questions using the generators: does better identification of styles help better stylistic generation ( §6.2)? and which combination of styles are more natural or contradictive in generation ( §6.3)?

Style-conditioned Generation
Let x an input text and s a target style. Since we already have the fine-tuned style classifiers P(s|x) from §4, we can combine them with a genera-tor P(x), like a pre-trained language model, and then generate text conditioned on the target style P(x|s). We extend the plug-and-play language model (PPLM) (Dathathri et al., 2019) to combine our style classifiers trained on XSLUE with the pre-trained generator; GPT2 (Radford et al., 2019) without extra fine-tuning: P(x|s) ∝ P(x) · P(s|x). Table 6 shows example outputs from our styleconditioned generators given a prompt 'Every natural text is'.
We evaluate quality of output text: given 20 frequent prompts randomly extracted from our 'Every natural text is' a series of images. The images, as they are known within the text, are the primary means by which a text is read, and therefore are .. 'Every natural text is' a natural language, and every natural language is a language that we can speak. It is the language of our thoughts and of our lives.. 'Every natural text is' worth reading...I'm really going to miss the music of David Byrne, and that was so much fun to watch live. The guy is a *ucking *ick. .. Table 6: Given a prompt 'Every natural text is', output text predicted by our stylistic generator. The blue and red phrases are manually-labeled as reasonable features for each label. Offensive words are replaced with *.

Output conditioned on
training data, 4 we generate 10 continuation text for each prompt for each binary label of four styles (sentiment, politeness, offense, and formality) 5 using the conditional style generator; total 20 * 10 * 2 * 4=1600 continuations.
We evaluate using both automatic and human measures: In automatic evaluation, we calculate F1 score of generated text using the fine-tuned classifiers, to check whether the output text reflects stylistic factor of the target style given. In human  Table 7: Automatic and human evaluation on generated text. 1 st and 2 nd labels correspond to positive and negative for sentiment, polite and impolite for politeness, formal and informal for formality, and nonoffensive and offensive for offense. Three numbers in human evaluation means stylistic appropriateness, consistency with prompt, and overall coherence in order. evaluation, scores (1-5 Likert scale) annotated by three crowd-workers are averaged on three metrics: stylistic appropriateness 6 , consistency with prompt, and overall coherence.
In Table 7, compared to F1 scores on individual test set in XSLUE, automatic scores on output from the generator are less by 20.5% on average, showing sub-optimality of the conditional style generator between classification and generation. Interestingly, in human evaluation, negative labels (2 nd label for each style) for each style, like negative sentiment, impoliteness, informality, and offensiveness, show less stylistic appropriateness than positive or literal labels.

Better classification, better generation
To further investigate the relationship between classifier's performance and generation quality, we conduct a study by decreasing the training completion ratio (i.e., a fraction of epochs until completion; C%) of the classifiers; P C% (s|x) over the four styles and again evaluate the output continuation; P C% (x|s) ∝ P(x) · P C% (s|x) using the same  Figure 5: As the training completion ratio (x-axis, %) of classifiers increases, stylistic appropriateness (blue, y-axis) and consistency (red, y-axis) increase.  human metrics. Figure 5 shows that the better style understanding (higher F1 scores in classification) yields the better stylistic generation (higher stylistic appropriateness and consistency scores).

Contradictive styles in generation
We have generated text conditioned on single styles. We now generate text conditioned on combination of multiple styles; P(x|s 1 ..s k )& ∝ P(x) · P(s 1 |x) · · · P(s k |x) where k is the number of styles. In our experiment, we set k=2 for sentiment and politeness styles, and generate text conditioned on all possible combinations between the labels of the two styles (e.g., positive and polite label, negative and impolite label). We again conduct human evaluation on the output text for measuring whether the generator produces stylistically appropriate text given the combination. Table 8 shows averaged human-measured stylistic appropriate scores over the four label combinations (left) and the correlation scores observed in the style correlation matrix on written text in Figure 3 (right). Some combinations, like positive and impolite or like negative and polite, show less stylistic appropriateness scores, because they are naturally contradictive in their stylistic variation. Moreover, the stylistic appropriateness scores look similar to the correlation score observed from written text, showing that there exists some natural or unnatural combination of styles in both classification on human-written text and output generated by the model.

Conclusion and Discussion
We introduce a benchmark XSLUE of mostly existing datasets for studying cross-style language understanding and evaluation. Using XSLUE, we found interesting cross-style observations in classification, correlation, and generation case studies. We believe XSLUE helps other researchers develop more solid methods on various cross-style applications. We summarize other concerns we found from our case studies: Style drift. The biggest challenge in collecting style datasets is to diversify the style of text but preserve the meaning, to avoid semantic drift. In the cross-style setting, we also faced a new challenge; style drift, where different styles are coupled so changing one style might affect the others.
Ethical consideration. Some styles, particularly on styles related to personal traits, are ethically sensitive, so require more careful interpretation of the results not to make any misleading points. Any follow-up research needs to consider such ethical issues as well as provides potential weaknesses of their proposed methods.
From correlation to causality. Our analysis is based on correlation, not causality. In order to find causal relation between styles, more sophisticated causal analyses, such as propensity score (Austin, 2011), need to be considered for controlling the confounding variables. By doing so, we may resolve the biases driven from the specific domain of training data. For example, generated text with the politeness classifier (Danescu et al., 2013) contains many technical terms (e.g., 3D, OpenCV, bugs) because its training data is collected from StackExchange.