Does It Capture STEL? A Modular, Similarity-based Linguistic Style Evaluation Framework

Style is an integral part of natural language. However, evaluation methods for style measures are rare, often task-specific and usually do not control for content. We propose the modular, fine-grained and content-controlled similarity-based STyle EvaLuation framework (STEL) to test the performance of any model that can compare two sentences on style. We illustrate STEL with two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contrac’tion and numb3r substitution). We find that BERT-based methods outperform simple versions of commonly used style measures like 3-grams, punctuation frequency and LIWC-based approaches. We invite the addition of further tasks and task instances to STEL and hope to facilitate the improvement of style-sensitive measures.


Introduction
Natural language is not only about what is said (i.e., content), but also about how it is said (i.e., linguistic style). Linguistic style and social context are highly interrelated (Coupland, 2007;Bell, 2013). For example, people can accommodate their linguistic style to each other based on social power differences (Danescu-Niculescu-Mizil et al., 2012). Furthermore, linguistic style can influence perception, e.g., the persuasiveness of news (El Baff et al., 2020) or the success of pitches on crowdsourcing platforms (Parhankangas and Renko, 2017). As a result, style is relevant for natural language understanding, e.g., in author profiling (Rao et al., 2010), abuse detection (Markov et al., 2021) or understanding conversational interactions (Danescu-Niculescu-Mizil and Lee, 2011). Additionally, style can be important to address in natural language generation (Ficler and Goldberg, 2017), including identity modeling in dialogue systems (Li et al., 2016) and style preservation in machine translation (Niu et al., 2017;Rabinovich et al., 2017). Figure 1: STEL Task Instance. Anchor 1 (A1) and anchor 2 (A2) and the alternative sentences 1 (S1) and 2 (S2) are split along the same style dimension (here: formal/informal). The sentences and anchors are paraphrases of each other. The STEL task is to order S1 and S2 to match A1-A2. Here, the correct order is S2-S1.
There are several general evaluation benchmarks for different linguistic phenomena (e.g., Wang et al. (2018Wang et al. ( , 2019) but less emphasis has been put on linguistic style. Nevertheless, natural language processing literature shows a variety of approaches for the evaluation of style measuring methods: They have been tested on whether they group texts by the same authors together (Hay et al., 2020;Bevendorff et al., 2020), whether they can correctly classify the style for ground truth datasets (Niu and Carpuat, 2017;Kang and Hovy, 2021) and whether 'similar style words' are similarly represented (Akama et al., 2018). However, these evaluation approaches are (i) often application-specific, (ii) rarely used to compare different style methods, (iii) usually do not control for content and (iv) often do not test for fine-grained style differences.
These shortcomings (i)-(iv) might be the result of the following challenges for the construction of style evaluation methods: 1. Style is a highly ambiguous and elusive term (Biber and Conrad, 2009;Crystal and Davy, 1969;Labov, 2006;Xu, 2017). We propose a modular framework where components can be removed or added to fit an application or specific understanding of style. 2. Variation in style can be very small. Our proposed evaluation framework can be used to test for fine-grained style differences. 3. Style is hard to disentangle from content as the two are often correlated (e.g., Gero et al. (2019); Bischoff et al. (2020)). For example, people might speak more formally in a job interview than in a bar with friends. Thus, language models and methods might pick up on spurious content correlations (similar to Poliak et al. (2018)) in a benchmark that does not control for content.
To this end, we propose the modular, fine-grained and content-controlled similaritybased STyle EvaLuation framework (STEL). We demonstrate it for the English language. An example task is shown in Figure 1. The task is to order sentence 1 (S1) and sentence 2 (S2) to match the style order of anchor 1 (A1) and anchor 2 (A2). Our STEL framework encompasses two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contraction and number substitution). By design, the style characteristics are easy to identify. Thus, the STEL characteristic tasks are easier to solve than the STEL dimension tasks. STEL contains 815 task instances per dimension and 100 task instances per characteristic (see Table 1). To be evaluated on STEL, style measuring methods need not be able to classify styles directly. Instead, any method that can calculate the style similarity between two sentences can be evaluated: (1) Style (measuring) methods that calculate similarity values directly (e.g., edit distance or cross-encoders Reimers and Gurevych (2019)) and (2) vector representations of a sentence's style (e.g., Hay et al. (2020); Ding et al. (2019)) by using a distance or similarity measure between them (e.g., cosine similarity). This similarity-based setup also simplifies task extension (c.f. modularity). STEL components can easily be generated from parallel sets of paraphrases which differ along a style dimension ( §3), e.g., sets of paraphrases that vary along the formal/informal dimension (Rao and Tetreault, 2018).
Contribution. With this paper, we contribute (a) the modular, fine-grained and content-controlled STEL framework ( §3), (b) 1830 validated task instances for the considered style components ( §4) and (c) baseline results of STEL on 18 style measuring methods ( §5). We find that the BERT base model outperforms simple versions of commonly used style measuring approaches like LIWC, punctuation frequency or character 3-grams. We invite the addition of complementary tasks and hope that this framework will facilitate the development of improved style-sensitive models and methods. Our data and code are available on GitHub. 1
Linguistic style is usually defined to be distinct from content. However, style is often found to be correlated with content (e.g., Gero et al. (2019)). To address this, some control for content with word-level paraphrases (Pavlick and Nenkova, 2015;Preoţiuc-Pietro et al., 2016;Niu and Carpuat, 2017), topic labels (e.g., Boenninghoff et al. (2019)) or by avoiding the use of content-specific features (c.f. Neal et al. (2017); Stamatatos (2017)), others choose no or only limited control for content (e.g., Zangerle et al. (2020); Kang and Hovy (2021)). There has been considerable work in creating parallel datasets of (sentence level) paraphrases with shifting style, often using human annotations (Xu et al., 2012(Xu et al., , 2016Rao and Tetreault, 2018;Krishna et al., 2020). The task of generating paraphrases of text fragments with different style properties is sometimes also called style transfer.
There is little work on general evaluation benchmarks for style measuring methods. Kang and Hovy (2021) use style classification tasks to compare 5 language models. Only models that classify style into the given 15 dimensions can be evaluated. They do not control for content. Individually finetuned RoBERTa (Liu et al., 2019) and BERT (Devlin et al., 2019) classifiers for one style were outperformed by a fine-tuned T5 model (Raffel et al., 2020) that was jointly trained on multiple style labels. BERT/RoBERTa outperformed the T5 model on some styles (e.g., 'sarcasm' and 'methaphor'). Other related tasks are the PAN Authorship Verification (Kestemont et al., 2020) and Style Change Detection (Zangerle et al., 2020) tasks which aim at identifying whether two documents or consec-
r u a fan of them or something?
Oh, and also, that young physician got an unflattering haircut.
Oh yea and that young dr got a bad haircut.
simple/complex 815 S1-S2 These rock formations are made of sandstone with layers of quartz.
These rock formations are characteristically composed of sandstone with layers of quartz.
The Odyssey is an ancient Greek epic poem attributed to Homer.
The Odyssey is an old Greek epic poem written by Homer.
number substitution 100 S2-S1 <3 friends forever <3 friends 4ever D00d $30 is heaps cheap, that must work out to just a couple of bucks an hour Dude $30 is heaps cheap, that must work out to just a couple of bucks an hour contraction 100 S1-S2 In that time, it's become one of the world's most significant financial and cultural capital cities.
In that time, it has become one of the world's most significant financial and cultural capital cities.
Will doesn't refer to any particular desire, but rather to the mechanism for choosing from among one's desires.
Will does not refer to any particular desire, but rather to the mechanism for choosing from among one's desires. Table 1: STEL Examples. We give an example for each component (Comp) of STEL: Formal/informal and simple/complex for the more complex style dimensions as well as number substitution and contraction for the simpler style characteristics. The task is to order sentence 1 (S1) and sentence 2 (S2) to match the style order of anchor 1 (A1) and anchor 2 (A2). The correct order is given in the 'Order' column.
utive paragraphs have been written by the same author. In their current version both tasks do not control for topic. However, Kestemont et al. (2020) controls for domain (here: 'fandom' of the considered 'fanfictions'). The best performing model for Kestemont et al. (2020) was a neural LSTMbased siamese network (Boenninghoff et al., 2020), which is conceptually similar to some variants of sentence BERT (Reimers and Gurevych, 2019). The PAN setup assumes that authors tend to write in a relatively consistent style. Based on similar assumptions, the field of authorship attribution wants to determine which author wrote a given document.

Style Evaluation Framework
We introduce the modular, fine-grained, and content-controlled similarity-based STyle EvaLuation framework (STEL). STEL tests a (language) model's ability to capture the style of a sentence.
Modular Operationalization of Style. Style has previously been conceptualized in many different ways. From being defined as purely aesthetic in Biber and Conrad (2009) to encompassing all forms of language variation, e.g., in Crystal and Davy (1969). We refrain from meddling in the style definition debate and instead use the broad notion of "how vs. what", i.e., how something is said as opposed to what is said. Inspired by Campbell-Kibler et al. (2006), we use different characteristics (i.e., more specific linguistic choices) as well as more general dimensions of style (i.e., more complex combinations of style features). By not only using complex style dimensions, but also small scale and simpler characteristics, STEL allows for very controlled and fine-grained testing. We can easily make sure that only the characteristics and no other aspects change (c.f. Table  1). Depending on one's goal and understanding of style, some components (i.e., dimensions or characteristics) should be excluded and others should be added to this modular framework. We exemplify the framework's more complex dimensions with the formal/informal distinction as this has been one of the most agreed upon dimensions of style (Heylighen et al., 1999;Labov, 2006). Additionally, we use the simple/complex dimension which has been used in connection to linguistic-stylistic choices as well (Haaften and Leeuwen, 2021;Pavlick and Nenkova, 2015). We exemplify the framework's simpler style dimensions (i.e., characteristics) with numb3r substitutions and contraction usage. See Table 1 for examples for each component.

Controlling for Content.
It is difficult to clearly separate style from content (Stamatatos, 2017;Gero et al., 2019). Specific scenarios might correlate with both style and content. For example, in a job interview applicants might use a more formal style and talk more about their profession than in a more informal setting at a bar. Then, a model that generally rates texts about jobs as formal and texts about beverage choices as informal might perform well at style prediction. In other words, models that correctly use style features could sometimes be indistinguishable from those that use topical features. To control for content, we use parallel paraphrase datasets ( §4.1), which consist of a set of sentences written in one style and a parallel set of sentences written in another.
Task Setup. We test a method's style measuring capability with tasks of the setup shown in Figure  1. The sentences (S1 and S2) have to be ordered to match the order of the anchor sentences (A1 and A2). Here, 'r u' (A1) and 'Oh yea' (S2) are written in a more informal style than their respective paraphrases A2 ('Are you') and S1 ('Oh, and also'). Thus, the correct order is S2, then S1. We call this setup the quadruple setup. Additionally, we explore a second task setup, the triple setup, which leaves out anchor 2 (A2). There, the task is to decide which of the two sentences matches the style of anchor 1 (A1) the most. The two different setups are similar to the triple and quadruple training instances in the field of metric learning, e.g., (de Vazelhes et al., 2020;Law et al., 2016;Kaya and Bilge, 2019).

Task Generation
We describe the task instances of STEL: First, we generate potential task instances ( §4.1). Second, we describe problems with the generated instances (i.e., ambiguity in §4.2). Third, we filter out the problematic instances via crowd-sourcing ( §4.3).

Potential Task Instances
We generate potential task instances on the basis of parallel paraphrase datasets written in style 1 and style 2 respectively. For each style 1/style 2 paraphrase pair (anchors in Table 1), we randomly select another sentence pair (sentences in Table 1). Again randomly, we decide which of the anchor pair is anchor 1 (A1) and which is anchor 2 (A2) and fix that ordering for all future considerations. We do the same for the sentence pair. The answer to the STEL task ( Figure 1) is labeled as S1-S2 if A1 was taken from the same style set as S1, e.g., both from style 1. Otherwise the order is reversed.
Formal/Informal Dimension. We use the test and tune split of the Entertainment_Music GYAFC subcorpus (Rao and Tetreault, 2018) as the parallel paraphrase dataset. It consists of a set of informally phrased sentences and a parallel set of crowd-sourced formal paraphrases. We generate 918 potential STEL formal/informal task instances.
Simple/Complex Dimension. We use the test and tune split from Xu et al. (2016). It consists of English Wikipedia sentences and 8 crowd-sourced simplifications per sentence. For each Wikipedia sentence, we randomly draw the parallel paraphrase out of the 8 simplifications. We discard sentences that are too close to the original via the character edit distance of 3 or lower. From this parallel paraphrase dataset, we generate 1195 potential STEL simple/complex task instances.
Contraction Characteristic. We generated the parallel contraction dataset from the December 2018 abstract dump of English Wikipedia 2 . The Wikipedia style guide discourages contraction usage and provides a dictionary with contractions that should be avoided. 3 We use an adapted version 4 to select 100 sentences where an apostrophe is present and a contraction is possible. Such a sentence could Figure 2: Triple Problem. Tasks are generated from sentence pairs (A1, A2) and (S1, S2) that are split along the same style dimension (e.g., formal/informal). For each pair, only the order on the axis (e.g., S2 < S1) but not the absolute localization is known. This might lead to a wrong generated label for the triple setup. Here, removing A2 leads to S2 being stylistically closer to A1, whereas the generated label would be S1.
be "It is near Thomas's car". For each sentence, we generate a parallel sentence with a contraction, e.g., "It's near Thomas's car", c.f. Table 1. . We selected a pool of potential sentences where words contained character substitution symbols (4,3,1,!,0,7,5) or are part of a manually selected list of number substitution words (see Appendix). Then, we manually filtered out sentences without number substitutions (e.g., common measuring units or product numbers). We selected 100 sentences, 50 of which were selected to contain at least one additional number that is not part of a number substitution word (e.g., Anchor 1 in Table 1). This setup ensures that the task is not as simple as checking whether there are numbers present in the sentence. To generate the parallel phrases, we manually translated the sentences to contain no number substitutions. As we looked for naturally occurring number substitution words, we decided to keep word pairs that contain additional changes besides number substitution. For example, generally different spelling (e.g., 'd00d', 'dude') or phonetic spelling (e.g., 'str8', 'straight'). We decided to replace the number substitution symbols with characters only -e.g., not with punctuation marks as seen in 's1de!!!!!1!' -> 'side!!!!!1!'.

Ambiguity
Manual inspection shows that the generated potential task instances of the formal/informal and simple/complex dimension contain ambiguities: (i) Some are a result of unclear or very fine distinc-tions between the two parallel styles in the original data. For example, "Each band member chose an individual number as their alias towards the end of 1997." and "Towards the end of 1997, each band member chose an individual number as their alias.". The first is labelled as written in a simpler style in Xu et al. (2016). However, putting "towards the end of 1997" at the beginning of the sentence could also be understood as structuring the sentence more clearly, and thus as simpler. After manual inspection, ambiguities seem to be more prevalent for the simple/complex than the formal/informal dimension. (ii) Other ambiguities are the result of entangled additional linguistic components. For example, consider the potential task instance (A1) "He's supposed to be in jail!", (A2) "I understood he was still supposed to be incarcerated." and (S1) "green day is the best i think", (S2) "I think Green Day is the best.". The sentences are clearly split along the formal/informal dimension leading to the label S1-S2. Still, (A1) and (S2) could also be understood as being written in a more decisive tone than (A2) and (S1) leading to the order S2-S1.
We find that the triple setup has additional theoretical limitations that can lead to ambiguity: Consider the 'Triple Problem' in Figure 2 where S2 is labelled as formal and A1 is labelled as informal. Removing A2, to get from the quadruple to a triple setup, will leave A1 closer to S2, contrary to the original labelling (see also the previous example in (ii)). Additionally, having fewer sentences in the triple setup increases the chance of a random correlation with a different linguistic component (similar to the 'decisive tone' in example (ii)).

Removing Ambiguity
Using crowd-sourced annotations, we filter the previously discussed ambiguity out of the potential formal/informal and simple/complex task instances. The simpler STEL tasks (contraction and number substitution) mostly differ in the amount of apostrophes and numbers (e.g., Table 1). As a result, we expect the style characteristics to contain little to no ambiguity and do not filter those further.
Annotation Tasks. For both the triple and quadruple setup we collected annotations on a subsample of all generated task instances (301 simple/complex and formal/informal instances respectively). Then, we annotated a larger set of task instances on the quadruple setup alone. Based on performance results from the subsample ('Anno-  Table 2: Annotation Results. We filter out ambiguous task instances via annotations. In (a), we display interannotator agreement (Fleiss's κ) and annotation accuracy (acc.) for the sample and total of potential task instances on the quadruple and triple setup for the simple/complex (c) and the formal/informal (f) dimensions. We also display the number of task instances per dimension (n). In (b), we display the share of all combinations of correct () and wrong () annotations per dimension and task setup. The union of and cases make up a majority.
tation Results'), we had 617 and 894 more task instances annotated for the formal/informal and simple/complex dimension respectively.
Annotation Setup. We used annotations from 839 different Prolific 5 crowdworkers with 5 distinct annotators per potential task instance. We paid participants 10.21£/h 6 on average. All annotators were native English speakers as we assume them to have a better intuition about their language. See the Appendix for further detail.
Annotator Agreement. In Table 2a, we report inter-annotator agreement with Fleiss's κ (Fleiss, 1971) as κ allows different items to be rated by different sets of raters. Inter-annotator agreement is only moderate. This does not mean that the annotations are of poor quality. As discussed in §4.2, our generated data contains ambiguous, noisy or faulty task instances. Manual inspection confirms that low annotator agreement is a sign of ambiguity (see also 'Annotation Analysis' Table in Ap pendix). This problem is more pronounced for the simple/complex than the formal/informal dimension. We ensured annotator quality with screening questions (appendix Table 1) and by selecting annotators with the highest platform-internal rating.
Annotation Results. Results are reported in Table 2a. Annotation accuracy is the share of correctly annotated task instances (by a majority of at least 3) out of all potential task instances. The accuracy and the inter-annotator agreement are considerably higher for the formal/informal di-mension (Table 2a) than for the simple/complex dimension. This aligns with our expectation of more ambiguity in the simple/complex task instances (c.f. §4.2(i)). Similarly, our expectations regarding theoretical problems with the triple setup ( §4.2) are confirmed: Accuracy for the sample is generally higher for the quadruple than the triple setting. There are more examples where the quadruple setup was correctly annotated but the triple setup was not (in Table 2b), than there are for the opposite kind ().
As a consequence, the annotation of the bigger set of task instances was only done on the quadruple setup. On the total set of potential task instances (which includes the sample) we obtained similar accuracy and annotator agreement as on the sample (see Table 2a). We filter the potential task instances by only keeping those that were correctly annotated by a majority (i.e., at least 3/5). This leaves 822 task instances for the formal/informal and 815 for the simple/complex dimension. We randomly remove 7 task instances from the formal/informal dimension for equal representation of the two style dimensions. In the following, and under the name STEL, we will only consider the quadruple setup on the 1830 filtered task instances (i.e., 815, 815, 100 and 100 for simple/complex, formal/informal, number substitution and contraction respectively).

Evaluation
We use our STEL framework to test several models and methods that could be expected to capture style information ( §5.1). We describe how the models decide the STEL tasks ( §5.2) and discuss their performance on STEL ( §5.3).  Table 3: STEL Results. We display STEL accuracy for different language models and methods. Random performance is at 0.5. The share of task instances for which a method decides randomly as it can not decide between the two options ('=' in Equation 1) is given in the 'random' column. Both the performance on the set of task instances before (full) and after crowd-sourced filtering (filter) is displayed. The two best accuracies are boldfaced. The BERT-based models perform the best, followed by the "deepstyle" style sentence embedding method. On average, methods perform best for the c'tion and worst for the simple/complex dimension.

Style Measuring Methods
We describe methods and models that can be used to calculate a (style) similarity. Given two sentences, the methods return a similarity value between 0 and 1 or -1 and 1 (when using cosine similarity), where 1 represents the highest similarity. Authorship Attribution Methods. The following methods are inspired by successful or commonly used approaches in authorship attribution (Neal et al., 2017;Sari et al., 2018). We use character 3-gram similarity by calculating the cosine similarity between the frequencies of all character 3-grams. We calculate the word length similarity via the average word lengths a and b of two sentences: 1 − |a − b|/max(a, b). We calculate the punctuation similarity by using the cosine similarity between the frequencies of punctuation marks Other Methods. We also experiment with the "deepstyle" model (Hay et al., 2020) by taking the cosine similarity between the style vector representations. Additionally, we consider the following sentence features: NLTK POS Tags (Bird et al., 2009) and share of cased characters (e.g., Sari et al. (2018)) via the cosine similarity between the frequency vectors and 1 -the difference between the proportion of cased characters respectively. We also include the edit distance as a simple baseline.

Results
Performance results are shown in Table 3. The accuracy is a weighted mean of 0.5 (proportional to the share of undecided instances, c.f. 'random' in Table 3) and the accuracy in the decided cases. Random guessing would show an accuracy of 0.5 exactly. Stylistic differences can be subtle for the STEL dimensions and we expect this to be a hard task to solve. In contrast, the STEL characteristics (i.e., contraction and number substitution) should be easier to solve (via detecting an additional apostrophe or number) and are especially interesting for model error analysis. Note: We do not make general quality judgements because models were not trained on the components of STEL and were often not even meant to measure style directly.  (McNemar, 1947)). A possible explanation of RoBERTa's reduced performance might be the removal of the next sentence prediction (NSP) task. Closer sentences could generally be more similar in style than a different random sentence -possibly making the NSP a valuable learning objective for style similarity learning. To further look at this, we experiment with the BERT NSP head on the cased and uncased base model. For the quadruple setup, we calculate the four 'similarity' values as described in Equation 1 by using the predicted softmax probability that A1 is followed by S1 for sim(A1, S1). The other similarities are calculated equivalently. Interestingly, BERT's cased NSP head (accuracy of 0.71) performs better than RoBERTa (p < 0.001) across the STEL tasks.
The effect of training objectives on learning style information could be explored in future work.
(Semantic) Sentence Embedding Methods perform well. SBERT para-mpnet (0.68) trained on the paraphrase data performs better than SBERT mpnet (0.61, p < 0.001) and USE (0.59, p < 0.001). Overall, SBERT para-mpnet is the third best performing model after the base BERT models and the best performing model in the nb3r dimension. In future work, it could be interesting to explore the effect of different training data on the performance of embedding models.
LIWC alone does not perform well. On the style dimensions LIWC performs similar to the random baseline. Possibly because the LIWC methods often find no difference between the two possible orderings (10%, 71% and 32% of tasks). The difference between the three LIWC-based methods is not significant (p > 0.05). Future work could explore models that consider more fine-grained differences between LIWC categories.
Authorship attribution methods perform better than random. Character 3-grams and punctuation perform at 0.58 accuracy on the formal/informal dimension. Considering some of the informal examples, punctuation seems to be one of the most prominent visible changes from a formal to an informal style (see Appendix). Interestingly, word length is the method that most clearly performs better on the simple/complex than the formal/informal dimension. This aligns with the intuition that shorter words are a sign of a simpler style as found in Paetzold and Specia (2016).
Casing encodes style information. The uncased performs worse than the cased BERT model (0.74 vs. 0.77, p=0.008). Additionally, the cased letter ratio performs slightly better than random for the formal/informal dimension (0.55) and perfect for the contraction characteristic (1.0): When the sentence consists of fewer lower cased characters (as a result of removing them when using contractions), the share of upper cased characters increases.
Style embedding yields promising results. The method "deepstyle" (Hay et al., 2020) performs well across STEL components (0.66). It performs the worst on the simple/complex dimension (0.55). The method embeds sentences in a vector space where texts by "similar" authors are similarly embedded. In the training data (blog and news articles), authors might not consistently use one style over the other. The difference between same author and same style could be explored in future work.
Less ambiguous task instances reach higher accuracy values. Table 3 (c.f. 'full') shows the accuracy of the style measuring methods for the complete set of potential task instances before filtering out ambiguity ( §4.1). The accuracies are the same or lower than the crowd-validated task instances in STEL. The differences are more pronounced for the simple/complex than the formal/informal dimension. This aligns with the higher (expected) ambiguity in the simple/complex dimension ( §4.2 and §4.3). In general, we recommend to use the filtered STEL task with less ambiguity for testing.

Limitations and Future Work
Our illustrative set of task instances does not cover all possibilities of style variation. Future work could extend STEL to cover additional style dimensions or more fine-grained task instances using several sources of data. The STEL task instances for one style component can contain correlations with unconsidered (style) components. Consider the following task instance (shortened for readability): (A1) "Fortynine species of pipefish [...] have been recorded.", (A2) "Forty-nine type of pipefish [...] have been found", (S1) "Patients [...] must have their liver checked for damage and other side effects." and (S2) "[...] patients [...] must be monitored for liver damage and other possible side effects.". (A2) and (S1) are the simpler version of (A1) and (S2) (Xu et al., 2016). Additionally, the sentences vary along other aspects: (A2) is missing the punctuation mark and includes a misspelling. (S1) is different in content from (S2) as (S1) is only considering effects on the liver while (S2) also includes other side effects.
However, those aspects did not change the label given by the annotators (S2-S1) and should mostly be secondary to the considered style dimension.
With STEL, language models and methods are tested only on whether they capture clear differences in style when content is approximately the same. When there are also content differences, such models might put more emphasis on content than stylistic aspects. Our framework could be extended to allow testing for whether a model prefers style over content (e.g., with a new task format where sentence 1 is closer in content to anchor 1 but closer in style to anchor 2, c.f. Figure 1). STEL could also be extended to test for individual author styles and style variation related to the social or regional background of authors (e.g., different age groups). For example, by including sentence pairs with the same content but written by different authors. Current and future dimensions could also be extended by a train/dev/test split to enable training on the task directly. Further, STEL could be enriched by including longer texts (e.g., paragraphs or documents) as anchor and alternative sentences.

Conclusion
Style is an integral part of language. However, there are only few benchmarks for linguistic style. In this work, we introduce STEL, a modular, content controlled and fine-grained similarity-based style evaluation framework. Out of the evaluated language models and methods, the cased BERT base model performs the best on STEL. Simpler sentence features perform close to the random baseline. STEL includes two general style dimensions and two specific style characteristics. We hope that this framework will grow to include an even more exhaustive representation of linguistic style and will facilitate the development of improved style(-sensitive) measures.

Task Usage
When using this task, please also cite the original datasets the tasks were generated from: (1) Rao and Tetreault (2018)

Ethical Considerations
The STEL tasks are based on datasets (Rao and Tetreault, 2018;Baumgartner et al., 2020;Xu et al., 2016) from popular online forums and web pages (Yahoo! Answers, Reddit, Wikipedia). However, the user demographics on these platforms are often skewed towards particular demographics. For example, Reddit users are more likely to be young and male. 10 Thus, our dataset might not be representative of (English) language use across different social groups. Further, the usage of posts from online platforms without explicit consent from users might lead to (among others) privacy concerns. The Wikipedia simplifications and formal Yahoo! Answers paraphrases were generated by consenting crowdworkers (Xu et al., 2016;Rao and Tetreault, 2018). We expect the sentences that were extracted from Wikipedia for the contraction dimension and for the complex/simple dimension to lead to minimal privacy concerns as they were meant to be read and copied by a broader public. 11 Rao and Tetreault (2018) and the nb3r dimension do not include user names. However, we acknowledge that users might be identifiable from the exact wording of posts. We removed nb3r substitution instances that included Reddit user names. We hope the ethical impact of reusing the already published Rao and Tetreault (2018)

A Removing Ambiguity
Annotation Setup Information. Prolific 12 crowdworkers could participate up to 5 times in annotating different generated tasks from the formal/informal and simple/complex style dimensions. Each time a participant was asked to annotate 14 potential task instances as well as 2 additional screening questions. The screening questions were randomly sampled from a list of 10 screening questions (see Table 5). The screening questions were manually created and then unanimously and correctly answered by 3 lab-internal annotators in the triple setting.
We filtered out all crowdworkers that wrongly answered any of the screening questions. We display the task description ( Figure 5 and 6) as well as the phrasing of the questions (Figure 4 and 3). Participants were payed 10,21£/hour on average (above 8.91£ UK minimum wage) and gave consent to the publication of their annotations.
We required annotators to be native speakers as we assume them to have a better intuition about their language than non-native speakers: During study design, we conducted a pilot study with 8 different non-native annotators. Several felt their English-speaking abilities were insufficient for the task. The study projected a higher perceived and measured difficulty of the simple/complex dimension. As a result we required annotators to be native speakers and generated more potential simple/complex than formal/informal tasks.   Table 2a). Compared to the overall accuracy on the sample, the accuracy is higher with opposite filtering for all dimensions (i.e., complex and simple) and setups (i.e., triple and quadruple). The increase is higher for the quadruple than for the triple setup.  Additional Annotation Results. To further look at the difference between the quadruple and triple setup, we display additional results in Table 4.
Here, we only consider the potential task instances that were correctly annotated in the opposite setup. For example, we take the task instances that were correctly annotated in the quadruple setup and see how many of them were also correctly annotated in the triple setup (in this case 0.68). One goal of this analysis was to see whether we can use annotations from the quadruple setup to also remove the ambiguities from the triple setup (or the other way around). However, in both cases accuracy is somewhat low (i.e., below 0.9) and we decided against such an approach. See Table 6 for examples of every combination of (in)correctly annotated triple and quadruple setup of a potential task instance.
They were having sex. You do not have the perspective.
It's cause ya got no sense.
formal/informal S2-S1 OH, REALLY? Oh, is that so? Girlfriends is one of my favorite shows on television.

GIRLFRIENDS IS ONE OF MY FAVORITE SHOWS.
simple/complex S1-S2 Many species had vanished by the end of the nineteenth century.
Many animals had disappeared by the end of the 1800s.
They are culturally akin.
Their culture is like the other.
simple/complex S1-S2 This stamp remained the standard letter stamp for the remainder of Victoria's reign, and vast quantities were printed.

S1-S2
Among the casualties were two fishers who were reported missing.
Two fisherman are missing among the people who may have been hurt or killed.
Baduhennna is solely attested by Tacitus' Annals where Tacitus records that a grove in Frisia was dedicated to her, and that near this grove 900 Roman prisoners were killed in 28 CE.
In Tacitus' Annals by Tacitus, it is recorded that a grove in Frisia was dedicated to Baduhennna, and near to this grove 900 Roman prisoners were killed in 28 CE.
formal 208 ≈ 0.691 S1-S2 im pretty sure that it was kiss I am fairly certain it was a kiss.
Law and Order... it just has a clunk clunk I like Law and Order, although it is a bit clunky lately.

complex
111 ≈ 0.369 S1-S2 Mifepristone is a synthetic steroid compound used as a pharmaceutical.
Mifepristone is a synthetic steroid compound which is used as a medicine.
The video was premiered on MTV2 on July 14, 2006. Table 6: Annotation analysis. For the simple/complex and the formal/informal dimensions, we give the number of occurrences of each combination of correct () and wrong () annotations in the triple (T) and quadruple (Q) setting. For every combination and style dimension an example is given. The share is calculated out of 301 examples. In total 602 examples were annotated for both Q and T settings with 301 per style dimension. The most common cases are and the combination for both style dimensions totaling 68.1% and 88.7% of the cases for the simple/complex and formal/informal dimensions respectively. There are ambiguous examples, where one could argue for both possible orders. After manual inspection, this seems to be more prevalent for the simple/complex dimension but it also happens for the formal/informal style dimension. E.g., for row (, formal), Anchor 1 could be understood as more formal (e.g., 'gentleman') or more informal (e.g., '!' and an unusual grammatical structure). Row (, formal) is an example of the 'triple problem'.

B Additional STEL Results
In Table 7, we display the share of task instances where models and methods could not decide between the two possible answers. This is adding more detail to the 'random' column of Table 3. The share of random decisions is lower for the more complex style dimensions (formal/informal: 0.05 and simple/complex: 0.13) and higher for the simpler style characteristics (nb3r substitution: 0.38 and contrac'tion usage: 0.15). This aligns with the intuition that the difference between the sentence pairs in the nb3r and contrac'tion dimension is smaller. The neural methods have a lower share of random decisions overall.  Table 7: Share of Random Decisions. The share of task instances for which a method can not decide between the two options and decides randomly is given per dimension. The performance on the set of task instances before (full) and after crowd-sourced filtering (filter) is displayed. The two highest shares of random decisions is boldfaced. The share of random decisions is highest for the nb3r and lowest for the formal dimension. LIWC (style) and punctuation similarity have the overall highest share of random decisions.  (1), when replacing similarities (i.e., 1 − sim(x, y)) with distances (i.e., |x − y|). Thus, equation (1) holds when working with style-sensitive similarity functions that can be translated to distances. Note: As only cosine 'angular distance' is a distance metric, this would need to be the angular cosine similarity. However, angular cosine similarity can be replaced by cosine similarity in inequality (1) as relative ordering is the same for the two similarity metrics.

E Computing Infrastructure
The evaluation of the 18 (language) models and methods took 14 hours in total on a machine with 32 GB RAM and 8 intel i7 CPUs using Ubuntu 20.04 LTS. No GPU was used.