Contextualizing Variation in Text Style Transfer Datasets

Text style transfer involves rewriting the content of a source sentence in a target style. Despite there being a number of style tasks with available data, there has been limited systematic discussion of how text style datasets relate to each other. This understanding, however, is likely to have implications for selecting multiple data sources for model training. While it is prudent to consider inherent stylistic properties when determining these relationships, we also must consider how a style is realized in a particular dataset. In this paper, we conduct several empirical analyses of existing text style datasets. Based on our results, we propose a categorization of stylistic and dataset properties to consider when utilizing or comparing text style datasets.


Introduction
The general task of text style transfer involves rewriting source content in a target style. Currently, there are a number of text style transfer tasks with available data, such as formality (Rao and Tetreault, 2018), bias (Pryzant et al., 2020), sentiment (He and McAuley, 2016), humor or romance (Gan et al., 2017), offensiveness, (Nogueira dos Santos et al., 2018), authorship or time period (Xu et al., 2012), and personal attributes (Kang et al., 2019). While these specific tasks are often modeled in isolation, the general task definition remains consistent. As such, a natural question arises of what the relationship is between the stylistic variation of specific tasks.
Stylistic variation can arise from a number of factors such as communicative intent, topic, and speaker-receiver dynamics (Biber and Conrad, 2019), yet within the task of text style transfer, our view of a style is constrained to the context of each specific dataset. Therefore, understanding the tasks as well as the relationships between different tasks requires considering the stylistic properties and potential contextual and social factors (Hovy and Yang, 2021;Hovy, 2018) underpinning them, as well as the dataset characteristics (Bender and Friedman, 2018) and intersection of influences giving rise to the realization of style within a dataset.
From an application standpoint, considering these influences can provide a more comprehensive understanding of important textual features. There is a body of work already looking at how to identify generic features to increase target task performance  or to compute similarity of textual features to select data for transfer learning (Ruder and Plank, 2017). In the context of text style transfer, these approaches first require understanding what features should be shared across tasks. For example,  leveraged the stylistic features shared between grammatical error correction data and formality to increase model performance on formality transfer datasets.
In addition to textual features such as stylistic properties, existing work also suggests that context of dataset creation should be taken into account when identifying compatible data or assessing possible out-of-distribution generalizability. For example, the similarity between how sentiment information is reflected in different domains affects adaptation performance , and many models can achieve high performance on natural language inference tasks through task-limiting annotation artifacts (Gururangan et al., 2018;Poliak et al., 2018). In other words, factors such as data source and annotation method can create underlying textual features that can impact performance and limit generalizability. Thus, in combination, these existing works on leveraging inherent stylistic similarities  or similar stylerepresentations in different dataset domains , as well as identifying task-limiting dataset properties (Gururangan et al., 2018 Table 1: An overview of the datasets used for exploratory analyses. Task describes the source-target direction used in our experiments and domain and annotation show general categorizations. Size provides statistics of the data splits, with standard, pre-existing data splits used when available. et al., 2018) indicate that analysis of both stylistic properties and dataset characteristics, as well as the potential interdependencies between them, is warranted.
In this paper, we consider two primary categories of textual variation within the context of text style transfer: stylistic characteristics and dataset characteristics. We perform a series of empirical analyses to demonstrate the visible influence of both style and dataset characteristics on the performance of text style transfer models. Then, we present a categorization of style and dataset properties for consideration when utilizing or comparing style transfer datasets. Finally, we discuss the downstream applications for contextualizing variation in text style datasets, including multi-task learning, data selection, and generalizability. Our work and suggestions fall within the context of and align with recent work on incorporating social factors in natural language processing systems (Hovy and Yang, 2021) and characterizing datasets (Bender and Friedman, 2018).

Empirical Analyses
As an exploratory step, we question whether we can distinguish differences arising from style or dataset properties when comparing empirical results across datasets. We identify a set of aligned English datasets used for supervised text style transfer that exhibit differences ranging from style, annotation method, and domain. We further restrict our selection to datasets in which a single stylistic attribute is transferred between classes. Specifically, we look at GYAFC-EM & GYAFC-FR (Rao and Tetreault, 2018), Shakespeare (Xu et al., 2012), Biased-word (Bias) (Pryzant et al., 2020), Fluency (Wang et al., 2020;Godfrey et al., 1992), andFlickr (Gan et al., 2017). We provide dataset overviews in Table 1, with detailed dataset descriptions provided in Appendix A. We perform a preliminary qualitative analysis to get an initial impression of the data differences.
First Impression of Data: Of the six datasets, four were manually annotated and two were automatically annotated. For manually annotated datasets, GYAFC-EM and GYAFC-FR utilized crowdsourced rewrites, Flickr utilized crowdsourced sentences with only visual context shared between annotators, and Fluency utilized expert annotations of the target attribute. Both automatically annotated datasets (Bias, Shakespeare) were created through identification of existing data sources. While each style task is unique (other than two domains of GYAFC for formality), in terms of style we observe that Shakespeare has a significantly different temporal context than all other datasets, and Fluency involves a stylistic attribute that, ideally, the sentence pairs in all other datasets should possess. 1 Beyond our qualitative observations, we perform an exploratory multi-task learning experiment, described in the following subsection.

Multi-Task Learning
As a toy experiment, we ask the question "What would our results look like if we naively train on all style transfer tasks, with no considerations beyond the fact that the tasks share a general task defi-nition? 2 We essentially ignore all considerations for style or dataset properties. Our expectation is that negative transfer will occur due to the lack of consideration for factors such as domain (Pan and Yang, 2009; 3 , but we are interested in whether all tasks share similar performance patterns or if performance on any tasks diverge from the overall set. If the latter, is there any intuitive explanation for the divergences?
We further expect that the degree of negative transfer will be impacted by the degree of difference of stylistic or data properties, relative to the full set of pre-training datasets. Specifically, we anticipate some level of alignment with our initial impression of the data: the alternate temporal context of Shakespeare may increase degree of negative transfer, yet the inherent stylistic connection with Fluency may lessen the degree of negative transfer.

Experimental Setup
We utilize two experimental settings: GPT-2 directly fine-tuned on each dataset, and GPT-2 with multi-task pre-training on all datasets followed by fine-tuning on each target dataset. For both settings, we initialize GPT-2 with the pre-trained parameters from Radford et al. (2019). For our multi-task experimental setup, we follow prior works (Liu et al., 2015(Liu et al., , 2019Raffel et al., 2020) to perform multi-task learning for the baseline GPT-2 model (Wang et al., 2019): we initialize GPT-2 with the pre-trained parameters from Radford et al. (2019), then we jointly pre-train on all style tasks in a supervised manner and fine-tune on each individual style transfer task. 4 For multi-task learning, we construct our pretraining dataset by randomly shuffling the training examples from all datasets. During pre-training, each training example from each individual task is seen at least once per epoch. All of the training examples in the largest dataset are seen exactly once per epoch, while all training examples for the smallest dataset are seen multiple times per epoch (proportional to the ratio between the training set size of the largest-scale task and the smallest-scale task). For the fine-tuning step, we leverage the multi-task pre-trained model and further fine-tune on each individual supervised task, saving the model with 2 The general task definition is rewriting the source content of a text in a target style (see section 1) 3 Negative transfer occurs when transferred knowledge negatively impacts target performance (Pan and Yang, 2009). 4 GPT-2 models were each trained on a single NVIDIA GTX 1080 Ti GPU.  Table 2: Experiments conducted using GPT-2, where BLEU-og represents directly fine-tuning the original GPT-2 on the target task, BLEU-mt represents multitask pre-training using all datasets and fine-tuning on the target task, and %og represents the relative performance of multi-task pre-training in comparison to the performance of the original GPT-2 (computed by dividing BLEU-mt by BLEU-og).
the lowest validation set loss as our final model for evaluation.

Results
We report BLEU (Papineni et al., 2002) in Table 2 as a measure of content preservation. 5 We compare the performance between directly finetuning the original GPT-2 on the target task (BLEUog) and firstly multi-task pretraining the original GPT-2 then fine-tuning it on the target task (BLEUmt). Negative transfer is identified as a performance drop in BLEU-mt, i.e. %og < 1.00. Since the style transfer datasets in use are diverse across domain and stylistic properties, we expect negative transfer to occur in the multi-task learning setting. However, we are specifically looking at the overall performance pattern as an initial step in determining what properties may underlie such differences, which should be accounted for in a taxonomy.
While most tasks perform within a 12% margin below the original GPT-2 performance, we observe two divergences: with multi-task learning, the Shakespeare-to-modern task performed at less than 50% of the original GPT-2 performance, and the disfluent-to-fluent task experienced a slight performance increase. Performance on Fluency exceeded our initial expectation that the degree of negative transfer would simply be lower compared to other datasets, but overall the divergences with Shakespeare and Fluency match our expectations based on our initial impression of the data style differences. Specifically, we attribute the performance drop on the Shakespeare dataset to limited suitability for combining the data sources likely due to the stylistic attribute pertaining to different temporal context, and we attribute the Fluency dataset performance increase to high suitability for combining the data sources likely due to its stylistic attribute pertaining to a textual criteria that is assumed to be inherent to the other data.
With regard to dataset differences, we note the potential impact of dataset size on performance: to maintain consistency of the model architecture, we utilize the same model configuration with GPT-2 across datasets and experimental settings. In the case of performance on the Flickr dataset (see Table  1), it is possible that such a model configuration may overfit on the dataset. However, this alone fails to account for our observations of performance pattern divergences.
Beyond overall pattern, we observe an unexpectedly wide range of BLEU scores across datasets, which we expect could be attributable to differences in either dataset creation or style. There may be stylistic differences in how style information is encoded that impact content preservation. For example, some styles may have more words that encode both style and content information which may increase the difficulty of content retention (Cao et al., 2020), yet other styles may be characterized by stylistic attributes encoded in only a few key words or phrases (Fu et al., 2019). However, these differences may also be attributable to dataset creation. We expect that if the attribute-encoding words are constrained to a few words or phrases as a property of a style itself, then a dataset's style classes should be highly distinguishable using lexical features; in other words, the decision boundary when classifying styles should stay at the lexical level (Fu et al., 2019).
To test these hypotheses and help explain the range of BLEU scores, we perform two complementary experiments. First, we compute sentence similarity metrics averaged over each dataset to 1) identify if there is a relationship between BLEU scores and baseline sentence pair similarities, and 2) identify datasets with high similarity across class boundaries that constrain stylistic attributes to a few words or phrases. Second, we perform classification and ablation studies using a set of linguistic features defined on each dataset. For datasets with high sentence similarities, if a style can be well-represented by a few style-encoding words or phrases, then we expect high classification performance using only lexical features. Conversely, if a style cannot be isolated to a few words and phrases,  Table 3: Jaccard Similarity (JS), Levenshtein Distance (LD), normalized Levenshtein Distance (LD-norm), and F1-Score. Sentence similarity measures quantify the distance between target and source for the training sets with arrows indicating direction for more similar sentences.
we expect low classification performance using lexical features alone, in which case a high sentence similarity is likely attributable to dataset properties rather than inherent style properties.

Similarity Metrics
We calculate token-based Jaccard Similarity, tokenbased Levenshtein distance, and F 1 -score between the source and target training sets. We also report Levenshtein distance normalized by sentence length, LD norm (s, t) = LD(s,t) max |s|,|t| where LD(s, t) is the Levenshtein distance, s, t refer to sentences in a sentence pair, and | · | refers to the number of tokens in a sentence. Scores are reported in Table 3. 6 We see some relationships between similarities in Table 3 and GPT-2 performances in Table 2 in that the datasets with the lowest BLEU scores (Shakespeare and Flickr) have the lowest baseline similarities, and the datasets with the highest BLEU scores (Fluency and Bias) have the highest baseline similarities. We therefore can identify the Fluency and Bias datasets as being of particular relevance for the lingusitic features analysis. Specifically, our hypothesis is that if the Bias and Fluency styles can truly be isolated to few words as the sentence similarities would suggest, then the classification performance should be high using only lexical features. In contrast, if the dataset properties influence variation through constrained stylistic representation, then we expect low classification accuracy using lexical features.

Linguistic Features Analysis
We define linguistic features to refer to properties characterizing textual variation primarily at the lexical or syntactic level, where the "other" category in Table 4 indicates features that may capture slight semantic variation (subjectivity) or reflect overall lexical tendencies (bag-of-words). Features are adopted from prior works (Pavlick and Tetreault, 2016;Abu-Jbara et al., 2011;Roemmele et al., 2017) and listed in Table 4, with further description in Appendix C. We train logistic regression classifiers with 1regularization and feature scaling on the full feature set for each text style dataset. Next, we train and subsequently test classifiers with all features ablated except the specified subset, and identify important features as those with minimal relative performance drop compared to full-feature classification accuracy. Results are shown in Table 5. We further quantify the magnitude of variation by computing the Jensen-Shannon (JS) divergence for each feature, and indicate the cells corresponding to features with divergences ≥ 0.075 in Table 5 in bold. 7 Datasets with the lowest BLEU scores (Flickr and Shakespeare) have more distributed salient class features across linguistic levels, further reflected in a higher number of features with large divergence magnitudes (≥ 0.075). For the high BLEU and sentence similarity datasets of interest (Bias, Fluency), the inverse of this is true. For Bias and Fluency we see consistently low classification 7 Table 6 in Appendix D shows a JS-divergence heatmap.  performance across ablations, including the lexical feature ablations. These results support our hypotheses and further suggest that neither stylistic differences nor dataset characteristics alone can be used to relate text style datasets. Rather, both influences as well as their interactions require consideration.
In the following section, we propose a taxonomy of style and dataset property categories that can contribute to variation in text style transfer datasets. Additionally, we note that when introducing these properties, we view style as the targeted stylistic property within the context of a text style dataset.

Variation From Style and Data Properties
Our empirical analyses demonstrate the visible influence of both style and dataset properties on how a style is represented in a given dataset. In addition to brief mentions of influences of dataset creation in section 1, we can identify an intuitive reason for these dual influences. While linguistic approaches exist to analyze textual variation (Halliday and Matthiessen, 2013;Holmes and Wilson, 2017;Biber, 2012), we suggest that the processes of linguistic-based stylistic analysis and text style transfer typically occur in inverse directions: linguistic analysis may work from human-written text and then analyze stylistic variation, whereas text style transfer may work from pre-existing ideas of targeted stylistic variation and then create datasets of human-written text that meet stylistic expectations. In other words, to create a text style transfer dataset or train a text style transfer model, the researcher should have a notion of the desired style against which to judge the resulting artifact. Intuitively, this process can lead to process-attributable Figure 1: Framework overview visualizing style and dataset properties discussed throughout section 3. Grey boxes indicate example considerations within each category. We contextualize both style and dataset properties within language and sociocultural context as all language is implicitly reflective of these influences (Hovy and Yang, 2021).
variation secondary to and alongside the intended stylistic variation. Based on our results and observations, we consider stylistic properties as properties influencing textual variation that are inherent to a particular style and dataset properties as factors influencing textual variation due to how a particular dataset was created. We detail style and dataset properties in the following subsections and visualize the major distinctions in Figure 1.

Stylistic Properties
We group stylistic properties under two broad categories: style entanglement and style type.

Style Entanglement
Although some recent approaches to style transfer model style and content words separately (Li et al., 2018), or try to disentangle style and content representations (John et al., 2019;Kazemi et al., 2019), this approach may be less effective when used to transfer styles in which a higher ratio of words embed both style and content information. We can consider this ratio of dual-embedding a property inherent to the style. Specifically, we can consider how entangled the style and the content or semantic meaning is, where content entanglement refers to whether changes to the style result in additions or reductions in the total content details, and meaning entanglement refers to whether changes to the style can retain the content details but alter the semantic meaning. As an example of this distinction, sentiment transfer, which has been regarded previously as transfer between negative and positive style (Shen et al., 2017;Prabhumoye et al., 2018) alters semantic meaning while retain-ing most content, yet transferring between styles such as expert-to-layman can retain meaning but lead to content detail reductions due to the difficulty of preserving content from professional sentences (Cao et al., 2020).

Style Type
Style can refer to the individuating sense or evaluative sense of a text (Crystal and Davy, 1969). We refer to evaluative styles as styles distinguished by general properties that address overall textual quality corresponding with rules of usage and composition, effectiveness of expression (Strunk and White, 1999) or based on overall quality evaluation and judgments (Williams and Bizup, 2017). Stylistic variation occurs solely along evaluative lines, independent of situational context or language choice. From our empirical experiments, we can consider the Fluency dataset representative of a dataset in which the transferred stylistic attribute refers to an evaluative sense of style.
We consider descriptive styles as distinguished by stylistic properties that characterize textual variation through influences such as the underlying communicative intent, the situational or social factors influencing language choice, and the attributes of the producer of the text. We can further differentiate descriptive styles by the stability or variability of the targeted stylistic property.

Stability of Targeted Style Properties
On one end of the spectrum variable stylistic properties (high variance, low stability) are characterized by dynamically shifting language to convey information a certain way, which may be reflective of factors such as the underlying intent in producing the text or the social dynamics of a situation. For exam-ple, politeness can shift based on social dynamics such as social distance and relative power between participants (Brown et al., 1987) independently of the directness of communication, such as formality 8 in email (Peterson et al., 2011). From our empirical experiments, we consider Flickr, GYAFC, and Bias as reflective of variable targeted properties.
At the other end of the spectrum, more stable targeted stylistic properties (low variance, high stability) remain more consistent across social situations and arise from relatively stable internal or external context. These may reflect internal context such as the personal attributes of the producer of text (Kang et al., 2019), or external context such as the temporal context at time of text production or stylistic properties inherent to the mode of distribution. Example datasets include the PASTEL dataset (Kang et al., 2019) annotated for personal attributes such as gender and age group, and the Shakespeare dataset (Xu et al., 2012) which can be considered reflective of authorship (Xu, 2017) or temporal context. 9

Dataset Properties
While in the previous section we discussed properties inherent to specific styles, in this section we discuss properties of datasets to which textual variation is attributable. We identify the broad categories of properties due to creation method and data source. In this context, creation method refers to the general method of creating sentence pairs (automatic or manual annotation, as well as any properties arising from utilizing a specific method, such as influences of annotator background or perceptions) and data source refers to characteristics (such as domain) from where the source data was collected. We provide more detailed discussion in the following subsections.

Creation Method
Generally speaking, datasets can be created via manual annotation, such as through judgments or rewrites, or via automatic annotation, such as through filtering data that has a target attribute (i.e., detection with a classifier). With particular attention on manual annotation, in addition to potential generalizability-limiting data properties arising 8 Formality is closely related to politeness (Kang and Hovy, 2021) 9 Regarding distribution mode, Abu-Jbara et al. (2011) suggested a set of linguistic features differentiating written and audio styles. from artifacts of the annotation method and annotation type ((Geva et al., 2019), also, see section 1), the annotators themselves can influence stylistic variation. For example, model performance has been improved by incorporating annotator identifiers as features (Geva et al., 2019) and by augmenting machine translation models with distinct translator styles identifiable in the training data (Wang et al., 2021). In the case of Wang et al. (2021), using annotator styles resulted in BLEU score variations of up to +4.5 points.
Underlying these influences, annotator properties that may give rise to textual variation could include the background of the annotator such as experts or crowd-sourced workers, and the perception the annotators have of the style task. Similar to human evaluation of outputs, perception may arise due to personal understanding or the wording of instructions presented. 10 .
Data Source -Domain: Differences in domain can be reflected in entirely different word meanings and contexts of use , as well as different manners of encoding attribute information such as sentiment (Blitzer et al., 2007;. In addition to differences of a single style between domains, the domains themselves have different levels of stylistic diversity (Kang and Hovy, 2021). Further, while the properties characterizing a style may be inherent to how a style is realized within a domain, there is a distinction in how the style is reflected between domains that necessitates domain being considered as a dataset property influencing variation in text style datasets.

Interplay Between Style and Data Properties
Bender and Friedman (2018) proposed data statements for documenting dataset contextual factors such as language variety, speaker demographics, annotator demographics, speech situation, and text characteristics (e.g. genre, topic). The style and dataset properties we discuss as potentially contributing to variation in text style transfer datasets show some alignment with those proposed for data statements as such factors contribute to linguistic variation in a general sense. However, our categorization specifically operates within the context of text style transfer datasets for which there are unique considerations and important distinctions between sources of variation and downstream implications or applications.
In the previous subsections, we discussed style properties and dataset properties to which variation in text style transfer datasets can be attributed. In this section, we discuss the interdependence of style and data properties in text style transfer datasets in terms of context-dependence of and interactions between sources of variation.
Style and Data Property Interactions While we previously considered the potential impact of both style and dataset characteristics independently, these characteristics may have underlying interactions and influences on one another. Specifically, certain types of stylistic properties may be more or less amenable to certain dataset creation methods or sources, and vice versa.
With regard to the stability of stylistic properties, dataset properties such as annotation method may be indirectly influenced when transferring across relatively stable stylistic properties. For example, machine translation models have been found to exhibit stylistic bias through reflecting demographically-biased training data (Hovy et al., 2020). While this demonstrates that the demographics of annotators can serve as an important dataset characteristic, it also demonstrates the potential to transfer across relatively stable stylistic properties, such as personal attributes (Kang et al., 2019). However, as the stylistic properties are inherent to the annotator, there may be constraints on dataset creation through manual data annotation, such as potential limitations and additional considerations for using methods such as human judgments. This underscores additional considerations for and potential challenges of selecting data from two styles that may have underlying influences on how datasets are constructed.
Context-Dependence of Variation Relatedly, contextual considerations come into play with respect to the the Shakespeare to Modern English style transfer task, a dataset also reflective of transfer across stable, contextual boundaries. The Shakespeare to Modern English transfer task can be considered as transferring across temporal context, or as the characteristic style of a single author (Xu, 2017). In this case, while an influence of sociocultural context is apparent when considering the original data sources, the targeted stylistic variation occurs across such context boundaries. Thus, source of variation for textual features arising from external context lies with whether the intent is present for a dataset to represent a transfer across context boundaries, rather than an artifact reflecting specifics of dataset creation. This is illustrated in Figure 1 as a dashed line connecting style type to dataset properties.
With further regard to dataset creation, it is important to acknowledge that while we consider many properties arising from social influences as dynamic and variable influences giving rise to particular styles, a dataset will indirectly and inadvertently reflect such social context during creation to some degree. As such, we also must consider social factors not related to the actual targeted style, but rather arising from the dataset creation process. As an example of this consideration, we can't simply say two sentiment datasets from the same general domain (such as restaurant reviews) are equivalent if one was constructed with reviewers who had anonymity (in a sense mitigating some of the direct social pressure or influence) and the other was constructed with reviewers who were not anonymous and were thus subject to increased social pressure. By understanding both data and style differences and their interactions within a particular context, these potential differences or hidden influences can be more easily identified. In summary, the interactions between style and data properties are complex. While we have suggested interactions between context and sources of influence, there are likely correlations that exist based on sources of variation which future work can investigate.

Influences and Applications
In the previous sections, we demonstrated visible influences of style and dataset properties on performance, categorized a set of style and dataset properties for consideration, and discussed the potential interactions between sources of variation. We conclude by discussing several applications of understanding the sources of variation in text style transfer datasets. Specifically, we look at multi-task learning, domain adaptation, and generalizability.

Multi-Task Learning and Domain Adaptation
Multi-task learning aims to jointly train a model with auxiliary tasks to complement learning of the target task. When determining which auxiliary objectives to incorporate, multi-task learning for various NLP tasks has been shown to benefit from knowledge about both dataset characteristics and stylistic properties. For example, multi-task learning performance gains for NLP tasks such as POS tagging and text classification are predictable from dataset characteristics (Kerinec et al., 2018;Bingel and Søgaard, 2017). With regard to stylistic properties, within the context of multi-task learning for style transfer  achieved performance gains by leveraging an intuitive stylistic connection between formality data and grammatical error correction data. 11 .
While multi-task learning can be viewed as a form of parallel transfer learning, we can view domain adaptation as a form of sequential transfer learning and look at similar applications of contextualizing stylistic variation.  found that leveraging generic style and content information outperformed generic content information alone for domain adaptation, however, the closeness of sentiment information (target attribute) in the source and target domains impacted performance. In other words how the style was reflected in the particular dataset (i.e., a dataset characteristic) was related to the benefit provided by the adaptation. Based on the combined evidence in this section, we can thus support applying analysis of both style and dataset properties for transfer learning data selection, including multi-task learning and domain adaptation, in text style transfer. We suggest that the taxonomy presented in this paper can assist exploration of systematic data selection methods in these and related application areas.
Generalizability One of the underlying motivations for pursuing multi-task learning and domain adaptation is the issue of generalizability. In the context of style transfer, we can consider generalizing a model for one style across different data distributions with the same stylistic attribute, or across similar domains yet different stylistic attributes. In either case, how the model learns to represent the generic style or content information is vital for successful transfer. As we've demonstrated throughout prior sections, considering both style and dataset properties can aid in identifying sources from which possible issues may arise in terms of along which dimensions stylistic attributes may significantly differ, or which artifacts or influences of dataset creation may influence general- 11 Other styles, such as impoliteness and offense, are also highly dependent on each other (Kang and Hovy, 2021) izability secondary to any stylistic considerations. Considerations to this end may prove beneficial both in the dataset creation process as well as when considering how a model may perform beyond a specific dataset.

Conclusion
In this paper, we conducted a set of exploratory analyses to assess the visibility or influence of both style and dataset characteristics on text style transfer. Based on these observations, we proposed a categorization of stylistic and dataset properties that can contribute to variation in text style transfer datasets and described the applications in which these properties may be influential, limiting, or leveragable.

B Similarity Metrics
In Table 3 we do not distinguish between source and target direction due to the symmetry of met-12 https://www.sparknotes.com/ rics in our setting. We provide further justification below: Jaccard similarity can be defined as where V {s (k) } denotes the set of vocabulary words existing in a source sentence {s (k) } and V {t (k) } denotes the set of vocabulary words existing in a target sentence {t (k) }. By the commutative property, ing Jaccard similarity symmetric. Word-based Levenshtein distance is defined as the minimum number of edit operations to convert {s (k) } to {t (k) } through insertions, deletions, and substitutions. Substitutions are symmetric by definition, and insert and delete operations to convert {s (k) } to {t (k) } are simply reversed when converting {t (k) } to {s (k) }. In LD norm (s, t), we normalize by max |s|, |t|, which is invariant to order. Finally, where precision = TP TP+FP and recall = TP TP+FN . In our setting, TP = w ∈ s ∩ t , FP = w ∈ s\t, and FN = w ∈ t\s. By these definitions, FP and FN are reversed when source and target are reversed, and therefore by definition, F 1 is symmetric when comparing source and target sentence pairs. 13

C Linguistic Features
Lexical Complexity Lexical complexity refers to the complexity of words based on the length or number of syllables. We use average word length in characters (Pavlick and Tetreault, 2016) and average number of syllables, with and without stopwords.
Lexical Diversity Size of vocabulary has been used as a feature for style categorization in prior work (Abu-Jbara et al., 2011). We chose to include unigrams and bigrams to reflect diversity of vocabulary as well as diversity of expression.
Jensen-Shannon Divergence results in Table 6 in a numerical format as well.  Table 6: Jensen-Shannon divergence between source and target on each test set using feature groupings in Table 4. Scores ≥ 0.075 are made bold.