Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual Styles

An individual’s variation in writing style is often a function of both social and personal attributes. While structured social variation has been extensively studied, e.g., gender based variation, far less is known about how to characterize individual styles due to their idiosyncratic nature. We introduce a new approach to studying idiolects through a massive cross-author comparison to identify and encode stylistic features. The neural model achieves strong performance at authorship identification on short texts and through an analogy-based probing task, showing that the learned representations exhibit surprising regularities that encode qualitative and quantitative shifts of idiolectal styles. Through text perturbation, we quantify the relative contributions of different linguistic elements to idiolectal variation. Furthermore, we provide a description of idiolects through measuring inter- and intra-author variation, showing that variation in idiolects is often distinctive yet consistent.


Introduction
Linguistic identities manifest through ubiquitous language variation. The notion that language functions as stylistic resources for the construction and performance of social identity rests upon two theoretical constructs: sociolect and idiolect (Grant and MacLeod, 2018). The term 'sociolect' refers to the socially structured variation at the group level, whereas 'idiolect' denotes language variation associated with individuals (Wardhaugh, 2011;Grant and MacLeod, 2018). Variationist sociolinguistics emphasizes the systematic variation of sociolect such as gender, ethnicity, and socioeconomic stratification (Labov, 1972). While a central concept in sociolinguistics, idiolect has received far more research attention in forensic linguistics (Wright, 2018;Grant and MacLeod, 2018).
Although idiolects have played a central role in stylometry and forensic linguistics, which seek to quantify and characterize individual textual features to separate authors (Grant, 2012;Coulthard et al., 2016;Neal et al., 2017), the theory of idiolect remains comparatively underdeveloped (Grant and MacLeod, 2018). An in-depth understanding of the nature and the variation of idiolect not only sheds light on the theoretical discussion of language variation but also aids practical and forensic applications of linguistic science.
Here, we characterize the idiolectal variation of linguistic styles through a computational analysis of large-scale short online texts. Specifically, we ask the following questions: 1) to what extent can we extract distinct styles from short texts, even for unseen authors; 2) what are the core stylistic dimensions along which individuals vary; and 3) to what extent are idiolects consistent and distinctive? By using deep metric learning, we show that idiolect is, in fact, systematic and can be quantified separately from sociolect. And we introduce a new set of probing tasks for testing the relative contributions of different linguistic variations to idiolectal styles. Secondly, we found that the learned representations for idiolect also encode some stylistic dimensions with surprising regularity (see Figure 1), analogous to linguistic regularity found in the word embeddings. Thirdly, with our proposed metrics for style distinctiveness and consistency, we show that individuals vary considerably in their internal consistency and distinctiveness in idiolect, which has implications for the limits of authorship recognition and the practice of forensic linguistics. For replication, we make our code available at https: //github.com/lingjzhu/idiolect.

Idiolectal variation
Theoretical questions "Idiolect" remains a fundamental yet elusive construct in sociolinguistics (Wright, 2018). The term has been abstractly defined as the totality of the possible utterances one could say (Bloch, 1948;Turell, 2010;Wright,  [Left] Lower-casing all letters shifts all original texts in the same direction (blue dots → orange dots; e.g., "I love Cantonese BBQ!"→"i love cantonese bbq!").
[Right] The magnitude of movement in one direction is proportionate to the number of null-subject sentences in texts (blue dots → orange dots → green crosses; e.g., "I went out. I bought durians."→"Went out. I bought durians."→"Went out. Bought durians".). 2018). Idiolect as a combination of one's cognitive capacity and sociolinguistic experiences (Grant and MacLeod, 2018) raises many interesting linguistic questions. First, how are idiolects composed? Forensic studies often focus on a few linguistic features as capturing a person's idiolect (Coulthard, 2004;Barlow, 2013;Wright, 2013) but few have explicitly offered an explanation as to why particular word sequences were useful or not (Wright, 2017). Frameworks for analyzing individual variations have been proposed (Grant, 2012;Grant and MacLeod, 2020) but contributions to idiolectal variation at different linguistic levels are seldom explicitly measured.
Sociolinguists have pursued the relationship between idiolect and sociolect. However, the perceived idiosyncrasies of idiolect often render it second to sociolect as an object of study (Labov, 1989). Multiple scholars have suggested that idiolects are the building blocks of various sociolects (Eckert, 2012;Barlow, 2013;Wright, 2018). While some studies have probed the relations between the language of individuals and the group (Johnstone and Bean, 1997;Schilling-Estes, 1998;Meyerhoff and Walker, 2007), it remains less clear as to what extent idiolect is composed of sociolects or has unique elements of their own (see Barlow, 2013, for a review). We test this hypothesize relationship later in Appendix F and do find preliminary quantitative evidence that idiolect representations some information of sociolects.
The other theoretical question relevant to cognitive science as well as forensic linguistics is to what extent an individual's idiolect is distinctive and consistent against a background population (Grant, 2012;Grant and MacLeod, 2018). Studies on forensic linguistics (Johnson and Wright, 2014;Wright, 2013Wright, , 2017 provide confirmatory answers to both questions-yet these studies only focus on a specific set of features for a small group of authors. At the other extreme, the high performance of applying machine learning on authorship verification and attribution (Kestemont et al., 2018(Kestemont et al., , 2019(Kestemont et al., , 2020 stems from placing more emphasis on separating authors (distinctiveness) than consistency. As an empirical investigation of these two concepts in a relatively large population remains to be conducted. Here we aim to quantify these two linguistic constructs in large-scale textual datasets.
Stylometry and stylistic similarities Traditional stylometry often relies on painstaking manual analysis of texts for a closed set of authors (Holmes, 1998). Surface linguistic features, especially function words or character n-grams, have been found to be effective in authorship analysis (Kestemont, 2014;Neal et al., 2017). Despite the overwhelming success of deep learning, traditional features are still effective in authorship analysis (Kestemont et al., 2018(Kestemont et al., , 2019(Kestemont et al., , 2020). Yet the wide application of machine learning and deep learning in recent years has greatly advanced the stateof-the-art performance in authorship verification (Boenninghoff et al., 2019a,b;Weerasinghe and Greenstadt, 2020). Recent PAN Authorship Verification shared tasks suggest that characterizing individual styles in long texts can be solved with almost perfect accuracy (Kestemont et al., 2020); as a result, stylometric studies have increasingly focusing on short texts in social media or online communications for authorship profiling or verification (Brocardo et al., 2013;Vosoughi et al., 2015;Boenninghoff et al., 2019b). Our study points to a promising application of sociolinguistics for authorship detection in social media, as idiolect variation is evident and consistent in texts as short as 100 tokens.

Learning representations of idiolects
In forensic linguistics, textual similarity was traditionally quantified as the proportion of shared vocabulary and the number and length of shared phrases or characters (n-grams) (Coulthard, 2004). More sophisticated statistical methods to perform textual comparison have been developed over the years (Neal et al., 2017;Kestemont et al., 2020).
To learn representations of idiolectal style, we propose using a proxy task of authorship verification, where, given two input texts, a model must determine if they were written by the same author or not (Neal et al., 2017). The identification is performed by scoring the two texts under comparison with a linguistic similarity measure and, if their linguistic similarity measure exceeds a certain threshold, the two texts are judged to be written by the same author.

Task definition Given a collection of text pairs by multiple authors
. . , p t } and labels Y = {y i , y 2 , . . . , y n }, we aim to identify a function f θ that can determine whether two text samples x i and x j are written by the same author (y = 1) or different authors (y = 0).
Stylometric similarity learning Our model for extracting stylistic embeddings from input texts is the same as the Sentence RoBERTa or BERT network (SBERT/SRoBERTa) (Reimers and Gurevych, 2019). For a text pair (x i , x j ), the Siamese model f θ maps both text samples into embedding vectors (z i , z j ) in the latent space such that z i = f θ (x i ) and z j = f θ (x j ). Rather than using the [cls] token as the representation, we use attention pooling to merge the last hidden states [h 0 , . . . , h k ] into a single embedding vector to represent textual style.
where W 1 and W 2 are learnable parameters and σ(·) is the ReLU activation function. Stylometric similarity between the text pair is then measured by a distance function d(z i , z j ). Here we mainly consider the cosine similarity. The underlying models are RoBERTa (Liu et al., 2019) and BERT (Devlin et al., 2019). Specifically, we used the roberta-base or bert-base-uncased as the encoder. 1 Loss function The classic max-margin loss for deep metric learning was shown to be effective in previous work on stylometry (Boenninghoff et al., 2019a,b). Inspired by Kim et al. (2020), we used a continuous approximation of the max-margin loss to learn the stylometric distance between users. The additional hyperparameter in this loss allows the fine-grained control of the penalty magnitude for hard samples.
The loss function is an adaptation from the proxy-anchor loss proposed by Kim et al. (2020). Given a text pair {x i , x j }, stylometric similarity between the text pair is then measured by a distance function d(z i , z j ). Here we mainly consider the cosine similarity. To minimize the distance for same-author pairs and maximize the distance between different-author pairs, the model was trained with a contrastive loss with pre-defined margins {τ s , τ d } for the set of positive samples P + and negative samples P − is given below.
where Softplus(z) = log(1 + e z ) is a continuous approximation of the max function. α is a scaling factor that scales the penalty of the out-of-themargin samples. The out-of-margin samples are exponentially weighted through the Log-Sum-Exp operation such that hard examples are assigned exponentially growing weights, prompting the model to learn hard samples harder. During inference, we compare the textual distance d(x 1 , x 2 ) with the threshold τ t , the average of the two margins, τ t = τs+τ d 2 . We set {τ s = 0.6, τ d = 0.4} and α = 30. Details about hyperparameters searching can be found in Appendix A.
Baseline methods We also compare our models against several baseline methods in authorship verification: 1) GLAD. Groningen Lightweight Authorship Detection (GLAD) (Hürlimann et al., 2015) is a binary linear classifier using multiple handcrafted linguistic features. 2) FVD. the Feature Vector Difference (FVD) method (Weerasinghe and Greenstadt, 2020) is a deep learning method for authorship verification using the absolute difference between two traditional stylometric feature vectors. 3) AdHominem. AdHominem is an LSTM-based Siamese network for authorship verification in social media domains (Boenninghoff et al., 2019a). 4) BERT Concat /RoBERTa Concat . The model is fine-tuned BERT/RoBERTa by concatenating two texts under comparison. Authorship was determined by performing the binary classification task on the <cls> token (Ordoñez et al., 2020). Implementation details are provided in the Appendix B.

Data
Amazon reviews Our dataset was extracted from the release of the full Amazon review dataset up to 2018(Ni et al., 2019. We filtered out reviews that were shorter than 50 words to ensure sufficient text to reveal stylistic variation. We only retained users that have reviews in at least two product domains (e.g., Electronics and Books) and at least five reviews in each domain. After text cleaning, the dataset contained 128,945 users. We partitioned 40%, 10%, and 50% of users into training, development, and test sets. There were 51398, 12849, and 64248 unique users in each set respectively. As one of our goals is to analyzed stylistic variation, we reserved the majority of the data (50% of users) for model evaluation and for subsequent linguistic analysis. The maximum length of all samples was limited to 100 tokens.
Negative sampling For each user, we randomly sampled six pairs of texts written by the same user as positive samples of same-authorship (SA). For negative samples, we randomly sampled six texts from the rest of the data and paired them with the original text. In order to improve generalizability across domains, we enforced a sampling scheme that half of the positive/negative samples were matched in domain while the other half were cross-domain.
Reddit posts To test the generalizability, we additionally constructed a second dataset from the online community Reddit and ran a subset of experiments on it. The top 200 subreddits were extracted via the Convokit package (Chang et al., 2020). Only users who had posted texts longer than 50 words in more than 10 subreddits were selected in the dataset, resulting in 55,368 unique users. We partitioned the 60%, 10%, and 30% of users as training, development, and test sets respectively. Each user's idiolect is represented by 10 posts from 10 different subreddits. The binary labels were then generated by randomly sampling from SA and DA pairs, using the same negative sampling procedure used in the creation of the Amazon dataset.

Results on proxy task
To test whether our model does recover stylistic features, we first test its performance on the proxy task of author verification, contextualizing the performance with other models specifically designed for that task. The performance of authorship verification is evaluated by accuracy, F1 and the area-under-the-curve score (AUC) (Kestemont et al., 2020). Table 1 suggests that all models are able to at least recover some distinctive aspects of individual styles even in these short text. Deep learning-based methods generally achieve better verification accuracy than GLAD (Hürlimann et al., 2015) and FVDNN (Weerasinghe and Greenstadt, 2020), models based on traditional linguistic features. Siamese architecture demonstrates its usefulness in the authorship verification task, as AdHominem (Boenninghoff et al., 2019a) and SRoBERTa perform better than the pre-trained transformer RoBERTa Concat . These results confirm that models recognize authors' stylistic variation. Our error analysis also shows that identifying the same author across domains or different authors in the same domain poses a greater challenge to these models in general, though different model choices may exhibit various inductive biases (see Appendix C). As SRoBERTa is shown to be the most effective architecture in this task, we will use these models to examine idiolectal variation for the rest of the study.

Linguistic Analysis
In this section, we seek to quantify the idiolectal variation at different linguistic levels.

Stylometric features
Idiolects vary at lexical, syntactic, and discourse levels, yet it remains unclear which type of variation contributes most to idiolectal variation.
Ordering Hierarchical models of linguistic identities hold that authorial identities are reflected at all linguistic levels (Herring, 2004;Grant, 2012;Grant and MacLeod, 2020), yet the relative importance of these elements is seldom empirically explored. In order to understand the contributions of lexical distribution, syntactic ordering, or discourse coherence, we test the contributions of different linguistic features to authorship verification by perturbing the input texts. To force the model to only use lexical information, we randomly permute all tokens in our data, removing information about syntactic ordering (lexical model). The organization of discourse might also provide cues to idiolectal style. To test this, we preserve the word order within a sentence but permute sentences within the text to disrupt discourse information (lecico-syntactic model). Then we ran our experiments on these datasets using the same set of hyperparameters to compare the model performance on these perturbing inputs.
Content and function words. The use of function words has long been recognized as an important stylometric feature. A small set of function words are disproportionately frequent, relatively stable across content, and seem less under authors' conscious control (Kestemont, 2014). Yet few studies have empirically compared the relative contributions between function words and content words.
To test this, we masked out all content words in the original texts with a masked token <mask>, which was recognized by the transformer models.
For comparison, we also created masked texts with only content words. Punctuation and relative positions between words were retained as this allows the model to maximally exploit the spatial layout of content/function words.
Results While the importance of lexical information in authorship analysis has been emphasized, it is suggested that only using lexical information is insufficient in forensic linguistics (Grant and MacLeod, 2020). Our results in Table 2 suggest that, even with only lexical information, the model performance is only about 4% lower than models with access to all information. Syntactic and discourse ordering do contribute to author identities, yet the contributions are relatively minor. In forensic linguistics, it is commonly the case that only fragmentary texts are available (Grant, 2012), and our findings suggest that even without broader discourse information, it is still possible to estimate author identity with good confidence. The weak contribution of discourse coherence to authorship analysis highlights that the high level organization of texts is only somewhat consistent within authors, which has been mentioned but rarely tested in forensic linguistics (Grant and MacLeod, 2020). From Table 3, it is apparent that, even with half of the words masked out, the transformed texts still contain an abundance of reliable stylometric cues to individual writers, such that the overall accuracy is not significantly lower than models with full texts. While the importance of function words in authorship analysis has been emphasized (Kestemont, 2014), content words seem to convey slightly more idiolectal cues despite the topical variation. Both SBERT and SRoBERTa achieve similar performance on Amazon and Reddit data, yet SRoBERTa better exploits the individual variation   in content words. These results strongly suggest that there are unique individual styles that are stable across topics, and our additional probing also reveals that topic information is significantly reduced in the learned embeddings (see Appendix F).

Analysis of tokenization methods
We hypothesized that the large performance gap between BERT and RoBERTa (~5%) could be caused by the discrepancy in the tokenization methods. The BERT tokenizer is learned after preprocessing the texts with heuristic rules (Devlin et al., 2019), whereas the BPE tokenizer for RoBERTa is learned without any additional preprocessing or tokenization of the input (Liu et al., 2019).
Method To verify, we trained several lightweight Siamese LSTM models from scratch that only differed in tokenization methods: 1) wordbased tokenizer with the vocabulary size set to either 30k or 50k to match the sizes of BPE encodings; 2) pre-trained wordpiece tokenizer for bert-base-uncased and bert-base-cased; 3) pre-trained tokenizer for roberta-base. Implementation details are attached in Appendix D.
Results As shown in Table 4, the RoBERTa tokenizer outperforms other tokenizers by a significant margin, even though it has similar numbers of parameters to Word-50k. Interestingly, pretrained BERT tokenizer is not superior to the wordbased tokenizer, despite better handling of outof-vocabulary (OOV) tokens. For word-based tokenizers, increasing the vocabulary from 30k to

Characterizing idiolectal styles
In this section, we turn our attention to distinctiveness and consistency in writing styles, both of which are key theoretical assumptions in forensic linguistics (Grant, 2012).
Distinctiveness We examine inter-author variation through inter-author distinctiveness by constructing a graph that connects users with similar style embeddings, described next. For each user in the test set, we randomly sampled one text sample and extracted its embedding through the Siamese models. Then we created the pairwise similarity matrix M by computing the pairwise similarity between each text pair. Then M is pruned by re-

Most distinctive Least distinctive
its ok seems like a reprint i mean its not horrible but i was expecting a lil better qaulity but if i wore to do it again yes i would still buy this poster its not blurry or anything but if you have a good eye it seems a lil like a reprint Nice, thinner style plates that are well suited for building Lego projects. They hold Lego pieces securely and match up perfectly. Also, as a big PLUS for this company you get amazing customer service. moved entries below a threshold τ cutof f , the same threshold τ t that is used to determined SA or DA pairs. The pruned matrixM is treated as the graph adjacency matrix from which a network G is constructed.
where V i is the set of neighbors of node i in G, N the total node count, and I[ ] the indicator function. N j I[j ∈ V i ] is the degree centrality of node i. We found that features from the unweighted graph are perfectly correlated with the ones from the weighted graph. The unweighted graph is kept for computational efficiency. The scores were averaged over 5 runs. The intuition is that, since authors are connected to similar authors, the more neighbors an author has, the less distinctive their style is. A distinctiveness of 0.6 implies that this author is different from 60% of authors in the dataset.
Consistency We also measured the intra-author consistency in styles through concentration in the latent space. The concentration can be quantified by the conicity, a measure of the averaged vector alignment to mean (ATM), as in the following equation (Chandrahas et al., 2018).
where v ∈ V is a latent vector in the set of vectors V. The ATM measures the cosine distance between v to the centroid of V whereas the conicity indicates the overall clusteredness of vectors in V around the centroid. If all texts written by the same user are highly aligned around their centroid with a conicity close to 1, this suggests that this user is highly consistent in writing style.
Analysis The distributions of style distinctiveness and consistency both conform to a normal distribution (Figure 2), yet no meaningful correlation exists between these two measures (Amazon: Spearman's ρ=0.078; Reddit: Spearman's ρ=0.11). In general, users are highly consistent in their writing styles even in such a large population, with an average of 0.8, much higher than that for random samples (~0.4). Users are also quite distinctive from one another, as on average a user's style is different from 80% of users in the population pool. Yet individuals do differ in their degrees of distinctiveness and consistency, which may be taken into consideration in forensic linguistics. Because inconsistency or indistinctiveness may weaken the linguistic evidence to be analyzed.
In Table 5, the least distinctive text is characterized by plain language, proper formatting, and typical content, which reflects the unmarked style of stereotypical Amazon reviews. Yet this review itself is still quite distinctive as it differs from 60% of the total reviews. The most distinctive review exhibits multiple deviations from the norm of this genre. The style is unconventional with uncapitalized letters, run-on sentences, typos, the lack of periods, and the use of colloquial alternative spellings such as "haft", "lil" and "wore", all of which make this review highly marked. For style consistency, the most consistent writers incline towards using similar formatting, emojis, and narrative perspectives across reviews, whereas the least consistent users tend to shift across registers and perspectives in writings (see Appendix E for additional samples).
We tested how various authors affect the verification performance. To avoid circular validation resulting from repeatedly using the same training data, we retrained the model with the repartitioned test data and tested them using the development set. The original test set was repartitioned into three disjoint chunks of equal size, each chunk containing authors solely from either the top, middle or bottom 33% in terms of distinctiveness or con-  sistency. Results in Table 6 suggest that, while most models performed similarly, models trained on inconsistent or indistinctive authors significantly underperformed. This result may have implications for comparative authorship analysis in that it is desirable to control the number of inconsistent or indistinctive authors in the dataset.

Compositionality of styles
Finally, we sought to understand how stylistic variations are encoded. At least for certain stylistic features, there is additive stylistic compositionality in the latent space onto which the texts are projected ( Figure 1).   2016), we designed a series of linguistic stimuli that vary systematically in styles to probe the structure of the stylistic embeddings. For each stylistic dimension, we created n text embedding pairs P = [(p 1 r , p 1 m ), . . . , (p n r , p n m )] where p i r is the embedding of a randomly sampled text and p i m is the embedding of the modified version of p i r so that it differs from p i r in only one stylistic aspect. For sample i and j from P, we quantified Like the word analogy task, if this stylistic dimension is encoded only in one direction, we should expect S ij close to 1. For a target qualitative style shift, we randomly sampled 1000 texts and modified the text to approximate the target stylistic dimension. For example, if null subject is the target feature, we remove the subjective "I" from "I recommend crispy pork!" to "Recommend crispy pork!". Then we compute S ij for each pair, totaling 499500 possible comparisons. Here we selected 10 stylistic markers of textual formality for evaluation (MacLeod and Grant, 2012;Biber and Conrad, 2019) and these textual modifications cover insertion, deletion, and replacement operations.
For quantitative style shifts, we measure S k between samples as well as the difference in length with the following equation. We compare two embedding pairs (p k r , p k s ) and (p k r , p k l ), where both p k s and p k l differ from p i r in only one stylistic dimension. But p k l is further along that dimension than p k s . For instance, compared to the original review p k r , p k s contains five more "!!!" whereas p k l contains ten more such tokens. Here we surveyed four stylistic markers of formality.
For each stylistic shift, we collected 2000 samples, each contained at least 8 markers of that style. Then we modified the original review p k r to the target style by incrementally transforming the keywords into the target keywords by 50% (p k s ) and 100% (p k l ). If these styles are highly organized, we should expect S k to be close to 1, suggesting that changes in the same dimension point to the same direction. Yet we also expect that quantitative changes should also be reflected in the significant length difference (magnitude of ∆norm k ) and the direction of the difference (∆norm k being positive or negative) of the style vectors. Table 7 suggest that both models outperform the random baseline, S ij generated with the same samples by randomly replacing some words. Like word embeddings, stylistic embeddings also exhibit a linear additive relationship between various stylistic attributes. In Figure 1, converting all letters to lower case causes textual representations to move collectively in approximately the same direction. Despite such regularities, the offset vectors for style shifts were not perfectly aligned in all instances, which may be attributed to the variations across texts.

Results in
For quantitative changes, SRoBERTa on both Amazon and Reddit data encode the same type of change in the same direction, as the vector offsets are highly aligned to each other (Table 8). Yet, greater degrees of style shift relative to the original text translate to a larger magnitude of change along that direction ( ∆norm k in Table 8). In Figure 1, after removing the first three instances or all the occurrences of "I" from the original text, the resulted representations both shift in the same direction but differ in magnitude. Such changes cannot be explained by random variation, suggesting that both models learn to encode fine-grained stylistic dimensions in the latent space through the proxy task.
While we only examine several stylistic markers, we were aware that the learned style representations also exhibit regularities for other lexical manipulations, as long as the manipulation is systematic and regular across samples. An explanation is that the model is systematically tracking the fine-grained variations in lexical statistics. Yet the proposed model must also encode more abstract linguistic features because it outperformed GLAD (Hürlimann et al., 2015) and FVDNN (Weerasinghe and Greenstadt, 2020) that also track bag-of-words or bag-of-ngram features. Previous research in word embeddings attributes the performance of the analogy task to the occurrence patterns (Pennington et al., 2014;Levy et al., 2015;Ethayarajh et al., 2019). The fact that these variations are encoded systematically beyond random variation and at such a fine-grain manner indicates that they are stylistic dimensions along which individual choices vary frequently and regularly.

Conclusions
The relatively unconstrained nature of the online genres tolerates a much wider range of stylistic variation than conventional genres (Hovy et al., 2015). Online genres are often marked by unconventional spellings, heavy use of colloquial language, extensive deviations in formatting, and the relaxation of grammatical rules, providing rich linguistic resources to construct and perform one's identity. Our analysis of idiolects in online registers has highlighted that idiolectal variations permeate all linguistic levels, present in both surface lexico-syntactic features and high-level discourse organization. Traditional sociolinguistic research often regards idiolects as idiosyncratic and unstable and not as regular as sociolects (Labov, 1989;Barlow, 2018); here, we show that idiolectal variation is not only highly distinctive but also consistent, even in a relatively large population. Our findings suggest that individuals may differ considerably by degrees of consistency and distinctiveness across multiple text samples, which sheds light on the theoretical discussions and practical applications in forensic linguistics. Our findings also have implications for sociolinguistics, as we have shown an effective method to discover, understand and exploit sociolinguistic variation.

Ethical considerations
While this study is theoretically driven, we are aware that there might be some ethical concerns for models surveyed in this study. While computational stylometry can be applied to forensic investigations (Grant and MacLeod, 2020), authorship verification in online social networks, if put to malicious use, may weaken the anonymity of some users, leading to potential privacy issues. Our results also show that caution should be taken when deploying these models in forensic scenarios, as different models or tokenizers might show different inductive biases that may bias towards certain types of users. Another potential bias is that we only selected a small group of the most productive writers from the pool (less than 20% of all data), but this sample might not necessary represent all populations. We urge that caution should be exercised when using these models in real-life settings.
We still consider that the benefits of our study outweigh potential dangers. Deep learning-based stylometry is an active research area in recent years. While many studies focus on improving performance, we provide insights into how some of these models make decisions based on certain linguistic features and expose some of the models' inductive biases. This interpretability analysis could be used to guide the proper use of these methods in decision-making processes. The analysis could also be useful in developing adversarial techniques that guard against the malicious use of such technologies.
All of our experiments were performed on public data, in accordance with the terms of service. All authors were anonymized. In addition, the term "Siamese" may be considered offensive when it refers to some groups of people. The use of this term follows the research naming of a mathematical model in machine learning literature. We use the word here to refer to a neural network model and make no reference to particular population groups.

A Hyperparameter tuning
In this section, we report the results from our hyperparameter tuning process. The following Table 9 reports some additional results obtained during hyperparameter tuning. Changing the masking probability or the margins for the contrastive loss has an impact on the final accuracy. This search is manual and not exhaustive. We used the best parameters in the main text.
For actual implementation, we used an effective batch size of 256. The default optimizer was the Adam optimizer with a learning rate of 1e − 5. All models were trained on a single Nvidia V100 GPU with 16GB memory. The models were set to train for 5 epochs but we applied early stopping when the validation accuracy stopped to increase. Each epoch took about 2 hours to complete. For each model, we limited the maximum length of text samples to 100 tokens but the actual definition of tokens depended on the tokenizer used.

Model
Accuracy

B Baseline methods
We ran some baseline models for comparison. When setting up these models, we tried to make minimal changes to the original implementation. Details of our changes are provided below.
GLAD We used the original code for GLAD 2 . The linguistic features were extracted using the combo4 options in the code, which covers 23 linguistic features. While support vector machine (SVM) was used as the classifier in their paper (Hürlimann et al., 2015), we found SVM did not scale to the size of our data. Instead, we ran a logistic regression model on the features as this allowed us to interpret the feature importance. The performance of logistic regression was very close to the Random Forest classifier in their code.
FVD The model was trained with the code released by the original authors 3 (Weerasinghe and Greenstadt, 2020). We kept the original feature extraction methods and the model architecture. The input to the neural network was a 2314-dimensional feature vector, which was computed by taking the absolute difference between the linguistic feature vectors of the two authors under comparison. The two-layered fully connected neural network was trained for 100 epochs and the model with the best validation accuracy was kept.
AdHomenin We used the original implementation 4 provided by the author. While this implementation was slightly different from that described in Boenninghoff et al. (2019a), no modification was made to the code other than adapting the code to work on our own data. The same pre-processing method, model architecture, and parameters were kept. Yet the evaluation code was not used as it ignores uncertain samples, which is a standard practice in PAN 20 (Kestemont et al., 2020). The model was trained for 5 epochs and we only kept the model with the best validation results. , we re-implemented the model using the original pre-trained RoBERTa, so that the model can be directly compared to the Siamese version. Since the Longformer is highly similar to RoBERTa and BERT, we do not expect a significant performance gap between them.

BERT Concat /RoBERTa Concat
Evaluation To ensure consistency, all evaluation metrics were computed by the functions in Sklearn: accuracy_score for accuracy, F1_score for F1 and roc_auc_score for AUC.

C Error analysis
We also analyzed the error distributions across different conditions, shown in Table 10. Given a pair of texts, we categorized them into four different categories, same-author (SA)/different author (DA), or same domain (SD)/different domain (DD). Unsurprisingly, most methods still struggling with SA-DD and DA-SD pairs, suggesting that domainspecific/topic information partially interferes with the extraction of writing styles. The cases with RoBERTa Concat and BERT Concat are particularly interesting, as both models consistently performed worse at SA pairs but outperformed the rest of the models in DA pairs. Cosine distance-based models seem to better balance the trade-off across conditions. This shows that model architectures also exhibit inductive biases of their own, which may bias them to be more or less effective in certain conditions.

D Comparing tokenization methods
Tokenization methods For the word-based tokenization, we made use of the word_tokenizer function in NLTK. Either 30k or 50k most frequent lexical tokens were kept as the vocabulary for training the LSTM model, plus a padding token and an OOV token. As for the BPE tokenizer, we directly used the pre-trained tokenizers for BERT and RoBERTa accessed through HuggingFace's Transformers.
Model specification The underlying model is an LSTM-based Siamese network. The model consists of two bidirectional LSTM layers with 300 hidden states for each direction. The last hidden states of the last layer in both forward and backward directions were concatenated as the repre-sentation of the whole input text, which was then passed to a two-layer fully connected network with 300 hidden states in each layer. The similarity between the two paired texts was computed with the cosine distance function. The hyperparameters for the loss function were τ d = 0.4 and τ s = 0.6. No pre-trained word embedding weights were used and all weights were trained from scratch.
Training details The training, development, and test data were the same as those in the main experiments. The model was optimized by the Adam optimizer with a learning rate of 0.001. Gradient clipping was applied to stabilize training with the maximum gradient norm set to 1. The model was trained on an 11GB RTX 2080Ti with an effective batch size of 256 for 10 epochs. The average training time for each model was about 3 hours.

E Distinctiveness and consistency
Here we also show the joint distribution of distinctiveness and consistency given by SBERT in Figure 3. The shape of the distribution is a bivariate normal distribution and these two metrics are not correlated.
The overall distribution of distinctiveness and consistency computed using different models are given in Figure 4. The distribution of style distinctiveness conforms to a normal distribution regardless of models (Figure 4), though the distribution is more peaked for better models. For style consistency, its distribution also conforms to a normal distribution but the distributions predicted by different models are highly similar.
Correlations across models We used the Spearman correlation coefficients to assess to what extent different models assign similar rankings of distinctiveness and consistency. Results are presented in Table 11. For consistency scores given by all models are moderately correlated yet the correlations for distinctiveness are generally weak.

F Additional analysis: characterizing sociolects
Language varies at both individual and collective levels (Eckert, 2012). In this section, diagnostic classification is employed to probe to what extent the collective language variations are retained in the stylistic embeddings.    Dataset compilation From the test set, we created a small subset of high socioeconomic status (SES) users and low SES users by using the prices of the reviewed products as a proxy. We verified that there is a clear distinction in readability between high SES and low SES groups, which is a reliable linguistic indicator of SES (Flekova et al., 2016;Basile et al., 2019).
We compiled this sociolect dataset as a subset of the test data, which contained unseen speakers and samples by the trained model. The core idea is to select users that fall into distinct socioeconomic statuses by utilizing the price tag of their reviewed products. If a user consistently reviews expensive products, it is more likely that this user is associated with high socioeconomic status. This method has been used in a study that surveyed socio-economically related variations (Basile et al., 2019).
The meta-information was provided together with the original Amazon dataset. 5 For each product, we acquired the product title and its price from the product meta-information based on its unique identifier. However, the meta-information was incomplete for a sizable fraction of data, either missing certain attributes or in the wrong format. We only kept the products with complete metainformation.
Then for each product domain, we discretized the price distribution by categorizing product prices into ten quantiles. As a proxy metric for price ranking, the quantile into which a product fell was used as an approximation of the relative expensiveness of the product. This was done for each domain separately rather than for the whole dataset, otherwise, a few domains such as appliances, luxury products, or electronics will dominate the tail of the distribution. After categorizing the data, we averaged the rankings of all the products associated with a user, the result of which was treated as an approximation of a user's socioeconomic status. We kept the top 10% and the bottom 10% of users as high SES users and low SES users respectively, so as to maximize the differences between these two groups. Finally, we ended up with 6567 users with 72335 reviews in the high SES group and 6939 users with 79190 reviews in the low SES groups. The dataset is relatively balanced (48% vs. 52%) so we did not further resample the data. The distribution of product domains is displayed in Figure 5. Product distribution by domains The readability scores were computed by the functions provided in textstat 6 . The differences are all statistically significant, implying that reviews written by high SES users tend to be more linguistically complex than those by low SES users. These results are consistent with results reported in previous studies (see Table 1 in Flekova et al. (2016) and Table 3 in Basile et al. (2019)).  Training details Five models were trained to predict SES based on language: 1) TF-IDF, with mostly topical information; 2) Handcrafted stylometric features (Weerasinghe and Greenstadt, 2020); 3) RoBERTa with both topical and linguistic features; 4) SRoBERTa embeddings with idiolectal features and 5) a random baseline (BL). We also used the same models and data to predict the product domain of each short text. For TD-IDF and Stylometric features, we used logistic regression as the base model. The stylometric features were extracted using the FVD method (Weerasinghe and Greenstadt, 2020), one of our baseline methods for authorship verification. For RoBERTa and SRoBERTa, we added a two-layered neural network on top of the [cls] token with cross-entropy loss. The only difference was that, for SRoBERTa, the base RoBERTa was freezed during training. We ran each model 3 times with different random seeds. For each time, we randomly split the data into 75% and 25% partitions for training and testing. The averaged results were reported.  Results For the challenging task of SES prediction, all models attain moderate performance that is consistently above chance level (Table 13), echoing previous findings (Flekova et al., 2016;Basile et al., 2019). Compared to the fine-tuned RoBERTa, the idiolectal features have filtered out some SES-related variations, which could be related to domain-specific information. Notably, the style embeddings performed the worst at predicting product domain indicates the idiolectal style is not simply capturing product domain as a proxy for SES (e.g., learning more expensive domains). The SRoBERTa's high performance on SES and low performance on domain suggest that our task setup and sampling strategy forced the model to smooth out a significant portion of variation associated with topics. As noted by Boenninghoff et al. (2019a), even if surface linguistic features are not highly content-related, they still achieve moderate performance, suggesting that variation across domains may be more than topical. The fact that SES variations are present in the idiolectal embeddings suggests that at least some SES variations are nested within idiolectal variation (Eckert, 2012).

G Additional text samples
Additional samples pf Amazon reviews with polarizing distinctiveness are given in Table 14. Different models single out reviews with wide-ranging stylistic traits. Full reviews are shown in the table, though only the first 100 words are used by the model during inference. Table 15 shows the text samples from the most and the least consistent authors in terms of their writing styles. For each model, each column presents reviews written by the same author.

Model
Most distinctive Least distinctive LSTM BERT The late80s were a golden age for CD reissues, especially of tracks from the50s and60s, since the new digital format was just gaining popularity, there was a retro-1960s revival going on, and record companies realized they had whole new revenue stream from people buying (or re-buying) back-catalog material for their new players. The compilations issued then were full of quality stuff, unlike later bottomof-the-barrel reissues.
This Urban Fantasy series pulls you right in and the more you know the characters the more you want to know. Hailey Edwards will make you smirk, bite your nails, cry and hope, hope, hope because her characters become (our) friends. As fantastic as the characters origins and abilities are their personalities are so appealing that I found myself hoping in the goodness of even some of the meanies. LSTM RoBERTa **UPDATE 4/19/16** apparently got a bad cable Couldn't figure out why I was having issues connected to Ethernet Thought it might be a network driver issue or a modem issue But after replacing this cable with a shorter one had laying around come to conclusion its this cable that was bad Not a big deal it happens only out a couple $$$ , disappointed but not to upset I need a 15 foot+ Ethernet cable it works , really not much to review ends snap in ok , no twists in cable works good My cats don't like to be brushed. But when I can get several strokes in, this works well. I use the dog brush on my dog; the cat brush is a little smaller that the dog brush and weighs less which are good changes to make for the kitty models. SBERT A Great Forza, Serafin conducts with wonderful pace, warmth and subtly for such an unsubtle opera making this a real beauty, and so easy to listen to.
Callas is quite magnificent with a fine supporting cast. Disregard many of the somewhat breathless negatives, gushing with crushes and arguments for other favorite sopranos, so juvenile, the fact is there are many great female opera singers all suited to different operas some more than others, Callas happens to be one of the greatest in emotional commitment and inner depth of feeling, I wanted a simple steel men's ring without a design and that wouldn't show fingerprints. This ring is perfect. One great thing that I enjoy is that the interior is rounded and polished, making it feel like silk when I put it on. Very affordable, too! Just goes to show, you don't have to break the bank to get attractive quality. Author A: the star wars prequel trilogy is the greatest movie trilogy in the universe there true classics the best of the best the star wars prequel trilogy rocks the star wars charecter anakin skywalker is the greatest movie charecter in the universe its pure genius the best of the best the star wars charecter anakin skywalker rocks^_^A uthor B: A true feat of alchemy, turning base metal (a script worth itś weight in manure) into piles of cash. Or more specifically, this is one of the dumbest, least plausible, movies weve watched in a long time.

So I hold a bachelors and masters in Speech
And yet not without comic relief. Now, who was it that said, "Nothing will come of nothing"? Silly old bard.

RoBERTa
Author C: these our not real instruction tapes but introductions to who Larry really is you can learn from them some really good stuff kenpo is a marshal art that is based on common sense any one who really understand his marshal art will be doing kenpo with out knowing kenpo our even taking a class all marshal artist will run it to these principles for they our the principle of the sword