Mode Effects’ Challenge to Authorship Attribution

The success of authorship attribution relies on the presence of linguistic features specific to individual authors. There is, however, limited research assessing to what extent authorial style remains constant when individuals switch from one writing modality to another. We measure the effect of writing mode on writing style in the context of authorship attribution research using a corpus of documents composed online (in a web browser) and documents composed offline using a traditional word processor. The results confirm the existence of a “mode effect” on authorial style. Online writing differs systematically from offline writing in terms of sentence length, word use, readability, and certain part-of-speech ratios. These findings have implications for research design and feature engineering in authorship attribution studies.


Introduction
That authorship attribution techniques work as reliably as they do has been attributed to the fact that each individual has a distinctive writing style. Texts written by the same author can be recognized by analyzing lexical and syntactic features in documents (Juola, 2006). This principle is practically successful in a variety of settings (Abbasi and Chen, 2008;Overdorf and Greenstadt, 2016;Afroz et al., 2014). In some cases, however, authorial style is challenging to detect. For example, authorial style fades as time goes by (Glover and Hirst, 1996;Baayen et al., 2002), varies considerably in collaborative environments (Graham et al., 2005;Kestemont et al., 2018;Zangerle et al., 2019), and drifts depending on document genre (Stamatatos, 2018;Koppel et al., 2007;Sapkota et al., 2016).
The aforementioned changes are conspicuous due to the fact that there are certain markers indicating that a document may have been written in a fashion that will lead to stylistic variation. For instance, the presence of genre-specific words indicates a document may present a puzzle for standard authorship attribution techniques. Sometimes, however, documents which may challenge an analysis of authorial style can be unannounced. This paper shows that authorial style changes with respect to sentence length, word use, readability, and certain part-of-speech ratios when the writing environment switches from traditional word processing software to an input box of a web browser.

Mode Effects
Originally developed in survey research and educational testing, a "mode effect" describes the following phenomenon: a respondent may answer the same question differently depending on how a survey is administered (e.g., online vs. phone) (Hochstim, 1967;Leeson, 2006). Although discussion continues about mode effects' underlying mechanism (Kreuter et al., 2008;Sidi et al., 2017), contextualized magnitude (Carpenter and Alloway, 2019;Washburn et al., 2017), and adjustment methods (Kolenikov and Kennedy, 2014), a consensus has been reached that mode effects can impair survey validity. For instance, Tourangeau et al. (2000) compiled six studies investigating illicit drug use with self-administered and intervieweradministered surveys. The results showed that illicit drug use was reported at higher rates when questions were administered without an interviewer present.
Functionally, writing modality resembles survey modality: the style observed in an individual's writing may vary depending on how the writer composes the document. A document written by hand may vary from a document composed using traditional word processing software. Further variation may be observed if the document is typed into a text box in a web browser. Therefore, this research uses "mode effect" to label such differences.

Data
Participants in this experiment were recruited on Amazon Mechanical Turk (MTurk). Two distinct types of writing were collected from each of the 18 participants: (1) ca. 6,500 words of pre-existing formal writing and (2) a short ca. 500-word openended response to an essay prompt. For the preexisting writing samples, participants were asked to "Submit at least 6500 words total from multiple documents of your own writing that was done for a formal purpose (school essays, grant proposals, etc)." For the 500-word essay, respondents were asked to describe [their] neighborhood to someone who has never been there before as part of a college application. 1 Respondents also completed a demographic questionnaire, reporting their gender and age bracket.
Responses that were not in English or which seemed very likely to be inauthentic were excluded. (Kennedy et al. (2018) discusses challenges dealing with MTurk surveys). We also excluded one response which appeared to contain writing copied (without attribution) from online sources. The preexisting writing samples were further processed in order to remove personally identifying information. Lengthy quotations, headings, tables, and figures were also removed.
The data for this experiment are a subset of data collected as part of research seeking to replicate results in Brennan et al. (2012). In the full replication experiment, respondents were randomly given one of four essay prompts. In this paper, we only used the responses by respondents randomly assigned to the "control" condition. These respondents provided pre-existing writing samples and a response to the essay prompt mentioned above. Responses were collected between March 29th and June 1st, 2019. 14 out of 18 respondents reported their age as "18-34". Self-reported gender was also collected. Ten of the respondents were men and eight were women.
The overwhelming majority of pre-existing writing samples collected were essays written for undergraduate courses. Many essays discussed films and literary works. Many appeared to be written for political science and business courses.
The writing prompt generally elicited the desired response: respondents wrote about their neighborhood using formal or semi-formal prose.
To check that all writing collected exhibited approximately the same degree of formality, we compared the formality of the writing in the offline corpus with the formality of the writing in the online corpus using a formality score developed by Heylighen and Dewaele (1999). We found that the formality scores in each corpus were similar. 2

Method
This paper focuses on stylistic differences introduced by seemingly innocuous variation in the mode used to enter a text-offline composition vs. online typing into a text box. We are interested in whether a mode effect occurs in writing. To the extent that it is observed, we wish to know if its impact on an author's style is predictable. Does writing mode induce similar changes in the writing style of different individuals?
To answer these questions, we extract linguistic features from the documents written by the 18 participants. We then use a Bayesian hierarchical model to estimate differences in the rates at which the linguistic features appear in texts written using different writing modes.

Feature Selection
We use a set of high-level, familiar linguistic features in our study. Our "Comparative Style" ("CS") feature set aims to capture word-, sentence-, and chunk-level features. All features are described in Table 1. For sentence-level features, white space between words is not counted as a character. Punctuation includes periods, exclamation marks, question marks, commas, semicolons, colons, and apostrophes. For function words we use the list of 512 words from Koppel et al. (2005). The Voice of America (VOA) Special English word list contains 1,512 frequently-used words which are used Table 1: This table describes features in the Comparative Style feature set. The last column mentions an example sentence and an example text chunk. The first sentence of the abstract is the example sentence. The abstract is the example text chunk.

Level
No. Feature Abbreviation Explanation/Example Word 1 Word length in syllables WordLenSyll E.g. the word "mode" has one syllable while the word "effect" has two. 2 Word length in characters WordLenChar E.g. the word "mode" has four characters while the word "effect" has five.
Sentence 3 Sentence length in syllables SentLenSyll E.g. the example sentence has 35 syllables.

Sentence length in words SentLenWord
The example sentence has 16 words. 5 Sentence length in characters SentLenChar E.g. the example sentence has 99 characters.

6
Punctuation to character ratio PuncChar E.g. the PuncChar ratio of the example sentence is 1/99. 7 Function word to word ratio FuncWord E.g. the FuncWord ratio of the example sentence is 6/16. 8 Special English ratio SplEng E.g. the SplEng ratio of the example sentence is 6/16, because "on", "the", "to", "individual," and two "of" are in the sentence. 9 Common word ratio CommWord E.g. the CommWord ratio of the example sentence is 12/16, because "success", "on", "the", "presence", "linguistic", "feature", "specific", "to", "individual", "author," and two "of" are in the sentence. A chunk is defined as a sequence of consecutive sentences containing at least 150 words. The criterion "150 words" was arbitrarily chosen to balance meeting the length requirement of readability tests and the desire to extract as many chunk-level observations as possible in order to better estimate feature variability within writing modes. We experimented with different chunk lengths (e.g., 100,150,200) and found that our results did not depend strongly on chunk length. The Flesch-Kincaid grade level Flesch (1948) and Gunning fog index (Gunning, 1968) were calculated for every chunk. More challenging texts are associated with lower Flesch-Kincaid levels and higher Gunning fog indexes.
Thirteen of the 15 CS features were transformed by taking the square root so that the feature dis-tributions would be approximately Gaussian. The Flesch-Kincaid level and the Gunning fog index are left on the original scale as their distributions were already approximately Gaussian. Although the hierarchical model uses the transformed features, in subsequent visualizations and tables, parameter estimates are reported on the original scale.

Setup
We divide the documents into two groups: the "offline" documents, the pre-existing writing samples from the 18 subjects (authored using word processing software) and the "online" documents, written in a web browser in response to the essay prompt asking for a description of the writer's neighborhood.
To compare features across modes and individuals, we use a hierarchical model. Within each mode-specific group of documents, feature observations associated with an individual are modeled using a normal sampling model with an individualspecific mean and scale. The individual-specific means and scales are, in turn, modeled using a normal distribution and gamma distribution.
where Y j,online = {y 1,j , y 2,j , ..., y n j ,j } online are observations for the jth subject in the online mode. These observations are drawn from a normal sampling distribution Normal(θ j,online , σ j,online ). The individual-specific standard deviation σ j,online comes from a gamma distribution parameterized by shape α online and rate β online . The individualspecific mean θ j,online is drawn from a normal distribution with a location µ online and a scale τ online . In addition, the µ online was assigned a uniform prior distribution, while weakly informative priors were given to τ online , α online , and β online . Each feature is modeled separately. The models for CS features in the offline mode mirrors those for the online mode.
In a pilot study, we considered using a Student-t distribution instead of a normal distribution as the sampling distribution. We found that the estimated degrees of freedom for these sampling distributions were sufficiently large (>30). Hence we concluded it was safe to use the simpler normal distribution as the sampling distribution.

Decision Rule
Posterior differences between the two group means (µ online and µ offline ) and group scales (η online and η offline ) will be characterized using 95% highest posterior density (HPD) intervals and regions of practical equivalence (ROPE) (Kruschke and Liddell, 2018). The 95% HPD interval describes an interval in which a parameter is likely to be found. The ROPE specifies a region of practical equivalence around a null value. When the 95% HPD for a parameter falls outside the ROPE, the null value is rejected, and the parameter is considered to be different from the null value. If the HPD falls entirely inside the ROPE, the null value is accepted. Otherwise, we withhold judgment.
The data can be analyzed using different ROPEs. Given the goals of this investigation, we use ROPEs associated with a "small" effect (Cohen's d=0.2, according to Cohen (1988)), that is, calculating ±0.1 standard deviations for every posterior difference as the upper and the lower ROPE limits around zero, as suggested by Kruschke (2018). There are many ways to calculate the effect size. We follow Kruschke (2014) in our calculation and refer to this effect size as Cohen's d.

Results
Before summarizing differences between online and offline writing across all individuals and features, we first consider how a single feature varies in writing from the 18 subjects. This analysis demonstrates how we use Bayesian methods to infer collective tendencies in the data.

A Close Look at Sentence Length
Do sentence lengths vary systematically by writing mode? We consider three measures of sentence length, one of which is "sentence length in characters." Even before performing any modeling, the individual and collective tendencies can be identified in a visualization of the data (Figure 2).
The box plots on the left-hand side of Figure 2 show that all but one of the 18 participants tended to use longer sentences in their offline documents. The variability in sentence lengths also tended to be greater. The right-hand side histograms of aggregated observations further confirm this characterization. Figure 3 shows the 95% HPD intervals for the parameters of interest related to "sentence length in Figure 2: The grouped box plots and histograms with kernel density estimates for sentence length in characters. Figure 3: Posterior distribution of µ online , µ offline , µ online − µ offline , η online , η offline , η online − η offline , and effect size for the "sentence length in characters" feature. The ranges between two dotted reference lines are ROPEs. characters", µ online , µ offline , η online , η offline , µ online − µ offline , η online − η offline , and effect size.
In the first two upper panels, posteriors indicate that individuals' sentences were typically 73.84 characters when typing into a text box in a web browser but were typically 116.88 characters with traditional word processing software. Those writing online tended to use shorter sentences (42.36 characters fewer, µ offline − µ online ). Note that the 95% HPD falls far away from the ROPE, indicating a non-negligible difference. Therefore, we conclude that people wrote shorter sentences when writing online.
The lower panels of Figure 3 show another difference: the standard deviation in the online setting was 5.04 characters fewer than that in the offline mode, indicating a relative lack of variability in sentence length when individuals wrote online.
The estimated effect size was -5.44 with a 95% HPD interval between -8.55 and -2.84 (Figure 3 upper-right). A effect size of greater than 2 (in absolute value) counts as "huge" (Sawilowsky, 2009). One way of comprehending the magnitude of an effect size is the following: with the naked eye, one can barely detect a "small" effect (e.g., Cohen's d=0.2) but would have no difficulty in seeing a "large" one (e.g., Cohen's d=0.8).
The preceding analysis looked closely at a specific feature. We considered both the raw data and posterior estimates. In the remainder of the paper, for the sake of brevity, only differences between group means (µ online − µ offline ), scales (η online − η offline ), and effect sizes will be reported.

How Writing Style Varies by Mode
In Figure 4, 12 out of 15 posterior differences of group means (µ online −µ offline ) are credibly nonzero, leaving the rest undecided.
The posterior differences between word lengths and sentence lengths are negative. Participants tend to write shorter sentences and use shorter words in the online condition. Relative to offline writings, a positive difference in mean Flesch-Kincaid levels and a negative difference in mean Gunning fog index scores indicate that individuals simplify their writing style when they are entering prose in the web browser. Similar patterns also appear when examining the percentages of function words, Special English, and common words. Individuals tended to use simpler vocabulary in the online condition. For ratios of parts of speech, the adjective to noun and pronoun to noun ratios show credibly positive differences.
Differences in feature standard deviations are shown in Figure 5, where five features indicated nonzero differences. Sentence length, measured in three different ways, varies less in online writing than in offline writing. That is, individuals tend to use a wider range of sentence lengths in offline writing than in online writing. Two readability scores also show less variations in the online mode.
Effect size. The posterior distributions for effect size are shown in Figure 6. Eleven out of 15 features have nonzero effect size. One effect size counts as "medium", five count as large "large", one counts as "very large," and four count as "huge" (using levels defined in Cohen (1988) and Sawilowsky (2009)). Measures regarding word length both manifest "large" effect sizes, and features related to sentence length all have "huge" effect sizes. Likewise, the Flesch-Kincaid reading ease and Gunning's fog index differences are "huge" and "very large." Function words and special English show "large" effects while common words display "medium" effects. This confirms that individuals used simpler words online. The pronoun to noun ratio is the only part-of-speech ratio that shows a credibly nonzero effect (a "large" effect).

Mode Effects and Authorship Attribution Accuracy
Another way to understand the magnitude of mode effects is to check if standard authorship attribution techniques have a harder time identifying an individual's writing when presented with the same individual's writing composed in a different mode.
That is, we can compare the rate at which an authorship attribution model identifies the correct individual when presented with an unsigned document written offline with the rate at which the model identifies the correct individual when given a document written offline. (The model is trained using writing samples written in the offline mode.) This approach has the virtue of allowing us to answer the question we began with: Does the mode effect make authorship attribution more difficult? For this experiment, we use two authorship attribution models featured in Brennan et al. (2012). One is extremely basic, making use of nine features and a simple feed-forward neural network. The second model uses a larger feature set ("Writeprints Static") and a support vector machine classifier with a linear kernel. 3 Authorship attribution accuracy is calculated in the following steps: 1. Choose a random subset from the 18 authors, starting from two and gradually increasing.
2. Calculate authorship attribution accuracy with five-fold cross-validation using the authorship attribution model (SVM or neural network) with pre-existing writings. 3 We re-engineered the Writeprints Static" and the "Basic-9" feature sets. Our re-implementation of the Writeprints-Static feature set mirrored the original set with three exceptions. First, we applied another widely-used function word list (Koppel et al., 2005) (including 512 words in total) in lieu of the original word list because we could not locate the original list. Second, we used the Penn Treebank tagset (39 non-punctuation tags in NLTK 3.4.5) rather than the original maximum entropy tagset (22 tags). We expect to see very minor differences between the two implementations. Third, we used a linear kernel instead of the polynomial kernel mentioned in Brennan et al. (2012) because the linear kernel performed far better. Indeed, we suspect Brennan et al. (2012) may have used a linear kernel (despite reporting having used a polynomial kernel). A subsequent paper, Overdorf and Greenstadt (2016), which shares an author with Brennan et al. (2012) describes the Brennan et al. (2012) as having used SVM with a linear kernel.
The SVM with a linear kernel used a maximum iteration of 100,000. The neural network used half of the sum of the author count and feature size (which is nine) as the hidden layer count, 100 neurons per layer, and a maximum iteration of 100. All experiments were performed using Scikit-Learn 0.22.1. 3. Repeat the previous steps 1,000 times using a different author subsets. Calculate the average accuracy over these replications.
Calculating attribution accuracy for the online writing samples follows similar steps.
Step 2 differs. The model is trained on pre-existing documents and asked to predict the authorship of a document written online. Figure 7 shows our results. It is clear that the accuracy suffers when applying offline-writingtrained classifiers to online writings. That is, changes in authorial style are big enough to confuse the classifiers. Writing mode differences make authorship attribution more difficult.

Discussion
This study investigated whether individuals' writing style varies by "mode": Does the mode used to  compose a document (word processor (offline) versus web browser text entry (online)) affect measurements of individuals' writing style? Our findings confirmed the existence of mode effects. In online writing, respondents tend to use shorter sentences, shorter words, more adjectives (relative to nouns) and pronouns (relative to nouns). Sentence lengths exhibit lower variability as well.
Therefore, we suggest authorship attribution researchers should exercise caution when dealing texts written using different modes.
For example, in Brennan et al. (2012), the au-thors attributed lower accuracy in an authorship attribution task to the fact that writers employed authorship attribution circumvention techniques. Our research suggests that this lower accuracy may be due in part to differences in writing mode. (The circumvention techniques were only used in online writing.) Systematic differences in writing associated with different modes may complicate a broader range of experiments. Just as researchers appropriately anticipate genre-dependent stylistic differences in individuals' writing (e.g., fiction vs. non-fiction prose), experiments should also anticipate modedependent differences.