Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling

Written language contains stylistic cues that can be exploited to automatically infer a variety of potentially sensitive author information. Adversarial stylometry intends to attack such models by rewriting an author’s text. Our research proposes several components to facilitate deployment of these adversarial attacks in the wild, where neither data nor target models are accessible. We introduce a transformer-based extension of a lexical replacement attack, and show it achieves high transferability when trained on a weakly labeled corpus—decreasing target model performance below chance. While not completely inconspicuous, our more successful attacks also prove notably less detectable by humans. Our framework therefore provides a promising direction for future privacy-preserving adversarial attacks.


Introduction
The widespread use of machine learning on consumer devices and its application to their data has sparked investigation of security and privacy researchers alike in correctly handling sensitive information (Edwards and Storkey, 2016;Abadi et al., 2016b). Natural Language Processing (NLP) is no exception (Fernandes et al., 2019;; written text can contain a plethora of author information-either consciously shared or inferable through stylometric analysis (Rao et al., 2000;Adams, 2006). This characteristic is fundamental to author profiling (Koppel et al., 2002), and while the field's main interest pertains to the study of sociolinguistic and stylometric features that underpin our language use (Daelemans, 2013), herein simultaneously lie its dual-use problems. Author profiling can, often with high accuracy, infer an extensive set of (sensitive) personal information, such as age, gender, education, socio-economic status, and mental health issues (Eisenstein et al., 2011;Alowibdi et al., 2013;Volkova et al., 2014;Plank and Hovy, 2015;Volkova and Bachrach, 2016). It therefore potentially exposes anyone sharing written online content to unauthorized information collection through their writing style. This can prove particularly harmful to individuals in a vulnerable position regarding e.g., race, political affiliation, or mental health.
Privacy-preserving defenses against such inferences can be found in the field of adversarial 1 stylometry. Our research 2 concerns the obfuscation subtask, where the aim is to rewrite an input text such that the style changes, and stylometric predictions fail. It is part of a growing body of research into adversarial attacks on NLP (Smith, 2012), which various modern models have proven vulnerable to; e.g., in neural machine translation (Ebrahimi et al., 2018), summarization (Cheng et al., 2020), and text classification (Liang et al., 2018).
Adversarial attacks on NLP are predominantly aimed at demonstrating vulnerabilities in existing algorithms or models, such that they might be fixed, or explicitly improved through adversarial training. Consequently, most related work focuses on white or black-box settings, where all or part of the target model is accessible (e.g., its predictions, data, parameters, gradients, or probability distribution) to fit an attack. The current research, however, does not intend to improve the targeted models; rather, we want to provide the attacks as tools to protect online privacy. This introduces several constraints over other NLP-based adversarial attacks, as it calls for a realistic, in-the-wild scenario of application.
Firstly, authors seeking to protect themselves from stylometric analysis cannot be assumed to be knowledgeable about the target architecture, nor to have access to suitable training data (as the target could have been trained on any domain). Hence, we cannot optimally tailor attacks to the target, and need an accessible method of mimicking it to evaluate the obfuscation success. To facilitate this, we use a so-called substitute model, which for our purposes is an author profiling classifier trained in isolation (with its own data and architecture) that informs our attacks. Attacks fitted on substitute models have been shown to transfer their success when targeting models with different architectures, or trained on other data, in a variety of machine learning tasks (Papernot et al., 2016). The effectiveness of an attack fitted on a substitute model when targeting a 'real' model is then referred to as transferability, which we will measure for the obfuscation methods proposed in the current research.
Secondly, for an obfuscation attack to work in practice (e.g., given a limited post history), it should suggest relevant changes -to-the author's writing on a domain of their choice. This implies the substitute models should be fitted locally, and therefore need to meet two criteria: reliable access to labeled data, and being relatively fast and easy to train. To meet the first criterion, the current research focuses on gender prediction, as: i) Twitter corpora annotated with this variable are by far the largest (and most common), ii) author profiling methods typically use similar architectures for different attributes; therefore, the generalization of attacks to other author attributes can be assumed to a large extent, and, most importantly, iii) Beller et al. (2014) and Emmery et al. (2017) have shown that through distant labeling, a representative corpus for this task can be collected in under a day. This allows us to measure transferability of attacks fitted using realistically collected distant corpora to models using high-quality hand labeled corpora.
As for the attacks, we focus on lexical substitution of content words strongly related to a given label, as those have been shown to explain a significant portion of the accuracy of stylometric models (see e.g., Rao et al., 2000;Burger et al., 2011;Sap et al., 2014;Rangel et al., 2016). To that effect, we extend the substitution attack of Jin et al. (2020) and apply it to author attribute obfuscation. Specifically, we explore the potential of training a simple (as to meet the speed criterion), non-neural substitute model f to indicate relevant words to perturb, where retaining the original meaning is prioritized. f ' (D) f (D ADV ) Figure 1: Obfuscation scenario: model f trains on tweet batches, an omission score is used to determine and rank the words according to their classification contribution. These are then passed to either TextFooler, Masked BERT, or Dropout BERT to suggest top-k replacement candidates. From these, a selection is made based on their class probability change on f (D). Finally, f is evaluated on the perturbed tweets D ADV .
Two transformer-based models are introduced to the framework to propose and rank lexical substitutions towards a change in the predictions of f . We evaluate if the attacks on f transfer across corpora, architectures, and a separately trained target model f (see Figure 1). Finally, we measure the quality of changes using automatic evaluation metrics, and conduct an human evaluation that focuses on detection accuracy of the attacks.

Related Work
Stylometry, the study of (predominantly) writing style, dates back several decades (Mosteller and Wallace, 1963), and has seen increased accessibility through the introduction of statistical models (see surveys by Holmes, 1998;Neal et al., 2017) and machine learning (e.g., Matthews and Merriam, 1993;Merriam and Matthews, 1994). Computational stylometry distinguishes several subtasks such as determining (Baayen et al., 2002) and verifying author identity (Koppel and Schler, 2004), and author profiling (Argamon et al., 2005); e.g., predicting demographic attributes. Adversarial stylometry (as conceptualized by Brennan et al., 2012) intends to subvert these inferences by changing an author's text through imitation, or, as pertains to our research, the obfuscation of writing style (Kacmarcik and Gamon, 2006;Caliskan et al., 2018;Le et al., 2015;. These changes, or perturbations, can be produced in several ways, and the task is therefore of-ten conflated with paraphrasing (Reddy and Knight, 2016), style transfer (Kabbara and Cheung, 2016), and generating adversarial samples or triggers (Zhang et al., 2020b). Regardless of the employed method, the main challenge of obfuscation lies in retaining the original meaning of an input text; its written language medium limits any perturbations to discrete outputs, and unnatural discrepancies are significantly better discernible by humans than, say, a few pixel changes in an image. An additional, persistent limitation is the absence of evaluation metrics that guarantee complete preservation of the original meaning of the input whilst changes remain unnoticed . This not only inhibits automatic evaluation of obfuscation, but all natural language generation research (Novikova et al., 2017)-placing an emphasis on human evaluation (van der Lee et al., 2019).
It is perhaps for this reason that most obfuscation work uses heuristically-driven, controlled changes such as splitting or merging words or sentences, removing stop words, changing spelling, punctuation, or casing (see e.g., Karadzhov et al., 2017;Eger et al., 2019). These specific attacks are typically easier to mitigate through preprocessing (Juola and Vescovi, 2011). Obfuscation through lexical substitution (Mansoorizadeh et al., 2016;Bevendorff et al., 2019Bevendorff et al., , 2020 provides a middle ground of control, semantic preservation and attack effectiveness; however, they might prove less effective against models relying on deeper stylistic features (e.g. word order, part-of-speech (POS) tags, or reading complexity scores). End-to-end systems have been employed for similar purposes (Shetty et al., 2018;Saedi and Dras, 2020), or to rewrite entire phrases (Emmery et al., 2018;Bo et al., 2019) using (adversarially-driven) autoencoders. Such attacks seem less common, and provide less control over the perturbations and semantic consistency.
Our work does not assume the attacks to run end-to-end, but with a hypothetical human in the loop. We further opt for techniques that are more likely to find strong semantic mirrors to the original text while making minimal changes. A substitute model (the algorithm, hyper-parameters, and output of which an author can manipulate as desired) is employed to indicate candidate replacement words, and our attacks suggest and rank those against this substitute. Moreover, prior work typically attacks adversaries trained on the same data, whereas we add a transferability measure. Lastly, while au-thor identification has been investigated in the wild (Stolerman et al., 2013), our work is, to our knowledge, the first to make a conscious effort towards realistic applicability of obfuscation techniques.

Method
Our attack framework extends TextFooler (TF, Jin et al., 2020) in several ways. First, a substitute gender classifier is trained, from which the logit output given a document is used to rank words by their prediction importance through an omission score (Section 3.1). For the top most important words, substitute candidates are proposed, for which we add two additional techniques (Section 3.2). These candidates can be checked and filtered on consistency with the original words (by their POS tags, for example), accepted as-is, or re-ranked (Section 3.3). For the latter, we add a scoring method. Finally, the remaining candidates are used for iterative substitution until TF's stopping criterion is met (i.e., the prediction changes, or candidates run out).

Target Word Importance
We are given a target classifier f , substitute classifier f , a document D consisting of tokens D i , and a target label y. Here, f is trained on some corpus X, and receives an author's new input text D, where the author provides label y. We denote a class label asȳ if f (D) predicts anything but y. Our perturbations form adversarial input D ADV , that intends to produce f (D ADV ) =ȳ, and thereby implicitly f (D ADV ) =ȳ. Note that we only submit D to f for evaluating the attack effectiveness, and it is never used to fit the attack itself.
To create D ADV , a minimum number of edits is preferred, and thus we rank all words in D by their omission score (similar to e.g., Kádár et al., 2017) according to f (omission score in Algorithm 1). Let D \i denote the document after deleting D i , and o y (D) the logit score by f . The omission score is then given by o y (D) − o y (D \i ), and used in an importance score I of token D i , as: (1) With I D i calculated for all words in D, the top k ranked tokens are chosen as target words T . Input : f -substitute model D = {w0, w1, . . . , wn} -document y -target label checks -apply checks (bool) k -target max k-amount words Output : DADV -obfuscated document

Lexical Substitution Attacks
Four approaches to perturb a target word t ∈ T are considered in our experiments. These operations are referred to as candidates in Algorithm 1.

Synonym Substitution (WS)
This TF-based substitution embeds t as t using a pre-trained embedding matrix V . C t is selected by computing the cosine similarity between t and all available wordembeddings w ∈ V . We denote cosine similarity with Λ(t, w). A threshold δ is used to keep only reliable candidates Λ(t, w) > δ.
Masked Substitution (MB) The embeddingbased substitutions can be replaced by a language model predicting the contextually most likely token. BERT (Devlin et al., 2019)-a bi-directional encoder (Vaswani et al., 2017) trained through masked language modeling and next-sentence prediction-makes this fairly trivial. By replacing t with a mask, BERT produces a top-k most likely C t for that position. Implementing this in TF does imply each previous substitution of t might be included in the context of the current one. This method of contextual replacement has two drawbacks: i) semantic consistency with the original word is not guaranteed (as the model has no knowledge of t), and ii) the replaced context means semantic drift can occur, as all subsequent substitutions follow the new, possibly incorrect context. Dropout Substitution (DB) A method to circumvent the former (i.e., BERT's masked prediction limitations for lexical substitution), was presented by Zhou et al. (2019). They apply dropout (Srivastava et al., 2014) to BERT's internal embedding of target word t before it is passed to the transformer-zeroing part of the weights with some probability. The assumption is that C t (BERT's top-k) will contain candidates closer to the original t than the masked suggestions.
Heuristic Substitution To evaluate the relative performance of the techniques we described before, we employ several heuristic attacks as baselines. In the order of Table 3: 1337-speak: converts characters to their leetspeak variants, in a similar vein to e.g. diacritic conversion (Belinkov and Bisk, 2018). Character flip: inverts two characters in the middle of a word, which was shown to least affect readability (Rayner et al., 2006). Random spaces: splits a token into two at a random position.

Candidate Filtering and Re-ranking
Given C t , either all, or only the highest ranked candidate can be accepted as-is. Alternatively, all D can be filtered by submitting them to checks, or reranked based on their semantic consistency with D. These operations are referred to as rank/filter in Algorithm 1-both of which can be executed.
Part-of-Speech and Document Encoding TF employs two checking components: first, it removes any c that has a different POS tag than t. If multiple D exist so that f (D ) =ȳ, it selects the document D which has the highest cosine similarity to the Universal Sentence Encoder (USE) embedding (Cer et al., 2018) of the original document D. If not, the D with the lowest target word omission score is chosen (as per TF's method).
BERT Similarity Zhou et al. (2019) use the concatenation of the last four layers in BERT as a sentence's contextualized representation h. We apply this in both Masked (MB) and Dropout (DB) BERT to re-rank all possible D by embedding them. Given document D, target t, and perturbation candidate document D , C t would be ranked via an embedding similarity score: where h (D i |D) is BERT's contextualized representation of the i th token in D, and w i,t is the average self-attention score of all heads in all layers ranging from the i th token with respect to t in D. 3 4 Experiment

Data
We use three author profiling sets (see Table 1 for statistics) that are annotated for binary gender classification (male or female): first, that of Volkova et al. (2015) which was collected through annotating 5,000 4 English Twitter profiles by crowdsourcing via Mechanical Turk. This can be considered a 'random' sample of Twitter profiles, and is therefore the most unbiased set of the three. Hence, we consider it the most representative of an author profiling set, and employ this as training split (80%) for f , and test split for our attacks (20%). The second is the English portion of the Multilingual Hate Speech Fairness corpus of Huang et al. (2020), which was collected with a different objective than author profiling. It was aggregated from existing hate speech corpora (by Waseem and Hovy, 2016;Waseem, 2016;Founta et al., 2018)-which were largely bootstrapped with lookup terms, selection of frequently abusive users, etc.-and annotated post-hoc with demographic information. The collection did not focus on profiles, and most authors are only associated with a single tweet. This can cause a significant domain shift compared to general author profiling. However, it can be seen as freely available (noisy) data.
Lastly, we include a weakly labeled author profiling corpus by Emmery et al. (2017), collected through English keyword look-up for self-reportssimilar to Beller et al. (2014). This corpus likely includes incorrect labels, but was collected in less than a day, making it an ideal candidate for realistic access to (new) data to fit the substitute model.
Preprocessing & Sampling All three corpora were tokenized using spaCy 5 (Honnibal and Montani, 2017). Other than lowercasing, allocating special tokens to user mentions and hashtags (# and text were split), and URL removal, no additional preprocessing steps were applied. Every author timeline was divided into chunks for a maximum of 100 tweets (i.e., some contain less) to form our documents, implying a maximum of 25 instances per author (some contain one, 2,500 is the API history limit). From the test set, the last 6 200 instances were sampled for the attack (110 male, 90 female). While fairly small, this sample does reflect a realistic attack duration and timeline size, as they would be executed for a single profile.
We adopt the same parameter settings as Jin et al. (2020) throughout our TF experiments: they set N (considered synonyms) and δ (cosine similarity minimum) empirically to 50 and 0.7 respectively. For MB and DB, we capped T at 50 and top-k at 10 (to improve speed). For DB, we follow Zhou et al. (2019) and set the dropout probability to 0.3. 5 https://spacy.io 6 As the datasets are not shuffled to avoid overfitting on author-specific features, a few documents of the same author might spill from the train into the test split; this avoids incorporating those in our attack sample. 7 https://github.com/jind11/TextFooler 8 https://scikit-learn.org/ 9 https://tensorflow.org/ 10 https://huggingface.co/ 11 https://pytorch.org/

Models
For f and f we require (preferably fast) pipelines that achieve high accuracy on author profiling tasks, and are sufficiently distinct to gauge how well our attacks transfer across architectures, rather than solely across corpora. As state-of-the-art algorithms have not yet proven to be sufficiently effective for author profiling (Joo et al., 2019) we opt for common n-gram features and linear models.
Logistic Regression Logistic Regression (LR) trained on tf·idf using uni and bi-gram features proved a strong baseline in author profiling in prior work. The simplicity of this classifier also makes it a substitute model that can realistically be run by an author. No tuning was performed: C is set to 1.

N-GrAM
The New Groningen Author-profiling Model (N-GrAM) from Basile et al. (2018), was proposed as a highly effective-simple-model that outperforms more complex (neural) alternatives on author profiling with little to no tuning. It uses tf·idf-weighted uni and bi-gram token features, character hexa-grams, and sublinearly scaled tf (1 + log(tf)). These features are then passed to a Linear Support Vector Machine (Cortes and Vapnik, 1995;Fan et al., 2008), where C = 1.

Experimental Setup
To summarize (and see  Note that we are predominantly interested in transferability, and would therefore like to test as many combinations of data and architecture access limitations as possible. If we assume an author does not have access to the data, the substitute classifier is trained on any other data than the Volkova et al. corpus. If we assume the author does not know the target model architecture, the target model is N-GrAM (rather than LR). A full model transfer setting (in both data and architecture) will therefore be, e.g.: data f = Emmery et al., data f = Volkova et al., f = LR, and f = NGrAM. Finally, for comparison to an optimal situation, we test a setting where we do have access to the adversary's data.

Evaluation
Metrics The obfuscation success is measured as any accuracy score below chance level performance, which given our test sample is 55%. We would argue that random performance is preferred in scenarios where the prediction of the opposite label is undesired. For the current task, however, any accuracy drop to around or lower than chance level satisfies the conditions for successful obfus-   (Zhang et al., 2020a) between D and D ADV . METEOR captures flexible uni-gram token overlap including morphological variants, and BERTScore calculates similarities with respect to the sentence context.

Human Evaluation
For the human evaluation, we sampled 20 document pieces (one or more tweets) for each attack type in the best performing experimental configuration. A piece was chosen if it satisfied these criteria: i) contains changes for all three attacks, ii) consists of at least 15 words (excluding emojis and tags), and iii) does not contain obvious profanity. 14 All 60 document pieces of the three models were shuffled, and the 20 original versions were appended at the end (so that 'correct' pieces were seen last). Each substitute model therefore has 80 items for evaluation. While in prior work it is common to rate semantic consistency, fluency, and label a text (see e.g., Jin et al., 2020), our Twitter data are too noisy (including many spelling and grammar errors in the originals), and document batches too long to make this a feasible task. Instead, our six participants (three per substitute) were asked to indicate if: a) a sentence was artificially changed, and if so, b) indicate one word that raised their suspicion. This way, we can evaluate which attack produces the most natural sentences, and the least obvious changes to the input.
The items were rated individually; the human evaluators did not know beforehand that different versions of the same sentences were repeated, nor 12 If an attack drops accuracy to 0%, this effectively flips (in case of a binary label) the label. This label might also be undesired by the author (e.g., being classified as having polar opposite political views). This implies the target model being maximally unsure about the classification is desirable. 13 https://www.nltk.org/_modules/nltk/ translate/meteor_score.html 14 To avoid exposing the raters to overly toxic content, blatant examples were filtered using a keyword list. Some minor examples remained, for which we added a disclaimer. that the originals were shown at the end. All participants have a university-level education, a high English proficiency, and are familiar with the domain of the data. Several example ratings of the same sentence can be found in Table 6.

Domain Shift
As we alluded to in Section 4.1, both corpora used to train our substitute models were in fact not reference corpora for author profiling, and can therefore be considered as suboptimal, disjoint domains. The Huang et al. corpus in particular shows a strong domain shift (see Table 4) for both training and test sets. The distantly labeled Emmery et al. corpus achieves 7.5% more accuracy on the train split of Volkova et al., and test performance is significantly higher (27%). We might therefore expect better obfuscation performance from the latter.

Baselines
The results for all attacks are shown in Table 3. Note that these are performances for f ; therefore, when no attacks are applied (none), the performance for both substitute corpora stays the same (as those only influence the attacks). For the heuristic attacks, 1337 seems to make the more robust baseline; outperforming some of the other settingseven on transferability. A surface-level advantage is that this attack has a minor impact on readability (when applied conservatively) and does not change semantics; however, the heuristic attacks are fairly simple to mitigate in preprocessing (Juola and Vescovi, 2011) and through character features (as shown by the performance of the N-GrAM model). For transferability, we evidently need to do more than simply trying to convert words to be out-of-vocabulary (OOV) with noise. While it can be argued the heuristics could change all words, shifting everything OOV would not be robust; the target model side could easily spot the anomalous input and might act (e.g., reject) accordingly. expectations from the observed domain shift. Our results also show that a noticeable decrease in obfuscation performance occurs (10-30% increased target model performance) when the attacks are transferred to different data and another model. In contrast, as can be observed from the last two columns in Table 3, in a practically unrealistic setting where the model and data are available, the obfuscation is specifically tailored to known weaknesses and therefore highly effective. 15

Transformer Performance
Looking at the Top-1, Check and Check brackets (Table 3), other than the BERT-based models having higher success of transferability than TF, they also retain obfuscation success; deteriorating the target model's performance to lower than chance level (55%) for the settings not using additional checks. This comparison also demonstrates the synonym ranking to work (Top-1 vs. Check and Check), and the Check condition to be too restrictive; attaining lower attack power, and low transferability. This is further illustrated by the %changes shown in Figure 2. Comparing the MB and DB variants, their performance seems almost identical, with masking having a slight advantage. As Zhou et al. (2019) argued, applying dropout should produce words that are closer to the original (compared to MB), which might affect obfuscation performance. Additionally, the BERT similarity ranking (described in Section 3.3) applied to the Masked substitution candidates could have some beneficial effect. This will have to be studied in more detail using the output evaluations.  Rewrite Metrics The metrics in Figure 2 show a common initial limitation in their application to this task: the more frequent an attack makes no changes, the higher the automatic evaluation metrics (BERTScore, METEOR). Hence, to compare models, these scores need to be considered in light of the obfuscation performance, and related work. It can be observed that with consistently higher changes, MB and DB score lower on semantic consistency than TF. However, between MB and DB, and TF for the Emmery et al. corpus, these differences are minor. Furthermore, despite being fit on a different domain, these scores are comparable to prior obfuscation work (e.g., Shetty et al. (2018) show METEOR scores between 0.69 and 0.79).
Human Evaluation The results in Table 5 reflect the same trend that can be observed in Table 3; high obfuscation success seems to result in higher human error when predicting if a sentence was obfuscated. Conversely, it seems that despite higher semantic consistency scores, the original TF pipeline is easier to detect. This can be attributed to the number of spelling and grammar errors the model makes without its additional checks. Furthermore, the 11% error in identifying the original sentences also reflects some expected margin of error in this task, as our Twitter data is inherently noisy. Finally, while these results are in line with the obfuscation success, and are lower than detectability scores in related work (Mahmood et al., 2020), they also indicate that the models are still detectable above chance-level. Given three alternatives (including the original), performance should be 25% or lower to indicate no intrusive changes are made to text (that are not semantically coherent or not inconspicuous enough-both metrics used by . Therefore, while the presented approaches are effective, and realistically transferable, there is room for improvement for practical applicability.  Table 6: Example ratings of different attacks (not shown together to the human evaluators) on two sentences with varying semantic consitency and human detection accuracy. In the first example, HMB was marked unaltered by all raters, HDB by the majority, and HTF by none. In the second, only HDB was marked unaltered, by only one rater. Attacked words are marked in bold, guessing any one of these would count as correctly identifying the attack.

Discussion and Future Work
We have demonstrated the performance of author attribute obfuscation under a realistic setting. Using a simple Logistic Regression model for candidate suggestion, trained on a weakly labeled corpus collected in a day, the attacks successfully transferred to different data and architectures. This is a promising result for future adversarial work on this task, and its practical implementation. It remains challenging to automatically evaluate how invasive the required number of changes are for successful obfuscation-particularly to an author's message consistency as a whole. However, in practice such considerations could be left up to the author. In this human-in-the-loop scenario, a more extensive set of candidates could be suggested, and their effect on the substitute model shown interactively. This way, the attacks can be manually tuned to find a balance of effectiveness, inconspicuousness, and to guarantee semantic consistency. It would also show the author how their writing style affects potential future inferences.
Regarding the performance of the attacks: we demonstrated the general effectiveness of contextual language models in retrieving candidate suggestions. However, the quality of those candidates might be improved with more extensive rule-based checks; e.g., through deeper analyses using parsing. Nevertheless, such venues leave us with a core limitation of rewriting language, and therefore more broadly NLP: while the Masked attacks seemed more successful in our experiments, after manual inspection of the perturbations Dropout was found to often be semantically closer (see also Table 6)which was not reflected in the human evaluation. This begs the question if any automated approach, evaluated under the current limitations of semantic consistency metrics, could realistically optimize for both obfuscation and inconspicuousness.
As such, we would argue that future work should focus on making as few perturbations as possible, retaining only the minimum amount of required obfuscation success. Given this, the other constraints become less relevant; one could generate short sentences (e.g., a single tweet) that might be semantically or contextually incorrect, but if it is a message in a long post history, it will hardly be detectable or intrusive. This would require certain triggers (as demonstrated by Wallace et al. (2019) for example), and ascertaining how well they transfer.

Conclusion
In our work, we argued realistic adversarial stylometry should be tested on transferability in settings where there is no access to the target model's data or architecture. We extended previous adversarial text classification work with two transformer-based models, and studied their obfuscation success in such a setting. We showed them to reliably drop target model performance below chance, though human detectability of the attacks remained above chance. Future work could focus on further minimizing this detection under our realistic constraints.