Through the Looking Glass: Learning to Attribute Synthetic Text Generated by Language Models

Given the potential misuse of recent advances in synthetic text generation by language models (LMs), it is important to have the capacity to attribute authorship of synthetic text. While stylometric organic (i.e., human written) authorship attribution has been quite successful, it is unclear whether similar approaches can be used to attribute a synthetic text to its source LM. We address this question with the key insight that synthetic texts carry subtle distinguishing marks inherited from their source LM and that these marks can be leveraged by machine learning (ML) algorithms for attribution. We propose and test several ML-based attribution methods. Our best attributor built using a fine-tuned version of XLNet (XLNet-FT) consistently achieves excellent accuracy scores (91% to near perfect 98%) in terms of attributing the parent pre-trained LM behind a synthetic text. Our experiments show promising results across a range of experiments where the synthetic text may be generated using pre-trained LMs, fine-tuned LMs, or by varying text generation parameters.


Introduction
Recent advancements in natural language processing have enabled synthetic text generation that is often of comparable quality to the organic text (Ippolito et al., 2020;Zellers et al., 2019;Gehrmann et al., 2019). This capability has the potential to be misused by malicious actors to launch misinformation, spam, and phishing campaigns (Solaiman et al., 2019;Brown et al., 2020). To prevent potential misuse, prior research has shown considerable success in building machine learning (ML) algorithms that detect (Zellers et al., 2019) or assist humans in detecting (Gehrmann et al., 2019) synthetic text.
While prior research has shown promise in distinguishing between synthetic and organic text, very little has been done on attributing the authorship of the language model (LM) generating the synthetic text (Pan et al., 2020). It is important to be able to track the provenance of synthetic text to the source LM. This can be useful in identifying perpetrators of potential misuse and the unauthorized use of an LM (e.g., in case it is stolen through sophisticated model inversion attacks (Fredrikson et al., 2015) or outright security breaches).
It is particularly challenging to attribute the authorship of the synthetic texts because of the variety and number of available LMs and configurations. While there are only a handful of public pre-trained LMs, it is common to further fine-tune them before using them to generate synthetic text (Devlin et al., 2019;Sanh et al., 2020). Fine-tuning can significantly impact the characteristics of the generated text (Howard and Ruder, 2018;Cruz and Cheng, 2019). Moreover, variations in the sampling parameters used while generating synthetic text whether from pre-trained or fine-tuned LMs can further impact text characteristics (Zellers et al., 2019).
In this paper, we design and evaluate ML-based techniques for attributing the LM and configuration used to generate a synthetic text. We do this in the context of four problem scenarios, each representing a variation of a threat posed by an adversary or malicious user. The scenarios vary in terms of what information the LM attribution system has about the adversary's strategy for generating fake text.
Methodologically, our key insight for attributing the LM used by the adversary is that differences between LM architecture (i.e., layers, parameters), training (i.e., pre-training and fine-tuning), and generation techniques (i.e., sampling parameters) will leave their subtle mark on the generated synthetic texts. The success of our attributors at identifying the LM and configuration used relies on the presence of these subtle distinguishing marks and on the ability to exploit them effectively. As our re-sults indicate, this success holds especially in terms of attributing pre-trained models used to generate text even under varying conditions.
In summary, our key contributions are: • We evaluate a variety of attribution techniques on their ability to attribute the LM and configuration used to generate text. These include attributors making use of stylometric features as well as static and dynamic embeddings.
• We evaluate these attributors on a corpus of 350,000 synthetic texts that we generated in a controlled manner using combinations of LMs, sampling parameters, and fine-tuning.
• Our best attributor built on top of a fine-tuned version of XLNet (XLNet-FT) performs excellently at identifying pre-trained LM used to generate coherent synthetic texts. Accuracy ranges between 91% and close to perfect 98%. This performance holds for various experiments where we use fine-tuning and different sampling parameters. However, the performance is mediocre when attributing the finetuned LM used to generate the text.
Paper Organization: The rest of the paper is organized as follows. Section 2 presents the different threat models based on the adversary's strategy for generating synthetic text and assumptions made by the attributor. We then describe our data and attribution methods in Section 3. Experimental results are in Section 4. Section 5 contextualizes our work with respect to prior literature. Section 6 concludes the paper with an outlook on future work.

Threat Model
This section describes different threat models that we consider in this paper. The adversary's goal is to generate synthetic text using language models (LMs). The attributor's goal is to attribute the synthetic text to the source LM used by the adversary. All of the threat models operate under the closed world scenario, where the attributor is assumed to know the universe of LMs. The threat models differ based on the adversary's LM training (i.e., pre-training or fine-tuning) and sampling strategies.

Attributing pre-trained LMs
In the first scenario, the adversary uses a pre-trained LM to generate synthetic text. The attributor trains a classifier to attribute the synthetic text to the source pre-trained LM. We assume a closed-world scenario where both the adversary and attributor have access to the set of off-the-shelf pre-trained LMs such as  More formally, the scenario can be described as: Given n pre-trained LMs P M 1 , P M 2 , ..., P M n , the goal is to train a n-class attributor to attribute test instances to the correct source pre-trained LM. In this scenario, the adversary generates texts using P M k where 1 ≤ k ≤ n and the attributor's goal is to predict label P M k for the generated texts.

Attributing fine-tuned LMs to parent pre-trained LMs
In this scenario, the adversary fine-tunes a pretrained LM to generate synthetic text. The attributor trains a classifier to attribute the synthetic text to the source pre-trained LM. The main difference from the first scenario is that the attributor is unaware of the fine-tuning used by the adversary before generating text. Note that the goal of the attributor is to detect the source pre-trained LM rather than the fine-tuned LM that is used to generate synthetic text. More formally, the scenario can be described as: Given n pre-trained LMs P M 1 , P M 2 , ..., P M n , and a LM F M k , generated by fine-tuning P M K where 1 ≤ k ≤ n, the goal is to train a n-class attributor to attribute test instances to the correct source pre-trained LM. In this scenario, the adversary generates text using fine-tuned LM F M k and the attributor's goal is to predict label P M k for generated text.

Attributing pre-trained or fine-tuned
LMs with different sampling parameters In this scenario, the attributor trains a classifier to attribute the synthetic text generated by the adversary using a pre-trained or fine-tuned LM. The main difference from the first scenario is that the adversary potentially uses different sampling parameters for text generation than those used by the attributor to train the classifier. More formally, the scenario can be described as: Given n pre-trained or fine-tuned LMs M 1 , M 2 , ..., M n , the goal is to train a n-class attributor to attribute test instances to the correct source model. As per this scenario the adversary generates texts using model M k , 1 ≤ k ≤ n, with sampling parameters S k that are unknown to the attributor, and the attributor's goal is to predict label M k for the generated text.

Attributing fine-tuned variants of a pre-trained LM
In this scenario, the adversary fine-tunes a pretrained LM to generate synthetic text. The attributor trains a classifier to attribute the synthetic text to the source fine-tuned LM. The main difference as compared to the second scenario is that the attributor is aware of the fine-tuning used by the adversary. Note that there are multiple fine-tuned variants of the same parent pre-trained LM. More formally, the scenario can be described as: Given n fine-tuned LMs F M 1 , F M 2 , ..., F M n , the goal is to train a n-class attributor to attribute test instances to the correct fine-tuned LM. As per this scenario, the adversary generates text using a fine-tuned LM F M k and the attributor's goal is to predict label F M k for the generated text.

Data & Methods
In this section, we present details about (1) the text generating language models (LMs) and their configurations, and (2) about the attributors studied. To address our research goals, we need a dataset of synthetic texts generated by various pre-trained and fine-tuned LMs under different configurations. Publicly available datasets are unsuitable because there can be high variability in the conditions under which text was generated 2 . It is crucial for us to be able to control the underlying conditions such as: the architecture of LM, prompt used for text generation, sampling parameters, and the data size and topics used for fine-tuning. Details about this generated dataset are also provided in this section.

Text Generation: LMs, parameters, and configurations
We used four pre-trained LMs: OpenAI GPT (Radford et al., 2018), OpenAI GPT2 , XLNet (Yang et al., 2019), and BART (Lewis et al., 2020). BART and XLNet are both based on the BERT architecture, which makes use of the bidirectional context of input text to develop a deep understanding of language. XLNet improves on BERT with a form of generalized autoregressive pre-training using permutation model-ing. It outperforms BERT on several classification tasks (Yang et al., 2019). BART combines the bidirectional encoder used by BERT with an autoregressive decoder used by GPT, which, through a noising and text reconstruction pre-training task, achieves good performance in both language understanding and language generation tasks. In other words, both BART and XLNet augment their training strategies to make up for the lack of language generation capabilities in BERT. GPT and GPT2 are architecturally identical LMs with GPT2 trained on 10 times the data used for training original GPT LM. These use a more traditional generative pre-training approach, looking only at the context coming before a part of the text and not after . All four pre-trained LMs are publicly available.

Text generation parameters
Three key parameters when generating texts are: p, k, and temperature. The range of values tested are given in Table 1, with default values emphasized in boldface. Note that one chooses to use either pvalue or k-value sampling since they have the same goal -controlling the number of words taken into consideration while sampling text from an LM.
With top-k sampling, the LM randomly chooses one from the top k words. With top-p sampling, it chooses from the set of words whose cumulative probability exceeds p. Both Zellers et al. (2019) and Holtzman et al. (2020) conclude that synthetic text matches organic text closely when the p-value is kept in range [0.9, 1.0]. Higher values lead to repetitions as the length of text increases. Thus, we choose the lower limit of p from the range [0.9, 1.0].
For top-k sampling, we use a range of values both higher and lower than 40, which is used as the default for text generation by  in their breakthrough GPT2 paper. Between a choice of top-p or top-k sampling, we chose top-p (p = 0.9) as default due to its lower dependency on vocabulary size and extensive use in previous research on GPT2 Zellers et al., 2019;Ippolito et al., 2020).
Temperature controls the likelihood of low probability words appearing in the final pool of words used for random selection (Holtzman et al., 2020). Higher temperatures produce text containing highly unusual words that are normally not favored by topk or top-p sampling. At the other end, Holtzman et al. (2020) note that temperatures below 1 reduce

Data for fine-tuning
For scenarios where we need synthetic text generated using fine-tuned LMs, we limit text generation to the GPT2 LM. GPT2 has been shown to have state of the art performance in language generation tasks Klein and Nabi, 2019). Data from four Reddit communities was used for fine-tuning LMs: r/relationships, r/technology, r/changemyview, and r/conspiracy.
These subreddits were chosen based on qualitative differences between their content. r/technology contains technical jargon, while r/relationships focuses on personal pronouns and adopts a critical approach towards writing. r/changemyview has confrontational content with members attempting to challenge and disprove each other's views, while r/conspiracy focuses on hyperbolic statements. In essence, each subreddit is considered a different topic area. Table 2 shows the number of posts and comments scraped from each subreddit.

Dataset details
We generate text of three different lengths: short (up to 40 words), medium (between 40 and 100 words), and long (above 100 words). In experiments where length is not the focus, we use medium as the default. Each synthetic text is generated using a randomly selected subreddit submission as a prompt. We start by sampling words equal to the length of the prompt from the LM. We trim the generated text to follow standard sentence structure such as start capitalization and end punctuation, after which text is sorted into one of three length categories. We generated 10,000 synthetic texts for each target class in our experiments. For example, when evaluating the performance of attributors against fine-tuned LMs, we generated 10,000 samples for each of four GPT2 LMs fine-tuned on one of the four Reddit topics mentioned previously. In total, 35 distinct sets of synthetic documents, each with 10,000 examples, were used for a total of 350,000 unique synthetic documents 3 . We build training and test datasets that are balanced in classes for each scenario because while there is growing evidence that synthetic text is appearing in the wild, there is little to no information about the relative impact of the source LMs. Thus, any split beyond an even split across classes has little justification.

Attributors
We test six attributors in their ability to identify the source LMs. The first attributor is a decision tree classifier with Writeprints (Abbasi and Chen, 2008) feature set, second is a CNN with GloVe (Pennington et al., 2014) embeddings as the feature set. The next four attributors are softmax classifiers, with the first two built on top of pre-trained XLNet and GPT2, and the other two on top of XLNet and GPT2 fine-tuned on training data used in the corresponding scenario.

Decision tree with Writeprints features
The Writeprints features have been used extensively and successfully for authorship attribution. 3 We will make this dataset available for research upon publication of our paper (Abbasi and Chen, 2008;Mahmood et al., 2020) When combined with SVMs and decision trees these have shown good performance in attribution tasks (Abbasi and Chen, 2008;Pearl and Steyvers, 2012). Due to ease of interpretability of features, we implement a decision tree classifier with Writeprints features to test our intuition that stylistic, rather than topical, differences contribute towards the attribution of synthetic text.

CNN with GloVe embeddings
Pre-trained GloVe embeddings have been shown to outperform word frequency and count-based embeddings for sentence and sequence classification tasks (Pennington et al., 2014;Le-Hong and Le, 2018). Also, the use of GloVe with CNNs has shown good results in classification tasks like newsgroup identification (Gupta et al., 2018).

Attributors from LM embeddings
Embeddings generated through LMs like BERT, XLNet, and GPT2 have been shown to capture language semantics and context much better than static embeddings generated through GloVe and other word count or frequency-based embeddings generators (Sun et al., 2020;Howard and Ruder, 2018). Because of their extensive pre-training, these LMs can capture long-term dependencies and incorporate contextual and hierarchical relations between words better than pre-computed static embeddings.
LMs such as XLNet make use of a special [CLS] token to get pooled output representing a complete text sequence. We use the final network layer embeddings of this token for attribution. Specifically, we train a softmax output layer that takes as input XLNet's [CLS] token embeddings and generates probabilities for each decision class in the experiment setup. For GPT2 we use a parallel strategy with pooled output from the complete final layer of the model for a particular input text. Again this output is connected to a softmax output layer which, similar to our strategy with XLNet, is trained to generate class predictions based on the input embeddings.
In addition to using the pre-trained versions of XLNet and GPT2, we also evaluate attributors built from fine-tuned versions of these LMs. Note that here fine-tuning is on training data used to train all attributors in the corresponding experiment. Figure 2 illustrates these strategies with XLNet as an example. The sequence with dashed lines represents the fine-tuned versions.

Results
We present attribution accuracy results in the same order of scenarios described earlier in Section 2. 4 Table 3 presents the accuracy results for short (up to 40 words), medium (between 40 and 100 words), and long (more than 100 words) synthetic text generated using pre-trained GPT, GPT2, XLNet, and BART language models (LMs). Decision tree and the two XLNet versions achieve accuracy between 82 and, near perfect 98%, across the three types of texts. In comparison, CNN and GPT2 attributors lag behind.

Attributing pre-trained LMs
While both XLNet attributors score higher than decision tree, XLNet-FT has the best performance which when compared with the next best XLNet-PT ranges from 3% to 7%. Note that apart from the pre-trained GPT2 attributor, all show marked improvement in accuracy scores with an increase in text length. Similar results showing direct proportionality of classifier performance with text length were also observed by Ippolito et al. (2020) in experiments detecting synthetic text.
Prior work has shown that uni-directional LMs are more suited for language generation due to generative pre-training (Lewis et al., 2020) where the LM learns to predict the next word based on the previous context. Bidirectional LMs like BERT and XLNet excel at classification as they make use of masked modeling and next sentence prediction tasks to improve understanding of necessary language attributes (Devlin et al., 2019;Yang et al., 2019). Our results are consistent in that XLNet performance is better than GPT2.
Interestingly, the decision tree with Writeprints outperforms GPT2 based attributors in all three text lengths. Our investigation into specific Writeprints features emphasized by the decision tree (see appendix A.1) reveals a greater emphasis on stylistic features. This gives further credence to our intuition that variations between texts generated by different LMs are more stylistic than topical in nature. Our results suggest that GPT2 based attributors are not adept at capturing such stylistic differences.

Attributing fine-tuned LMs to the parent pre-trained LMs
Our goal in this scenario is to attribute the synthetic text generated using a fine-tuned variant (using an unknown dataset) of a pre-trained LM. Note that the attributor is unaware of fine-tuning. We limit fine-tuned text generation in this experiment to just GPT2 for reasons described in Section 3.
The first row in Table 4 reports the accuracy results. We note that XLNet-FT again performs the best with XLNet-PT in the second place. CNN has the weakest results. Interestingly, Writeprints continues to do fairly well -once again emphasizing the role of style in identifying source LM. Comparing these results with Table 3 (for medium length texts), we see all accuracies drop slightly as expected when the adversary chooses to fine-tune the LM that is unknown to the attributor.
We run a second variation of the same experiment -one where the attributor has partial knowledge of the adversary's strategy. Specifically, the fact that the adversary is using a fine-tuned LM to generate text is known but the dataset used for fine-tuning remains unknown. In response, we pick some dataset (here r/relationships) to add finetuned LM generated texts to our training data. Note that the adversary uses r/changemyview. The second row in Table 4 reports similar results as the first row. Thus, it seems that this additional knowledge does not help improve attribution accuracy.
In sum, the accuracy for XLNet-FT across all experiments thus far is above 90%. This indicates that even when the adversary fine-tunes the LM for text generation, the parent pre-trained LM is still identifiable. This result confirms our intuition that as fine-tuning is known to leave the majority of layers unchanged, the text generated retains characteristics of the parent pre-trained LM, making accurate attribution possible.   Table 4: Accuracy percentages in attributing source pre-trained LM when adversary generates synthetic text using a fine-tuned LM. In addition to GPT2 variants mentioned in columns 1 and 2, training and testing data also includes classes representing XLNet, BART, and GPT.

Attributing LM with different sampling parameters
Here we consider the scenario where both the attributor and the adversary use the same LM but they differ in parameter choices when generating texts. We run this experiment assuming the adversary uses GPT2 fine-tuned on r/changemyview. The attributor is aware of this but not the sampling parameters. Selecting parameter values that are quite different from each other we see from Table 5 that there is virtually no performance drop for XLNet-FT. That is, our best performing attributor is resilient to these differences. Temperature sampling shows weaker results all around. This is not a concern as discussed later in this section.
We next explore the parameter differences angle further to get a sense of what would happen if the adversary chose a parameter value other than the ones explored in Table 5. The different values for k, p, and temperature are as listed in Table 1. We remind the reader that one uses either top − k or top − p sampling to control the number of words under consideration during text generation. We use top − p sampling as the default strategy. When varying k or p, the temperature is fixed at the default value. When varying temperature, p is kept at the default value.
The results in Table 6 show that it is challenging to tell apart synthetic texts generated by different values of k and p. Given their strong similarities we expect to see results as in Table 5 when the adversary picks other parameter values. With temperature variations we get accuracy above 80%, indicating marked differences between texts generated at different temperatures. However, taking a closer look at the text reveals a serious problem: temperature > 1 produces erratic and confusing text. This problem becomes more acute as the temperature approaches its upper limit. This is consistent with the observation by Holtzman et al. (2020) showing that temperatures above 1 produce incoherent and confusing text. This reduces viability in a setting where the synthetic text is to be used as a suitable replacement for organic text.
We conclude from Tables 5 and 6 that our attributors should be resilient even when the adversary chooses parameter values for text generation beyond the ones explicitly tested here.

Attributing fine-tuned variants of a pre-trained LM
Finally, we explore the scenario where the adversary is using different fine-tuned LMs with the same parent pre-trained LM. The attributor is aware of this fine-tuning and is attempting to tell apart these fine-tuned LMs. Table 7 presents the accuracy results when synthetic text is generated by fine-tuning GPT2 on 4 different subreddits (r/changemyview, r/technology, r/relationships, r/conspiracy). XLNet-FT again achieved the best accuracy, however, this time it is less than 60%. Curiously, CNN which was the least successful in earlier experiments performed almost identically to XLNet-FT. GPT2 performed just slightly better than a random attributor (1/4, i.e., 25%). Overall, variations between texts generated by different finetuned variants of the same pre-trained LM are not    Table 7: Accuracy percentage in attributing fine-tuned GPT2 LMs. Dataset contains texts generated by four GPT2 LMs fine-tuned on each subreddit in Table 2. pronounced enough to be leveraged by the attribution techniques we consider. Our preliminary analysis shows that there is some correlation between the attributor's mistakes and the vocabulary similarity of the corresponding subreddits. However, further research is needed to probe the causes of this lackluster performance and devise ways to improve the attribution of text produced by finetuned LMs.

Related Work
We contextualize our work with respect to prior literature on detection and attribution of organic and synthetic text.

Synthetic text detection
There has been a lot of recent interest in developing ML approaches to distinguish between organic and synthetic text. GLTR (Giant Language Model Test Room) leveraged the statistical tendency of LMs to produce words with higher probability of occurrence to help users differentiate between synthetic and organic text (Gehrmann et al., 2019). Grover used a purpose-built LM to train a classifier for synthetic and organic text (Zellers et al., 2019). Bakhtin et al. (2019) proposed energy based models for differentiating between synthetic and organic text. Our work takes this line of work a step further by trying to attribute synthetic text to the source LM.

Synthetic text attribution
Pan et al. (2020) proposed a dynamic embedding based approach to attribute synthetic text generated by a pre-trained LM as part of their broader investigation of sensitive information exposed by LMs. We significantly build on this work from both the methodological and application perspectives. Differently from this work, we use stylometric as well as static and dynamic embeddings. We also consider more realistic threat models where the synthetic text is generated by either pre-trained or fine-tuned LMs and using different sampling parameters.

Organic text attribution
There is a rich body of literature on authorship attribution of organic text using stylometric features.
We discuss a few classic papers here. Mosteller and Wallace (1964) used word frequency analysis for authorship attribution. Abbasi and Chen (2008) proposed a ML-based approach for authorship attribution using an exhaustive stylometric feature set called Writeprints. While there is impressive progress in stylometric organic text attribution (e.g., Narayanan et al., 2012;Ruder et al., 2016), these approaches do not work as effectively for synthetic text attribution. As our evaluation showed, Writeprints were significantly outperformed by other approaches for synthetic text attribution. This is because LMs are trained on large text corpora from different authors thus there are no clear-cut stylometric differences in synthetic text generated by different LMs.

Synthetic image attribution
Recent advances in Generative Adversarial Networks (GANs) have led to impressive results in synthetic image generation (Bao et al., 2017;Taigman et al., 2017;Ma et al., 2017). For example, Chen et al. (2020) proposed image models similar to pre-trained LMs to learn an unsupervised representation of images for various downstream tasks. Most related to our work, Yu et al. (2019) proposed an ML approach to attribute synthetic images generated by GANs with different architectures and parameters. At the most basic level, the problem of synthetic image attribution differs from synthetic text attribution because images are smooth and local where words in a text document may be correlated even if they are far apart (Sharir et al., 2020). For instance, Yu et al. (2019) showed that their ML classifier could use only part of the synthetic image for attribution. In contrast, we observed a large drop in accuracy when we make use of only part of input synthetic text.

Conclusion
In this paper, we presented an ML approach to attribute authorship of synthetic text to its source LM. Our results showed that an attributor based on fine-tuned XLNet embeddings outperformed other approaches based on stylometric features as well as static and dynamic embeddings. Our results also showed there is significant room for improvement in distinguishing between synthetic text generated by different fine-tuned variants of an LM. Further research is also needed for effective attribution of synthetic text generated by more diverse fine-tuned LMs in both closed-world and-open world settings. Finally, future research on synthetic text attribution should also consider more sophisticated LMs (e.g., GPT-3 with 175 billion parameters (Brown et al., 2020) and Google's trillion parameter LM (Fedus et al., 2021)) when they are publicly released.