On the Zero-Shot Generalization of Machine-Generated Text Detectors

The rampant proliferation of large language models, fluent enough to generate text indistinguishable from human-written language, gives unprecedented importance to the detection of machine-generated text. This work is motivated by an important research question: How will the detectors of machine-generated text perform on outputs of a new generator, that the detectors were not trained on? We begin by collecting generation data from a wide range of LLMs, and train neural detectors on data from each generator and test its performance on held-out generators. While none of the detectors can generalize to all generators, we observe a consistent and interesting pattern that the detectors trained on data from a medium-size LLM can zero-shot generalize to the larger version. As a concrete application, we demonstrate that robust detectors can be built on an ensemble of training data from medium-sized models.


Introduction
Thanks to large-scale pretraining and tuning with human feedback (Ouyang et al., 2022), large language models (LLMs) (Chung et al., 2022;Zhang et al., 2022;Touvron et al., 2023) are now able to follow instructions and generate realistic and consistent texts.A prominent example is the recently developed ChatGPT or GPT4 model (Ope-nAI, 2023), which when instructed, can write documents, create executable code, or answer questions that require world knowledge.In a lot of scenarios, the machine-generated texts have high quality and cannot easily be distinguished from genuine human texts (Dugan et al., 2022;Gehrmann et al., 2019).
These trends give an unprecedented importance to the detection of machine-generated text (Su et al., 2023;Jawahar et al., 2020;Pagnoni et al., 2022a).A lot of work has been devoted to proposing efficient detection models or algorithms (Mitchell Code and datasets will be available at https://github.com/SophiaPx/detectors-generalization.et al., 2023;Kirchenbauer et al., 2023;Zellers et al., 2019).However, in most studies, the detector is tested on the same generator model that it is trained/tuned on.
This study is motivated by an underexplored research question: How will the detector perform on a different generator that it is not trained on?This question is important due to multiple reasons: (1) LLMs are becoming increasingly large and expensive.Some of the most recent models are either too large to fit into a common GPU (e.g., LLaMA-65B) or require payment from the user (OpenAI, 2023), making the collection of training samples difficult.(2) The number of released LLMs is growing rapidly.In a real application scenario, the detector needs to cover a wide range of LLMs (including the ones the detector is not trained on), instead of only one generator.
In this work, we collect generation data from a wide range of LLMs.We then train neural detectors on data from each generator and test its performance on other generators.Our primary findings include: (1) In many cases, detectors can zero-shot generalize to a held-out generator (Figure 1).In particular, we observe an interesting pattern that the detector for the medium version of an LLM can generalize to the larger version.(2) None of the detectors generalizes to all generators, implying that an ensemble of detectors/data is necessary for a wider coverage.(3) As a concrete application, we demonstrate that robust detectors can be built on an ensemble of training data from medium-sized models; Excluding large-versions only leads to a minor drop in performance.

Methodology
We begin by giving an overview of our experiment structure and establish some notations.This study includes detection of a range of popular LLMs (detailed in §3), and we construct train/dev/test sets for each generator.In §4.1, we train neural detectors on data from each generator and test its performance on other generators.In §4.2, we further consider an ensemble setting, where the detector is trained on data composed of multiple generators, and test its generalization ability on held-out generators.
We denote the detector model trained on data from generator model M as D M .Since we will test the accuracy of D M on data from different generators, we use Acc N (D M ) to denote the accuracy of D M on the test set of generator N .Finally, we define Acc-Gap D M N to measure the drop of performance when the detector is trained on generator M instead of N itself: We expect Acc-Gap to be larger than zero in general, and a large Acc-Gap means D M has poor generalization on generator N .
For each dataset, we first randomly sample 5000 real-world human-written samples, with a train/dev/test split ratio of 8:1:1.For all samples, we truncate the first 20 tokens to serve as prompts and feed them into different generators for text continuation, yielding 5000 machine-generated samples.For generation we apply nucleus sampling (Holtzman et al., 2020) with p = 0.96, following the setting in Pagnoni et al. (2022b).We truncate each sample so that its length is around 120 tokens.For all training or test sets in this work, we keep the ratio of human and machine text to be 1:1.
Detectors For data from each generator, we train a ELECTRA-large model (Clark et al., 2020) as a binary classifier.The detectors were trained for 1 epoch with a learning rate of 5e-6 (training  for more epochs only gives minimal improvement on the dev set).For the data-mix baseline and pruned models in §4.2, 3 epochs of training is used.We use Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.999.The average accuracy (when tested on the same generator it is trained on) of all detectors in news, review and knowledge domains are 94.1%, 96.2% and 94.9%, separately.
4 Experiment Results

On Generalization Ability of Detectors
As explained in §2, we compute Acc-Gap to reflect the generalization ability of detectors trained on each generator.Figure 1 depicts the Acc-Gap of each detector/generator pair.We link from node M to node N if Acc-Gap D M N < T (good generalization), where the threshold T is set to a small number from {1%, 2%, 4%}.On the other hand, in Figure 3 (Appendix B), node M is linked to node N when Acc-Gap D M N > 20% (poor generalization).For statistical significance, we utilize bootstrapping (Koehn, 2004) and generate 100 virtual test sets by sampling with replacement from the original test set.We then conduct one-sided t-test and use a p-value of 0.05.
We observe two interesting patterns shared across the three datasets.First, the detectors for the medium-version LLMs can generalize to the large-version models.For example, D LLaMA7B generalizes to LLaMA13B, and D GPT3 general-izes to GPT4.3This is somewhat surprising because generations from the large-version generator is commonly considered to have higher quality.
Interestingly, the generalization of the reverse direction is weaker on RealNews and IMDBReview.As shown in Table 1, when attempting to generalize from the large-version models to medium ones using ELECTRA detectors, the generalization performance is slightly worse, reflected by a larger Acc-Gap.For the reason behind, we conjecture that comparing to the larger model, the medium generator is making a similar but wider range of artifacts in its generations, leading to a smooth generalization to the detection of the larger model.We also experiment with additional base detectors, e.g.ALBERT Large v2 (Lan et al., 2019) and find that the key observation-that the detectors trained for the medium-size models can generalize to largersize models-still holds.These results are omitted for brevity.
Second, Figure 3 (Appendix B) shows that none of the detectors, on its own, can generalize to all generators.In particular, GPT3 and GPT4 seem "isolated" from other families of generators.This result indicates that if we want an "universal" detector which can cover all generators, an ensemble of detectors/data is necessary.We explore this direction in the next section.

Pruning Out Large-Version LLMs in a
Mixed Training Dataset We now demonstrate a concrete application of our findings, and the following realistic threat scenario is considered: The task is still binary classification but the machine text is composed of generations from a range of models (listed in §3).For simplicity, we use a uniform data ratio for the generators.
Following results of the last section, an ensemble of detectors/data is necessary.We begin by comparing two baselines: (1) Model ensemble, where we aggregate predictions from all detectors by majority voting or confidence (probability) average; (2) Data mixing, where we train a new ELECTRAlarge detector by mixing up the training data from all generators. 4or each baseline detector D, we report the average accuracy on all generators, and the worst-case accuracy which is min N Acc N (D).Accuracy on the four largest generators is also reported.We conduct experiments on RealNews and IMDBReview datasets, and the results for baselines on are shown in the left part of Table 2.It is shown that the datamix model outperforms the ensemble approach by a large margin.Therefore, we base our pruning experiments on the data-mix model.
Following insights from the last section, we then prune out data from the large-version language models (i.e., GPT4, GPT-Neo2.7B,LLaMA13B and GPT-2xl) and train a detector by mixing up training data from the remaining generators. 5The degree of drop on the worst-case accuracy reflects the zero-shot generalization ability of the proposed detector.
Also shown in Table 2, the accuracy of the proposed detectors (both average and worst-case) remains similar to or only slightly decreases compared to the data-mix baseline.Figure 2 provides detailed information on the changes in accuracy after pruning out four large-version models.The accuracy of the proposed detector only experiences a slight decrease (<3%) for GPT4 and LLaMA13B.Our results show that in the case of limited budget or computing, data from the mediumversion LM can decently approximate the largeversion in an ensembled data collection.
On the right part of Table 2, we conduct com- parison experiments where both medium and large versions are pruned out.As expected, this results in a worse performance on detection of the pruned generators, reflected by the worst-case accuracy.
Especially, the comparison experiment of pruning out both GPT3 and GPT4 is quite alarming: The detector trained from combined data of all other generators only has accuracy around 42% (Real-News) or 62% (IMDBReview).This implies that if OpenAI did not give public access to generations of the two models, existing detectors would fail.

Related Work
We now discuss the literature most related to our work, and defer a more complete review to Appendix A. Pagnoni et al. (2022a) demonstrate the degraded performance of trained detectors under different threat scenarios, while the range of generator models is not as wide or up-to-date as our work.Liang et al. (2023) study the bias of detectors for LLMs in the case of non-native English writers.In a very recent and concurrent work, Mireshghallah et al. (2023) study the generalization of detectors under the DetectGPT (Mitchell et al., 2023) algorithm, which is also shown to be far from perfect.
Comparing to a trained detector, DetectGPT relies on access to the generator LLM, which might be expensive.

Conclusion and Discussion
In this work, we observe a generalization relationship among detectors trained on different generators in three domains, where detectors for mediumversion models demonstrate the ability to effectively generalize to the larger-version.Building upon this finding, we prune out data from largeversion generators in an ensembed training dataset and demonstrate that the performance loss is min-imal.Our results indicate that practitioners with limited budget or computing resources can use data from medium-size LLMs as a good approximation for the large version.
With the rapid release of various LLMs and generation APIs, a detector needs to cover a wide range of generators.While our work makes some initial progress, our experiments show that the detection of an unseen (or non-public) generator is still a difficult and open question.We hope our work could motivate more research devoted to this important direction.

Limitations
Our work focuses on supervised detector models and there are other approaches for machinegenerated text detection (Appendix A).In a very recent and concurrent work, Mireshghallah et al. (2023) studies the generalization of detectors under the DetectGPT (Mitchell et al., 2023) algorithm, which is also shown to be far from perfect.Comparing to a trained detector, DetectGPT relies on access to the generator LLM, which might be expensive.It is also interesting to base the detector on a larger LM than ELECTRA-large, but we surmise the observations should be similar.
The zero-shot generalization ability of detectors shown in this work implies that different generators are making similar artifacts based on which the detectors make decisions.As future work, it would be interesting to examine the salient features (Zeiler and Fergus, 2014) and compare between machine/human-generated text.
Finally, our experiments show that the detection of an unseen or non-public generator is still a difficult and open question.For example, the combination of data from all other generators can not generalize to GPT3 and GPT4.This important research direction deserves more research efforts.

Ethics Statement
The detection of machine-generated text has important applications such as detecting fake news and fake reviews on the internet.However, it could also introduce new risks: Malicious parties can use released detectors to develop text generation systems that evade existing detectors in an adversarial manner.Our experiments show that the detection of an unseen (or non-public) generator is still a difficult and open question, and we hope our work could motivate more research devoted to this important direction.

A Related Work
Research on detecting machine-generated text can be roughly divided into two categories: supervised training and zero-shot detection (To clarify, in the literature "zero-shot" usually means that the approach does not require training data, while our work focus on zero-shot generalization to the detection of a held-out generator).
In the case of supervised methods, Bakhtin et al. ( 2019) train an energy-based model to identify machine-generated text.Zellers et al. (2020) trainn a GROVER detector and finds that models exhibiting superior performance in generating neural disinformation are also highly effective in detecting their own generated content.Both Solaiman et al. (2019) and Ippolito et al. (2020) propose zero-shot approaches to detect machine-generated text and evaluate the capability of pretrained models.Liu et al. (2022) present a coherence-based contrastive learning model to detect the machine-generated text under low-resource scenario.Kirchenbauer et al. (2023) propose a watermarking method (Abdelnabi and Fritz, 2021) which introduces designed noise which is imperceptible to human readers.Mitchell et al. (2023) propose DetectGPT, a zeroshot method that utilizes a novel curvature-based criterion to determine whether a text is generated by a specific model.This approach has demonstrated superior detection capabilities compared to other existing zero-shot methods.While DetectGPT does not require training a separate detector, it relies on access to the generator LLM, which can be costly.Recently, Su et al. (2023) follow up the work of De-tectGPT and introduce two new zero-shot methods: DetectLLM-LRR and DetectLLM-NPR.

B Auxiliary Results
In Figure 3, we plot detector-generator pairs with large (>20%) Acc-Gap on the three datasets.It shows that none of detectors is able to generalize to all generators.For example, all detectors except D GPT4 has large accuracy gap for GPT3.
In Figure 4, 5 and 6, we give detailed heatmaps of Acc-Gap for every detector/generator pair on the three datasets.The reported numbers are calculated as the averages of Acc-Gap obtained by bootstrapping 100 times.

Figure 1 :
Figure 1: The generalization ability of detectors when applied to differsent generators, measured by Acc-Gap (defined in §2).Detectors for a medium-size generator can zero-shot generalize to the larger-version model (highlighted by dotted green).

Figure 2 :
Figure 2: Accuracy comparison on each generator before and after pruning out four large-version LLMs on the RealNews dataset.

Figure 4 :
Figure 4: Heapmap of Acc-Gap for detector/generator pairs on the RealNews dataset.

Figure 6 :
Figure 6: Heapmap of Acc-Gap for detector/generator pairs on the Wikipedia dataset.

Table 2 :
Accuracy of the baseline detectors and detectors trained on pruned data."L13B/7B" refers to the LLaMA 13B/7B generator.We highlight the results for the data-mix model becuase it serves as the base for the pruned models.It is shown that pruning out the large-version LLMs only induce minimal accuracy loss.