DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text

With the rapid progress of large language models (LLMs) and the huge amount of text they generated, it becomes more and more impractical to manually distinguish whether a text is machine-generated. Given the growing use of LLMs in social media and education, it prompts us to develop methods to detect machine-generated text, preventing malicious usage such as plagiarism, misinformation, and propaganda. Previous work has studied several zero-shot methods, which require no training data. These methods achieve good performance, but there is still a lot of room for improvement. In this paper, we introduce two novel zero-shot methods for detecting machine-generated text by leveraging the log rank information. One is called DetectLLM-LRR, which is fast and efficient, and the other is called DetectLLM-NPR, which is more accurate, but slower due to the need for perturbations. Our experiments on three datasets and seven language models show that our proposed methods improve over the state of the art by 3.9 and 1.75 AUROC points absolute. Moreover, DetectLLM-NPR needs fewer perturbations than previous work to achieve the same level of performance, which makes it more practical for real-world use. We also investigate the efficiency--performance trade-off based on users preference on these two measures and we provide intuition for using them in practice effectively. We release the data and the code of both methods in https://github.com/mbzuai-nlp/DetectLLM


Introduction
Large language models (LLMs) have made rapid advancements in recent years, and are now able to generate text with significantly improved diversity, fluency, and quality.Models such as Chat-GPT (OpenAI, 2022), GPT-3 (Brown et al., 2020), LLaMa (Touvron et al., 2023) and BLOOM (Scao et al., 2022) demonstrate exceptional performance in answering questions (Robinson et al., 2022), writing stories (Fan et al., 2018;Yuan et al., 2022) and thus facilitating daily life and improving work efficiency.However, LLMs can also be misused for generating plagiarized text, misinformation, and propaganda, which can lead to negative consequences (Zhuo et al., 2023).For instance, students might use LLMs to write assignments (Rosenblatt, 2023), making fair evaluation difficult for teachers, and in the long run, undermining the integrity of the entire education system.Malicious actors might generate fake news articles to spread misinformation and propaganda or to manipulate public opinion, which is dangerous, especially when it comes to politics (Floridi and Chiriatti, 2020;Stokel-Walker, 2022).
With the proliferation of LLMs and the increasing amount of texts they produce, it is challenging for humans to accurately identify machinegenerated texts (Gehrmann et al., 2019).Moreover, it is unrealistic to hire humans to manually identify machine-generated text at scale due to the prohibitively high costs and the efficiency requirements in real-time applications, e.g., in social media.Thus, it is essential to develop tools and strategies to automatically identify machine-generated text and to mitigate the potential negative impact of LLMs.
The problem of distinguishing machinegenerated from human-written text is commonly formulated as a binary task (Jawahar et al., 2020).Most previous work has focused on the black-box scenario, where the detector has access to the output of the LLMs only and cannot make use of its internals.Such methods lack flexibility since they need to be retrained from scratch to be able to recognize the output of a new LLM (Mitchell et al., 2023).Given the speed at which new LLMs are developed, black-box methods are becoming more and more expensive and impractical.In cases when the access to the LLM is via an API only, one possibility is for the LLM owner to record all content it has generated, or to watermark all texts it has generated (Kirchenbauer et al., 2023;Zhao et al., 2023).However, such solutions are not feasible for third parties.

12395
We therefore consider a white-box setting, where the detector has full access to the LLMs.We focus on zero-shot methods, where we use the LLM without additional training.Generally speaking, zero-shot methods use the source LLM to extract statistics such as the average per-token log probability or the average rank of each token in the ranked list of possible choices to make a prediction by comparing it to a threshold (Solaiman et al., 2019;Ippolito et al., 2019;Gehrmann et al., 2019;Mitchell et al., 2023).Based on whether the queried statistics are only about the target texts, we can roughly categorize them as perturbation-free and perturbation-based.Perturbation-free methods only query LLMs about the statistics on the target text x, while perturbation-based methods such as Mitchell et al. (2023) queries also the statistics of additional perturbed texts, which achieves better performance but is 50-100 times more costly than perturbation-free methods.Thus, there exists a trade-off between performance and efficiency among zero-shot methods.
To mitigate the gap of these two categories and design zero-shot methods with better performanceefficiency balance, we should either improve the accuracy of perturbation-free methods or reduce their cost.Thus, we propose two novel zero-shot methods, one perturbation-free, but more accurate than previous methods, and one perturbation-based method, but with better efficiency.
• We propose two novel zero-shot approaches based on Log-Rank statistics, which improve over the state of the art.On average, these methods improved upon the previous best  (Fan et al., 2018) as well as 300 texts generated with GPT-2-xl (MG) by prompting it with the first 30 tokens from human-written texts.
zero-shot methods by 3.9 and 1.75 AUROC points absolute.
• We investigate the efficacy of existing zeroshot methods and explore their limits as the size of the LLMs increases from 1.5 to 20 billion.• We conduct comprehensive experiments to better understand the efficiency-performance trade-offs in zero-shot methods, thereby providing interesting insights on how to choose among different categories of zero-shot methods based on users' preference for performance or efficiency.

Related Work
The detection of machine-generated text is commonly formulated as a classification task (Jawahar et al., 2020;Fagni et al., 2021;Bakhtin et al., 2019;Sadasivan et al., 2023;Wang et al., 2023).One way of solving it is to use supervised learning, where a classification model is trained on a dataset containing both machine-generated and human-written texts.For example, GPT-2 Detector (Solaiman et al., 2019) fine-tunes RoBERTa (Liu et al., 2019) on the output of GPT-2, while the ChatGPT Detector (Guo et al., 2023) fine-tunes RoBERTa on the HC3 (Guo et al., 2023) dataset.However, models trained explicitly to detect machine-generated texts may overfit their training distribution of the domains (Bakhtin et al., 2019;Uchendu et al., 2020).
Another stream of work attempts to distinguish machine-generated from human-written texts based on statistical irregularities in the entropy (Lavergne et al., 2008), perplexity (Beresneva, 2016) or in the n-gram frequencies (Badaskar et al., 2008).Gehrmann et al. (2019) introduced hand-crafted statistical features to assist humans in detectingmachine generated texts.Moreover, (Solaiman et al., 2019)  methods for detecting machine-generated text by evaluating the per-token log probability of texts and using thresholding.Mitchell et al. (2023) observed that machine-generated texts tend to lie in the local curvature of the log probability and proposed De-tectGPT, whose prominent performance can only be guaranteed by the large size of the perturbation function and by a large number of perturbations, and thus costs more computational resources.Other work explored watermarking, which imprints specific patterns of the LLM-output text that can be detected by an algorithm while being imperceptible to humans.Grinbaum and Adomaitis (2022) and Abdelnabi and Fritz (2021) watermarked machine-generated text using syntax tree manipulation, while Kirchenbauer et al. (2023) required access to the LLM's logits at each time step.

Improved Zero-Shot Approaches by
Leveraging Log-Rank Information In this section, we introduce the Log-Likelihood Log-Rank Ratio (LRR) and the Normalized Perturbed log-Rank (NPR).LRR combines Log-Rank and Log-Likelihood as they provide complementary information about the text.NPR uses the idea that the Log-Rank of machine-generated texts should be more sensitive to smaller perturbations.

Log-Likelihood Log-Rank Ratio (LRR)
We define the Log-Likelihood Log-Rank Ratio as , where r θ (x i |x <i ) ≥ 1 is the rank of token x i conditioned on the previous tokens.The Log-Likelihood in the numerator represents the absolute confidence for the correct token, while the Log-Rank in the denominator accounts for the relative confidence, which reveals complementary information about the texts.As illustrated in Figure 1, LRR is generally larger for machinegenerated text, which can be used for distinguishing machine-generated from human-written text.One plausible reason might be that for machinegenerated text, the Log-Rank is more discernible than the Log-Likelihood, so LRR illustrates this pattern for machine-generated text.In Sections 4 and 6, we experimentally show that LRR is a better discriminator than either the Log-Likelihood or the Log-Rank.We call the zero-shot method using LRR as a detection feature as DetectLLM-LRR, and use the abbreviation LRR in the rest of the paper.

Normalized Log-Rank Perturbation (NPR)
We define the normalized perturbed Log-Rank as where small perturbations are applied on the target text x to produce the perturbed text xp .Here, a perturbation means minor rewrites, such as replacing some of the words.We call the zero-shot method using NPR as a detection feature DetectLLM-NPR, and use the abbreviation NPR in the rest of the paper.
The motivation for NPR is that machinegenerated and human-written texts are both negatively affected by small perturbations, i.e., the Log-Rank score will increase after perturbations, but the machine-generated text is more susceptible to perturbations and thus increase more on Log-Rank score after perturbation, which suggests higher NPR score for machine-generated texts.As shown in Figure 1, NPR can be a discernible signal for distinguishing machine-generated from humanwritten text.DetectGPT (Mitchell et al., 2023) uses a similar idea, but experimentally, we find NPR to be more efficient and to perform better.Details and comparisons are given in Section 4.

Experimental Setup
In this section, we conduct comprehensive experiments to evaluate the performance of LRR and NPR in comparison to several methods previously proposed in the literature.We experiment with LLM sizes varying from 1.5B to 20B parameters, probing the boundary of zero-shot methods when 12397 LLMs continue to grow in size.We further study the impact of the perturbation function, the number of perturbations (especially for NPR and Detect-GPT), the decoding strategy, and the temperature.
Data Following (Mitchell et al., 2023), we use three datasets: XSum (Narayan et al., 2018), SQuAD (Rajpurkar et al., 2016), WritingPrompts (Fan et al., 2018), containing news articles, Wikipedia paragraphs and prompted stories, respectively, as human-written texts and we produce machine-generated texts using LLMs.These datasets are chosen to represent the areas where LLMs could have a negative impact.For each experiment, we evaluate 300 machine-generated and human-written texts pairs by prompting the LLMs with the first 30 tokens of the human-written text.We release the code for this.
Evaluation Measure Following previous work (Mitchell et al., 2023;He et al., 2023;Krishna et al., 2023), we use the area under the receiver operating characteristic curve (AUROC), which is the probability that a classifier correctly ranks the machinegenerated example higher than human-written example.Since for zero-shot methods, detection rates are heavily dependent on the threshold when using discriminative statistics, AUROC is commonly used to measure zero-shot detector performance, which considers the range of all possible thresholds (Krishna et al., 2023).

Methods
Zero-Shot Methods We compare the following: • log p(x): the idea is that a passage with a high average log probability is more likely to have been generated by the target LLM; • Rank: the idea is that a passage with a higher average rank is more likely to have been generated by the target LLM; • Log-Rank: passage with a higher average observed Log-Rank is more likely to have been generated by the target LLM; • Entropy: machine-generated text has higher entropy; • DetectGPT: machine-generated text has more negative log probability curvature.More detail and exact definitions of these methods can be found in Appendix A.
These zero-shot baselines, along with our newly proposed LRR and NPR, can be categorized as • Perturbation-free: log p(x), Rank, Log-Rank, Entropy, LRR.They only query the LLM for statistics about the target text x. • Perturbation Based: DetectGPT and NPR.
These methods query the LLM not only for the target text x, but also for perturbed versions thereof x1 , • • • , xp .As perturbation-based methods perform better (but are also more time-consuming), for fair comparison, we compare them within their own group.

Supervised Methods
We also experiment with two supervised detectors: RoBERTa-base and RoBERTa-large.As these are not central to our narrative, we put the results in Appendix B.
Experimental Details For the perturbationbased methods (DetectGPT and NPR), we use T5-3B for perturbation and we perturb the input text 50 times for all the experiments, unless specified otherwise.For all zero-shot methods, we use sampling with a temperature of 1, unless specified otherwise.More detail are given in Appendix A.

Zero-Shot Results
Table 1 shows a comparison of the five baseline zero-shot approaches to our proposed LRR and NPR, grouped as perturbation-based and perturbation-free.
We can see that for the perturbation-based methods, NPR consistently outperforms DetectGPT on all datasets and LLMs, except for one case, with an average improvement of 0.90, 2.03, 2.32 AUROC points absolute on XSum, SQuAD, and WritingPrompts, respectively, (using the same perturbation function and the same number of perturbations).For the experiments among perturbation-free methods, on average, our method achieves the best performance and improves by 2.15, 8.27, 1.28 AUROC points absolute over the second-best perturbation-free method (i.e., Log-Rank) on XSum, SQuAD, and WritingPrompts, respectively.Moreover, we find that in some cases, LRR can even perform better than perturbationbased methods, e.g., on SQuAD, LRR outperforms DetectGPT by 4.23 AUROC point absolute and outperforms NPR by 2.20 AUROC points.

Comparing DetectGPT to NPR
Equipped with large perturbation functions and an adequate amount of perturbations, perturbationbased methods generally outperform perturbationfree ones, e.g., using T5-3b as the perturbation function and perturb 50 times as in  practice, due to time and resource constraints, not all users can afford these models and large amounts of perturbations.Thus, it is important to investigate how NPR and DetectGPT behave with smaller perturbation function size and fewer perturbations.
Different Number of Perturbations. Figure 2 shows the averaged performance of DetectGPT and NPR with varying number of perturbations.We can see that NPR consistently performs better than DetectGPT when using the same number of perturbations.In other words, NPR can achieve a comparable or better performance but with significantly fewer perturbations.For example, in SQuAD and WritingPrompts dataset, NPR achieves 85 points and 95 points using approximately 10 perturbations while DetectGPT requires around 100 perturbations, which highlights the effectiveness and efficiency of NPR.More complete results for each dataset and model can be found in Figure 6 and Figure 7 of Appendix C.
Different Perturbation Functions.In Table 3, we compare NPR to DetectGPT using a smaller perturbation model T5-large, and the result is averaged over 6 LLMs and 3 datasets.We found that replacing T5-3b with smaller models harms the performance of both NPR and DetectGPT, and the performance degradation can't be mitigated by increasing the number of perturbations.For both NPR and DetectGPT, the average performance of 100 perturbations with T5-large is still worse than 10 perturbations with T5-3b (emphasized with the grey box in Table 3).Moreover, one can observe that, NPR is less affected by the reduced perturbation function size: when replacing T5-3b to T5large, the performance degradation averaged over 10, 20, 50, 100 perturbations for NPR is 4.40 points, much smaller compared to that of 8.06 points for DetectGPT.The complete results on 6 LLMs and 3 datasets can be found in Figure 8 of Appendix C.

Different Decoding Strategy and Temperature
Alternative Decoding Strategies.In line with prior work (Pagnoni et al., 2022), we experimented with top-k sampling (Fan et al., 2018) and top-p sampling (Holtzman et al., 2019 Different Temperature.Temperature controls the degree of randomness of the generation process. Increasing the temperature leads to more randomness and creativity while reducing it leads to more conservation and less novelty.In practice, people adjust the temperature for their specific purposes.For example, students might set a high temperature to encourage more original and diverse output when writing a creative essay, whereas fake news producers might set lower temperatures to generate seemingly convincing news articles for their deceptive purposes.Based on our experiments in Table 4, we found that Log-Likelihood (log p), Log-Rank and LRR is highly sensitive to the temperature and can get even better results than perturbation-based methods when the temperature is relatively low.
In addition, the performance improvement of the Rank method with the increased temperature is negligible compared to Log-Likelihood, Log-Rank and LRR, while the performance of the entropy method seems to be positively correlated to the temperature.We conjure that the abnormal behaviour of the Entropy method might be because of the assumption that "machine-generated text has higher entropy" (Mitchell et al., 2023), which, from our experiments, doesn't stand for high temperature.
As for the perturbation-based method, the impact of temperature is not as clear as a perturbation-free method.But in general, the results suggest the temperature has only minor effects on DetectGPT while it improves the performance of NPR.Another observation is that the perturbation-free method performs better than the perturbation-based method in low temperatures, for example, if the temperature is smaller than 0.95, perturbation-based methods get better detection accuracy while being efficient.

Analysis of the Efficiency
Though in Table 1, perturbation-based methods appear to be significantly better than perturbation-free methods, it is important to note that their superior performances can only be achieved with large perturbation functions and multiple number of perturbations, which leads to intensive demand for computational resources and longer computational time.Thus, while performance is an important factor, it is crucial to consider the efficiency of these zero-shot methods as well.

Computational Cost Analysis
To get an idea of how costly different zero-shot methods are to achieve their performance in Table 1, we estimated the computational time (per sample) for each zero-shot method in Table 5.The time is estimated over the average of 10 samples.
For perturbation-based methods, since the time depends on the perturbation function and the number of perturbations, we used T5-3b as the perturbation function and use 50 perturbations since this is the setting used for the main results in Table 1, we want to provide an idea of how much more it costs for perturbation based method to achieve exceptional performance in Composition of the Computational Time.In general, for perturbation-free zero-shot methods, the computational time only depends on the size of LLM and the complexity of statistics.LRR is twice as complex as simple statistics such as Log-Rank and Log-Likelihood, so it takes approximately twice as long to compute.As for LLM size, intuitively, larger models usually take more time to compute, which can also be observed in Table 5.
The additional computational time of perturbationbased methods comes from two folds: (1) The total time for perturbation, which depends on the perturbation function we use and the number of perturbations.(2) The total time for calculating statistics of the perturbed texts, which depends on the number of perturbations, the size of LLM and the complexity of statistics.To reduce the computational time of the perturbation-based method, we could either choose a smaller size of the perturbation function or reduce the number of perturbations.
Formula for Estimating the Computational Time.Let t p be the time of perturbing 1 sample, t m be the time of calculating simple statistics (such as Log-Likelihood) of one sample for a particular LLM and n be the number of perturbations.The computational time for Log-Likelihood, rank, Log-Rank, and entropy is approximately t m , the estimated time for LRR is 2 • t m , while the estimated computational time for the perturbationbased method is n • t p + (n + 1) • t m .The estimated values of t p and t m are illustrated in Table 6, which can help us estimate the total running time (in seconds) of different zero-shot methods.

Balancing Efficiency and Performance
In this subsection, we provide additional experiments on LRR (the best perturbation-free method) and NPR (the best perturbation-based method, more time-consuming than LRR but also rather satisfactory performance) to provide users with some intuition on setting parameters of NPR and choosing among between these two methods according to user's preference of efficiency and performance.First, we study the perturbation function used for NPR.Different from Section 5.2, where the focus is to illustrate the advanced performance of NPR compared with DetectGPT, here, we mainly focus on the efficiency performance trade-off perspective and provide some intuition on choosing perturbation functions.
T5-small and T5-base are not good candidates for perturbation functions.T5-small and T5base are 2 or 3 times faster than larger models such as T5-large (as shown in Table 6), one might wonder if it is possible to trade the saved time with more perturbations for a better performance?We give a negative answer to this.We observe in Figure 3 that using T5-base and T5-small performs worse than LRR even with 50 to 100 perturbations, which suggests that LRR can be at least 50 to 100 times faster 12401  while outperforming perturbation based methods.So, if the user can only afford T5-small or T5-base as a perturbation function, they should choose LRR with no hesitation since it achieves both better efficiency and better performance.
Cost-Effectiveness on More Perturbations and Larger Perturbation Function.In Figure 4, we illustrate the effectiveness of LRR compared to NPR with T5-large and T5-3b as perturbation function respectively, from which, we find that (1) T5-3b has a higher performance upper limits compared with T5-large.So, if resources are allowed (enough memory and adequate perturbation time), t5-3b would be a better choice, especially for users who prioritize performance.
(2) To achieve the same performance as LRR, generally, we only need less than 10 perturbations using T5-3b as the perturbation function.This estimate could help us choose whether to use NPR or LRR on the validation set: setting the number of perturbations to be 10, if LRR outperforms NPR, we would suggest using LRR, otherwise, NPR would be a better option.(3) To achieve the same performance, using T5-large takes more than 2 times perturbations than using T5-3b, while the perturbation time using T5-3b is less than twice the time using T5-large, so using large perturbation functions such as T5-3b is much more efficient than using smaller ones such as T5-large.The only concern is the memory.In summary, we suggest using the larger perturbation functions if memory permits, which is more cost-effective: and less time-consuming for the same performance and has a high-performance upper limit.Moreover, setting the number of perturbations to 10 would be a good threshold on the validation set to decide whether to use NPR or LRR.

Conclusion
In this paper, we proposed two simple but effective zero-shot machine-generated text detection methods by leveraging the Log-Rank information.The methods we proposed -LRR and NPR-, achieve state-of-the-art performance within their respective category.In addition, we explored different settings such as decoding strategy and temperatures, as well as different perturbation functions and number of perturbations to better understand the advantages and the disadvantages of different zero-shot methods.Then, we analyzed the computational costs of these methods, and we provided guidance on balancing efficiency and performance.

Limitations
One of the limitations of zero-shot methods is the white box assumption that we can have some statistics about the source model.This induces two problems: for closed-source models (such as GPT-3), these statistics might not have been provided.Moreover, in practice, the detector might have to run the model locally to get the statistics for the purpose of detection, which requires that the detector have enough resources to use the LLM for inference.Based on the limitations of zero-shot methods, we consider weakly supervised learning (Ratner et al., 2017) as an important direction for future work.Though many papers in detecting machine-generated text assume knowing the source LLM where the text is generated from, in realistic, the source LLM might be unknown, so it is worth combining weak supervised learning as well as weak supervision sources (other LLMs at hand that might not be the target LLM) to weakly train a classifier.With the flexibility of the weak supervision sources, the limitations of our work could possibly be addressed: (1) Since the weak supervision sources do not have to be from the same target model, there is no need to assume that the target LLM is known.(2) Since the weak supervision sources are classifiers, we could only use statistics that are within reach, or even statistics from other open-source LLMs.(3) The weak supervision sources can be from smaller LLMs, rather than the target LLM, this relaxes the requirement for running an extremely large LLM locally.
In addition, our conclusions hugely rely on our English-centric experiments.It is worth noting that the detection of machine-generation text in other languages is also important, especially for lowsource languages.We encourage future studies on the investigation of zero-shot detection in multilingual settings.

Ethics and Broader Impact
Although our paper focuses on the malicious use of LLMs such as spreading misinformation and propaganda, or dishonesty in the education system, it's necessary to recognize that LLMs also have a wide range of potential benefits, and we would like to point out that most of the people apply LLMs for good conducts such as improving their work efficiency.
Moreover, even though our detectors achieve high AUC scores, it should be recognized that every machine-generated text detector, including ours, has its limitations and can make mistakes, we can't guarantee 100% accuracy for every sample.As such, when deciding whether a text, such as a student's essay, is written by a human or machine, our results are for reference only and should not be used as concrete evidence for punishment.For ethical concerns, our detector should only assist humans to make decisions, rather than directly make decisions for users, thus, we recommend users take these results as one of many pieces in a holistic assessment of texts.

B Supervised Methods
Main results for supervised methods.Comparing Table 1 with Table 7, we found that, on average, our best zero shot method (either LRR on SQuAD dataset or NPR on XSum and WritingPrompts dataset) can exceed supervised model fine-tuned on RoBERTa-base.For the larger model RoBERTa-large, only on writing dataset, perturbation-based method DetectGPT and NPR outperform RoBERTa-large model, by a margin of 0.55% and 2.87% respectively.
Supervised Method with Different Decoding Strategy.We experimented the 4 models used in zeroshot methods with top-p and top-k decoding strategy for the supervised method and found that using top-p decoding strategy performs better than using top-k.(See Table 8).Compared to zero-shot methods, the best zero-shot method NPR can outperform the RoBERTa-base model while being comparable to the RoBERTa-large model.
Supervised Method with Different Temperature.Supervised methods also perform better with lower temperature, but zero-shot methods such as Log-Rank and Log-Likelihood methods might exceed supervised methods in low temperature.Moreover, we found that the performance gap of RoBERTa-base and RoBERTa-large would be narrowed with lower temperature.The results are illustrated in Figure 5.

C Comparing NPR and DetectGPT
Different Number of Perturbations.The results for models smaller than or equal to 13B parameters are shown in Figure 6.For the NeoX-20b model, we don't have enough computation resources to perform 100 perturbations, so we show it separately in Figure 7 with 1, 10, 20, and 50 perturbations.For XSum dataset, NPR and DetectGPT almost coverages with 100 perturbations, but for the SQuAD and WritingPrompts dataset, NPR still outperforms DetectGPT even with 100 perturbations.For the SQuAD dataset with the Llama-13b model, DetectGPT exhibits abnormality while NPR maintains stably improved performance as the number of perturbations increases.In addition, in nearly all the datasets and models, NPR outperforms DetectGPT except GPT-j on the XSum dataset, demonstrating the effectiveness of NPR compared to DetectGPT.
Using T5-large as Perturbation Function.We illustrate the performance of NPR and DetectGPT in Figure 8 with different combinations of dataset and LLMs using T5-large as a perturbation function.
Compared to T5-3b illustrated in Figure 6, the superiority of NPR over DetectGPT becomes more distinct with T5-large being the perturbation function, where in almost all the LLMs, datasets and different numbers of perturbations (except with Llama-13b on SQuAD), NPR outperforms DetectGPT by a large margin.In addition, we could also observe that NPR achieves comparable or even better results with only  10 perturbations to that of DetectGPT with 100 perturbations, which indicates that NPR is more efficient and can achieve a similar level of performance with significantly fewer perturbations.

D Alternative Sampling Strategies and Temperature
Different Sampling Strategy.In Table 9, we illustrate the complete results with different zero-shot methods with four LLMs using top-p and top-k sampling.For perturbation-based methods, even with different sampling strategies, NPR provides a clearer signal for machine-generated text detection than DetectGPT.Moreover, we find that although LRR is more stable than Log-Rank and Log-Likelihood methods: when replacing temperature sampling to top-p and top-k sampling, all the above-mentioned three zero-shot methods' performance improves, however, LRR improves approximately the same for both top-k and top-p sampling while the other two is more in favour of top-p sampling.
Different Temperature.Here, we investigate how the temperature used for machine-generated texts affects the detection accuracy of different zero-shot methods.From Figure 9, we find that all the perturbation-free zero-shot methods improved their performance with the decreasing temperature.In particular, for the Log-Rank and Log-Likelihood method, the performance can become extremely high when the temperature drops, even exceeding NPR and achieving approximately 100 points detection accuracy.For example, in Neo-2.7 and OPT-13 with temperature 0.5, log p method and Log-Rank method achieve an accuracy of 100 points on WritingPrompts dataset, this prevalent performance can be observed notably in smaller models with relatively high temperature (such as GPT-2-xl and Neo-2.7 with high temperature such as 0.7) or in large models with relatively lower temperature such as OPT-13 with temperature 0.5 as we demonstrated in Figure 9. Though we omit the entropy method because it gets an accuracy worse than random guessing, one of the observations from our experiments is that using the assumption "machine-generated text has higher entropy" suggested in (Mitchell et al., 2023), the performance of the entropy method improves with the increasing temperature with absolute accuracy smaller than 50 points, which suggests that for low temperature, we should use the assumption "machinegenerated text has lower entropy" for detection machine-generated text.In general, the Entropy method performs worse than random and is not an implementable detection method.
For perturbation-based methods (Figure 10), while DetectGPT does not exhibit a clear trend with respect to temperature, the performance of NPR improves with the decreasing temperature most of the time.However, this trend is not as clear as the Log-Rank and Log-Likelihood methods, especially when the temperature becomes too low.This behaviour suggests that the perturbation-based method is more suitable for high temperatures, while the perturbation-free method is more suitable for low temperature.

Figure 1 :
Figure1: Distribution of LRR and NPR visualized on 300 human-written texts (HG) from the WritingPrompts dataset(Fan et al., 2018) as well as 300 texts generated with GPT-2-xl (MG) by prompting it with the first 30 tokens from human-written texts.

Figure 2 :
Figure 2: Comparison of DetectGPT to NPR averaged across six models (in terms of AUROC).(The full results are given in Figure 6 in the Appendix).

Figure 3 :
Figure 3: Comparing LRR and NPR when T5-small and T5-base are used for perturbation in NPR (AUROC scores).

Table 1 :
Zero-shot experiments.Comparison of the proposed LRR and NPR to other zero-shot methods in terms of AUROC.For fair comparison, we show in bold the best results, both with and without perturbations.
). Top-k sampling generates from top-k most likely words according to the LLM.Top-p sampling (nucleus sampling) samples from the set of words that collectively ac-due to the unstable performance surge of the Log-Rank method, LRR become slightly behind the Log-Rank method, with a minor difference of 0.36 and 0.19 points on the XSum and WritingPrompts datasets, respectively.For perturbation-based methods, their behaviour is consistent with previous results, where NPR outperforms DetectGPT for both top-p and top-k sampling strategies.

Table 1
based methods, the running time is at least 50 times longer compared to Log-Likelihood, Rank, Log-Rank, and Entropy method, since they calculate the Log-Likelihood or Log-Rank for not only the target text but also perturbed samples.

Table 5 :
Computational time (seconds) for different zero-shot methods on different LLMs (averaged over 10 reruns).

Table 6 :
Computation time.Estimated computation time for one perturbation (t p ) and for calculating the target statistics on the text (t m ): shown in seconds.

Table 7 :
Complete results for the supervised methods (AUROC score).

Table 8 :
Complete results for the supervised methods using top-k (k = 40) and top-p (p = 0.96) sampling across four models (AUROC scores).

Table 9 :
Complete result for the zero-shot methods using top-k and top-p sampling across four models (AUROC score).