LLMDet: A Third Party Large Language Models Generated Text Detection Tool

Generated texts from large language models (LLMs) are remarkably close to high-quality human-authored text, raising concerns about their potential misuse in spreading false information and academic misconduct. Consequently, there is an urgent need for a highly practical detection tool capable of accurately identifying the source of a given text. However, existing detection tools typically rely on access to LLMs and can only differentiate between machine-generated and human-authored text, failing to meet the requirements of fine-grained tracing, intermediary judgment, and rapid detection. Therefore, we propose LLMDet, a model-specific, secure, efficient, and extendable detection tool, that can source text from specific LLMs, such as GPT-2, OPT, LLaMA, and others. In LLMDet, we record the next-token probabilities of salient n-grams as features to calculate proxy perplexity for each LLM. By jointly analyzing the proxy perplexities of LLMs, we can determine the source of the generated text. Experimental results show that LLMDet yields impressive detection performance while ensuring speed and security, achieving 98.54% precision and x3.5 faster for recognizing human-authored text. Additionally, LLMDet can effortlessly extend its detection capabilities to a new open-source model. We will provide an open-source tool at https://github.com/TrustedLLM/LLMDet.


Introduction
Recently, the emergence of ChatGPT1 has heralded a "Cambrian Explosion" for generative large language models (LLMs).GPT-4 (OpenAI, 2023), Bard2 , PaLM-2 (Anil et al., 2023), and other LLMs from internet companies are currently flourishing, while open-source communities are witnessing a proliferation of open-source models like LLaMA (Touvron et al., 2023a), OPT (Liu et al., 2021), ChatGLM (Du et al., 2022).These models are capable of generating coherent, fluent, and meaningful text.However, the formidable text generation capabilities of generative language models have also raised concerns about their potential misuse in domains such as phishing, spreading false information, and academic fraud.Additionally, with the application of products like ChatGPT, the future abundance of machine-generated text data has the potential to contaminate genuine humangenerated data (Hataya et al., 2022), altering the data ecosystem of the real world.
Accordingly, the study of practical content generation detection tools has attracted widespread attention from the community.Recently, the primary focus of research is on approaching the text detection problem as a binary classification task to distinguish machine-generated text and humanauthored text, making it hard to assign responsibility to a specific model or its provider.Nevertheless, Watermarking (Kirchenbauer et al., 2023) methods necessitate altering the text generation process, leading to a compromise in the quality of the generated content.Techniques like GPT-zero3 , Detect-GPT (Mitchell et al., 2023), and the classifier in OpenAI (OpenAI, 2023) require access to the deployed model, thereby resulting in high cost and intractability for third parties.
Thus, a practical LLM detection tool should possess the following capabilities, which are also the objectives of our method: Specificity: Merely focusing on identifying human and machinegenerated text is insufficient for duty attribution.There is a pressing need for the ability to recognize the specific model responsible for generating the text.Safety: Ensuring model security and mitigating potential risks require a detection method that does not require accessing model parameters.This need is particularly urgent for commercial mod-els.Efficiency: With the increasing demand for detection and the exponential growth of models, it is crucial to develop detection algorithms that have low resource and low latency requirements.Extendibility: The detection tool should inherently possess the capacity to seamlessly accommodate emerging model paradigms.This capability plays a pivotal role in refining the detection ecosystem and effectively addressing the ever-expanding variety of LLMs.
Guided by the aforementioned capabilities, we propose a pragmatic third-party detection method called LLMDet.Our approach is inspired by the observation that perplexity serves as a reliable signal for distinguishing the source of generated text, a finding that has been validated in previous work (Solaiman et al., 2019;Jansen et al., 2022;Mitchell et al., 2023).However, directly calculating perplexity requires access to LLMs, which compromises both safety and efficiency.In LLMDet, we address this challenge by capturing the next token probabilities of prominent n-gram in texts as priors.This enables us to efficiently compute a proxy perplexity for each LLM.By comprehensively analyzing the proxy perplexities of LLMs, we can accurately trace the specific language model responsible for generating the text.Notably, our method eliminates the need to access the model at the detection end, ensuring the security of parameters in large-scale models.It also offers the potential for seamless integration with emerging open-source models, as well as proprietary models under appropriate licensing.These factors contribute to the widespread adoption of our approach.
LLMDet exhibits outstanding overall detection performance, with an F1-Macro score of 88.14% and near-perfect results for R@2, indicating that highly ranked predictions cover the correct labels for the majority of instances.Particularly notable is its exceptional discriminative ability in human text, LLaMA-generated text, and BART-generated text.In terms of detection efficiency, LLMDet significantly outperforms other similar methods such as fine-tuned RoBERTa, GPT-zero 4 , Detect-GPT (Mitchell et al., 2023), and True-PPL with respect to speed.And, it has very low resource requirements, as text detection can be accomplished solely on a CPU, enabling easy accessibility for a wider range of users.Additionally, when tested on perturbated text data, LLMDet produces satisfac-4 https://gptzero.metory detection results, demonstrating its robustness and adaptability.

Related Work
The existing methods for detecting generated text can be broadly categorized into two types: blackbox and white-box detection (Tang et al., 2023).

Black-box Detection
Black-box detection methods can be further divided into three main branches: statistical learning methods, supervised learning methods, and unsupervised learning methods.Traditional approaches utilize statistical metrics such as entropy, perplexity, and n-gram frequency for text classification (Gehrmann et al., 2019;Fröhling and Zubiaga, 2021).
Compared to statistical learning methods, supervised learning methods are more commonly used in text detection.These works leverage text features to train a supervised classification model specifically designed for the detection of machinegenerated text (Bakhtin et al., 2019;Uchendu et al., 2020;Fagni et al., 2021;OpenAI, 2023).
However, the study conducted by (Uchendu et al., 2020;Chakraborty et al., 2023) demonstrates that a limitation of supervised models is the potential occurrence of overfitting within the domain, resulting in poor detection performance outside the domain.
To address the limitations of supervised learning methods, unsupervised learning methods such as DetectGPT (Mitchell et al., 2023) and GPT-Zero have been developed.These approaches utilize checks on perplexity and burstiness in the text to determine whether it is artificially generated or authored by a human.

White-box Detection
White-box detection methods require full access to LLMs, thereby enabling control over the generation behavior of the model or embedding watermark within the generated text (Abdelnabi and Fritz, 2021;Ueoka et al., 2021;Dai et al., 2022).This enables the tracking and detection of machinegenerated text within white-box settings.
The current state-of-the-art approach, as proposed by (Kirchenbauer et al., 2023), partitions the model's vocabulary into whitelist and blacklist tokens when predicting the next token given a prompt.During text generation, the goal is to produce whitelist tokens as much as possible, effectively creating a strong watermark.Third parties can determine if the text is machine-generated by analyzing the frequency of whitelist tokens within the text.While watermarking methods offer robustness and interpretability, they can compromise the quality of the generated text and may not be highly practical in certain scenarios (Sadasivan et al., 2023).

Motivation
A practical LLMs detection method should possess the characteristics of being specific, secure, efficient, and extensible, which serve as the intention for developing our third-party detection tool.
Specificity: The field of LLMs constantly evolves, indicating that a sole focus on identifying human and machine-generated text is insufficient to meet detection requirements.From the perspective of copyright protection for works generated by artificial intelligence (Aplin and Pasqualetto, 2019), an ideal detection tool should be capable of identifying the specific language model responsible for generating the text, thereby exerting a lasting impact on intellectual property rights protection.
Safety: The majority of existing detection methods require accessing or modifying model parameters, which is deemed unacceptable for commercial models.Once the model is loaded, it represents a financial loss for the owner and can also expose the model to potential attacks (Kurita et al., 2020).Hence, considering the security of the model, it is desirable to minimize the need for model loading during the detection process.
Efficiency: With the growing number of users utilizing large-scale models, the future of text detection is poised for exponential expansion in terms of demand and user base.For instance, in the realm of education, there is a significant need for text detection to combat cheating and plagiarism (Cotton et al.), despite often constrained hardware conditions.This poses a formidable challenge to existing detection methods.Hence, the pursuit of rapid and resource-efficient approaches has become a pivotal direction in developing efficient detection algorithms.
Extendibility: As for multi-model generated text detection approaches, it is crucial to seamlessly adapt to emerging model paradigms and extend detection capabilities to new models.This is because an excellent detection tool is not static but needs to keep up with technological advancements and continuously enhance its own detection ecosystem to address the challenges posed by new models.

LLMDet
Combining the aforementioned motivations, we introduce LLMDet, a text detection tool capable of identifying the sources from which the text was generated, such as Human, LLaMA, OPT, or others.The overall framework of the system is illustrated in Figure 1 and consists of two main components: Dictionary Construction (see § 4.1) and Text Detection (see § 4.2).
The construction of the dictionary is performed offline by us or provided by the model owner, ensuring its independence from external systems.This ensures the fulfillment of the four characteristics proposed for our detection tool in § 3. The text detection component can be distributed to tool users, allowing third-party detection without requiring the possession of the model.For the specific algorithm, please refer to Appendix A.

Dictionary Construction
Drawing from previous detection works, such as DetectGPT (Mitchell et al., 2023) and GPT-Zero5 , perplexity has shown promising results in detecting machine-generated text.Therefore, we consider utilizing perplexity as a measurement of identifying the generated text from different LLMs.However, calculating the actual perplexity requires access to LLMs, which goes against the safety and efficiency characteristics of the practical LLMs detection method.
Perplexity is a measure used to evaluate the performance of language models.Specifically, it is the exponential average of the negative log-likelihood of a sequence generated by the model.The perplexity score is calculated based on the probability of generating the next word, given all the previous words in the sequence, e.g.p(x i , x <i ).In order to calculate the perplexity of text without accessing the model, we need approximate p(x i , x <i ) by replacing x <i with a n-gram , thus a dictionary should be constructed, with n-gram as keys and the next token probabilities as values.This dictionary serves as prior information during the detection process, allowing us to compute the proxy perplexity of the text instead of the true perplexity.The construction process can be divided into three steps:  1) Generated Text Sampling: Due to the absence of readily available model-generated text data, it is necessary to collect a sufficient number of corresponding generated texts for each model.We provide a prompt dataset and, for each model, randomly sample an equal number of prompts.We use these prompts to generate corresponding texts and collect the required text data.
2) Word Frequency Statistics: In this phase, we first utilize the generated texts collected in the previous step to perform n-gram word frequency statistics (Pang et al., 2016).The n-gram range from 2-gram to n-gram.Subsequently, we select the top-k n-gram based on their frequency.
3) Next Token Probability Sampling: In this phase, we use each n-gram s obtained from word frequency statistics as samples.We input the first n − 1 token s [1:n−1] into the corresponding generative models for predicting next-token probabilities where |W| is the size of vocabulary.Subsequently, we sample the top-K words based on next-token probabilities.For ngram with different values of n, the optimal value of K for top-K sampling may vary.
We should consider the optimal values, the degree of n-gram, the number of n-gram k, and the number of next token probabilities K from two aspects: detection performance and storage cost.
In terms of detection performance, the larger n, k, and K may improve the detection performance of LLMDet, as this enables the proxy perplexity to approximate the true perplexity.
In terms of storage cost, due to the data type of the sampling probabilities being Float64 and ngram being a string, a significant amount of storage space is required, e.g.O(nkK).If n is set to 4, k is set to 100,000 (much smaller than the number of 4-gram), and K is set to 10,000 (most vocabulary size is larger than that), we need almost 22GB to store only probabilities for one model.Thus, we have to reduce the storage in practical use.The reduction can be considered in two folds, 1) select a suitable n, k and K, 2) reduce Float64 to Float16 and represent n-gram as Int16.We find that does not significantly affect LLMDet, while it reduces storage costs by approximately 11 times about 0.5GB.

Text Detection
In § 4.1, we have obtained the dictionary of n-gram and their probabilities.Therefore, we can use the corresponding dictionary of each model as prior information for third-party detection to calculate the proxy perplexity of the text being detected on each model.Immediately after, by inputting the proxy perplexity as a feature into a trained text classifier, we can obtain the corresponding detection results.

Proxy Perplexity Estimating
During text detection, for the input text X, our initial task is to estimate the proxy perplexity of this text across various large language models as a vector of feature information.
Taking the estimation of proxy perplexity on M odel m as an example, we begin by tokenizing the input text X to obtain its sequence X = [x 1 , x 2 , ..., x t ], assuming the length of the tokenized sequence is denoted as t.
Then, the proxy perplexity of the sequence X on M odel m can be mathematically represented by the following function, denoted as Proxy_PPL: More specifically, log p (x i | n-gram) represents the logarithmic likelihood of the i-th token, conditioned on the preceding tokens x <i matching the n-gram in the dictionary of M odel m .The likelihood probability p (x i | n-gram) corresponds to the value associated with the matching n-gram in the dictionary.
Similarly, by repeating the above procedure on other models, we can obtain the proxy perplexity of the detection text on the respective models.These proxy perplexities constitute the feature information vector for detection, denoted as F = [Proxy_PPL 1 , Proxy_PPL 2 , ..., Proxy_PPL c ], subscript c denotes the number of LLMs.

Result Ranking
Before result ranking, we initially estimate the proxy perplexity of the generated texts from each language model and human-generated texts.This estimation allows us to obtain a separate feature information vector for each text.Subsequently, these vectors are employed to train a text detector.
Next, we input the feature information vectors F, obtained during the proxy perplexity estimation phase, of the text to be detected into the trained text detector for result prediction, yielding a prediction result, such as for a given Model i , the probability is denoted as p i .It is important to note that we denote the probability of Human as p 0 .
However, due to the fact that the text detector is trained based on perplexity as a feature, it is not sensitive to the length information of the detected text, resulting in suboptimal detection performance for some short texts.Therefore, it is necessary to apply a smoothing technique to the probabilities of the detection results in order to enhance the success rate of detecting short texts.The smoothing process is denoted as, with L is the length of the text to be detected, c denotes the number of LLMs.
Finally, we apply softmax to the smoothed probabilities to obtain [ p0 , p1 , ..., pc ].Consequently, the detection results are transformed into the probability of Model i is pi .Subsequently, the detection results are sorted based on the magnitude of the probability values in the result dictionary, yielding the final detection outcome,

Experiments
We conduct experiments based on proxy perplexity and true perplexity according to the methods proposed in § 4. By comparing the performance of the text detectors based on fine-tuned RoBERTa, proxy perplexity, and ture perplexity, we find that our proposed method outperforms existing methods in terms of detection efficiency, security, and scalability while ensuring the performance of the detector.

Datasets
In our experiments, we use Wikipedia paragraphs from the SQuAD context (Rajpurkar et al., 2016) and news articles from the Xsum (Narayan et al., 2018) dataset for extraction.We extract the first 5 phrases of each text data to form a prompt dataset.
During the text generation phase, for each LLM, we randomly select 32,000 data samples from the prompt dataset as input and have the model generate corresponding text.The generated text from each model is evenly split into two parts: 16,000 samples for the statistical dataset and 16,000 samples for the validation dataset.The statistical dataset is used for n-gram frequency counting.The validation dataset from LLMs, along with 16,000 samples collected from HC3 (Guo et al., 2023) as human-generated text, form a combined dataset for the training and validation of text detectors.

Metrics
To evaluate the ability of the detector to distinguish between text generated by different LLMs and human-written text, we employ precision (P ), recall (R), and F1 score to assess the discriminative performance of the text detector on each of LLMs and human-generated text.Additionally, F1-Macro, R@1, R@2, and R@3 metrics are used to analyze the overall performance of the detector, where P i , R i and F1 i respectively represent the precision, recall, and F1 score of Model i .N denotes the total number of categories, M represents the number of texts being tested.G j represents the ground label of Text j, K j refers to the top-k categories with the highest probabilities in the predicted results, I G j ∈K j takes the value of 1 when G j ∈ K j , and 0 otherwise.

Research Quesitons
Based on the characteristics and assumptions of our proposed detection tool in § 3, we formulate four research questions regarding LLMDet.
• RQ1: Can perplexity-based methods trace the source of text from certain LLM?
• RQ2: How significant is the impact of the proxy perplexity-based approach on detection performance?
• RQ3: Can LLMDet achieve the expected level of efficiency compared to existing methods?
• RQ4: How is the extendibility of LLMDet demonstrated?

Experiments & Results
We conducted experimental verification for the aforementioned raised questions.
The classifier is then tested, and the results are presented in Table 1.We observe that the text detector based on true perplexity achieved excellent detection success rates when confronted with texts generated by different models, with the exception of the generated texts by UniLM.Despite the comparatively lower detection performance for UniLM-generated texts, the F1 score reaches 80.60%, which is significantly higher than random guessing.These experimental results robustly validate the applicability of perplexity as a distinguishing metric for models that identify specific sources of text.
For Safety (RQ2): We utilize the statistical datasets generated on GPT-2, GPT-2-Large, OPT-1.3B,OPT-2.7B,UniLM, LLaMA-7B, BART, T5-Base, Bloom-650M, and GPT-Neo-2.7B,as mentioned in the § 5.1, to construct dictionaries for each model using the method described in the § 4.1.Then, we employ these dictionaries to calculate the proxy perplexity of the combined dataset as features for training a text classifier based on Light-GBM (Ke et al., 2017).
The classifier is then tested, and the results are presented in Table 1.Our proposed method based on proxy perplexity achieves comparable results to the text detector based on real perplexity on Human, LLaMA-generated, and BART-generated texts, with detection success rates exceeding 95%.Additionally, our method outperforms the true perplexity-based detector when it comes to detecting UniLM-generated texts.Furthermore, the F1 score for detecting texts from other sources is at least 76.39%, significantly higher than random guessing.Based on the confusion matrix in Figure 2, it can be observed that there is a tendency for the text generated by GPT-2 and OPT to be easily confused with each other, while text generated by T5, Bloom, and GPT-Neo also exhibit a tendency to be easily confused.Although the overall performance is not as high as the real perplexity-based text classifier, our proposed method does not require model access during detection and offers advantages such as speed, scalability, and enhanced security.
To assess the comprehensive detection capability of the detector, we compute the F1-Macro, R@1, R@2 and R@3 values.From Table 2, it is evident that our proposed method achieves an R@2 value of 98.00%.This indicates that, among the top two text sources with the highest predicted probabilities, there is typically one source that corresponds to the true source of the text.
Based on the efficiency analysis in Table 2 and Table 3, it can be observed that LLMDet outperforms other detection methods significantly.Furthermore, in terms of resource requirements, our approach exhibits the lowest demands.Consequently, our detection tool demonstrates a substantially higher efficiency compared to other methods, making it more aligned with future detection needs.
For Extendibility (RQ4): To illustrate the extendibility of the LLMDet method, we expand its detection capability from one model to eight.Specifically, We sequentially add the LLM model into our LLMDet tool in the following sequence: GPT-2, LLaMA, OPT, UniLM, BART, T5, Bloom, and GPT-Neo.Thereby, continuously extending the detection capability to these models.Additionally, with each expansion, we retrain the text detector (LightGBM) and assessed the resultant changes in overall performance.
From Figure 3, it can be observed that during the expansion of LLMDet, there is only a slight fluctuation in the value of F1-Macro, which remains consistently around 85%.Therefore, it can be concluded that in the future, LLMDet can be easily expanded to a new model with sightly performance affection.
In addition, in order to explore the performance changes of LLMDet when using newer and larger LLM, we also conducted additional experiments.The detailed experimental steps and results can be seen in Appendix B.

Analysis
In this section, we conduct several additional experiments to facilitate a more comprehensive analysis of LLMDet.Firstly, we verify the detection robustness of LLMDet.Subsequently, we investigate the impact of n-gram in dictionary construction on the detection performance of LLMDet.Finally, we explore the influence of the top-k of the next token samples in dictionary construction on the detection performance of LLMDet.

The Robustness Testing of Detector
Many LLMs can change their probability of the next token via different methods, for example, changing hyperparameters like temperature, or even updating weight by fine-tuning.Furthermore, generated text may encounter deliberate perturbation, such as random deletions.It is worth considering the robustness of this method in these situations.
For hyperparameter changes, we use the approach outlined in § 5.1 of this article to generate 16,000 text instances using LLaMA-7B at temperatures of 0.1, 0.4, 0.7, and 1.0 respectively.
For random deletion, we use the approach out-lined in § 5.1 to generate 16,000 text instances using LLaMA-7B.For the generated text, we set the deletion rates at 0.1, 0.3, and 0.5, respectively, subsequently introducing corresponding perturbed texts by randomly removing words from the text according to these specified rates.
For weight updates, we employ the approach outlined in § 5.1 to generate 16,000 text instances using the Vicuna-7B, an instruction fine-tuned version of LLaMA-7B.
These text instances are then utilized as test data to assess the robustness of LLMDet, and the experimental outcomes are presented in Table 4. LLMDet exhibits strong robustness against certain types of perturbations in the text, such as random deletions, slight weight updates in the generative model, and adjustments to temperature settings.For more analysis of experimental results, please see Appendix C.

The Influence of N -gram
We compute the proxy perplexity of each model for the combined dataset in the § 4.1 using dictionaries built on 2-gram, 3-gram, and 4-gram, respectively.By jointly analyzing the proxy perplexities to train and test the text classifier using the LightGBM.It should be noted that (n-1)-gram are a subset of n-gram.Based on the results shown in Table 5, it can be observed that the overall detection performance of text within the domain does not increase significantly as the value of n increases, but rather exhibits a slight improvement.Considering that the number of n-gram increases exponentially as n increases, we only consider 4-gram in LLMDet.

Next Token Top-K Sampling
The construction of the dictionary incurs significant storage overhead due to the necessity of storing the top-K probabilities along with their corresponding n-gram, presenting a challenge to our method.Consequently, determining the optimal value of K requires a comprehensive consideration of both detection performance and storage costs.
In order to gain a more intuitive understanding of the impact of the K value on the detection performance of LLMDet, while keeping the number of 2-gram, we solely vary the K value and examine the changes in F1-Macro of LLMDet across different K values.The result is presented in Figure 4.
We observe that as the value of K increases, the detection performance of LLMDet gradually improves.However, the performance improvement  becomes less pronounced after K reaches 1500.Nonetheless, the corresponding storage overhead still increases linearly.Therefore, considering the overall trade-off between detection performance and storage cost, we recommend adopting a top-2000 sampling for 2-gram.For 3-gram and 4-gram, their quantities are immense.Therefore, following the completion of similar experimental analyses, we employ a top-100 sampling for these n-gram .

Conclusions and Future Work
In the era dominated by machine-generated text, there is a growing need for an efficient and secure detection tool.However, existing detection methods typically require interaction with language models, which inherently compromises speed and security.Our proposed detection tool, LLMDet, overcomes these limitations by leveraging premined prior probability information to compute proxy perplexity, ensuring both speed and secu-rity in the detection process.Additionally, our method enables text tracking, allowing for the identification of the underlying language model from which the text originates.Importantly, our detection tool can be continuously enhanced by expanding to new open-source LLMs, enabling ongoing improvements.
In the future, we aim to further refine our detection tool.Firstly, we will improve the dictionaries used to compute proxy perplexity, thereby enhancing the detection performance.Secondly, for closed-source models, we are unable to build their corresponding dictionaries.To mitigate it to some extent, we have considered two possible approaches: 1) In the process of implementing LLMDet, we offer not only detection capabilities but also an extensible interface for closed-source model owners.Details about this implementation can be found in Algorithm 1 of Appendix A. The extended interface aims to secure the model effectively without compromising the interests of the model owners.Through this approach, we hope to encourage more closed-source model owners to participate and contribute to the continuous improvement of the detection ecosystem of LLMDet.
2) We have also explored using statistical techniques to estimate the next-token probability in proprietary commercial models.However, due to limited data volume, achieving the anticipated results has been challenging.Additionally, generating a significant amount of statistical data comes with considerable costs.As a result, we have included this approach on our list of future work items.
Furthermore, the distillation method is a valuable avenue for future exploration.We will certainly consider it in our future research endeavors.

Limitations
One of the limitations of the current LLMDet is its restriction to detecting English text, thus unable to detect text in other languages.In the future, we can extend our approach to encompass models for other languages, thereby equipping it with the capability to detect text in diverse languages.
Furthermore, at present, the number of models detectable by LLMDet is limited.We will expand the capabilities of our detection tool to encompass a broader range of models, providing more possibilities for text tracing and attribution.

Ethics Statement
We honor and support the ethical guidelines of EMNLP.This paper primarily focuses on the detection of text generated by LLMs, aiming to construct a detection tool suitable for the user base from various domains.The tool is designed to efficiently and securely perform text detection to prevent the misuse of generated text.Overall, our approach exhibits advantages over previous methods in terms of efficiency and granularity of detection, making this work meaningful.Additionally, the datasets used in this study are sourced from previously published works and do not involve any privacy or ethical concerns.

A Algorithm of LLMDet
For the detailed implementation process of LLMDetd, please refer to the pseudocode provided below.Algorithm 1 is a dictionary construction algorithm that is completed offline by us or provided to the model holder independently of external systems.Algorithm 2 will be provided to users as a third-party tool.

B LLMDet Using Newer and Larger LLM
In order to explore whether the gap between proxy perplexity (our method) and true perplexity becomes more apparent as the size of LLMs increases, we conduct additional experiments.We replace LLaMA-7B with LLaMA2-13B (Touvron et al., 2023b) while keeping all other experimental settings the same as in the original paper.The detailed experimental results are shown in Table 6 and Table 7.
From the experimental results, when we replace the original LLM with a better-performing and larger-scale LLM, such as replacing LLaMA-7B with LLaMA2-13B, the detection performance remains essentially consistent with the original performance.This indicates that when a betterperforming and larger-size LLM is used, the performance gap between proxy perplexity (our method)

C Additional Analysis for Robustness Testing
From Table 4, it can be observed that as the temperature increases, the accuracy of text generation detection improves.
Regarding this phenomenon, what we need to clarify is that our method calculates proxy perplexity by building a dictionary based on the probability of sampling the next token.When calculating the probability of the next token, we directly use the softmax with a default temperature of 1.0.When the temperature of LLM is set to 1.0, the generated text actually conforms more closely to the probability distribution of the next token in the dictionary we have constructed.At this point, the calculated proxy perplexity is closer to the true perplexity, resulting in higher detection accuracy.Therefore, we can observe that when the temperature is higher, the text distribution generated by the LLM is closer to the probability distribution of the next token in the dictionary, leading to higher detection accuracy.

D Sample of n-gram for each LLM
Specific examples of 2-gram, 3-gram, and 4-gram for each LLM can be referred to in the tables.Table 8 shows the samples for GPT-2.Table 9 shows the samples for OPT.Table 10 shows the samples for LLaMA.Table 11 shows the samples for T5.Table 12 shows the samples for UniLM.Table 13 shows the samples for BART.Table 14 shows the samples for GPT-Neo.Table 15 shows the samples for Bloom.

Figure 1 :
Figure1: The detailed processes of the proposed tool LLMDet.It contains two main phases, dictionary construction and text detection.The dictionary construction phase is carried out offline by us or provided by the model holder, independent of external systems.The text detection phase can be accessed by the tool user who, as a third party, performs text detection without holding the model.

Figure 2 :
Figure 2: The confusion matrix of the detection performed by LLMDet.

Figure 3 :
Figure 3: The impact of sequentially adding the LLM into LLMDet on the comprehensive detection performance measured by F1-Macro.

Figure 4 :
Figure 4: The impact of the K value in top-K sampling of 2-gram on the detection performance of LLMDet.

Table 2 :
Comparison of overall performance between text detectors based on true perplexity, fine-tuned RoBERTa, and proxy perplexity.

Table 3 :
The detection time of GPT-Zero, De-tectGPT, and LLMDet on a dataset of 1000 texts.

Table 4 :
The detection performance of LLMDet in three scenarios: temperature changes, random deletion, and weight updates.

Table 5 :
The impact of the value of n in n-gram on the overall detection performance.
Algorithm 2: Text Detection Input :A piece of text t for detecting A list D = [D M 0 , . . ., D Mc ] for c LLMs and D M 0 denotes human Output :A detection result R

Table 6 :
Experimental results of text detector based on proxy perplexity with LLaMA-7B, and proxy perplexity with LLaMA2-13B.

Table 7 :
Comparison of overall performance between text detectors based on proxy perplexity with LLaMA-7B and proxy perplexity with LLaMA2-13B.