Counter Turing Test (CT 2 ): AI-Generated Text Detection is Not as Easy as You May Think – Introducing AI Detectability Index

,

governments have recently drafted their initial proposals regarding the regulatory framework for AI.Given this cynosural spotlight on generative AI, AI-generated text detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by emergence of techniques to bypass detection.This paper introduces the Counter Turing Test (CT 2 ), a benchmark consisting of techniques aiming to offer a comprehensive evaluation of the robustness of existing AGTD techniques.Our empirical findings unequivocally highlight the fragility of the proposed AGTD methods under scrutiny.Amidst the extensive deliberations on policymaking for regulating AI development, it is of utmost importance to assess the detectability of content generated by LLMs.Thus, to † Work does not relate to position at Amazon.establish a quantifiable spectrum facilitating the evaluation and ranking of LLMs according to their detectability levels, we propose the AI Detectability Index (ADI).We conduct a thorough examination of 15 contemporary LLMs, empirically demonstrating that larger LLMs tend to have a higher ADI, indicating they are less detectable compared to smaller LLMs.We firmly believe that ADI holds significant value as a tool for the wider NLP community, with the potential to serve as a rubric in AI-related policy-making.
1 Proposed AI-Generated Text Detection Techniques (AGTD) -A Review authors put forth evidently more resilient watermarking techniques, asserting that paraphrasing does not significantly disrupt watermark signals in this iteration of their research.By conducting extensive experiments (detailed in Section 3), our study provides a thorough investigation of the dewatermarking techniques w v1 and w v2 , demonstrating that the watermarked texts generated by both methods can be circumvented, albeit with a slight decrease in de-watermarking accuracy observed with w v2 .These results further strengthen our contention that text watermarking is fragile and lacks reliability for real-life applications.Perplexity Estimation: The hypothesis related to perplexity-based AGTD methods is that humans exhibit significant variation in linguistic constraints, syntax, vocabulary, and other factors (aka perplexity) from one sentence to another.In contrast, LLMs display a higher degree of consistency in their linguistic style and structure.Employing this hypothesis, GPTZero (Tian, 2023) devised an AGTD tool that posited the overall perplexity human-generated text should surpass that of AI-generated text, as in the equation: logp Θ (h text ) − logp Θ (AI text ) ≥ 0 (Appendix C).Furthermore, GPTZero assumes that the variations in perplexity across sentences would also be lower for AI-generated text.This phenomenon could potentially be quantified by estimating the entropy for sentence-wise perplexity, as depicted in the equation: k=1 (|s k AI − s k+1 AI |)] ≥ 0; where s k h and s k AI represent k th sentences of human and AI-written text respectively.Burstiness Estimation: Burstiness refers to the patterns observed in word choice and vocabulary size.GPTZero (Tian, 2023) was the first to introduce burstiness estimation for AGTD.In this context, the hypothesis suggests that AI-generated text displays a higher frequency of clusters or bursts of similar words or phrases within shorter sections of the text.In contrast, humans exhibit a broader variation in their lexical choices, showcasing a more extensive range of vocabulary.Let σ τ denote the Figure 1: (Top) The negative log-curvature hypothesis proposed by Mitchell et al. (2023).According to their claim, any perturbations made to the AI-generated text should predominantly fall within a region of negative curvature.(Bottom) Our experiments using 15 LLMs with 20 perturbations indicate that the text generated by GPT 3.0 and variants do not align with this hypothesis.Moreover, for the other LLMs, the variance in the negative log-curvature was so minimal that it had to be disregarded as a reliable indication.and represent fake and real sample respectively, whereas and depict perturbed fake and real sample.
standard deviation of the language spans and m τ the mean of the language spans.Burstiness (b) is calculated as b = ( σ τ /m τ −1 σ τ /m τ +1 ) and is bounded within the interval [-1, 1].Therefore the hypothesis is b H − b AI ≥ 0, where b H is the mean burstiness of human writers and b AI is the mean burstiness of AI aka a particular LLM.Corpora with antibursty, periodic dispersions of switch points take on burstiness values closer to -1.In contrast, corpora with less predictable patterns of switching take on values closer to 1.It is worth noting that burstiness could also be calculated sentence-wise and/or text fragment-wise and then their entropy could be defined as: Nevertheless, our comprehensive experiments involving 15 LLMs indicate that this hypothesis does not consistently provide a discernible signal.Furthermore, recent LLMs like GPT-3.5/4,MPT (OpenAI, 2023a;Team, 2023) have demonstrated the utilization of a wide range of vocabulary, challenging the hypothesis.Section 4 discusses our experiments on perplexity and burstiness estimation.
Negative Log-Curvature (NLC): DetectGPT (Mitchell et al., 2023) introduced the concept of Negative Log-Curvature (NLC) to detect AIgenerated text.The hypothesis is that text generated by the the model tends to lie in the negative curvature areas of the model's log probability, i.e. a text generated by a source LLM p θ typically lies in the areas of negative curvature of the log probability function of p θ , unlike human-written text.In other words, we apply small perturbations to a passage x ∼ p θ , producing x.Defining P NLC θ as the quantity logp θ (x) − logp θ ( x), P NLC θ should be larger on average for AI-generated samples than human-written text (see an example in Table 1 and the visual intuition of the hypothesis in Fig. 1).Expressed mathematically: It is important to note that DetectGPT's findings were derived from text-snippet analysis, but there is potential to reevaluate this approach by examining smaller fragments, such as sentences.This would enable the calculation of averages or entropies, akin to how perplexity and burstiness are measured.Finally, the limited number of perturbation patterns per sentence in (Mitchell et al., 2023) affect the reliability of results (cf.Section 5 for details).

Original
This sentence is generated by an AI or human

Perturbed
This writing is created by an AI or person Table 1: An example perturbation as proposed in DetectGPT (Mitchell et al., 2023).

Stylometric variation:
Stylometry is dedicated to analyzing the linguistic style of text in order to differentiate between various writers.➠ Introducing AI Detectability Index (ADI) as a measure for LLMs to infer whether their generations are detectable as AI-generated or not.
➠ Conducting a thorough examination of 15 contemporary LLMs to establish the aforementioned points.
➠ Both benchmarks -CT 2 and ADI -will be published as open-source leaderboards.
➠ Curated datasets will be made publicly available.

Design Choices for CT and ADI Study
This section discusses our selected LLMs and elaborates on our data generation methods.More details in Appendix A.
Given that the field is ever-evolving, we admit that this process will never be complete but rather continue to expand.Hence, we plan to keep the CT 2 benchmark leaderboard open to researchers, allowing for continuous updates and contributions.

Datasets: Generation and Statistics
To develop CT 2 and ADI, we utilize parallel data comprising both human-written and AI-generated text on the same topic.We select The New York Times (NYT) Twitter handle as our prompt source for the following reasons.Firstly, the handle comprises approximately 393K tweets that cover a variety of topics.For our work, we chose a subset of 100K tweets.Secondly, NYT is renowned for its reliability and credibility.The tweets from NYT exhibit a high level of word-craftsmanship by experienced journalists, devoid of grammatical mistakes.Thirdly, all the tweets from this source include URLs that lead to the corresponding human-written news articles.These tweets serve as prompts for the 15 LLMs, after eliminating hashtags and mentions during pre-processing.Appendix G offers the generated texts from 15 chosen LLMs when given the prompt "AI generated text detection is not easy." 3 De-Watermarking: Discovering its Ease and Efficiency In the realm of philosophy, watermarking is typically regarded as a source-side activity.It is highly plausible that organizations engaged in the development and deployment of LLMs will progressively adopt this practice in the future.Additionally, regulatory mandates may necessitate the implementation of watermarking as an obligatory measure.The question that remains unanswered is the level of difficulty in circumventing watermarking, i.e., de-watermarking, when dealing with watermarked AI-generated text.In this section, we present our rigorous experiments that employ three methods capable of de-watermarking an AI-generated text that has been watermarked: (i) spotting high entropy words and replacing them, (ii) paraphrasing, (iii) paraphrasing + replacing high-entropy words Table 2 showcases an instance of de-watermarking utilizing two techniques for OPT as target LLM.

De-watermarking by Spotting and
Replacing High Entropy Words (DeW 1 ) The central concept behind the text watermarking proposed by Kirchenbauer et al. (2023a)   We perform a comprehensive analysis of both qualitative and quantitative aspects of automatic paraphrasing for the purpose of de-watermarking.We chose three SoTA paraphrase models: (a) Pegasus (Zhang et al., 2020), (b) T5 (Flan-t5-xxl variant) (Chung et al., 2022), and (c) GPT-3.5 (gpt-3.5-turbo-0301variant) (Brown et al., 2020).We seek answers to the following questions: (i) What is the accuracy of the paraphrases generated?(ii) How do they distort the original content?(iii) Are all the possible candidates generated by the paraphrase models successfully de-watermarked?(iv) Which paraphrase module has a greater impact on the de-watermarking process?To address these questions, we evalu-   3 and Table 5, our experiments provide empirical evidence suggesting that the watermarking applied to AI-generated text can be readily circumvented (cf.Appendix B).

Reliability of Perplexity and Burstiness as AGTD Signals
In this section, we extensively investigate the reliability of perplexity and burstiness as AGTD signals.Based on our empirical findings, it is evident that the text produced by newer LLMs is nearly indistinguishable from human-written text from a statistical perspective.
4.1 Estimating Perplexity -Human vs. AI Perplexity is a metric utilized for computing the probability of a given sequence of words in natural language.It is computed as , where N represents the length of the word sequence, and p(w i ) denotes the probability of the individual word w i .As discussed previously, GPTZero (Tian, 2023) assumes that humangenerated text exhibits more variations in both overall perplexity and sentence-wise perplexity as compared to AI-generated text.To evaluate the strength of this proposition, we compare text samples generated by 15 LLMs with corresponding human-generated text on the same topic.Our empirical findings indicate that larger LLMs, such as GPT-3+, closely resemble human-generated text and exhibit minimal distinctiveness.However, relatively smaller models such as XLNet, BLOOM, etc. are easily distinguishable from human-generated text.Fig. 2 demonstrates a side-by-side comparison of the overall perplexity of GPT4 and T5.We report results for 3 LLMs in Table 4 (cf.Table 22 in Appendix C for results over all 15 LLMs).

Estimating Burstiness -Human vs. AI
In Section 1, we discussed the hypothesis that explores the contrasting burstiness patterns between human-written text and AI-generated text.Previous studies that have developed AGTD techniques based on burstiness include (Rychlỳ, 2011) and (Cummins, 2017).Table 4 shows that there is less distinction in the standard deviation of burstiness scores between AI-generated and human text for OPT.However, when it comes to XLNet, the difference becomes more pronounced.From several such examples, we infer that larger and more complex LLMs gave similar burstiness scores as humans.Hence, we conclude that as the size or complexity of the models increases, the deviation in burstiness scores diminishes.This, in turn, reinforces our claim that perplexity or burstiness estimations cannot be considered as reliable for AGTD (cf.Appendix C).
In Section 1, we discussed the NLC-based AGTD hypothesis (Mitchell et al., 2023).Our experimental results, depicted in Fig. 1, demonstrate that we are unable to corroborate the same NLC pattern for GPT4.To ensure the reliability of our experiments, we performed 20 perturbations per sentence.Fig. 1 (bottom) presents a comparative analysis of 20 perturbation patterns observed in 2000 samples of OPTgenerated text and human-written text on the same topic.Regrettably, we do not see any discernible pattern.To fortify our conclusions, we compute the standard deviation, mean, and entropy, and conduct a statistical validity test using bootstrapping, which is more appropriate for non-Gaussian distributions (Kim, 2015;Boos and Brownie, 1989).Table 22 documents the results (cf.Appendix C).Based on our experimental results, we argue that NLC is not a robust method for AGTD.

Stylometric Variation
Stylometry analysis is a well-studied subject (Lagutina et al., 2019;Neal et al., 2018) where scholars have proposed a comprehensive range of lexical, syntactic, semantic, and structural characteristics for the purpose of authorship attribution.Our investigation, which differs from the study conducted by Kumarage et al. (2023), represents the first attempt to explore the stylometric variations between humanwritten text and AI-generated text.Specifically, we assign 15 LLMs as distinct authors, whereas text composed by humans is presumed to originate from a hypothetical 16 th author.Our task involves identifying stylometric variations among these 16 authors.After examining other alternatives put forth in previous studies such as (Tulchinskii et al., 2023), we encountered difficulties in drawing meaningful conclusions regarding the suitability of these methods for AGTD.Therefore, we focus our investigations on a specific approach that involves using perplexity (as a syntactic feature) and burstiness (as a lexical choice feature) as density functions to identify a specific LLM.By examining the range of values produced by these functions, we aim to pinpoint a specific LLM associated with a given text.Probability density such as are calculated using Le Cam's lemma (Cam, 1986(Cam, -2012)), which gives the total variation distance between the sum of independent Bernoulli variables and a Poisson random variable with the same mean.In particular, it tells us that the sum is approximately Poisson in a specific sense (see more in Appendix E).Our experiment suggests stylistic feature estimation may not be very distinctive, with only broad ranges to group LLMs: (i) Detectable (80%+): T0 and T5, (ii) Hard to detect (70%+): XLNet, StableLM, and Dolly, and (iii) Impossible to detect (<50%): LLaMA, OPT, GPT, and variations.Fig. 3 offers a visual summary.

AI Detectability Index (ADI)
As new LLMs continue to emerge at an accelerated pace, the usability of prevailing AGTD techniques might not endure indefinitely.To align with the ever-changing landscape of LLMs, we introduce the AI Detectability Index (ADI), which identifies the discernable range for LLMs based on SoTA AGTD techniques.The hypothesis behind this proposal is that both LLMs and AGTD techniques' SoTA benchmarks can be regularly updated to adapt to the evolving landscape.Additionally, ADI serves as a litmus test to gauge whether contemporary LLMs have surpassed the ADI benchmark and are thereby rendering themselves impervious to detection, or whether new methods for AI-generated text detection will require the ADI standard to be reset and re-calibrated.
Among the various paradigms of AGTD, we select perplexity and burstiness as the foundation for quantifying the ADI.We contend that NLC is a derivative function of basic perplexity and burstiness, and if there are distinguishable patterns in NLC within AI-generated text, they should be well captured by perplexity and burstiness.We present a summary in Fig. 3  scores obtained using stylometry and classification methods.It is evident that the detectable LLM set is relatively small for both paradigms, while the combination of perplexity and burstiness consistently provides a stable ADI spectrum.Furthermore, we argue that both stylistic features and classification are also derived functions of basic perplexity and burstiness.ADI serves to encapsulate the overall distinguishability between AI-written and human-written text, employing the formula: where, When confronted with a random input text, it is difficult to predict its resemblance to humanwritten text on the specific subject.Therefore, to calculate ADI we employ the mean perplexity (µ plx H ) and burstiness (µ brsty H ) derived from human-written text.Furthermore, to enhance the comparison between the current text and human text, Le Cam's lemma has been applied using precalculated values (L plx H and L brsty H ) as discussed in Section 6.To assess the overall contrast a summation has been used over all the 100K data points as depicted here by U. Lastly, comparative measures are needed to rank LLMs based on their detectability.This is achieved using multiplicative damping factors, δ 1 (x) and δ 2 (x), which are calculated based on µ ± rank x × σ .Initially, we calculate the ADI for all 15 LLMs, considering δ 1 (x) and δ 2 (x) as 0.5.With these initial ADIs, we obtain the mean (µ) and standard deviation (σ ), allowing us to recalculate the ADIs for all the LLMs.The resulting ADIs are then ranked and scaled providing a comparative spectrum as presented in Fig. 3.This scaling process is similar to Z-Score Normalization (Wikipedia, 2019a) and/or Min-max normalization (Wikipedia, 2019b).However, having damping factors is an easier option for exponential smoothing while we have a handful of data points.Finally, for better human readability ADI is scaled between 0 − 100.
The ADI spectrum reveals the presence of three distinct groups.T0 and T5 are situated within the realm of detectable range, while XLNet, StableLM, Dolly, and Vicuna reside within the difficult-to-detect range.The remaining LLMs are deemed virtually impervious to detection through the utilization of prevailing SoTA AGTD techniques.It is conceivable that forthcoming advancements may lead to improved AGTD techniques and/or LLMs imbued with heightened human-like attributes that render them impossible to detect.Regardless of the unfolding future, ADI shall persist in serving the broader AI community and contribute to AI-related policy-making by identifying non-detectable LLMs that necessitate monitoring through policy control measures.

Conclusion
Our proposition is that SoTA AGTD techniques exhibit fragility.We provide empirical evidence to substantiate this argument by conducting experiments on 15 different LLMs.We proposed AI Detectability Index (ADI), a quantifiable spectrum facilitating the evaluation and ranking of LLMs according to their detectability levels.The excitement and success of LLMs have resulted in their extensive proliferation, and this trend is anticipated to persist regardless of the future course they take.In light of this, the CT 2 benchmark and the ADI will continue to play a vital role in catering to the scientific community.(Bommasani et al., 2023).In this study, the authors put forward a grading system consisting of 12 aspects for evaluating Language Models (LLMs).These aspects include (i) data sources, (ii) data governance, (iii) copyrighted data, (iv) compute, (v) energy, (vi) capabilities & limitations, (vii) risk & mitigations, (viii) evaluation, (ix) testing, (x) machine-generated content, (xi) member states, and (xii) downstream documentation.The overall grading of each LLM can be observed in Fig. 4.While this study is commendable, it appears to be inherently incomplete due to the ever-evolving nature of LLMs.Since all scores are assigned manually, any future changes will require a reassessment of this rubric, while ADI is auto-computable.Furthermore, we propose that ADI should be considered the most suitable metric for assessing risk and mitigations.9.1 Addressing Opposing Views by Chakraborty et al. (2023) It is important to note that a recent study (Chakraborty et al., 2023) contradicts our findings and claims otherwise.The study postulates that given enough sample points, whether the output was derived from a human vs an LLM is detectable, irrespective of the LLM used for AI-generated text.
The sample size of this dataset is a function of the difference in the distribution of human text vs AItext, with a smaller sample size enabling detection if the distributions show significant differences.However, the study does not provide empirical evidence or specify the required sample size, thus leaving the claim as a hypothesis at this stage.Furthermore, the authors propose that employing techniques such as watermarking can change the distributions of AI text, making it more separable from human-text distribution and thus detectable.However, the main drawback of this argument is that given a single text snippet (say, an online article or a written essay), detecting whether it is AI-generated is not possible.Also, the proposed technique may not be cost-efficient compute-wise, especially as new LLMs emerge.However, the authors did not provide any empirical evidence to support this hypothesis.
Limitations: This paper delves into the discussion of six primary methods for AGTD and their potential combinations.These methods include (i) watermarking, (ii) perplexity estimation, (iii) burstiness estimation, (iv) negative loglikelihood curvature, (v) stylometric variation, and (vi) classifier-based approaches.
Our empirical research strongly indicates that the proposed methods are vulnerable to tampering or manipulation in various ways.We provide extensive empirical evidence to support this argument.However, it is important to acknowledge that there may still exist potential deficiencies in our experiments.In this section, we explore and discuss further avenues for investigation in order to address these potential shortcomings.In the subsequent paragraph, we outline the potential limitations associated with each of the methods we have previously investigated.

Watermarking
Although Kirchenbauer et al. (2023a) was the pioneering paper to introduce watermarking for AIgenerated text, this research has encountered numerous criticisms since its inception.A major concern raised by several fellow researchers (Sadasivan et al., 2023) is that watermarking can be easily circumvented through machine-generated paraphrasing.In our experiment, we have presented two potential de-watermarking techniques.Subsequently, the same group of researchers published a follow-up paper (Kirchenbauer et al., 2023b) in which they asserted the development of a more advanced and robust watermarking technique.We assessed this claim as well and discovered that de-watermarking remains feasible.However, although the overall accuracy of de-watermarking has decreased, it still retains considerable strength.As the paper was published on June 9 th , 2023, we will include the complete experiment details in the final version of our report.
In their work, Kirchenbauer et al. (2023b) put forward improved watermarking techniques by enhancing the hashing mechanism for selecting watermarking keys and introducing more effective watermark detection techniques.They conducted extensive testing on de-watermarking possibilities, considering both machine-generated paraphrasing and human paraphrasing, and observed dilution in the strength of the watermark, which aligns with their findings.
Although paraphrasing is a powerful technique for attacking watermark text, we argue that highentropy-based word replacement offers a superior approach.When using high-entropy word replacements, it becomes exceedingly difficult for watermark detection modules to identify the newly generated text, even after paraphrasing.We will now elaborate on our rationale.In their work, Kirchenbauer et al. (2023b) identify content words such as nouns, verbs, adjectives, and adverbs as suitable candidates for replacement.However, any advanced techniques employed to select replacement watermark keys for these positions will result in high-entropy words.Consequently, these replacements will always remain detectable, regardless of the strength of the hashing mechanism.

Perplexity and Burstiness Estimation
Liang et al. ( 2023) and Chakraborty et al. (2023) among others have shown perplexity and burstiness are often not reliable indicators of human written text.The fallibility of these metrics become especially prominent in academic writing or text generated in a low-resource language.Our experiments have also pointed towards similar findings.Moreover, in our experiments, we computed perplexity and burstiness metrics both at the overall text level and the sentence level.It is also feasible to calculate perplexity at smaller fragment levels.Since each language model has a unique attention mechanism and span, these characteristics can potentially manifest in the generated text, making them detectable.However, determining the precise fragment size for a language model necessitates extensive experimentation, which we have not yet conducted.

Negative Log Curvature
Although we discussed earlier, it is crucial to reemphasize the significant limitations of DetectGPT (Mitchell et al., 2023).One of its major limitations is that it relies on access to the log probabilities of the texts, which necessitates the use of a specific LLM.However, it is unlikely that we would know in advance which LLM was employed to generate a particular text, and the log-likelihood calculated by different LLMs for the same text would yield significantly different results.In reality, one would need to compare the results with all available LLMs in existence, which would require a computationally expensive brute-force search.
In our experiments, we empirically demonstrate that the hypothesis of log-probability #2 < logprobability #1 can be easily manipulated using simple [MASK]-based post-fixing techniques.

Stylometric Variation
In this experiment, we made a simplifying assumption that all the human-written text was authored by a single individual, which is certainly not reflective of reality.Furthermore, texts composed by different authors inevitably leave behind their unique traces and characteristics.Furthermore, a recent paper by Tulchinskii et al. (2023) introduced the concept of intrinsic dimensionality estimation, which can be described as a stylometric analysis.However, this paper is currently available only on arXiv and lacks an implemented solution.We are currently working on replicating the theory and evaluating the robustness of the approach.

Classifier-based Approaches
Numerous classifiers have been proposed in the literature (Zellers et al., 2020;Gehrmann et al., 2019;Solaiman et al., 2019).However, the majority of these classifiers are specifically created to identify instances generated by individual models.They achieve this by either utilizing the model itself (as demonstrated by Mitchell et al. (2023)) or by training on a dataset consisting of the generated samples from that particular model.For example, RoBERTa-Large-Detector developed by OpenAI (OpenAI, 2023b) is trained or fine-tuned specifically for binary classification tasks.These detectors are trained using datasets that consist of both human-generated and AI-generated texts.Consequently, their ability to effectively classify data from new models and unfamiliar domains is severely limited.

Ethical Considerations
Our experiments show the limitations of AGTD methods and how to bypass them.We develop ADI with the hope that it could be used for guiding further research and policies.However, it can be misused by bad actors for creating AI-generated text, particularly fake news, that cannot be distinguished from human-written text.We strongly advise against such use of our work.

Frequently Asked Questions (FAQs)
✽ How do we envision ADI being used to influence LLM development, policy making, etc.?
➠ The LLM has achieved the status of the holy grail in the field of AI.Its widespread adoption has been influenced by the success stories of ChatGPT, reaching various domains.As new LLMs continue to emerge regularly, there is a strong belief that future iterations will be even more powerful.Consequently, advanced AGTD techniques will be proposed to address these advancements.Regardless of the future landscape, the ADI will persist as a crucial tool for the scientific community and policymakers to assess the detectability of LLMs within their range.

✽ For de-watermarking, Why do you use a brute force algorithm to choose a winning pair?
Isn't it inefficient?
➠ Our objective was to demonstrate the successful de-watermarking capability of a combination of open-source models.Currently, the combination of albert-large-v2 and distilroberta-base has shown the most promising performance among all the LLMs.However, determining the most suitable combination for a text encountered in real-world scenarios poses a challenge.Exploring more efficient and scalable approaches to identify the optimal pair in such cases is an area that requires further investigation in future work.
✽ For Stylometric analysis, the entire human-generated corpus was treated as if written by a single author.Won't that lead to noisy analysis?
➠ Indeed, we made an easy presumption, but it opened up further possibilities.
✽ Why did you compare only six methods?
➠ We covered some of the most popular methods.It is possible but highly unlikely that there would be other contemporary methods which we did not try and are also very effective in AGTD.
✽ How robust is ADI?Would it be possible that ADI shows a model to be undetectable but one of the AGTD methods can still detect it?
➠ From the methods we considered, it is unlikely that any of them would be effective for models with high ADI, as shown by our experiments and results.As LLMs get more advanced, we assume that the current AGTD methods would become even more unreliable.With that in mind, ADI will remain a spectrum to judge which LLM is detectable and vs. which is not.Please refer to Appendix F for more discussion.

Appendix
This section provides supplementary material in the form of additional examples, implementation details, etc. to bolster the reader's understanding of the concepts presented in this work.

A LLM Selection Criteria
Beyond the primary criteria for choosing performant LLMs, our selection was meant to cover a wide gamut of LLMs that utilize a repertoire of recent techniques under the hood that have enabled their exceptional capabilities, namely: FlashAttention (Dao et al., 2022) for memory-efficient exact attention, Multi-Query Attention (Shazeer, 2019) for memory bandwidth efficiency, SwiGLU (Shazeer, 2020) as the activation function instead of ReLU (Agarap, 2019), ALiBi (Press et al., 2022) for larger context width, RMSNorm (Zhang and Sennrich, 2019) for per-normalization, RoPE (Su et al., 2021) to improve the expressivity of positional embeddings, etc.

B De-Watermarking
As also shown by Krishna et al. (2023), watermarked texts can be relatively easily de-watermarked.Even with the implementation of the newer, more robust watermarking scheme presented by Kirchenbauer et al. (2023b), we were still able to circumvent the watermarks to a significant extent.Here we discuss the methods in detail, concluding with Table 21 showing de-watermarking accuracies across 15 LLMs after paraphrasing.

B.1 De-watermarking by spotting high entropy words and replacing them
The pivotal proposal made by the watermarking paper is to spot high entropy words and replace them with a random word from the vocabulary, so it is evident that if watermarking has been done, it has been done on those words.
What are high entropy words?High entropy words refer to words that are less predictable and occur less frequently in a corpus.These words have a higher degree of randomness and uncertainty and thus, pose a challenge for LLMs because they require a greater amount of training for accurate prediction.High entropy words can include domain-specific jargon or technical terms.Based on the observed patterns and frequencies of the training data, language models assign probabilities to words.Words with a high entropy tend to have lower probabilities because they are less common or have a more diverse contextual usage.These words are frequently uncommon or specialized terms, uncommon proper nouns, or words that are highly topic-or domain-specific.An example of such a high entropy word used in a sentence is as follows: "The adventurous child clambered up the gnarled tree, seeking the thrill of climbing to its lofty branches."In this sentence, the word "gnarled" is a high entropy word.It describes something that is twisted, rough, or knotted, typically referring to tree branches or old, weathered objects.In different language models, alternative words that might occur instead of "gnarled" could be "twisted," "knotty," or "weathered."These alternatives convey a similar meaning with more commonly used vocabulary.For instance, consider a masked input sentence: "Paris is the [MASK] of France."In this scenario, an LLM might predict candidate words with corresponding probabilities as follows: (i) "capital" [0.99], (ii) "city" [0.0], (iii) "metropolis" [0.0].Here, the LLM demonstrates a high level of certainty regarding the word "capital" to fill the mask.Now, consider another sentence: "I saw a [MASK] last night."The LLM's predicted candidate words and their corresponding probabilities are: (i) "ghost" [0.096], (ii) "UFO" [0.083], (iii) "vampire" [0.045].In this case, the LLM exhibits uncertainty in choosing the appropriate candidate word.

B.2 Dewatemarking on 14 LLMs
Here we present performance evaluation of all the models' combination for the rest of the 14 LLMs.The "Pre" column shows the accuracy scores for the text that was successfully de-watermarked without any paraphrasing techniques.The "Post" column shows the accuracy scores for a text that was not successfully de-watermarked in the initial attempt but was able to be de-watermarked more successfully after paraphrasing methods were applied.

B.3 De-watermarking by paraphrasing
A recent paper (Krishna et al., 2023)  Another paper (Sadasivan et al., 2023) also uses the DIPPER paraphrasing technique but a slightly modified version in which they use parallel paraphrasing of multiple sentences.However, in this paper, they came up with how to bypass the paraphrasing technique so that even after paraphrasing, the detector can tell if the text is in fact AI-generated.This bypassing technique was named Retrieval and it uses the semantic sequence to detect AI-generated text even after paraphrasing (Krishna et al., 2023).
Both these papers also talk about the negative log-likelihood and perplexity score and they have tried on GPT and OPT models.We might try to use paraphrasing as yet another technique to remove watermarking from LLMs.Idea 1) feed textual input to a paraphraser model such as Pegasus, T5, GPT-3.5 and evaluate watermarking for the paraphrased text.Idea 2) Replace the high entropy words, which are likely to be the watermarked tokens, and then paraphrase the text to ensure that we have eliminated the watermarks.
For a given text input, we generate multiple paraphrases using various SoTA models.In the process of choosing the appropriate paraphrase model based on a list of available models, the primary question we asked is how to make sure the generated paraphrases are rich in diversity while still being linguistically correct.We delineate the process followed to achieve this as follows.Let's say we have a claim c.We generate n paraphrases using a paraphrasing model.This yields a set of p c 1 , . .., p c n .Next, we make pair-wise comparisons of these paraphrases with c, resulting in c − p c 1 , . .., and c − p c n .At this step, we identify the examples which are entailed, and only those are chosen.For the entailment task, we have utilized RoBERTa Large (Liu et al., 2019) -a SoTA model trained on the SNLI task (Bowman et al., 2015).
Based on empirical observations, we concluded that GPT-3 outperformed all the other models.To offer transparency around our experiment process, we detail the aforementioned evaluation dimensions as follows.
Coverage -number of considerable paraphrase generations: We intend to generate up to 5 paraphrases per given claim.Given all the generated claims, we perform a minimum edit distance (MED) (Wagner and Fischer, 1974) -units are words instead of alphabets).If MED is greater than ±2 for any given paraphrase candidate (for e.g., c − p c 1 ) with the claim, then we further consider that paraphrase, otherwise discarded.We evaluated all three models based on this setup that what model is generating the maximum number of considerable paraphrases.Correctness -correctness in those generations: After the first level of filtration we have performed pairwise entailment and kept only those paraphrase candidates, are marked as entailed by the (Liu et al., 2019) (Roberta Large), SoTA trained on SNLI (Bowman et al., 2015).
Diversity -linguistic diversity in those generations: We were interested in choosing that model can produce linguistically more diverse paraphrases.Therefore we are interested in the dissimilarities check between generated paraphrase claims.For e.g., c − p c n , p c 1 − p c n , p c 2 − p c n , . . ., p c n−1 − p c n and repeat this process for all the other paraphrases and average out the dissimilarity score.There is no such metric to measure dissimilarity, therefore we use the inverse of the BLEU score (Papineni et al., 2002).This gives us an understanding of how linguistic diversity is produced by a given model.Based on these experiments, we found that gpt-3.5-turbo-0301performed the best.The results of the experiment are reported in the following table.Furthermore, we were more interested to choose a model that can maximize the linguistic variations, and gpt-3.5-turbo-0301performs on this parameter of choice as well.A plot on diversity vs. all the chosen models is reported in Fig. 5.
Table 21 provides a summary of the effectiveness of the three paraphrasing methods for dewatermarking.Among them, the GPT3.5 based method demonstrated the highest performance.Additionally, it is worth noting that the de-watermarking accuracy for w v2 , the watermarking technique proposed in (Kirchenbauer et al., 2023b), showed a slight decrease compared to w v1 , the watermarking technique proposed in (Kirchenbauer et al., 2023a).

C Perplexity and Burstiness Estimation
We have conducted an analysis to determine the perplexity and burstiness of an LLM, as well as calculate sentence-wise entropy.In order to evaluate the statistical significance of our findings, we employed the bootstrap method.Results of these experiments on all 15 models are reported in Table 22.Why entropy to estimate sentence level perplexity differences: Hypothesis assumes that AI-generated text displays a higher frequency of clusters or bursts of similar words or phrases within shorter sections of the text.In contrast, humans exhibit a broader variation in their lexical choices, showcasing a more extensive range of vocabulary.Moreover, sentence-wise human shows more variety in terms of length, and structure in comparison with AI-generated text.To measure this we have utilized entropy.The entropy p i logp i of a random variable is the average level of surprise, or uncertainty.
Brief on bootstrap method: Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples, illustrated in Fig. 6.This process allows for the calculation of standard errors, confidence intervals, and hypothesis testing.A bootstrapping approach is an extremely useful alternative to the traditional method of hypothesis testing as it is fairly simple and it mitigates some of the pitfalls encountered within the traditional approach.As with the traditional approach, a sample of size n is drawn from the population within the bootstrapping approach.Let us call this sample S.Then, rather than using theory to determine all possible estimates, the sampling distribution is created by resampling observations with replacement from S, m times, with each resampled set having n observations.Now, if sampled appropriately, S should be representative of the population.Therefore, by resampling S m times with replacement, it would be as if m samples were drawn from the original population, and the estimates derived would be representative of the theoretical distribution under the traditional approach.It must be noted that increasing the number of resamples, m, will not increase the amount of information in the data.That is, resampling the original set 100, 000 times is not more useful than only resampling it 1, 000 times.The amount of information within the set is dependent on the sample size, n, which will remain constant throughout each resample.The benefit of more resamples, then, is to derive a better estimate of the sampling distribution.The traditional procedure requires one to have a test statistic that satisfies particular assumptions in order to achieve valid results, and this is largely dependent on the experimental design.The traditional approach also uses theory to tell what the sampling distribution should look like, but the results fall apart if the assumptions of the theory are not met.The bootstrapping method, on the other hand, takes the original sample data and then resamples it to create many [simulated] samples.This approach does not rely on the theory since the sampling distribution can simply be observed, and one does not have to worry about any assumptions.This technique allows for accurate estimates of statistics, which is crucial when using data to make decisions.

C.2 Plots for 15 LLMs across the ADI spectrum
Here we present the histogram plots and negative log-curvature line plots for all 15 LLMs.Arranged as per the ADI spectrum, it is evident that higher ADI models come much closer to generating text similar to humans that models that fall lower on the spectrum.
D Negative Log-Curvature (NLC) DetectGPT (Mitchell et al., 2023) utilizes the generation of log-probabilities for textual analysis.It leverages the difference in perturbation discrepancies between machine-generated and human-written text to detect the origin of a given piece of text.When a language model produces text, each individual token is assigned a conditional probability based on the preceding tokens.These conditional probabilities are then multiplied together to derive the joint probability for the entire text.To determine the origin of the text, DetectGPT introduces perturbations.If the probability of the perturbed text significantly decreases compared to the original text, it is deemed to be AI-generated.Conversely, if the probability remains roughly the same, the text is considered to be human-generated.
The hypothesis put forward by Mitchell et al. (2023) suggests that the perturbation patterns of AIwritten text should align with the negative log-likelihood region.However, this observation is not supported by the results presented here.To strengthen our conclusions, we calculated the standard deviation, mean, and entropy, and performed a statistical validity test in the form of a p-test.The findings are reported in Table 22.

E Stylometric variation
The field of stylometry analysis has been extensively researched, with scholars proposing a wide range of lexical, syntactic, semantic, and structural features for authorship attribution.In our study, we employed Le Cam's lemma (Cam, 1986(Cam, -2012) ) as a perplexity density estimation method.However, there are several alternative approaches that can be suggested, such as kernel density estimation (Wikipedia_KDE), mean integrated squared error (Wikipedia_MISE), kernel embedding of distributions (Wikipedia_KED), and spectral density estimation (Wikipedia_SDE).While we have not extensively explored these variations in our current study, we express interest in investigating them in future research.
Our experiment yielded intriguing results.Given that our stylometric analysis is solely based on density functions, we posed the question: what would happen if we learned the search density for one LLM and applied it to another LLM?To explore this, we generated a relational matrix, as depicted in Fig. 7.As previously described and illustrated in Fig. 4, the LLMs can be classified into three groups: (i) easily detectable, (ii) hard to detect, and (iii) not detectable.Fig. 7 demonstrates that Le Cam's lemma learned for one LLM is only applicable to other LLMs within the same group.For instance, the lemma learned from GPT 4 can be successfully applied to GPT-3.5, OPT, and GPT-3, but not beyond that.Similarly, Vicuna, StableLM, and LLaMA form the second group.

F AI Detectability Index (ADI) -other possible variations
In our previous discussions, we have advocated for utilizing perplexity and burstiness as the fundamental metrics to quantify ADI within the context of various paradigms of AGTD.However, it is important to acknowledge that alternative features, such as stylistics, can also be employed to calculate the ADI.For instance, if we consider stylistic features like syntactic variation (L syn H ) and lexicon variations (L lex H ), the ADI can be reformulated as follows: where, P t = 1 U * {∑ U x=1 logp i u (syn) − logp i+1 u (syn) and P t = 1 U * {∑ U x=1 logp i u (lex) − logp i+1 u (lex) Similarly, it is worth noting that in the future, other potential features such as NLC and any novel features that may be proposed could also be incorporated within the framework of ADI.
ate the paraphrase modules based on three key dimensions: (i) Coverage: number of considerable paraphrase generations, (ii) Correctness: correctness of the generations, (iii) Diversity: linguistic diversity in the generations.Our experiments showed that GPT-3.5 (gpt-3.5-turbo-0301variant) is the most suitable paraphraser.Please see details of experiments in Appendix B.3.Key Findings from De-Watermarking Experiments: As shown in Table

Figure 4 :
Figure 4: Grading of current LLMs as proposed by a report entitled Do Foundation Model Providers Comply with the EU AI Act? from Stanford University (Bommasani et al., 2023).

Figure 5 :
Figure 5: A higher diversity score depicts an increase in the number of generated paraphrases and linguistic variations in those generated paraphrases.

Figure 6 :
Figure 6: An illustration of Bootstrapping method -how it creates simulated samples.

Table 2 :
(Mitchell et al., 2023watermarking by replacing high-entropy words and paraphrasing.p-value is the probability under the assumption of null hypothesis.The z-score indicates the normalized log probability of the original text obtained by subtracting the mean log probability of perturbed texts and dividing by the standard deviation of log probabilities of perturbed texts.DetectGPT(Mitchell et al., 2023) classifies text to be generated by GPT-2 if the z-score is greater than 4.

Table 3 :
Kirchenbauer et al. (2023a)encompassed 16 combinations for de-watermarking OPT generated watermarked text.The accuracy scores for successfully de-watermarked text using the entropy-based word replacement technique are presented in the DeW 1 columns.It is worth highlighting that the accuracy scores in the DeW 2 columns reflect the application of automatic paraphrasing after entropy-based word replacement.The techniques proposed inKirchenbauer et al. (2023a)are denoted as w v1 , while the techniques proposed in their subsequent workKirchenbauer et al. (2023b)are represented as w v2 .

Table 4 :
Perplexity, burstiness, and NLC values for 3 LLMs across the ADI spectrum along with statistical measures.
that illustrates the detectable and non-detectable sets of LLMs based on ADI HumanMachine Figure 3: ADI gamut for a diverse set of 15 LLMs.

Table 6 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking LLaMA generated watermarked text.

Table 7 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking Alpaca generated watermarked text.

Table 8 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking BLOOM generated watermarked text.

Table 9 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking StableLM generated watermarked text.

Table 10 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking Dolly generated watermarked text.

Table 11 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking T5 generated watermarked text.

Table 12 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking Vicuna generated watermarked text.

Table 13 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking T0 generated watermarked text.

Table 14 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking XLNet generated watermarked text.

Table 15 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking MPT generated watermarked text.

Table 16 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking GPT2 generated watermarked text.

Table 17 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking GPT3 generated watermarked text.

Table 18 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking GPT3.5 generated watermarked text.

Table 19 :
Performance evaluation of 16 combinations of 4 masking-based models for de-watermarking GPT4 generated watermarked text.
talks about the DIPPER paraphrasing technique and how it can easily bypass the watermarking technique.However, their de-watermarking strategy can reduce the detection accuracy of the watermark detector tool to a certain extent.It can't fully de-watermark all the texts.

Table 22 :
C.1 Reliability of Perplexity, Burstiness and NLC as AGT Signals for all LLMs Here we present the complete table showing results after performing experiments on Perplexity estimation (Section 4.1), Burstiness estimation (Section 4.2) and NLC (Section 5) over all 15 LLMs.Comprehensive table for all 15 LLMs with statistical measures for Perplexity, Burstiness, and NLC, along with bootstrap p values (α = 0.05), indicating non-significance for b values greater than the chosen alpha level.