SeqXGPT: Sentence-Level AI-Generated Text Detection

Widely applied large language models (LLMs) can generate human-like content, raising concerns about the abuse of LLMs. Therefore, it is important to build strong AI-generated text (AIGT) detectors. Current works only consider document-level AIGT detection, therefore, in this paper, we first introduce a sentence-level detection challenge by synthesizing a dataset that contains documents that are polished with LLMs, that is, the documents contain sentences written by humans and sentences modified by LLMs. Then we propose \textbf{Seq}uence \textbf{X} (Check) \textbf{GPT}, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection. These features are composed like \textit{waves} in speech processing and cannot be studied by LLMs. Therefore, we build SeqXGPT based on convolution and self-attention networks. We test it in both sentence and document-level detection challenges. Experimental results show that previous methods struggle in solving sentence-level AIGT detection, while our method not only significantly surpasses baseline methods in both sentence and document-level detection challenges but also exhibits strong generalization capabilities.


Introduction
With the rapid growth of Large Language Models (LLMs) exemplified by PaLM, ChatGPT and GPT-4 (Chowdhery et al., 2022;OpenAI, 2022OpenAI, , 2023;;Zhang et al., 2023a), which are derived from pre-trained models (PTMs) (Devlin et al., 2018;Radford et al., 2019;Lewis et al., 2019;Raffel et al., 2020), AI-generated texts (AIGT) can be seen almost everywhere.Therefore, avoiding the abuse of AIGT is a significant challenge that require great effort from NLP researchers.One effective solution is AIGT detection, which discriminates whether a Figure 1: Process of Sentence-Level AIGT Detection Challenge.In this figure, only the first sentence of the candidate document is human-created, while the others are generated by AI (GPT-3.5-turbo).The deeper the color, the higher the likelihood that this part is AIgenerated.The example demonstrates that SeqXGPT can efficiently conduct sentence-level AIGT detection.
given text is written or modified by an AI system (Mitchell et al., 2023;Li et al., 2023).
Current AIGT detection strategies, such as supervised-learned discriminator2 , perplexitybased methods (Mitchell et al., 2023;Li et al., 2023), etc., focus on discriminating whether a whole document is generated by an AI.However, users often modify partial texts with LLMs rather than put full trust in LLMs to generate a whole document.Therefore, it is important to explore fine-grained (e.g.sentence-level) AIGT detection.
Building methods that solve the sentence-level AIGT detection challenge is not an incremental modification over document-level AIGT detection.On the one hand, model-wise methods like DetectGPT and Sniffer require a rather long document as input (over 100 tokens), making them less effective at recognizing short sentences.On the other hand, supervised methods such as RoBERTa fine-tuning, are prone to overfitting on the training data, as highlighted by Mitchell et al. (2023); Li et al. (2023).Therefore, it is necessary to build sentence-level detectors considering the limitation of current AIGT methods.
In this paper, we study the sentence-level AIGT detection challenge: We first build a sentence-level AIGT detection dataset that contains documents with both humanwritten sentences and AI-generated sentences, which are more likely to occur in real-world AIassisted writings.Further, we introduce SeqXGPT, a strong approach to solve the proposed challenge.Since previous works (Solaiman et al., 2019a;Mitchell et al., 2023;Li et al., 2023) show that perplexity is a significant feature to be used for AIGT detection, we extract word-wise log probability lists from white-box models as our foundational features, as seen in Figure 1.These temporal features are composed like waves in speech processing where convolution networks are often used (Baevski et al., 2020;Zhang et al., 2023b).Therefore, we also use convolution network, followed by self-attention (Vaswani et al., 2017) layers, to process the wave-like features.We employ a sequence labeling approach to train SeqXGPT, and select the most frequent word-wise category as the sentence category.In this way, we are able to analyze the fine-grained sentence-level text provenance.
We conduct experiments based on the synthesized dataset and test modified existing methods as well as our proposed SeqXGPT accordingly.Experimental results show that existing methods, such as DetectGPT and Sniffer, fail to solve sentence-level AIGT detection.Our proposed SeqXGPT not only obtains promising results in both sentence and document-level AIGT detection challenges, but also exhibits excellent generalization on out-of-distribution datasets.

Background
The proliferation of Large Language Models (LLMs) such as GPT-4 has given rise to increased concerns about the potential misuse of AIgenerated texts (AIGT) (Brown et al., 2020;Zhang et al., 2022;Scao et al., 2022).This emphasizes the necessity for robust detection mechanisms to ensure security and trustworthiness of LLM applications.

Principal Approaches to AIGT Detection
Broadly, AIGT detection methodologies fall into two primary categories: 1. Supervised Learning: This involves training models on labeled datasets to distinguish between human and AI-generated texts.

Detection Task Formulations
The AIGT detection tasks can be formulated in different setups: 1. Particular-Model Binary AIGT Detection: In this setup, the objective is to discriminate whether a text was produced by a specific known AI model or by a human.Both GPTZero and DetectGPT fall into this category.

Mixed-Model Binary AIGT Detection:
Here, the detectors are designed to identify AI-generated content without the need to pinpoint the exact model of origin.

Mixed-Model Multiclass AIGT Detection:
This is a more difficult task, aiming not only to detect AIGT but also to identify the exact AI model behind it.Sniffer (Li et al., 2023) is a prominent model in this category.

Related Fields of Study
It's worth noting that AIGT detection has parallels with other security-centric studies such as watermarking (Kurita et al., 2020;Li et al., 2021;Kirchenbauer et al., 2023), harmful AI (Bai et al., 2022), and adversarial attacks (Jin et al., 2019;Li et al., 2020).These studies, together with AIGT detection, are indispensable for utilizing and safeguarding these advanced technologies and becoming more important in the era of LLMs.

Sentence-Level Detection
The document-level detection methods struggle to accurately evaluate documents containing a mix of AI-generated and human-authored content or consisting of few sentences, which can potentially lead to a higher rate of false negatives or false positives.
In response to this limitation, we aim to study sentence-level AIGT detection.Sentence-level AIGT detection offers a fine-grained text analysis compared to document-level AIGT detection, as it explores each sentence within the entire text.By addressing this challenge, we could significantly reduce the risk of misidentification and achieve both higher detection accuracy and finer detection granularity than document-level detection.
Similarly, we study sentence-level detection challenges as follows:

Dataset Construction
Current AIGT detection datasets are mainly designed for document-level detection.Therefore, we synthesize a sentence-level detection dataset for the study of fine-grained AIGT detection.Since the human-authored documents from SnifferBench cover various domains including news articles from XSum (Narayan et al., 2018), social media posts from IMDB (Maas et al., 2011), web texts (Radford et al., 2019), scientific articles from PubMed and Arxiv (Cohan et al., 2018), and technical documentation from SQuAD (Rajpurkar et al., 2016), we utilize these documents to construct content that include both human-authored and AI-generated sentences.To ensure diversity while preserving coherence, we randomly select the first to third sentences from the human-generated documents as the prompt for subsequent AI sentence generation.
Specifically, when generating the AI component of a document using language models such as GPT-2, we utilize the prompt obtained via the above process to get the corresponding AI-generated sentences.In the case of instruction-tuned models like GPT-3.5-turbo, we provide specific instruction to assure the generation.We show instruction details in Appendix D.
Regarding open-source models, we gather the AI-generated portions of a document from GPT-2 (Radford et al., 2019), GPT-J (Wang and Komatsuzaki, 2021), GPT-Neo (Black et al., 2022), andLLaMA (Touvron et al., 2023).For models which provide only partial access, we collect data from GPT-3.5-turbo, an instruction-tuned model.With a collection of 6,000 human-written documents from SnifferBench, we similarly assemble 6,000 documents containing both human-written and AI-generated sentences from each of the models.This results in a total of 30,000 documents.Accordingly, we have: Particular-Model Binary Detection Dataset To contrast with DetectGPT and the log p(x) method, we construct the dataset using documents that consist of sentences generated by a particular model or by humans.For each document, sentences generated by the particular model are labeled as "AI", while sentences created by humans are labeled as "human".Using the collected data, we can construct five binary detection datasets in total.
Mixed-Model Binary Detection Dataset We utilize all the collected data to construct a dataset intended for multi-model binary detection research.
For each document, we label sentences generated by any model as "AI", and sentences created by humans as "human".

Mixed-Model Multiclass Detection Dataset
For multiclass detection research, we again use the entire data.Unlike the binary detection dataset that merely differentiates between human-created and AI-generated sentences, we use more specific labels such as "GPT2", "GPTNeo", and so forth, instead of the generic label "AI".This allows us to further distinguish among a variety of AI models.
For each dataset, We divide the documents into a train/test split with a 90%/10% partition.We name the collected datasets SeqXGPT-Bench, which can be further used for fine-grained AIGT detection.

SeqXGPT
To solve the sentence-level detection challenge, we introduce a strong method SeqXGPT:

Strategies for Sentence-Level Detection
A sentence-level detection task is a specific form of the sentence classification task.Therefore, there are mainly two approaches: Sentence Classification This method treats each sentence in the document as an input text and independently determines whether each sentence is generated by AI, as depicted in Figure 2(b).
Sequence Labeling This method treats the entire document as input, creating a sequence labeling task to decide the label for each word.For each sentence, we count the number of times each label appears and select the most frequent label as the category of the sentence, as shown in Figure 2(c).It's worth noting that each dataset uses a crosslabel set based on {B, I, O}, similar to the usage in sequence labeling tasks (Huang et al., 2015;Wang and Ren, 2022).We use B-AI, B-HUMAN, etc. as word-wise labels.

Baseline Approaches
In the sentence-level AIGT detection challenge, we explore how to modify baselines used in documentlevel detection: log p(x): We consider sentence-level detection as a sentence classification task to implement the log p(x) method.The log p(x) method discriminates AIGT through perplexity.Specifically, we utilize a particular model such as GPT-2 to compute the perplexity of each sentence in the document and draw a histogram (Figure 4) to show the distribution of perplexity.Then we select a threshold manually as the discrimination boundary to determine whether each sentence is generated by a human or an AI.
DetectGPT (Mitchell et al., 2023): Similarly, we regard sentence-level detection as a sentence classification task to implement DetectGPT.DetectGPT discriminates AIGT using their proposed z-score.For each sentence, we add multiple perturbations (we try 40 perturbations for each sentence).We then compute the z-scores for each sentence based on a particular model and generate a histogram (Figure 4) showing the score distribution.Similarly, we manually select a threshold to discriminate AI-generated sentences.
Sniffer (Li et al., 2023): Sniffer is a powerful model capable of detecting and tracing the origins of AI-generated texts.In order to conduct sentencelevel AIGT detection, we train a sentence-level Sniffer following the structure and training process of the original Sniffer, except that our input is a single sentence instead of an entire document.

SeqXGPT Framework
In this section, we will provide a detailed introduction to SeqXGPT.The SeqXGPT employs a sequence labeling approach for sentencelevel AIGT detection, which consists of the following three parts: (1) Perplexity Extraction and Alignment; (2) Feature Encoder; (3) Linear Classification Layer.

Perplexity Extraction and Alignment
Previous works demonstrate that both the average per-token log probability (DetectGPT) and contrastive features derived from token-wise probabilities (Sniffer) can contribute to AIGT detection.Therefore, in this work, we choose to extract token-wise log probability lists from several public open-source models, serving as the original features of SeqXGPT.
Specifically, given a candidate text S, a known model θ n and the corresponding tokenizer T n , we obtain the encoded tokens x = [x 1 , ..., x i , ...] and a list of token-wise log probabilities ll θn (x), where the log probability of the i th token x i is the loglikelihood ll θn (x i ) = logp θn (x i |x <i ).Given a list of known models θ 1 , ..., θ N , we can obtain N tokenwise log probability lists of the same text S.
Considering that the tokens are commonly subwords of different granularities and that different tokenizers have different tokenization methods (e.g.byte-level BPE by GPT2 (Radford et al., 2019), SentencePiece by LLaMA (Touvron et al., 2023)), they do not align perfectly.As a result, we align tokens to words, which addresses potential discrepancies in the tokenization process across different LLMs.We use a general word-wise tokenization of S: w = [w 1 , . . ., w t ] and align calculated log probabilities of tokens in x to the words in w.Therefore, we can obtain a list of word-wise log probabilities ll θn (w) from tokenwise log probability list ll θn (x).We concatenate these word-wise log probability lists together to form the foundational features L = [l 1 , . . ., l t ], where the feature vector of the i th word w i is We select GPT2, GPT-Neo, GPT-J, and LLaMA as white-box models to obtain the foundational features.In the various implementations of alignment algorithms, this paper utilizes the alignment method proposed by Li et al. (2023).

Feature Encoder
The foundational features are lists of word-wise log probabilities, which prevents us from utilizing any existing pre-trained models.Consequently, we need to design a new, efficient model structure to perform sequence labeling for AIGT detection based on these foundational features.
The word-wise log probability lists for the same sentence of different LLMs may differ due to different parameter scales, training data, and some other factors.Here, we propose treating the word-wise log probability list as a kind of feature that reflects a model's understanding of different semantic and syntactic structures.As an example, more complex models are able to learn more complex language patterns and syntactic structures, which can result in higher probability scores.Moreover, the actual log probability list might be subject to some uncertainty due to sampling randomness and other factors.Due to these properties, it is possible to view these temporal lists as waves in speech signals.Inspired by speech processing studies, we choose to employ convolutional networks to handle these foundational features.
Using convolutional networks, local features can be extracted from input sequences and mapped into a hidden feature space.Subsequently, the output features are then fed into a context network based on self-attention layers, which can capture long-range dependencies to obtain contextualized features.In this way, our model can better understand and process the wave-like foundational features.
As shown in Figure 3, the foundational features are first fed into a five-layer convolutional network f : L → Z to obtain the latent representations [z 1 , . . ., z t ].Then, we apply a context network g : Z → C to the output of the convolutional network to build contextualized features [c 1 , . . ., c t ] capturing information from the entire sequence.The convolutional network has kernel sizes (5, 3, 3, 3, 3) and strides (1, 1, 1, 1, 1), while the context network has two Transformer layers with the simple fixed positional embedding.More implementation details can be seen in Appendix A.
Linear Classification Layer After extracting the contextualized features, we train a simple linear classifier F (•) to project the features of each word to different labels.Finally, we count the number of times each word label appears and select the most frequent label as the final category of each sentence.

Implementation Details
We conduct sentence-level AIGT detection based on SeqXGPT-Bench.As part of our experiments, we test modified existing methods as described in Sec.4.2, as well as our proposed SeqXGPT detailed in Sec.4.3.
We use four open-source (L)LMs to construct our SeqXGPT-Bench: GPT2-xl (1.5B), GPT-Neo (2.7B), GPT-J (6B) and LLaMA (7B).These models are also utilized to extract perplexity lists, which are used to construct the original features of our SeqXGPT and the contrastive features for Sniffer.For each model, we set up an inference server on NVIDIA4090 GPUs specifically for extracting perplexity lists.The maximum sequence length is set to 1024 for all models, which is the maximum context length supported by all the pre-trained models we used.We align all texts with a white-space tokenizer to obtain uniform tokenizations for our SeqXGPT.
In the evaluation process, we utilize Precision (P.), Recall (R.), and Macro-F1 Score as our metrics.Precision and Recall respectively reflect the "accuracy" and "coverage" of each category, while the Macro-F1 Score effectively combines these two indicators, allowing us to consider the overall performance.

Sentence-Level Detection Results
In this section, we present three groups of experiments: Particular-Model Binary AIGT Detection We test zero-shot methods log p(x) and DetectGPT, as well as Sent-RoBERTa and our SeqXGPT on two datasets shown in Table 1.The results for the other two datasets can be found in the appendix in Table 6.As seen in Figure 4, there is a large overlap between the peaks of AI-generated and humancreated sentences when using log p(x) and DetectGPT, making it difficult to distinguish sentences from different categories.The peaks almost completely overlap for DetectGPT in particular.This situation can be attributed to two main factors.Firstly, the perplexity of a sentence tends to be more sensitive than a document.Secondly, given the brevity of sentences, perturbations often fail to achieve the same effect as in DetectGPT.That is, the perturbed sentences either remain nearly identical to the original, or vary significantly.These findings indicate that these two zero-shot methods are unsuitable for sentence-level AIGT detection.
Overall, SeqXGPT exhibits great performance on all datasets, including the two presented in Table 1 and the other two in Table 6.Despite Sent-RoBERTa performs better than two zero-shot methods, it is still far inferior to our method.
Mixed-Model Binary AIGT Detection We implement Sniffer and Sent-RoBERTa based on sentence classification tasks, as well as our SeqXGPT and Seq-RoBERTa based on sequence labeling tasks.From the results in Table 3, it is apparent that our SeqXGPT shows the best performance among these four methods, significantly demonstrating the effectiveness of our method.In contrast, the performance of Sniffer is noticeably inferior, which emphasizes that document-level AIGT detection methods cannot be effectively modified for sentencelevel AIGT detection.Interestingly, we find that the performance of both RoBERTa-based methods is slightly inferior to SeqXGPT in overall performance.This suggests that the semantic features of RoBERTa might be helpful to discriminate human-created sentences.sentences generated by humans or more humanlike LLMs, such as GPT-3.5-turbo,while having difficulty detecting sentences generated by GPT-2/GPT-J/Neo.The reason for this may be that sentences generated by stronger models contain more semantic features.This is consistent with the findings of Li et al. (2023).In contrast, the Seq-RoBERTa outperforms Sent-RoBERTa, indicating that sequence labeling methods, with their ability to obtain the context information of the entire document, are more suitable for sentence-level AIGT detection.Sniffer, however, performs poorly, which may be due to its primary design focusing on document-level detection.

Document-Level Detection Results
We conduct experiments on a document-level detection dataset (more details in Appendix C.1). with the highest frequency as the document's category.For sequence labeling-based models, we label each word and select the category with the greatest number of appearances as the document's category.
Table 4 indicates that sentence-level detection methods can be transformed and directly applied to document-level detection, and the performance is positively correlated with their performance on sentence-level detection.Despite their great performance in detecting human-generated sentences, both Seq-RoBERTa and SeqXGPT perform slightly worse in discriminating humancreated documents.This is mainly because our training set for human-created data only comes from the first 1-3 sentences, leading to a shortage of learning samples.Therefore, appropriately increasing human-generated data can further enhance the performance.We can see that Sniffer performs much better in document-level detection, indicating that Sniffer is more suitable for this task.Overall, SeqXGPT exhibits excellent performance in document-level detection.

Out-of-Distribution Results
Previous methods often show limitations on out-ofdistribution (OOD) datasets (Mitchell et al., 2023).Therefore, we intend to test the performance of each method on the OOD sentence-level detection dataset (more details in Appendix C.2).
Table 5 demonstrates the great performance of SeqXGPT on OOD data, reflecting its strong generalization capabilities.Conversely, Sent-RoBERTa shows a significant decline in performance, especially when discriminating sentences generated by GPT-3.5-turbo and humans.This observation aligns with the findings from prior studies (Mitchell et al., 2023;Li et al., 2023), suggesting that semantically-based methods are prone to overfitting on the training dataset.However, Seq-RoBERTa performs relatively better on OOD data.We believe that this is because sequence labeling methods can capture contextualized information, helping to mitigate overfitting during learning.Despite this, its performance still falls short of SeqXGPT.

Ablations and Analysis
In this section, we perform ablation experiments on the structure of SeqXGPT to verify the effectiveness of different structures in the model.
First, we train a model using only transformer layers.As shown in Table 2, this model struggles with AIGT detection, incorrectly classifying all sentences as human-written.Although there are differences between a set of input feature waves (foundational features), they are highly correlated, often having the same rise and fall trend (Figure 3).Further, these features only have 4 dimensions.We think it's the high correlation and sparsity of the foundational features make it difficult for the Transformer layers to effectively learn AIGT detection.In contrast, CNNs are very good at handling temporal features.When combined with CNNs, it's easy to extract more effective, concentrated feature vectors from these feature waves, thereby providing higher-quality features for subsequent Transformer layers.Moreover, the parameter-sharing mechanism of CNNs can reduce the parameters, helping us to learn effective features from limited samples.Thus, CNNs play an important role in SeqXGPT.
However, when only using CNNs (without Transformer layers), the model performs poorly on sentences generated by strong models such as GPT-3.5-turbo.The reason for this may be that CNNs primarily extract local features.In dealing with sentences generated by these strong models, it is not sufficient to rely solely on local features, and we must also consider the contextualized features provided by the Transformer layers.

Conclusion
In this paper, we first introduce the challenge of sentence-level AIGT detection and propose three tasks based on existing research in AIGT detection.Further, we introduce a strong approach, SeqXGPT, as well as a benchmark to solve this challenge.Through extensive experiments, our proposed SeqXGPT can obtain promising results in both sentence and document-level AIGT detection.On the OOD testset, SeqXGPT also exhibits strong generalization.We hope that SeqXGPT will inspire future research in AIGT detection, and may also provide insightful references for the detection of content generated by models in other fields.

Limitations
Despite SeqXGPT exhibits excellent performance in both sentence and document-level AIGT detection challenge and exhibits strong generalization, it still present certain limitations: (1) We did not incorporate semantic features, which could potentially assist our model further in the sentence recognition process, particularly in cases involving human-like sentence generation.We leave this exploration for future work.
(2) During the construction of GPT-3.5-turbodata, we did not extensively investigate the impact of more diversified instructions.Future research will dive into exploring the influence of instructions on AIGT detection.
(3) Due to limitations imposed by the model's context length and generation patterns, our samples only consist of two distinct sources of sentences.
In future studies, we aim to explore more complex scenarios where a document contains sentences from multiple sources.sample 200 evidence documents from triviaQA.These documents, with their wide-ranging domains and inherent diversity, form an excellent basis for testing OOD performance.The methodology adopted for constructing this dataset mirrors that detailed in Sec.3.2, where we randomly select the first to third sentences from the human-generated documents as the prompt for subsequent AI generation.We then employ this newly constructed dataset as a testbed for evaluating the OOD performance of the models trained in Sec.5.2.

D Instruction Details
The specific instruction used in GPT-3.5-turbois: Please provide a continuation for the following content to make it coherent: {prompt} We consider simple instruction since we focus on studying the sentence-level AIGT detection without further analyzing the working mechanisms and better usage of instructions, which we leave to future works.

Figure 2 :
Figure 2: Strategies for AIGT Detection.(a) Document-Level AIGT Detection: Determine the category based on the entire candidate document.(b) Sentence Classification for Sentence-Level AIGT Detection: Classify each sentence one by one using the same model.(c) Sequence Labeling for Sentence-Level AIGT Detection: Classify the label for each word, then select the most frequently occurring category as the final category for each sentence.

Figure 3 :
Figure 3: SeqXGPT Framework.RoBERTa: RoBERTa(Liu et al., 2019) is based on the Transformer encoder that can perform sentence classification tasks as well as sequence labeling tasks.Based on RoBERTa-base, we train two RoBERTa baselines for sentence-level AIGT detection.As mentioned in Sec.4.1, one model employs a sequence labeling approach, while the other uses a sentence classification method.They are referred to as Seq-RoBERTa and Sent-RoBERTa, respectively.

Figure 4 :
Figure 4: The discrepancy between AI-generated and human-created sentences.In each figure, different bars show different sentence origins and each figure is to use a certain model of a certain detect method to test given sentences.The top two figures use log p(x), while the bottom two use DetectGPT.

Table 1 :
Results of Particular-Model Binary AIGT Detection on two datasets with AI-generated sentences are from GPT-2 and GPT-Neo, respectively.The Macro-F1 is used to measure the overall performance, while the P. (precision) and R. (recall) are used to measure the performance in a specific category.

Table 2 ,
SeqXGPT can accurately discriminate sentences generated by various models and those authored by humans, demonstrating its strength in multi-class detection.It is noteworthy that RoBERTa-based methods perform significantly worse than binary AIGT detection.They are better at discriminating

Table 3 :
Results of Mixed-Model Binary AIGT Detection.

Table 5 :
Results of Out-of-Distribution Sentence-Level AIGT Detection.