C O C O : Coherence-Enhanced Machine-Generated Text Detection Under Low Resource With Contrastive Learning

Machine-Generated Text (MGT) detection, a task that discriminates MGT from Human-Written Text (HWT), plays a crucial role in preventing misuse of text generative models, which excel in mimicking human writing style recently. The latest proposed detectors usually take coarse text sequences as input and fine-tune pre-trained models with standard cross-entropy loss. However, these methods fail to consider the linguistic structure of texts. More-over, they lack the ability to handle the low-resource problem, which could often happen in practice considering the enormous amount of textual data online. In this paper, we present a co herence-based co ntrastive learning model named C O C O to detect the possible MGT under the low-resource scenario. To exploit the linguistic feature, we encode coherence information in the form of graph into the text representation. To tackle the challenges of low data resources, we employ a contrastive learning framework and propose an improved contrastive loss for preventing performance degradation brought by simple samples. The exper-iment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-the-art methods significantly. Also, we surprisingly find that MGTs originated from up-to-date language models could be easier to detect than these from previous models, in our experiments. And we propose some preliminary explanations for this counter-intuitive phenomena. All the codes and datasets are open-sourced. 1

Previous works on MGTs detection mainly concentrate on sequence feature representation and classification (Gehrmann et al., 2019;Solaiman et al., 2019;Zellers et al., 2019;He et al., 2023;Mitchell et al., 2023).Recent studies have shown the good performance of automated detectors in a fine-tuning fashion (Solaiman et al., 2019;Mireshghallah et al., 2023).Although these finetuning-based detectors have demonstrated their effectiveness, they still suffer from two issues that limit their conversion to practical use: (1) Existing detectors treat input documents as flat sequences of tokens and use neural encoders or statistical features (e.g., TF-IDF, perplexity) to represent text as the dense vector for classification.These finetuning-based methods rely much on the token-level distribution difference of texts in each class, which ignores high-level linguistic representation of text structure.(2) Compared with the enormous number of online texts, the annotated dataset for training MGT detectors is rather low-resource.Constrained by the amount of available annotated data, traditional detectors sustain frustrating accuracy and even collapse during the test stage.
The defect in the coherence of LMs in generating long text has been revealed by previous works.Malkin et al. (2022) mentions that long-range semantic coherence remains challenging in language generation.Sun et al. ( 2020) also provides examples of incoherent MGTs.As shown in Fig. 1, MGTs and HWTs exhibit differences in terms of coherence traced by entity consistency.Accordingly, we propose that coherence could be an entry point for MGT detection via the perspective of high-level linguistic structure representation, where MGTs could be less interactive than HWTs.Specifically, we propose an entity coherence graph to model the sentence-level structure of texts based on the thoughts of Centering Theory (Grosz and Sidner, 1986), which evaluates text coherence by entity consistency.The entity coherence graph treats entities as nodes and builds edges between entities in the same sentences and the same entities among different sentences to reveal the text structure.Instead of treating text as a flat sequence, coherence modeling helps to introduce distinguishable linguistic features at the input stage and provides explainable differences between MGTs and HWTs.
To alleviate the low-resource problem in the second issue, inspired by the resurgence of contrastive learning (He et al., 2020;Chen et al., 2020), we utilize the proper design of sample pair and contrastive process to learn fine-grained instance-level features under low resource.However, it has been proven that the easiest negative samples are unnecessary and insufficient for model training in contrastive learning (Cai et al., 2020).To circumvent the performance degradation brought by the easy samples, we propose a novel contrastive loss with the capability to reweight the effect of negative samples by difficulty score to help the model concentrate more on hard samples and ignore the easy samples.Extensive experiments on multiple datasets (GROVER, GPT-2, GPT-3.5)demonstrate the effectiveness and robustness of our proposed method.Surpirsingly, we find that the GPT-3.5 datasets are easier for all the detectors compared with datasets of smaller and older models (GPT-2 and GROVER) under our setting.We take a small step to exploring why the GPT-3.5 dataset is overly simple by probing statistical cues, including perspective from token spans and individual tokens.
In summary, our contributions are summarized as follows: • Coherence Graph Construction: We model the text coherence with entity consistency and sentence interaction while statistically proving its distinctiveness in MGT detection, and we further introduce the linguistic feature at the input stage.
• Improved Contrastive Loss: We propose a novel contrastive loss in which hard negative samples are paid more attention to improve the detection accuracy of challenging samples.
• Outstanding Performance: We achieve stateof-the-art performance on four MGT datasets in both low-resource and high-resource settings.Experimental results verify the effectiveness and robustness of our model.

Related Work
Machine-generated Text Detection.Machinegenerated texts, also named deepfake or neural fake texts, are generated by language models to mimic human writing style, making them perplexing for humans to distinguish (Ippolito et al., 2020).Generative models like GROVER (Zellers et al., 2019), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), and emerging GPT-3.5-turbo(also known as ChatGPT) have been evaluated on the MGT detection task and achieve good results (Gehrmann et al., 2019;Mireshghallah et al., 2023).Bakhtin et al. (2019) train an energy-based model by treating the output of TGMs as negative samples to demonstrate the generalization ability.Deep learning models incorporating stylometry and external knowledge are also feasible for improving the performance of MGT detectors (Uchendu et al., 2019;Zhong et al., 2020).Our method differs from the previous work by analyzing and modeling text coherence as a distinguishable feature and emphasizing performance improvement under low-resource scenarios.
Coherence Modeling.For generative models, coherence is the critical requirement and vital target (Hovy, 1988).Previous works mainly discuss two types of coherence, local coherence (Mellish et al., 1998;Althaus et al., 2004) and global coherence (Mann and Thompson, 1987).Local coherence focus on sentence-to-sentence transitions (Lapata, 2003), while global coherence tries to capture comprehensive structure (Karamanis and Manurung, 2002).Our method strives to represent both local and global coherence with inner-and inter-sentence relations between entity nodes.
Contrastive Learning.Contrastive learning in NLP demonstrates superb performance in learning token-level embeddings (Su et al., 2022) and sentence-level embeddings (Gao et al., 2021b) for natural language understanding.With an in-depth study of the mechanism of contrastive learning, the hardness of samples is proved to be crucial in the training stage.Cai et al. (2020) define the dot product between the queries and the negatives in normalized embedding space as hardness and figured out the easiest 95% negatives are insufficient and unnecessary.Song et al. (2022) propose a difficulty measure function based on the distance between classes and apply curriculum learning to the sampling stage.Differently, our method pays more attention to hard negative samples for improving the detection accuracy of challenging samples.

Methodology
The workflow of COCO mainly contains coherence graph construction and supervised contrastive learning discriminator.Fig. 2 illustrates its overall architecture.The pseudocode of the training process is shown in Algorithm 1.

Coherence Graph Construction
In this part, we illustrate how to construct coherence graph to dig out the coherence structure of the text by modeling sentence interaction.
According to Centering Theory (Grosz and Sidner, 1986), the coherence of texts could be modeled by sentence interaction around center entities.To better reflect text structure and avoid semantic overlap, we propose to construct an undirected graph with entities as nodes.Specifically, we first implement the ELMo-based NER model TagLM (Peters et al., 2017) with the help of the NER toolkit AllenNLP 3 to extract the entities from document.A relation < inter > is constructed between the same entities in different sentences and nodes within the same sentences are connected by relation < inner > for their natural structure relevance.Formally, the mathematical form of the coherence graph's adjacent matrix is defined as follows: where v i,a represents i-th entity in sentence a, which is regarded as node in coherence graph.We verify how MGT and HWT separate through static analysis on coherence graph in Appendix I.

Model Overview
The training process is illustrated in Fig. 2. Each entry in the dataset is documented with its coherence graph.The entries in the training set are sampled randomly into keys and queries.Two coherence encoder modules (CEM) f k and f q , are initialized the same to generate coherence-enhanced representation D k and D q for key and query.A dynamic memory bank with the size of all training data is initialized to store all key representation and their annotations for providing enough contrastive pairs in low-resource scenarios.In every training step, the newly encoded key graphs update the memory bank following the First In First Out (FIFO) rule to keep it updated and the training process consistent.
3 https://demo.allennlp.org/named-entity-recognitionA novel loss composed of improved contrastive loss and cross-entropy loss ensures the model's ability to achieve instance-level intra-class compactness and inter-class separability while maintaining class-level distinguishability.A linear discriminator takes query representations as input and generates prediction results.

Positive/Negative Pair Definition
In the supervised setting, where we have access to label information, we define two samples with the same label as positive pairs and those with different labels as negative pairs for incorporating label information into the training process.

Encoder Design
In this part, we introduce the structure of graph neural network structure, an innovative coherence encoder module(CEM), which is utilized to integrate coherence information into a semantic representation of text by propagating and aggregating information from different granularity.The workflow is illustrated in Fig. 3.
Node Representation Initialization.We initialize the representation of entity nodes with the powerful pre-trained model RoBERTa for its superior ability to encode contextual information into text representation.
Given an entity e with a span of n tokens, we utilize RoBERTa to map input document x to embeddings h(x).The contextual representation of e is calculated as follows: where e i is the absolute position where the i-th token in e lies in the whole document.
Relation-aware GCN.Based on the vanilla Graph Convolutional Networks (Welling and Kipf, 2016), we propose a novel method to assign different weight W r for inter and inner relation r with Relation-aware GCN.Relation-aware GCN convolute edges of each kind of relation in the coherence graph separately.The final representation is the sum of GCN outputs from all relations.We use two-layer GCN in the model because more layers will cause an overfitting problem under low resources.We define the relation set as R, and the calculation formula is as follows: where where M i represents the number of entities in i-th sentence, H (i,j) represents the embedding of j-th entity in i-th sentence, W s is weight matrix and b s is bias.All the sentence representations within the same document are concatenated as sentence matrix Z s .Document Representation with Attention LSTM.We design a self-attention mechanism for discovering the sentence-level coherence between one sentence and other sentences, and apply LSTM with the objective to track the coherence in continuous sentences and take the last hidden state of LSTM for aggregated document representation containing comprehensive coherence information.The calculation is described as follows: where K, Q, V are linear transformations of Z s with matrix W k , W q , W v , d Z is the dimension of representation Z s , and γ is a hypergammarparameter for scaling.Finally, we concatenate Z c and the sequence representation h([CLS]) from the RoBERTa's last layer to generate document coherence-enhanced representation D.

Dynamic Memory Bank
The dynamic memory bank is created to store as much as key encoding D k to form adequate positive and negative pairs within a batch.The dynamic memory bank is maintained as a queue so that the newly encoded keys can replace the outdated ones, which keeps the consistency between the key encoding and the current training step.

Loss Function
Following the definition of positive pairs and negative pairs above, traditional supervised contrastive loss (Gunel et al., 2021) treats all positive pairs and negative pairs equally.However, with a recognition that not all negatives are created equal (Cai et al., 2020), our goal is to emphasize the informative samples to help the model differentiate difficult samples.Thus, we propose an improved contrastive loss that dynamically adjusts the weight of negative pair similarity according to the hardness of negative samples.To be specific, the hard negative samples should be assigned a larger weight to stimulate the model to pull the same classes together and push different classes away.The improved contrastive loss is defined as: where P(i) is the positive set in which data has the same label with q i and N (i) is the negative set in which data has a different label from q i .Apart from instance-level learning mechanism, a linear classifier combined with cross-entropy loss L CE is employed to provide the model with classlevel separation ability.L CE is calculated by where p i is the prediction probability distribution of i-th sample.The final loss L total is a weighted average of L ICL and L CE as: where the hyperparameter α adjusts the relative balance between instance compactness and class separability.

Momentum Update
The parameters of query encoder f q and the classifier can be updated by gradient back-propagated from L total .We denote the parameters of f q as θ q , the parameters of f k as θ k , The key encoder f k 's parameters are updated by the momentum update mechanism: where the hyperparameter β is momentum coefficient.

Datasets
We evaluate our model on the following datasets: GROVER Dataset (Zellers et al., 2019) is a News-style dataset in which HWTs are collected from RealNews, a large corpus of news from Common Crawl, and MGTs are generated by Grover-Mega (1.5B), a transformer-based news generator.
GPT-2 Dataset is a Webtext-style dataset provided by OpenAI4 with HWTs adopted from Web-Text and MGTs produced by GPT-2 XLM-1542M.
GPT-3.5 Dataset is a News-style open-source dataset constructed by us based on the text-davinci-0035 model (175B) of OpenAI, which is one of the most capable GPT-3.5 models so far and can generate longer texts (maximum 4,097 tokens).The GPT-3.5 model refers to various latest newspapers (Dec. 2022-Feb. 2023) whose full texts act as the HWTs part, and the model generates by imitation.We design two subsets: mixedand unmixedprovenances, whose details are explained in Appendix B. The brand-new datasets ensure no existing models have been pre-trained on the corpus, which accounts for the fairness of comparison.
The statistics of datasets are summarized in Appendix A. We randomly sample 500 examples as training data for low-resource settings.As for the full dataset setting, we utilize all training data.The implementation details are in Appendix D.

Comparison Models
We compare COCO to state-of-the-art detection methods to reveal the effectiveness.We mainly divide comparison methods into two categories, model-based and metric-based methods.The metrics-based methods detect based on specific statistical text-evaluation metrics and logistic regression while the model-based methods learn features via fine-tuning a model.
CE+SCL (Gunel et al., 2021), a state-of-the-art supervised contrastive learning method in various downstream task.We train the detector with Cross-Entropy loss (CE) and supervised contrastive loss (SCL) calculated within a mini-batch.
DualCL (Chen et al., 2022), a contrastive learn- ing method with the addition of label representations for data augmentation.
The metric-based baselines are as follows: GLTR (Gehrmann et al., 2019), a supporting tool for facilitating humans to recognize MGTs with visual hints.We follow the settings of (Guo et al., 2023) and select the Test-2 feature, which counts the top-k tokens ranking from GPT-2 medium (355M) predicted probability distributions as features for training a logistic regression classifier.
DetectGPT (Mitchell et al., 2023), a contemporaneous metric-based method utilizing the difference of model's log probability after text perturbations.We use T5-3B to perturb texts, and Pythia-12B (Biderman et al., 2023) for scoring in the model.A logistic regression classifier is trained to make predictions.

Performance Comparison
As shown in Table 1, COCO surpasses the stateof-the-art methods in MGT detection task by at least 1.23% and 1.64%, 1.75% and 2.83% on the GROVER, GPT-2 limited datasets in terms of Accuracy and F1-Score, respectively.And COCO achieves comparable performance with the most capable detectors in the complete dataset setting.The result indicates the utility of contrastive learning and the rationality of coherence representation.
Moreover, it should be noticed that compared with metric-based methods, model-based methods usually tend to achieve better results.This can be explained because metric-based methods can only concern and regress on a few features, which are over-compressed and under-represented for the detection task.Also, metric-based methods mainly use the pre-trained model for token probability instead of fine-tuning the whole model.And with more training samples involved, the performance of model-based methods improves drastically, while metric-based methods do not benefit much from more training examples.It reveals that logistic regression is not strong enough to take in many texts with diverse semantics.Meanwhile, COCO outperforms CE+SCL and DualCL regardless of the size of the training set, which suggests the success of improved contrastive loss to solve the performance degradation problem brought by simple negative samples.
We also find GROVER Dataset is the hardest to detect.It is because the GROVER generator is trained in an adversarial heuristic with the objective of deceiving the verifier, which endows the generator with a deceptive nature.To our surprise, the GPT-3.5 dataset is overly simple for all detectors.The result is also in accord with conclusions in recent works (Mireshghallah et al., 2023;Chen et al., 2023).We conduct extensive experiments on different self-constructed and published GPT-3.5 datasets generated by a series of prompts, validating this thundering conclusion.The experiment details and results are in Appendix C. We also implement experiments and discussions to explore further explanations in Sec.4.5.2.
Notably, a more comprehensive comparison experiment with 8 datasets (Pu et al., 2023) and 12 methods is presented in Appendix E, which substantiates the advantage of COCO.

Ablation Study
To illustrate the necessity of components of COCO, we conduct ablation experiments on 1,000-example GROVER dataset.The ablation models' structure is as follows: As shown in Table 2, coherence information and the contrastive learning framework greatly contribute to the development of model performance, especially in F1-Score.Replacing entity nodes in the coherence graph with sentences impairs the detector, which could be caused by semantic overlap between graph representation and text sequence representation.The attention LSTM also plays an important role in preserving the coherence information during sentence aggregation.Lastly, the results show the advantage of improved contrastive loss over standard supervised contrastive loss.
Furthermore, we also conduct ablation studies on other scenarios, including GPT-2, GPT-3.5-Unmixed, and GPT-3.5-Mixeddatasets.More detailed results are discussed in the Appendix G, which clearly stands for the performance gain of COCO components.Moreover, the helpfulness of contrastive learning is verified to be orthogonal to the helpfulness of coherence information.

Model Robustness to Perturbation
To validate the robustness of COCO to various perturbations, we train COCO on the GROVER dataset in the low-resource setting and perturb the test set with four different operations: Delete (randomly delete tokens in each entry), Repeat (randomly select tokens and repeat them twice in the text), Insert (add random tokens from the vocabulary of the pre-trained model into random positions in the text), Replace (randomly replace tokens with randomly selected tokens from the vocabulary).The perturbation scale is set to 15%.The experiment result is shown in Table 3.
4.5.2Statistic Cues for Detectable Feature in GPT-3.5 To further investigate the rationale behind the easyto-detect nature of GPT-3.5 generated texts, we utilize Transformers-Interpret6 , a tool for evaluating feature attribution in predictions based on Integrated Gradients (Sundararajan et al., 2017), for discovering the supporters and opponents (tokens) in the decision-making stage.We probe the statistical cues of the GPT-3.5 mixed dataset from two perspectives: spans of tokens and individual tokens.
We define spans of tokens coverage γ n as n-gram supporters for true positives P n , i.e., n consecutive tokens all contribute positively to the correct prediction, over all n-gram tokens in true positives A n , which could be formulated as γ n = Pn An .Moreover, we apply productivity π k and coverage ϵ k of statistic cue k (Niven and Kao, 2019) on the GPT-3.5 mixed dataset to find out if there are individual tokens acting as common and strong signals contribute to model predictions.Formally, productivity π k is defined as: Here, T (i) j is the set of tokens for text i with label j.And the coverage ϵ k is the portion that all applicable cues over the total number of data points.
We fine-tune the RoBERTa-base model with a classification head on the GPT-3.5 mixed dataset and quantify how tokens in GPT-3.5 mixed test data affect the model predictions with the criteria mentioned above.The results are shown in Table 4 and Table 5.It could be noticed that although γ 1 for MGT and HWT is about the same, the gap widens from γ 2 to γ 5 , indicating that more consecutive spans of tokens act as an indicator for MGT than HWT.Table 5 shows that "according", "where", and "they" are top-3 strongest tokens for detection.However, we could not reach any valid conclusions from their semantics.Meanwhile, these tokens only cover a small portion of the total number of data points (less than 0.4), leading to the weak strength of the signal they provide.Therefore, we come up with a hypothesis that the easy-to-detect nature of GPT-3.5 does not originate from specific token but from certain language patterns (could be demonstrated by a span of tokens).The reason might be that advanced LLMs fit extremely well to the corpus so that it generates more general expressions, which could be much easier to be expected by finetuned detectors.A case study for token importance illustration is shown in Appendix H.2. Further, we discuss more topics in the Appendix, e.g., the effect of hyper-parameters (F), case study (H), static geometric analysis on coherence graph (I), and exploration on imbalanced data (J).

Conclusion
In this paper, we propose COCO, a coherenceenhanced contrastive learning model for MGT detection.We construct a novel coherence graph from the document and implement a MoCo-based contrastive learning framework to improve model performance in low-resource settings.An innovative encoder composed of relation-aware GCN and attention LSTM is designed to learn the coherence representation from the coherence graph, which is further incorporated with the sequence representation of the document.To alleviate the effect of unnecessary easy samples, we propose an improved contrastive learning loss to force the model to pay more attention to hard negative samples.COCO outperforms all detection tasks generated by GROVER, GPT-2, and GPT-3.5, respectively, in both low-resource and high-resource settings.We also find the outputs from the advanced GPT-3.5 are more detectable and explore the rationale behind the phenomena through the perspective of spans of tokens and individual tokens.

Limitations
In this work, we step forward to better distinguishing MGTs under the low-resource setting.However, several limitations still exist for the broader applications of this detector.Firstly, MGTs are easier to generate and collect than HWTs, which may cause an imbalanced label distribution in the dataset.And COCO literally corrupts in extremely imbalanced data distribution condition, as shown in J. Future work could build upon the contrastive learning method of COCO with innovation on sampling strategy for harsh low-resource and imbalanced data settings.Secondly, our method artificially generates a coherence graph for every entry, which is not efficient for larger datasets.What's more, short text, codes, and mathematical proofs, which are hard to generate coherence graphs, are also limitedly detected by CoCo.More distinctive and easy-to-calculate features are worth exploring for generating distinguishable representations for texts with efficiency while better understanding the essence of TGMs.Thirdly, with instruct-based generation and human-in-loop fine-tuning models prevailing, the strategy and defect of TGMs change slightly but constantly.The entity relation with the same semantic granularity and concretization in this paper would not be enough to detect the high-quality content by TGMs in the future.More generative and adaptive detection models should be considered.

Ethical Considerations
We provide insight into the potential weakness of TGMs and publish the GPT-3.5 news datasets.We understand that the discovery of our work can be viciously used to confront detectors.And we understand that malicious users can copy the contents of our GPT-3.5 news dataset to disguise real news and publish them.However, with the purpose of calling for attention to detecting and controlling possible misuse of TGMs, we believe our work will inspire the advancement of the stronger detector of MGTs and prevent all potential negative uses of language models.
Our work complies with the sharing & publication policy of OpenAI 7 and all data we collect is in the public domain and licensed for research purposes.
7 https://openai.com/api/policies/sharing-publication/"title": "On Eve of World Cup, FIFA Chief Says, 'Don't Criticize Qatar; Criticize Me.'", "text": "DOHA, Qatar.The president of world soccer's governing body on Saturday sought to blunt mounting concerns about the World Cup in Qatar with a strident defense of both the host country's reputation and FIFA's authority over its showpiece championship....... Citing statistics, history and even childhood to bolster his case, he at one point likened his own experience as a redheaded child of immigrants to Switzerland to the assimilation problems of gays in the Middle East, and defended the laws, customs and honor of the host country.","authors": ["Tariq Panja"], "publish_date": "2022-11-19 00:00:00", "source": "The New York Times", "url": "https://www.nytimes.com/2022/11/19/sports/soccer/world-cup-gianni-infantino-fifa.html" And the following data shows the corresponding MGT in the dataset.
"title": "On Eve of World Cup, FIFA Chief Says, 'Don't Criticize Qatar; Criticize Me.'", "text": "The 2022 FIFA World Cup in Qatar is fast approaching, and its organizing committee's president, Gianni Infantino, is speaking out about the lingering criticism of the country hosting the event....... he said."It is a once-in-a-lifetime opportunity for the region to show the world its values and aspirations, and it is vital that this event is seen as a celebration of football and a celebration of the region."","authors": "machine", "source": "The New York Times", "matched_hwt_id": 202, "label": "machine""

B.1 Human Written Texts
Unmixed Subset.The HWTs of the unmixed subset are all from The New York Times8 to exclude the impact of writing style.The time span of our data is Nov 1, 2022 -Dec 25, 2022, making sure that no pre-trained model has learned them.We develop the crawler based on news-crawler9 .
Mixed Subset.The HWTs of the mixed subset come from various sources, listed as Table 7.The time span of the data is Jan 1, 2022 -Jan 7, 2023.We develop the crawler based on Newspaper3k 10 .
The dataset is specifically designed for MGTs detection and improving generation models.The contents of dataset are obtained from official news websites and the names of indicidual people are not mentioned maliciously.And we strongly reject using our dataset to create offensive content or peek at private information.
As the GPT-3.5 and ChatGPT model need prompts to generate, we write hints for the generation models to generate texts that meet our news-style long text generation.The hints format is as follows, and the content is related to HWTs.
Write a news more than 1000 words.The news is written by {Authors} from {Source} in {date}.Title is {title}.

C GPT-3.5 Dataset Generated by Different Prompts and Experiment Results
To further validate the conclusion that GPT-3.5 generated texts are easier to detect, we utilize CNN news as a reference and design different prompts for GPT-3.5 generation.The principle is to provide as much information as possible to GPT-3.5 to alleviate the possible gap in semantics and in length.
Keywords as Prompt (KP).We extract the keywords and entities with GPT-3.5-turbo and provide examples in original news to form the prompt for generation.The prompt format is as follows.
Example prompt for generation.Summary as Prompt (SP).We employ GPT-3.5turbo to summarize the original texts.The compression ratio is set to [0.3, 1.0], which means the summary is required to be longer than 0.3 of the length of original text and shorter than whole original text.The generated summary is used as prompt and the format is as follows: Generate a news based on the following abstract: Paris Saint-Germain's coach Christophe Galtier has stated that Lionel Messi is not expected to join the team until early January as he is spending time in Argentina following the World Cup.Kylian Mbappé, Neymar Jr. and Achraf Hakimi, who played for their respective national teams at Qatar 2022, could return to the team as long as they are physically and mentally fit...The news is written by Matias Grez from CNN in 2022-12-28 00:00:00.Title: Lionel Messi isn't expected to be back with PSG until early January after World Cup success News: Outline as Prompt (OP).We also outline the skeleton of original texts by GPT-3.5-turbo and feed the outline into GPT-3.5 text-davinci-003.The prompt format is as follows: Prompt for extraction.
"role": "system", "content": "Write a hierarchical multi-point outline for the paragraph.""role": "user", "content": {text} Example prompt for generation.We first remove the HWTs that do not have desired length (i.e., 200-1024 tokens).And we take half of the selected HWTs as references to formulate different prompts mentioned above and feed it into GPT-3.5 to get MGTs.The MGTs are sampled by Gaussion Distribution of their lengths.To avoid the possible label leakage brought by text length, we directly filter the no-reference HWTs according to the Gaussion Distribution of MGT lengths.
Besides the self-constructed datasets, we also utilize the published GPT-3.5 dataset TuringBench benchmark (abbraviate as GPT-3.5 (TB)) (Uchendu et al., 2020) to validate the deceptiveness of GPT-3.5.The statistics of datasets we use is in We conduct experiments with 3 random seeds and the average results are shown in Table 9. Counterintuitively, even if we elaborate the prompts and eliminate the length difference between MGTs and HWTs, the detection results are still superior, even on outdated baselines like GPT-2.The conclusion might be counterintuitive, but texts generated by the most advanced and popular    9: Experiment of different detectors on different GPT-3.5 Dataset.* : The great performance difference between validation set and test set on GPT-3.5 (TB) are because the test set randomly sample 50% of the words of each article in the dataset (Uchendu et al., 2021).We do not test COCO on GPT-3.5 (TB) for the reason that such operation greatly influences the coherence in texts.We provide an example of this in Table 10.
GPT-3.5 (TB) GPT-3.5 (OP) '.video : morne morkel press conference * cricbuzz.video: england cricbuzz.bevanleads scotland 's 21-man squad for their first ever test match against pakistan in edinburgh icc.chris rogers retires after champions trophy defeat : australian cricketer announces international retirement the sun.icc super eight teams : odi ranking results.bahrainhost oman on sunday kitply hans vohra gold cup gulf today.iccresults.newzealand series history : india v new zealandyazan mohsen qawasma : how bahrain caught Recent changes to key international indexes have resulted in the unprecedented exclusion of Russian stocks at a "zero" price, causing further losses in Moscow's already-dismal stock exchange.This exclusion has made Russia no longer an option for investors, prompting a shift to other emerging markets.\n\nThedramatic shift was made in early March, when FTSE Russell and MSCI announced the removal of Russian stocks from their indexes due to the country's escalating economic and geopolitical problems.Shortly after, the Moscow Exchange suspended trading, sending ripples through the market.\n\nThepossible default on Russian debt has Western investors further reconsidering their investments in Russia... Table 10: A comparison example between texts in test set of GPT-3.5 (TB) and GPT-3.5 (OP).The GPT-3.5 (TB) text shows great disorder while GPT-3.5 (OP) text is neat.

D Implementation Details
This part mentions the implementation details and hyper-parameter settings of all the methods in the experiment.To imitate the situation of low data-resources, we randomly sample 500 entries from the datasets as limited dataset (positive:negative=1:1), which will test models together with the complete datasets.And we conduct experiments on 10 different seeds and report the average test accuracy, F1-Score, and standard deviation only for model-based methods because metricbased methods would not be affected by random seeds.
We use RoBERTa base model to initialize the embedding of our representation and optimize the model using AdamW (Loshchilov and Hutter, 2018) optimizer with a 0.01 weight decay.We set the initial learning rate to 10 −5 and the batch size to 8 for all datasets based on experiences.
We utilize packages, namely transformers, pytorch, and allennlp to implement COCO.And the GPT-3.5 datasets and ChatGPT case is generated by OpenAI API and websites.We spend $300 for API costs, including development and final generation costs.We train and do experiments on 8 NVIDIA A100 GPUs on 2 Ubuntu-based servers.
The total budget for training 20 epochs, dev, and testing on the GROVER dataset is 2.5 hours.On GPT-2 dataset is 12 hours, and on GPT-3.5 dataset is 1.5 hours.We will publish our code and dataset recently.

E More Comparison Experiments
Provisioning empirical evidence to claim effectiveness is a relatively broad topic, and in Table 1 we have shown COCO outperforms on 4 datasets (8 settings) compared with 6 models, including Roberta and CE+SCL, the SOTA of the model-based methods, and DetectGPT, the SOTA of the metric-based methods.Moreover, our model is outperforming on very wide scenarios.Due to the limitation of pages, we do not post all the results in the main text, so we would love to share with you a more comprehensive result here.
• Perturbed metric-based methods: Detect-GPT (Mitchell et al., 2023), in the nomenclature of Table 11, the number '10' means the number of perturbation samples.The letter 'd' means not normalized on distribution, while 'z' means normalized.
Table 11 reveals the outstanding performance of COCO in almost all the scenarios.Moreover, We also find the following phenomenon: • It follows the intuitive notion that off-the-shelf models are only competitive in their designed scenario.OpenAI-detector performs well on GPT-2s and GPT-Neo datasets.And ChatGPTdetector, in reverse, excels on GPT-3, GPT-4, Llamas, GPT-J, and GPT-Neo.
• Probability metric-based methods rely on the likelihood from the generation model, which is mainly designed for white-box machinegenerated detection.For white-box models like GPT-2, GPT-Neo, and GPT-J, their performance is relatively good.But when applied to totally black-box models, these methods could easily fail.DetectGPT, the perturbed metricbased method, shares the same limitation with a similar mechanism.
• Among all the fine-tuned model-based methods, RoBERTa-base shows the best performance average on all datasets compared to other base models.Thus, it supports our claim that recognizing RoBERTa as SOTA for this category, and further built CL methods and COCO based on RoBERTa.
F Effect of Hyper-Parameters

F.1 Contrastive Learning Parameters
We evaluate the influence of contrastive learning hyper-parameters α and τ with experiments on different combinations of them.The result is shown in Fig. 4. Considering the discovering that smaller τ leads to better hard negative mining ability (Wang and Liu, 2021), we select α from {0.1, 0.2, ..., 0.9} and τ from {0.1, 0.2, 0.3}.We find that the extreme α value causes the performance degradation and the best hyper-parameter combination is α, τ = 0.6, 0.2.Our analysis is that large α forces the model to concentrate on the instance-level contrast and small α lets class separation objective take control.Both will reduce the generalization performance of the detector on test set.

F.2 Graph Parameters
We further investigate the effect of max node number and max sentence number on model perfor-mance.The result is shown in Fig. 5.We select max node number from {60, 90, 120, 150} and max sentence number from {30, 45, 60, 75}.The detector performs best when max node number is 90 and max sentence number is 45.The experiment results prove that the large node and sentence number are not necessary for the improvement of detection accuracy.We infer that even though setting large node and sentence number includes more entity information, excessive nodes bring noise to the model and impair the distinguishability of coherence feature.

G Ablation Study
In Sec.4.4, we mainly show the performance gain on the GROVER dataset.To further verify the effectiveness of COCO across other scenarios.We also do the ablation study on 500-example GPT-2, GPT-3.5-Unmixed, and GPT-3.5-Mixeddatasets.
The result is shown in Table 12.
Here, we add a new ablated setting, COCO (ICL), which applies the improved contrastive learning we proposed but does not include any part of the coherence graph representation model (i.e., Coherence and LSTM).
By comparing COCO (Coherence) with COCO (Plain), we can evaluate the effectiveness of the coherence model.It shows an average improvement of 1.14% accuracy and 1.54% F1 on the plain version.Furthermore, if we add attention LSTM for concatenation, it can achieve 1.26% accuracy enhancement.
Moreover, by comparing COCO (ICL) and COCO, we further show the effectiveness of the coherence model based on the ICL model.There's a gap of 0.86% accuracy and 0.61% F1 between with Coherence model and w/o it.The result shows the effectiveness of the coherence model component doesn't heavily overlap with the effectiveness of the ICL method component.In conclusion, both

H.2 Token Importance in GPT-3.5 Detection
As shown in Fig. 7, we take segments from two text pairs consisting of HWT and its corresponding MGT in GPT-3.5 mixed and GROVER dataset.It could be noticed that consecutive spans in text generated by GPT-3.5 tend to contribute more to the model decision.However, in HWTs, model pays more attention to individual tokens.Following this observation, we infer that with the improvement of model scale, LLMs fit extremely well to the corpus so that it generates more general expressions compared with HWTs, which follows certain patterns (always demonstrated by a span of tokens) that could be expected by fine-tuned models.Thus, barely all the methods show nearly perfect performance on GPT-3.5 dataset.
As for GROVER dataset, more tokens contribute negatively to the model prediction, even if the prediction is correct.This reflects the deceptive nature of GROVER and explains the reason why it is the hardest dataset in our experiment to some extent.

I Static Geometric Analysis on Coherence Graph
We have witnessed performance enhancement by applying the graph-based coherence model to the detection model, but how does the coherence graph help detection?In this subsection, we apply static geometric features analysis to coherence graph we construct to evaluate the distinguishable difference between HWTs and MGTs with explanation.In the following discussion, we take the dataset of GROVER into the analysis.Some basic metrics of data and the corresponding graph are shown in Table 13.Though HWTs and MGTs have approximately the same number of tokens in every text, coher-  ence graph for HWTs has larger scale than MGTs' with 34.7% more vertexes and 64.1% more edges, which shows that HWTs have more complex semantic relation structures than MGTs.

I.1 Degree Distribution
Semantically, degree of coherence graph measures the co-occurrence and TF-IDF feature of keywords.Moreover, degree distribution shows global coherence because high-degree nodes devote to the main topic and low-degree nodes are the extension.As shown in Table 14, The degree of the graph representation of HWTs is 2.980, which is 15.0% larger than MGTs (2.591), which shows disparities of MGTs to form coherent interaction between sentences.Fig. 8 measures the distribution of each graph's average nodes' degree, showing that the distribution of HWTs has a longer tail than MGTs.
Furthermore, we analyze the distinguishability of degree features when impacted by other factors.One most considerable influences is the style and genre of different provenance.We chose around 60 articles from The Sun11 and Boston12 .Then we use GROVER to mimic their style to generate similar topic news.Fig. 9 shows the degree distribution of HWTs and MGTs of both provenances.
We use the Jensen-Shannon divergence to evaluate the similarity of the degree distribution.The JS-divergence of MGTs mimicking The Sun and Boston is 0.029, while the JS-divergence of MGTs and HWTs in Boston is 0.050, in The Sun is 0.061.The apparent gap shows that degree distribution can robustly detect MGTs and HWTs when impacted by provenance differences.

I.2 Aggregation
Aggregation is a shared metric for complex networks and linguistics, depicting how closely the whole is organized around its core.We propose two metrics to evaluate the aggregation of graphbased text representation in our coherence model, the size of the largest connected subgraph and the clustering coefficient.
In our representation, not all sentences have entities related to others.Hence the graph is an unconnected one.The average number of nodes in We propose that the size of the largest connected subgraph shows the contents which are closely organized around the topic.Moreover, the size of graphs may be an unfair factor, so we use the portion of nodes in the largest connected subgraph to reflect its size.The average portion in HWTs is 0.6725 and in MGTs is 0.6458.Fig. 10 shows the distribution of the portion of graphs, and HWTs distribute more high-portion ones than MGTs.
The clustering coefficient represents how nodes tend to cluster.For the entities of texts, clustering evaluates how the author narrates around the central theme.The larger the clustering coefficient is, the tighter the semantic structure is.The average cluster coefficient of the graphs of HWTs is 0.2213 and of MGTs is 0.1983, HWTs is 11.6% better than MGTs.Fig. 11 shows the distribution.where I i is the information content represented by the degree distribution, N is the number of nodes, and k i is the degree of the i-th node.
Global coherence, from our perspective, equals refining more information inside the semantic structure of the whole text, which matches to structure entropy of our graph representation.From our experiments, the structure entropy of HWTs (2.263) is 6.80% larger than MGTs (2.119), which means HWTs obtain more structured information because their semantic information is globally organized.We show the network structure entropy distribution in Fig. 13.

J Exploration on Imbalanced Data
Imbalanced distribution in data is another crucial limitation in the task of MGTs detection, which is similar to the low resource limitation.It is imaginable that, with the development of generation technology, MGTs will overwhelmingly dominate low-quality articles since they are easier and faster to generate than human writing.The detection model will face training resources with MGTs as the main part and HWTs as the small part.We test the current models in the imbalanced limitation and find the dramatic decline in accuracy when the ratio of HWTs is less than 30%, as shown in the Fig. 14.The test is based on the 10% GROVER dataset.All models show poor performance at low HWTs ratios.With a percentage of HWTs of 0.1 (only 100 HWTs in the training set in this case), most of the models have an accuracy below 50%, which performance is close to random and reflects intolerance for extreme cases.Besides, we find that a high proportion of HWTs also cause a decrease in F1 score to some extent.

K Related Work: Graph-based Text Representation
Graph-of Words (GoW) Model (Turney, 2002;Mihalcea and Tarau, 2004) is a type graph representation method in which each document is represented by a graph, whose nodes correspond to terms and edges capture co-occurrence relationships between terms.Using GoW, keywords can be extracted by retaining the document graph (Turney, 2002).Thus, graph representation is sensible to apply in tasks like information retrieval (Blanco and Lioma, 2011), categorization (Malliaros and Skianis, 2015) and sentiment classification tasks (Huang and Carley, 2019;Hou et al., 2021).Most models enhance classification or detection performance by combining graph representation with neural networks.Text-GCN (Yao et al., 2019) first builds a single large graph for the whole corpus, followed by Tensor-GCN (Liu et al., 2020) with tensor representation.Also, the relation between words varies, and should be treated as different edges.COCO matches keywords PLM embedding to nodes and sentence representation, considers dealing inner-and inter-sentence relation differently in GCN, and merges the structure graph and flat sequence representation to predict accurately.

Figure 2 :
Figure 2: Overview of COCO.Input document is parsed to construct a coherence graph (3.1), the text and graph are utilized by a supervised contrastive learning framework (3.2), in which coherence encoding module is designed to encode and aggregate to generate coherence-enhanced representation (3.2.3).After that, we employ a MoCo based contrastive learning architecture in which key encodings are stored in a dynamic memory bank (3.2.4) with improved contrastive loss to make final prediction (3.2.5).

Figure 3 :
Figure 3: Illustration of CEM.It encodes and fuses the coherence graph and text sequence to generate coherenceenhanced representation of document.

Input:
Input X, consisting of documents D and corresponding coherence graph G, hyper-parameters such as the size of dynamic memory bank M and batch size S, labels Y Output: A learned model COCO, consisting of key encoder f k with parameters θ k , query encoder fq with parameters θq, classifier fc with parameters θc 1: Initialize θ k = θq, θc 2: Initialize dynamic memory bank with f k (x1, x2...xM ), where xi is randomly sampled from X. 3: Freeze θ k 4: epoch ← 0 5: while epoch ≤ epochmax do 6:

Figure 4 :
Figure 4: Effect of parameters α and τ on model performance.

Figure 5 :
Figure 5: Performance of COCO with different graph parameters.

Figure 6 :
Figure 6: An illustration for case study of our method.Entities in documents are colored green.The blue solid box indicates the sentence.The orange dashed lines are inner edges and green dashed lines are inter edges.Numbers in red indicate the probability of predicted label.

"Figure 7 :
Figure 7: Visualization of token attributions.The first text pair is sampled from GPT-3.5 mixed dataset and the second text pair is from GROVER dataset.The tokens in green represent contributing positively to the predicted label, while those in red contribute negatively.Label "0" represents HWT, and Label "1" represents MGT.Metric Avg.Degree HWT 2.980 MGT 2.591

Figure 8 :
Figure 8: Distribution of average degree of graphs.

Figure 9 :
Figure 9: Distribution of degree with different provenance.

Figure 12 :
Figure 12: Core-number of nodes in graphs

Figure 14 :
Figure 14: Model comparison results on DL dataset with 9 different human-generated text portions.

Table 1 :
Results of the model comparison.It should be noticed that DualCL is easily affected by random seed, which may be caused by its weakness in understanding long texts.We do not present the experiment results for DualCL on GPT-3.5 dataset because the documents in GPT-3.5 dataset is so long that DualCL completely fails.

Table 3 :
Model robustness to different perturbations.
model are the easiest to detect.

Table 11 :
Comprehensive experimental results on wide scenarios.The same as the limited setting in Sec.4.1, which uses 500 examples for these models to fine-tune.

Table 13 :
Basic metrics of texts and corresponding graphs.