Yifei Huang
2025
MAGRET: Machine-generated Text Detection with Rewritten Texts
Yifei Huang
|
Jiuxin Cao
|
Hanyu Luo
|
Xin Guan
|
Bo Liu
Proceedings of the 31st International Conference on Computational Linguistics
With the quick advancement in text generation ability of Large Language Mode(LLM), concerns about the misuse of machine-generated content have grown, raising potential violations of legal and ethical standards. Some existing studies concentrate on detecting machine-generated text in open-source models using in-model features, but their performance on closed-source large models is limited. This limitation occurs because, in the closed-source model detection, the only reference that can be obtained is the texts, which may differ significantly due to random sampling. In this paper, we demonstrate that texts generated by the same model can align both semantically and statistically under similar prompts, facilitating effective detection and traceability. Specifically, we fine-tune a BERT encoder through contrastive learning to achieve semantic alignment in randomly generated texts from the same model. Then, we propose a method called Machine-Generated Text Detection with Rewritten Texts, which designed several prompt refactoring methods and used them to request rewritten text from LLMs. Semantic and statistical relationships between rewritten and original texts provide a basis for detection and traceability. Finally, we expanded the text dataset with multi-parameter random sampling and verified the performance of MAGRET on three text-generated datasets. Experimental results show that previous methods struggle with closed-source model detection, while our approach significantly outperforms baseline methods in this regard. It also shows MagRet’s stable performance in detection and tracing tasks across various randomly sampled texts.
2023
Pretraining Language Models with Text-Attributed Heterogeneous Graphs
Tao Zou
|
Le Yu
|
Yifei Huang
|
Leilei Sun
|
Bowen Du
Findings of the Association for Computational Linguistics: EMNLP 2023
In many real-world scenarios (e.g., academic networks, social platforms), different types of entities are not only associated with texts but also connected by various relationships, which can be abstracted as Text-Attributed Heterogeneous Graphs (TAHGs). Current pretraining tasks for Language Models (LMs) primarily focus on separately learning the textual information of each entity and overlook the crucial aspect of capturing topological connections among entities in TAHGs. In this paper, we present a new pretraining framework for LMs that explicitly considers the topological and heterogeneous information in TAHGs. Firstly, we define a context graph as neighborhoods of a target node within specific orders and propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network. Secondly, based on the observation that some nodes are text-rich while others have little text, we devise a text augmentation strategy to enrich textless nodes with their neighbors’ texts for handling the imbalance issue. We conduct link prediction and node classification tasks on three datasets from various domains. Experimental results demonstrate the superiority of our approach over existing methods and the rationality of each design. Our code is available at https://github.com/Hope-Rita/THLM.