Dongping Chen
2024
LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?
Qihui Zhang
|
Chujie Gao
|
Dongping Chen
|
Yue Huang
|
Yixin Huang
|
Zhenyang Sun
|
Shilin Zhang
|
Weiye Li
|
Zhengyan Fu
|
Yao Wan
|
Lichao Sun
Findings of the Association for Computational Linguistics: NAACL 2024
With the rapid development and widespread application of Large Language Models (LLMs), the use of Machine-Generated Text (MGT) has become increasingly common, bringing with it potential risks, especially in terms of quality and integrity in fields like news, education, and science. Current research mainly focuses on purely MGT detection, without adequately addressing mixed scenarios including AI-revised Human-Written Text (HWT) or human-revised MGT. To tackle this challenge, we define mixtext, a form of mixed text involving both AI and human-generated content. Then we introduce MixSet, the first dataset dedicated to studying these mixtext scenarios. Leveraging MixSet, we executed comprehensive experiments to assess the efficacy of prevalent MGT detectors in handling mixtext situations, evaluating their performance in terms of effectiveness, robustness, and generalization. Our findings reveal that existing detectors struggle to identify mixtext, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixtext, offering valuable insights for future research. Code and Models are available at https://github.com/Dongping-Chen/MixSet.
Evaluating the Validity of Word-level Adversarial Attacks with Large Language Models
Huichi Zhou
|
Zhaoyang Wang
|
Hongtao Wang
|
Dongping Chen
|
Wenhan Mu
|
Fangyuan Zhang
Findings of the Association for Computational Linguistics: ACL 2024
Deep neural networks exhibit vulnerability to word-level adversarial attacks in natural language processing. Most of these attack methods adopt synonymous substitutions to perturb original samples for crafting adversarial examples while attempting to maintain semantic consistency with the originals. Some of them claim that they could achieve over 90% attack success rate, thereby raising serious safety concerns. However, our investigation reveals that many purportedly successful adversarial examples are actually invalid due to significant changes in semantic meanings compared to their originals. Even when equipped with semantic constraints such as BERTScore, existing attack methods can generate up to 87.9% invalid adversarial examples. Building on this insight, we first curate a 13K dataset for adversarial validity evaluation with the help of GPT-4. Then, an open-source large language model is fine-tuned to offer an interpretable validity score for assessing the semantic consistency between original and adversarial examples. Finally, this validity score can serve as a guide for existing adversarial attack methods to generate valid adversarial examples. Comprehensive experiments demonstrate the effectiveness of our method in evaluating and refining the quality of adversarial examples.
Search
Co-authors
- Qihui Zhang 1
- Chujie Gao 1
- Yue Huang 1
- Yixin Huang 1
- Zhenyang Sun 1
- show all...