Detecting Machine-Generated Long-Form Content with Latent-Space Variables

Yufei Tian; Zeyu Pan; Nanyun Peng

doi:10.18653/v1/2024.findings-emnlp.608

Detecting Machine-Generated Long-Form Content with Latent-Space Variables

Abstract

The increasing capability of large language models (LLMs) to generate fluent long-form texts is presenting new challenges in distinguishing these outputs from those of humans. Existing zero-shot detectors that primarily focus on token-level distributions are vulnerable to real-world domain shift including different decoding strategies, variations in prompts, and attacks. We propose a more robust method that incorporates abstract elements—such as topic or event transitions—as key deciding factors, by training a latent-space model on sequences of events or topics derived from human-written texts. On three different domains, machine generations which are originally inseparable from humans’ on the token level can be better distinguished with our latent-space model, leading to a 31% improvement over strong baselines such as DetectGPT. Our analysis further reveals that unlike humans, modern LLMs such as GPT-4 selecting event triggers and transitions differently, and inherent disparity regardless of the generation configurations adopted in real-time.

Anthology ID:: 2024.findings-emnlp.608
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10394–10408
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.608/
DOI:: 10.18653/v1/2024.findings-emnlp.608
Bibkey:
Cite (ACL):: Yufei Tian, Zeyu Pan, and Nanyun Peng. 2024. Detecting Machine-Generated Long-Form Content with Latent-Space Variables. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10394–10408, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Detecting Machine-Generated Long-Form Content with Latent-Space Variables (Tian et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.608.pdf

PDF Cite Search Fix data