Effects of Generation Model on Detecting AI-generated Essays in a Writing Test

Jiyun Zu; Michael Fauss; Chen Li

Effects of Generation Model on Detecting AI-generated Essays in a Writing Test

Abstract

Various detectors have been developed to detect AI-generated essays using labeled datasets of human-written and AI-generated essays, with many reporting high detection accuracy. In real-world settings, essays may be generated by models different from those used to train the detectors. This study examined the effects of generation model on detector performance. We focused on two generation models – GPT-3.5 and GPT-4 – and used writing items from a standardized English proficiency test. Eight detectors were built and evaluated. Six were trained on three training sets (human-written essays combined with either GPT-3.5-generated essays, or GPT-4-generated essays, or both) using two training approaches (feature-based machine learning and fine-tuning RoBERTa), and the remaining two were ensemble detectors. Results showed that a) fine-tuned detectors outperformed feature-based machine learning detectors on all studied metrics; b) detectors trained with essays generated from only one model were more likely to misclassify essays generated by the other model as human-written essays (false negatives), but did not misclassify more human-written essays as AI-generated (false positives); c) the ensemble fine-tuned RoBERTa detector had fewer false positives, but slightly more false negatives than detectors trained with essays generated by both models.

Anthology ID:: 2025.aimecon-sessions.11
Volume:: Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers
Month:: October
Year:: 2025
Address:: Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:: Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:: AIME-Con
SIG:
Publisher:: National Council on Measurement in Education (NCME)
Note:
Pages:: 92–98
Language:
URL:: https://aclanthology.org/2025.aimecon-sessions.11/
DOI:
Bibkey:
Cite (ACL):: Jiyun Zu, Michael Fauss, and Chen Li. 2025. Effects of Generation Model on Detecting AI-generated Essays in a Writing Test. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers, pages 92–98, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):: Effects of Generation Model on Detecting AI-generated Essays in a Writing Test (Zu et al., AIME-Con 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.aimecon-sessions.11.pdf

PDF Cite Search Fix data