Enhancing Essay Scoring with GPT-2 Using Back Translation Techniques

Aysegul Gunduz, Mark Gierl, Okan Bulut


Abstract
This study evaluates GPT-2 (small) for automated essay scoring on the ASAP dataset. Back-translation (English–Turkish–English) improved performance, especially on imbalanced sets. QWK scores peaked at 0.77. Findings highlight augmentation’s value and the need for more advanced, rubric-aware models for fairer assessment.
Anthology ID:
2025.aimecon-main.44
Volume:
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Month:
October
Year:
2025
Address:
Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States
Editors:
Joshua Wilson, Christopher Ormerod, Magdalen Beiting Parrish
Venue:
AIME-Con
SIG:
Publisher:
National Council on Measurement in Education (NCME)
Note:
Pages:
406–416
Language:
URL:
https://aclanthology.org/2025.aimecon-main.44/
DOI:
Bibkey:
Cite (ACL):
Aysegul Gunduz, Mark Gierl, and Okan Bulut. 2025. Enhancing Essay Scoring with GPT-2 Using Back Translation Techniques. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers, pages 406–416, Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States. National Council on Measurement in Education (NCME).
Cite (Informal):
Enhancing Essay Scoring with GPT-2 Using Back Translation Techniques (Gunduz et al., AIME-Con 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.aimecon-main.44.pdf