Simple models are all you need: Ensembling stylometric, part-of-speech, and information-theoretic models for the ALTA 2024 Shared Task

Joel Thomas, Gia Bao Hoang, Lewis Mitchell


Abstract
The ALTA 2024 shared task concerned automated detection of AI-generated text. Large language models (LLM) were used to generate hybrid documents, where individual sentences were authored by either humans or a state-of-the-art LLM. Rather than rely on similarly computationally expensive tools like transformer-based methods, we decided to approach this task using only an ensemble of lightweight “traditional” methods that could be trained on a standard desktop machine. Our approach used models based on word counts, stylometric features, readability metrics, part-of-speech tagging, and an information-theoretic entropy estimator to predict authorship. These models, combined with a simple weighting scheme, performed well on a held-out test set, achieving an accuracy of 0.855 and a kappa score of 0.695. Our results show that relatively simple, interpretable models can perform effectively at tasks like authorship prediction, even on short texts, which is important for democratisation of AI as well as future applications in edge computing.
Anthology ID:
2024.alta-1.19
Volume:
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association
Month:
December
Year:
2024
Address:
Canberra, Australia
Editors:
Tim Baldwin, Sergio José Rodríguez Méndez, Nicholas Kuo
Venue:
ALTA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
207–212
Language:
URL:
https://aclanthology.org/2024.alta-1.19/
DOI:
Bibkey:
Cite (ACL):
Joel Thomas, Gia Bao Hoang, and Lewis Mitchell. 2024. Simple models are all you need: Ensembling stylometric, part-of-speech, and information-theoretic models for the ALTA 2024 Shared Task. In Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association, pages 207–212, Canberra, Australia. Association for Computational Linguistics.
Cite (Informal):
Simple models are all you need: Ensembling stylometric, part-of-speech, and information-theoretic models for the ALTA 2024 Shared Task (Thomas et al., ALTA 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.alta-1.19.pdf