Thijs Brekhof


2024

pdf bib
Groningen team D at SemEval-2024 Task 8: Exploring data generation and a combined model for fine-tuning LLMs for Multidomain Machine-Generated Text Detection
Thijs Brekhof | Xuanyi Liu | Joris Ruitenbeek | Niels Top | Yuwen Zhou
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

In this system description, we describe our process and the systems that we created for the subtasks A monolingual, A multilingual, and B forthe SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection. This shared task aimsat detecting and differentiating between machinegenerated text and human-written text. SubtaskA is focused on detecting if a text is machinegenerated or human-written both in a monolingualand a multilingual setting. Subtask B is also focused on detecting if a text is human-written ormachine-generated, though it takes it one step further by also requiring the detection of the correct language model used for generating the text.For the monolingual aspects of this task, our approach is centered around fine-tuning a debertav3-large LM. For the multilingual setting, we created an ensemble model utilizing different monolingual models and a language identification toolto classify each text. We also experiment with thegeneration of extra training data. Our results showthat the generation of extra data aids our modelsand leads to an increase in accuracy.