Thomas Arnold


2024

pdf bib
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Chenxi Whitehouse | Osama Mohammed Afzal | Tarek Mahmoud | Toru Sasaki | Thomas Arnold | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4

pdf bib
M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Osama Mohammed Afzal | Tarek Mahmoud | Giovanni Puccetti | Thomas Arnold | Alham Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain and multi-generator corpus of MGTs — M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.

pdf bib
SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Osama Mohammed Afzal | Tarek Mahmoud | Giovanni Puccetti | Thomas Arnold
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.

2023

pdf bib
QUEST: Quizzes Utilizing Engaging StoryTelling
Thomas Arnold
Proceedings of the 1st Workshop on Teaching for NLP

2022

pdf bib
DP-Rewrite: Towards Reproducibility and Transparency in Differentially Private Text Rewriting
Timour Igamberdiev | Thomas Arnold | Ivan Habernal
Proceedings of the 29th International Conference on Computational Linguistics

Text rewriting with differential privacy (DP) provides concrete theoretical guarantees for protecting the privacy of individuals in textual documents. In practice, existing systems may lack the means to validate their privacy-preserving claims, leading to problems of transparency and reproducibility. We introduce DP-Rewrite, an open-source framework for differentially private text rewriting which aims to solve these problems by being modular, extensible, and highly customizable. Our system incorporates a variety of downstream datasets, models, pre-training procedures, and evaluation metrics to provide a flexible way to lead and validate private text rewriting research. To demonstrate our software in practice, we provide a set of experiments as a case study on the ADePT DP text rewriting system, detecting a privacy leak in its pre-training approach. Our system is publicly available, and we hope that it will help the community to make DP text rewriting research more accessible and transparent.

2018

pdf bib
Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data
Christopher Tauchmann | Thomas Arnold | Andreas Hanselowski | Christian M. Meyer | Margot Mieskes
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Network Motifs May Improve Quality Assessment of Text Documents
Thomas Arnold | Karsten Weihe
Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Language Processing

pdf bib
EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres
Steffen Remus | Gerold Hintz | Chris Biemann | Christian M. Meyer | Darina Benikova | Judith Eckle-Kohler | Margot Mieskes | Thomas Arnold
Proceedings of the 10th Web as Corpus Workshop