Workshop on GenAI Content Detection (GenAIDetect) (2025)


up

pdf (full)
bib (full)
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

pdf bib
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
Firoj Alam | Preslav Nakov | Nizar Habash | Iryna Gurevych | Shammur Chowdhury | Artem Shelmanov | Yuxia Wang | Ekaterina Artemova | Mucahid Kutlu | George Mikros

pdf bib
SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs
Aldan Creo | Shushanta Pudasaini

The advent of Large Language Models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, substantial research has been conducted with the objective of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks (‘A’ → Cyrillic ‘А’) as a means of circumventing existing detectors. We conduct a comprehensive evaluation to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI’s detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). Through further examination, we extract the technical justification underlying the success of the attacks, which varies across detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.

pdf bib
Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts
Philipp Moeßner | Heike Adel

With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt’s level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.

pdf bib
Mirror Minds : An Empirical Study on Detecting LLM-Generated Text via LLMs
Josh Baradia | Shubham Gupta | Suman Kundu

The use of large language models (LLMs) is inevitable in text generation. LLMs are intelligent and slowly replacing the search engines. LLMs became the de facto choice for conversation, knowledge extraction, and brain storming. This study focuses on a question: ‘Can we utilize the generative capabilities of LLMs to detect AI-generated content?’ We present a methodology and empirical results on four publicly available data sets. The result shows, with 90% accuracy it is possible to detect AI-generated content by a zero-shot detector utilizing multiple LLMs.

pdf bib
Benchmarking AI Text Detection: Assessing Detectors Against New Datasets, Evasion Tactics, and Enhanced LLMs
Shushanta Pudasaini | Luis Miralles | David Lillis | Marisa Llorens Salvador

The rapid advancement of Large Language Models (LLMs), such as GPT-4, has sparked concerns regarding academic misconduct, misinformation, and the erosion of originality. Despite the growing number of AI detection tools, their effectiveness is often undermined by sophisticated evasion tactics and the continuous evolution of LLMs. This research benchmarks the performance of leading AI detectors, including OpenAI Detector, RADAR, and ArguGPT, across a variety of text domains, evaded content, and text generated by cutting-edge LLMs. Our experiments reveal that current detection models show considerable unreliability in real-world scenarios, particularly when tested against diverse data domains and novel evasion strategies. The study underscores the need for enhanced robustness in detection systems and provides valuable insights into areas of improvement for these models. Additionally, this work lays the groundwork for future research by offering a comprehensive evaluation of existing detectors under challenging conditions, fostering a deeper understanding of their limitations. The experimental code and datasets are publicly available for further benchmarking on Github.

pdf bib
Cross-table Synthetic Tabular Data Detection
G. Charbel N. Kindji | Lina M. Rojas Barahona | Elisa Fromont | Tanguy Urvoy

Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified “in the wild”—meaning across different generators, domains, and table formats. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose three cross-table baseline detectors and four distinct evaluation protocols, each corresponding to a different level of “wildness”. Our very preliminary results confirm that cross-table adaptation is a challenging task.

pdf bib
Your Large Language Models are Leaving Fingerprints
Hope Elizabeth McGovern | Rickard Stureborg | Yoshi Suhara | Dimitris Alikaniotis

It has been shown that fine-tuned transformers and other supervised detectors are effective for distinguishing between human and machine-generated texts in non-adversarial settings, but we find that even simple classifiers on top of n-gram and part-of-speech features can achieve very robust performance on both in- and out-of-domain data. To understand how this is possible, we analyze machine-generated output text in four datasets, finding that LLMs possess unique fingerprints that manifest as slight differences in the frequency of certain lexical and morphosyntactic features. We show how to visualize such fingerprints, describe how they can be used to detect machine-generated text and find that they are even robust across text domains. We find that fingerprints are often persistent across models in the same model family (e.g. 13B parameter LLaMA’s fingerprint is similar to that of 65B parameter LLaMA) and that while a detector trained on text from one model can easily recognize text generated by a model in the same family, it struggles to detect text generated by an unrelated model.

pdf bib
GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests
Ishika M. Rathi | Sydney Taylor | Benjamin Bergen | Cameron Jones

Everyday AI detection requires differentiating between humans and AI in informal, online conversations. At present, human users most often do not interact directly with bots but instead read their conversations with other humans. We measured how well humans and large language models can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.

pdf bib
The Consistent Lack of Variance of Psychological Factors Expressed by LLMs and Spambots
Vasudha Varadarajan | Salvatore Giorgi | Siddharth Mangalik | Nikita Soni | Dave M. Markowitz | H. Andrew Schwartz

In recent years, the proliferation of chatbots like ChatGPT and Claude has led to an increasing volume of AI-generated text. While the text itself is convincingly coherent and human-like, the variety of expressed of human attributes may still be limited. Using theoretical individual differences, the fundamental psychological traits which distinguish people, this study reveals a distinctive characteristic of such content: AI-generations exhibit remarkably limited variation in inferrable psychological traits compared to human-authored texts. We present a review and study across multiple datasets spanning various domains. We find that AI-generated text consistently models the authorship of an “average” human with such little variation that, on aggregate, it is clearly distinguishable from human-written texts using unsupervised methods (i.e., without using ground truth labels). Our results show that (1) fundamental human traits are able to accurately distinguish human- and machine-generated text and (2) current generation capabilities fail to capture a diverse range of human traits

pdf bib
DAMAGE: Detecting Adversarially Modified AI Generated Text
Elyas Masrour | Bradley N. Emi | Max Spero

AI humanizers are a new class of online software tools meant to paraphrase and rewrite AI-generated text in a way that allows them to evade AI detection software. We study 19 AI humanizer and paraphrasing tools and qualitatively assess their effects and faithfulness in preserving the meaning of the original text. We show that many existing AI detectors fail to detect humanized text. Finally, we demonstrate a robust model that can detect humanized AI text while maintaining a low false positive rate using a data-centric augmentation approach. We attack our own detector, training our own fine-tuned model optimized against our detector’s predictions, and show that our detector’s cross-humanizer generalization is sufficient to remain robust to this attack.

pdf bib
Text Graph Neural Networks for Detecting AI-Generated Content
Andric Valdez | Helena Gomez-Adorno

The widespread availability of Large Language Models (LLMs) such as GPT-4 and Llama-3, among others, has led to a surge in machine-generated content across various platforms, including social media, educational tools, and academic settings. While these models demonstrate remarkable capabilities in generating coherent text, their misuse raises significant concerns. For this reason, detecting machine-generated text has become a pressing need to mitigate these risks. This research proposed a novel classification method combining text-graph representations with Graph Neural Networks (GNNs) and different node feature initialization strategies to distinguish between human-written and machine-generated content. Experimental results demonstrate that the proposed approach outperforms traditional machine learning classifiers, highlighting the effectiveness of integrating structural and semantic relationships in text.

pdf bib
I Know You Did Not Write That! A Sampling Based Watermarking Method for Identifying Machine Generated Text
Kaan Efe Keleş | Ömer Kaan Gürbüz | Mucahid Kutlu

Potential harms of Large Language Models such as mass misinformation and plagiarism can be partially mitigated if there exists a reliable way to detect machine generated text. In this paper, we propose a new watermarking method to detect machine-generated texts. Our method embeds a unique pattern within the generated text, ensuring that while the content remains coherent and natural to human readers, it carries distinct markers that can be identified algorithmically. Specifically, we intervene with the token sampling process in a way which enables us to trace back our token choices during the detection phase. We show how watermarking affects textual quality and compare our proposed method with a state-of-the-art watermarking method in terms of robustness and detectability. Through extensive experiments, we demonstrate the effectiveness of our watermarking scheme in distinguishing between watermarked and non-watermarked text, achieving high detection rates while maintaining textual quality.

pdf bib
DCBU at GenAI Detection Task 1: Enhancing Machine-Generated Text Detection with Semantic and Probabilistic Features
Zhaowen Zhang | Songhao Chen | Bingquan Liu

This paper presents our approach to the MGT Detection Task 1, which focuses on detecting AI-generated content. The objective of this task is to classify texts as either machine-generated or human-written. We participated in Subtask A, which concentrates on English-only texts. We utilized the RoBERTa model for semantic feature extraction and the LLaMA3 model for probabilistic feature analysis. By integrating these features, we aimed to enhance the system’s classification accuracy. Our approach achieved strong results, with an F1 score of 0.7713 on Subtask A, ranking ninth among 36 teams. These results demonstrate the effectiveness of our feature integration strategy.

pdf bib
L3i++ at GenAI Detection Task 1: Can Label-Supervised LLaMA Detect Machine-Generated Text?
Hanh Thi Hong Tran | Nguyen Tien Nam

The widespread use of large language models (LLMs) influences different social media and educational contexts through the overwhelming generated text with a certain degree of coherence. To mitigate their potential misuse, this paper explores the feasibility of finetuning LLaMA with label supervision (named LS-LLaMA) in unidirectional and bidirectional settings, to discriminate the texts generated by machines and humans in monolingual and multilingual corpora. Our findings show that unidirectional LS-LLaMA outperformed the sequence language models as the benchmark by a large margin. Our code is publicly available at https://github.com/honghanhh/llama-as-a-judge.

pdf bib
TechExperts(IPN) at GenAI Detection Task 1: Detecting AI-Generated Text in English and Multilingual Contexts
Gull Mehak | Amna Qasim | Abdul Gafar Manuel Meque | Nisar Hussain | Grigori Sidorov | Alexander Gelbukh

The ever-increasing spread of AI-generated text, driven by the considerable progress in large language models, entails a real problem for all digital platforms: how to ensure con tent authenticity. The team TechExperts(IPN) presents a method for detecting AI-generated content in English and multilingual contexts, using the google/gemma-2b model fine-tuned for COLING 2025 shared task 1 for English and multilingual. Training results show peak F1 scores of 97.63% for English and 97.87% for multilingual detection, highlighting the model’s effectiveness in supporting content integrity across platforms.

pdf bib
SzegedAI at GenAI Detection Task 1: Beyond Binary - Soft-Voting Multi-Class Classification for Binary Machine-Generated Text Detection Across Diverse Language Models
Mihaly Kiss | Gábor Berend

This paper describes the participation of the SzegedAI team in Subtask A of Task 1 at the COLING 2025 Workshop on Detecting AI-Generated Content. Our solutions investigate the effectiveness of combining multi-class approaches with ensemble methods for detecting machine-generated text. This approach groups models into multiple classes based on properties such as model size or generative capabilities. Additionally, we employ a length-based method, utilizing specialized expert models designed for specific text length ranges. During inference, we condense multi-class predictions into a binary outcome, categorizing any label other than human as AI-generated. The effectiveness of both standard and snapshot ensemble techniques is evaluated. Although not all multi-class configurations outperformed the binary setup, our findings indicate that the combination of multi-class training and ensemble methods can enhance performance over single-method or binary approaches.

pdf bib
Team Unibuc - NLP at GenAI Detection Task 1: Qwen it detect machine-generated text?
Claudiu Creanga | Teodor-George Marchitan | Liviu P. Dinu

We explored both masked language models and causal models. For Subtask A, our best model achieved first-place out of 36 teams when looking at F1 Micro (Auxiliary Score) of 0.8333, and second-place when looking at F1 Macro (Main Score) of 0.8301. For causal models, our best model was a fine-tuned version of Qwen and for masked models, our best model was a fine-tuned version of XLM-Roberta-Base.

pdf bib
Fraunhofer SIT at GenAI Detection Task 1: Adapter Fusion for AI-generated Text Detection
Karla Schaefer | Martin Steinebach

The detection of AI-generated content is becoming increasingly important with the growing prevalence of tools such as ChatGPT. This paper presents our results in the GenAI Content Detection Task 1, focusing on binary English and multilingual AI-generated text detection. We trained and tested transformers, adapters and adapter fusion. In the English setting (Subtask A), the combination of our own adapter on AI-generated text detection based on RoBERTa with a task adapter on multi-genre NLI yielded a macro F1 score of 0.828 on the challenge test set, ranking us third out of 35 teams. In the multilingual setting (Subtask B), adapter fusion resulted in a deterioration of the results. Consequently, XLM-RoBERTa, fine-tuned on the training set, was employed for the final evaluation, attaining a macro F1 score of 0.7258 and ranking tenth out of 25 teams.

pdf bib
OSINT at GenAI Detection Task 1: Multilingual MGT Detection: Leveraging Cross-Lingual Adaptation for Robust LLMs Text Identification
Shifali Agrahari | Sanasam Ranbir Singh

Detecting AI-generated text has become in- creasingly prominent. This paper presents our solution for the DAIGenC Task 1 Subtask 2, where we address the challenge of distin- guishing human-authored text from machine- generated content, especially in multilingual contexts. We introduce Multi-Task Detection (MLDet), a model that leverages Cross-Lingual Adaptation and Model Generalization strate- gies for Multilingual Machine-Generated Text (MGT) detection. By combining language- specific embeddings with fusion techniques, MLDet creates a unified, language-agnostic feature representation, enhancing its ability to generalize across diverse languages and mod- els. Our approach demonstrates strong perfor- mance, achieving macro and micro F1 scores of 0.7067 and 0.7187, respectively, and ranking 15th in the competition1. We also evaluate our model across datasets generated by different distinct models in many languages, showcasing its robustness in multilingual and cross-model scenarios.

pdf bib
Nota AI at GenAI Detection Task 1: Unseen Language-Aware Detection System for Multilingual Machine-Generated Text
Hancheol Park | Jaeyeon Kim | Geonmin Kim | Tae-Ho Kim

Recently, large language models (LLMs) have demonstrated unprecedented capabilities in language generation, yet they still often produce incorrect information. Therefore, determining whether a text was generated by an LLM has become one of the factors that must be considered when evaluating its reliability. In this paper, we discuss methods to determine whether texts written in various languages were authored by humans or generated by LLMs. We have discovered that the classification accuracy significantly decreases for texts written in languages not observed during the training process, and we aim to address this issue. We propose a method to improve performance for unseen languages by using token-level predictive distributions extracted from various LLMs and text embeddings from a multilingual pre-trained langauge model. With the proposed method, we achieved third place out of 25 teams in Subtask B (binary multilingual machine-generated text detection) of Shared Task 1, with an F1 macro score of 0.7532.

pdf bib
CNLP-NITS-PP at GenAI Detection Task 1: AI-Generated Text Using Transformer-Based Approaches
Annepaka Yadagiri | Sai Teja Lekkala | Mandadoddi Srikar Vardhan | Partha Pakray | Reddi Mohana Krishna

In the current digital landscape, distinguishing between text generated by humans and that created by large language models has become increasingly complex. This challenge is exacerbated by advanced LLMs such as the Gemini, ChatGPT, GPT-4, and LLaMa, which can produce highly sophisticated, human-like text. This indistinguishability introduces a range of challenges across different sectors. Cybersecurity increases the risk of social engineering and misinformation, while social media aids the spread of biased or false content. The educational sector faces issues of academic integrity, and within large, multi-team environments, these models add complexity to managing interactions between human and AI agents. To address these challenges, we approached the problem as a binary classification task using an English-language benchmark COLING dataset. We employed transformer-based neural network models, including BERT, DistilBERT, and RoBERTa, fine-tuning each model with optimized hyperparameters to maximize classification accuracy. Our team CNLP-NITS-PP has achieved the 23rd rank in subtask 1 at COLING-2025 for machine-generated text detection in English with a Main Score F1 Macro of 0.6502 and micro-F1 score of 0.6876.

pdf bib
LuxVeri at GenAI Detection Task 1: Inverse Perplexity Weighted Ensemble for Robust Detection of AI-Generated Text across English and Multilingual Contexts
MD. Kamrujjaman Mobin | Md Saiful Islam

This paper presents a system developed for Task 1 of the COLING 2025 Workshop on Detecting AI-Generated Content, focusing on the binary classification of machine-generated versus human-written text. Our approach utilizes an ensemble of models, with weights assigned according to each model’s inverse perplexity, to enhance classification accuracy. For the English text detection task, we combined RoBERTa-base, RoBERTa-base with the OpenAI detector, and BERT-base-cased, achieving a Macro F1-score of 0.7458, which ranked us 12th out of 35 teams. We ensembled RemBERT, XLM-RoBERTa-base, and BERT-base-multilingual-case for the multilingual text detection task, employing the same inverse perplexity weighting technique. This resulted in a Macro F1-score of 0.7513, positioning us 4th out of 25 teams. Our results demonstrate the effectiveness of inverse perplexity weighting in improving the robustness of machine-generated text detection across both monolingual and multilingual settings, highlighting the potential of ensemble methods for this challenging task.

pdf bib
Grape at GenAI Detection Task 1: Leveraging Compact Models and Linguistic Features for Robust Machine-Generated Text Detection
Nhi Hoai Doan | Kentaro Inui

In this project, we aim to address two subtasks of Task 1: Binary Multilingual Machine-Generated Text (MGT) Detection (Human vs. Machine) as part of the COLING 2025 Workshop on MGT Detection (Wang et al., 2025) using different approaches. The first method involves separately fine-tuning small language models tailored to the specific subtask. The second approach builds on this methodology by incorporating linguistic, syntactic, and semantic features, leveraging ensemble learning to integrate these features with model predictions for more robust classification. By evaluating and comparing these approaches, we aim to identify the most effective techniques for detecting machine-generated content across languages, providing insights into improving automated verification tools amidst the rapid growth of LLM-generated text in digital spaces.

pdf bib
AAIG at GenAI Detection Task 1: Exploring Syntactically-Aware, Resource-Efficient Small Autoregressive Decoders for AI Content Detection
Avanti Bhandarkar | Ronald Wilson | Damon Woodard

This paper presents a lightweight and efficient approach to AI-generated content detection using small autoregressive fine-tuned decoders (AFDs) for secure, on-device deployment. Motivated by resource-efficiency, syntactic awareness, and bias mitigation, our model employs small language models (SLMs) with autoregressive pre-training and loss fusion to accurately distinguish between human and AI-generated content while significantly reducing computational demands. The system achieved highest macro-F1 score of 0.8186, with the submitted model scoring 0.7874—both significantly outperforming the task baseline while reducing model parameters by ~60%. Notably, our approach mitigates biases, improving recall for human-authored text by over 60%. Ranking 8th out of 36 participants, these results confirm the feasibility and competitiveness of small AFDs in challenging, adversarial settings, making them ideal for privacy-preserving, on-device deployment suitable for real-world applications.

pdf bib
TurQUaz at GenAI Detection Task 1:Dr. Perplexity or: How I Learned to Stop Worrying and Love the Finetuning
Kaan Efe Keleş | Mucahid Kutlu

This paper details our methods for addressing Task 1 of the GenAI Content Detection shared tasks, which focus on distinguishing AI-generated text from human-written content. The task comprises two subtasks: Subtask A, centered on English-only datasets, and Subtask B, which extends the challenge to multilingual data. Our approach uses a fine-tuned XLM-RoBERTa model for classification, complemented by features including perplexity and TF-IDF. While perplexity is commonly regarded as a useful indicator for identifying machine-generated text, our findings suggest its limitations in multi-model and multilingual contexts. Our approach ranked 6th in Subtask A, but a submission issue left our Subtask B unranked, where it would have placed 23rd.

pdf bib
AI-Monitors at GenAI Detection Task 1: Fast and Scalable Machine Generated Text Detection
Azad Singh | Vishnu Tripathi | Ravindra Kumar Pandey | Pragyanand Saho | Prakhar Joshi | Neel Mani | Richa Alagh | Pallaw Mishra | Piyush Arora

We describe the work carried out by our team, AI-Monitors, on the Binary Multilingual Machine-Generated Text Detection (Human vs. Machine) task at COLING 2025. This task aims to determine whether a given text is generated by a machine or authored by a human. We propose a lightweight, simple, and scalable approach using encoder models such as RoBERTa and XLM-R We provide an in-depth analysis based on our experiments. Our study found that carefully exploring fine-tuned parameters such as i) no. of training epochs, ii) maximum input size, iii) handling class imbalance etc., plays an important role in building an effective system to achieve good results and can significantly impact the underlying tasks. We found the optimum setting of these parameters can lead to a difference of about 5-6% in absolute terms for measure such as accuracy and F1 measure. The paper presents crucial insights into optimal parameter selection for fine-tuning RoBERTa and XLM-R based models to detect whether a given text is generated by a machine or a human.

pdf bib
Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking
German Gritsai | Anastasia Voznuyk | Ildar Khabutdinov | Andrey Grabovoy

The paper describes a system designed by Advacheck team to recognise machine-generated and human-written texts in the monolingual subtask of GenAI Detection Task 1 competition. Our developed system is a multi-task architecture with shared Transformer Encoder between several classification heads. One head is responsible for binary classification between human-written and machine-generated texts, while the other heads are auxiliary multiclass classifiers for texts of different domains from particular datasets. As multiclass heads were trained to distinguish the domains presented in the data, they provide a better understanding of the samples. This approach led us to achieve the first place in the official ranking with 83.07% macro F1-score on the test set and bypass the baseline by 10%. We further study obtained system through ablation, error and representation analyses, finding that multi-task learning outperforms single-task mode and simultaneous tasks form a cluster structure in embeddings space.

pdf bib
GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov

We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.

pdf bib
CIC-NLP at GenAI Detection Task 1: Advancing Multilingual Machine-Generated Text Detection
Tolulope Olalekan Abiola | Tewodros Achamaleh Bizuneh | Fatima Uroosa | Nida Hafeez | Grigori Sidorov | Olga Kolesnikova | Olumide Ebenezer Ojo

Machine-written texts are gradually becoming indistinguishable from human-generated texts, leading to the need to use sophisticated methods to detect them. Team CIC-NLP presents work in the Gen-AI Content Detection Task 1 at COLING 2025 Workshop: the focus of our work is on Subtask B of Task 1, which is the classification of text written by machines and human authors, with particular attention paid to identifying multilingual binary classification problem. Usng mBERT, we addressed the binary classification task using the dataset provided by the GenAI Detection Task team. mBERT acchieved a macro-average F1-score of 0.72 as well as an accuracy score of 0.73.

pdf bib
CIC-NLP at GenAI Detection Task 1: Leveraging DistilBERT for Detecting Machine-Generated Text in English
Tolulope Olalekan Abiola | Tewodros Achamaleh Bizuneh | Oluwatobi Joseph Abiola | Temitope Olasunkanmi Oladepo | Olumide Ebenezer Ojo | Grigori Sidorov | Olga Kolesnikova

As machine-generated texts (MGT) become increasingly similar to human writing, these dis- tinctions are harder to identify. In this paper, we as the CIC-NLP team present our submission to the Gen-AI Content Detection Workshop at COLING 2025 for Task 1 Subtask A, which involves distinguishing between text generated by LLMs and text authored by humans, with an emphasis on detecting English-only MGT. We applied the DistilBERT model to this binary classification task using the dataset provided by the organizers. Fine-tuning the model effectively differentiated between the classes, resulting in a micro-average F1-score of 0.70 on the evaluation test set. We provide a detailed explanation of the fine-tuning parameters and steps involved in our analysis.

pdf bib
nits_teja_srikar at GenAI Detection Task 2: Distinguishing Human and AI-Generated Essays Using Machine Learning and Transformer Models
Sai Teja Lekkala | Annepaka Yadagiri | Mangadoddi Srikar Vardhan | Partha Pakray

This paper presents models to differentiate between human-written and AI-generated essays, addressing challenges posed by advanced AI models like ChatGPT and Claude. Using a structured dataset, we fine-tune multiple machine learning models, including XGBoost and Logistic Regression, along with ensemble learning and k-fold cross-validation. The dataset is processed through TF-IDF vectorization, followed by text cleaning, lemmatization, stemming, and part-of-speech tagging before training. Our team nits_teja_srikar achieves high accuracy, with DistilBERT performing at 77.3% accuracy, standing at 20th position for English, and XLM-RoBERTa excelling in Arabic at 92.2%, standing at 14th position in the official leaderboard, demonstrating the model’s potential for real-world applications.

pdf bib
IntegrityAI at GenAI Detection Task 2: Detecting Machine-Generated Academic Essays in English and Arabic Using ELECTRA and Stylometry
Mohammad ALSmadi

We present a robust system for detecting machine-generated academic essays, leveraging pre-trained, transformer-based models specifically tailored for both English and Arabic texts. Our primary approach utilizes ELECTRA-Small for English and AraELECTRA-Base for Arabic, fine-tuned to deliver high performance while balancing computational efficiency. By incorporating stylometric features, such as word count, sentence length, and vocabulary richness, our models excel at distinguishing between human-written and AI-generated content. Proposed models achieved excellent results with an F1- score of 99.7%, ranking second among of 26 teams in the English subtask, and 98.4%, finishing first out of 23 teams in the Arabic one. Main Contributions include: (1) We develop lightweight and efficient models using ELECTRA-Small and AraELECTRA-Base, achieving an impressive F1-score of 98.5% on the English dataset and 98.4% on the Arabic dataset. This demonstrates the power of combining transformer-based architectures with stylometric analysis. (2) We optimize our system to maintain high performance while being computationally efficient, making it suitable for deployment on GPUs with moderate memory capacity. (3) Additionally, we tested larger models, such as ELECTRA-Large, achieving an even higher F1-score of 99.7% on the English dataset, highlighting the potential for further accuracy gains when using more computationally intensive models.

pdf bib
CMI-AIGCX at GenAI Detection Task 2: Leveraging Multilingual Proxy LLMs for Machine-Generated Text Detection in Academic Essays
Kaijie Jiao | Xingyu Yao | Shixuan Ma | Sifan Fang | Zikang Guo | Benfeng Xu | Licheng Zhang | Quan Wang | Yongdong Zhang | Zhendong Mao

This paper presents the approach we proposed for GenAI Detection Task 2, which aims to classify a given text as either machine-generated or human-written, with a particular emphasis on academic essays. We participated in subtasks A and B, which focus on detecting English and Arabic essays, respectively. We propose a simple and efficient method for detecting machine-generated essays, where we use the Llama-3.1-8B as a proxy to capture the essence of each token in the text. These essences are processed and classified using a refined feature classification network. Our approach does not require fine-tuning the LLM. Instead, we leverage its extensive multilingual knowledge acquired during pretraining to significantly enhance detection performance. The results validate the effectiveness of our approach and demonstrate that leveraging a proxy model with diverse multilingual knowledge can significantly enhance the detection of machine-generated text across multiple languages, regardless of model size. In Subtask A, we achieved an F1 score of 99.9%, ranking first out of 26 teams. In Subtask B, we achieved an F1 score of 96.5%, placing fourth out of 22 teams, with the same score as the third-place team.

pdf bib
EssayDetect at GenAI Detection Task 2: Guardians of Academic Integrity: Multilingual Detection of AI-Generated Essays
Shifali Agrahari | Subhashi Jayant | Saurabh Kumar | Sanasam Ranbir Singh

Detecting AI-generated text in the field of academia is becoming very prominent. This paper presents a solution for Task 2: AI vs. Hu- man – Academic Essay Authenticity Challenge in the COLING 2025 DAIGenC Workshop 1. The rise of Large Language models (LLMs) like ChatGPT has posed significant challenges to academic integrity, particularly in detecting AI-generated essays. To address this, we pro- pose a fusion model that combines pre-trained language model embeddings with stylometric and linguistic features. Our approach, tested on both English and Arabic, utilizes adaptive training and attention mechanisms to enhance F1 scores, address class imbalance, and capture linguistic nuances across languages. This work advances multilingual solutions for detecting AI-generated text in academia.

pdf bib
CNLP-NITS-PP at GenAI Detection Task 2: Leveraging DistilBERT and XLM-RoBERTa for Multilingual AI-Generated Text Detection
Annepaka Yadagiri | Reddi Mohana Krishna | Partha Pakray

In today’s digital landscape, distinguishing between human-authored essays and content generated by advanced Large Language Models such as ChatGPT, GPT-4, Gemini, and LLaMa has become increasingly complex. This differentiation is essential across sectors like academia, cybersecurity, social media, and education, where the authenticity of written material is often crucial. Addressing this challenge, the COLING 2025 competition introduced Task 2, a binary classification task to separate AI-generated text from human-authored content. Using a benchmark dataset for English and Arabic, developing a methodology that fine-tuned various transformer-based neural networks, including CNN-LSTM, RNN, Bi-GRU, BERT, DistilBERT, GPT-2, and RoBERTa. Our Team CNLP-NITS-PP achieved competitive performance through meticulous hyperparameter optimization, reaching a Recall score of 0.825. Specifically, we ranked 18th in the English sub-task A with an accuracy of 0.77 and 20th in the Arabic sub-task B with an accuracy of 0.59. These results underscore the potential of transformer-based models in academic settings to detect AI-generated content effectively, laying a foundation for more advanced methods in essay authenticity verification.

pdf bib
RA at GenAI Detection Task 2: Fine-tuned Language Models For Detection of Academic Authenticity, Results and Thoughts
Rana Gharib | Ahmed Elgendy

This paper assesses the performance of “RA” in the Academic Essay Authenticity Challenge, which saw nearly 30 teams participating in each subtask. We employed cutting-edge transformer-based models to achieve our results. Our models consistently exceeded both the mean and median scores across the tasks. Notably, we achieved an F1-score of 0.969 in classifying AI-generated essays in English and an F1-score of 0.957 for classifying AI-generated essays in Arabic. Additionally, this paper offers insights into the current state of AI-generated models and argues that the benchmarking methods currently in use do not accurately reflect real-world scenarios.

pdf bib
Tesla at GenAI Detection Task 2: Fast and Scalable Method for Detection of Academic Essay Authenticity
Vijayasaradhi Indurthi | Vasudeva Varma

This paper describes a simple yet effective method to identify if academic essays have been written by students or generated through the language models in English language. We extract a set of style, language complexity, bias and subjectivity, and emotion-based features that can be used to distinguish human-written essays from machine-generated essays. Our methods rank 6th on the leaderboard, achieving an impressive F1-score of 0.986.

pdf bib
GenAI Content Detection Task 2: AI vs. Human – Academic Essay Authenticity Challenge
Shammur Absar Chowdhury | Hind Almerekhi | Mucahid Kutlu | Kaan Efe Keleş | Fatema Ahmad | Tasnim Mohiuddin | George Mikros | Firoj Alam

This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs human-authored essays for academic purposes. The task is defined as follows: “Given an essay, identify whether it is generated by a machine or authored by a human.” The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, five teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.

pdf bib
CNLP-NITS-PP at GenAI Detection Task 3: Cross-Domain Machine-Generated Text Detection Using DistilBERT Techniques
Sai Teja Lekkala | Annepaka Yadagiri | Mangadoddi Srikar Vardhan | Partha Pakray

This paper presents a Cross-domain Machine-Generated Text Detection model developed for the COLING 2025 Workshop on Detecting AI-generated Content (DAIGenC). As large language models evolve, detecting machine-generated text becomes increasingly challenging, particularly in contexts like misinformation and academic integrity. While current detectors perform well on unseen data, they remain vulnerable to adversarial strategies, including paraphrasing, homoglyphs, misspellings, synonyms, whitespace manipulations, etc. We introduce a framework to address these adversarial tactics designed to bypass detection systems by adversarial training. Our team DistilBERT-NITS detector placed 7th in the Non-Adversarial Attacks category, and Adversarial-submission-3 achieved 17th in the Adversarial Attacks category.

pdf bib
Leidos at GenAI Detection Task 3: A Weight-Balanced Transformer Approach for AI Generated Text Detection Across Domains
Abishek R. Edikala | Gregorios A. Katsios | Noelie Creaghe | Ning Yu

Advancements in Large Language Models (LLMs) blur the distinction between human and machine-generated text (MGT), raising concerns about misinformation and academic dishonesty. Existing MGT detection methods often fail to generalize across domains and generator models. We address this by framing MGT detection as a text classification task using transformer-based models. Utilizing Distil-RoBERTa-Base, we train four classifiers (binary and multi-class, with and without class weighting) on the RAID dataset (Dugan et al., 2024). Our systems placed first to fourth in the COLING 2025 MGT Detection Challenge Task 3 (Dugan et al., 2025). Internal in-domain and zero-shot evaluations reveal that applying class weighting improves detector performance, especially with multi-class classification training. Our best model effectively generalizes to unseen domains and generators, demonstrating that transformer-based models are robust detectors of machine-generated text.

pdf bib
Pangram at GenAI Detection Task 3: An Active Learning Approach to Machine-Generated Text Detection
Bradley N. Emi | Max Spero | Elyas Masrour

We pretrain an autoregressive LLM-based detector on a wide variety of datasets, domains, languages, prompt schemes, and LLMs used to generate the AI portion of the dataset. We aggressively employ several augmentation strategies and preprocessing strategies to improve robustness. We then mine the RAID train set for the AI examples with the largest error based on the original classifier, and mix those examples and their human-written counterparts back into the training set. We then retrain the detector until convergence.

pdf bib
LuxVeri at GenAI Detection Task 3: Cross-Domain Detection of AI-Generated Text Using Inverse Perplexity-Weighted Ensemble of Fine-Tuned Transformer Models
MD. Kamrujjaman Mobin | Md Saiful Islam

This paper presents our approach for Task 3 of the GenAI content detection workshop at COLING-2025, focusing on Cross-Domain Machine-Generated Text (MGT) Detection. We propose an ensemble of fine-tuned transformer models, enhanced by inverse perplexity weighting, to improve classification accuracy across diverse text domains. For Subtask A (Non-Adversarial MGT Detection), we combined a fine-tuned RoBERTa-base model with an OpenAI detector-integrated RoBERTa-base model, achieving an aggregate TPR score of 0.826, ranking 10th out of 23 detectors. In Subtask B (Adversarial MGT Detection), our fine-tuned RoBERTa-base model achieved a TPR score of 0.801, securing 8th out of 22 detectors. Our results demonstrate the effectiveness of inverse perplexity-based weighting for enhancing generalization and performance in both non-adversarial and adversarial MGT detection, highlighting the potential for transformer models in cross-domain AI-generated content detection.

pdf bib
BBN-U.Oregon’s ALERT system at GenAI Content Detection Task 3: Robust Authorship Style Representations for Cross-Domain Machine-Generated Text Detection
Hemanth Kandula | Chak Fai Li | Haoling Qiu | Damianos Karakos | Hieu Man | Thien Huu Nguyen | Brian Ulicny

This paper presents BBN-U.Oregon’s system, ALERT, submitted to the Shared Task 3: Cross-Domain Machine-Generated Text Detection. Our approach uses robust authorship-style representations to distinguish between human-authored and machine-generated text (MGT) across various domains. We employ an ensemble-based authorship attribution (AA) system that integrates stylistic embeddings from two complementary subsystems: one that focuses on cross-genre robustness with hard positive and negative mining strategies and another that captures nuanced semantic-lexical-authorship contrasts. This combination enhances cross-domain generalization, even under domain shifts and adversarial attacks. Evaluated on the RAID benchmark, our system demonstrates strong performance across genres and decoding strategies, with resilience against adversarial manipulation, achieving 91.8% TPR at FPR=5% on standard test sets and 82.6% on adversarial sets.

pdf bib
Random at GenAI Detection Task 3: A Hybrid Approach to Cross-Domain Detection of Machine-Generated Text with Adversarial Attack Mitigation
Shifali Agrahari | Prabhat Mishra | Sujit Kumar

Machine-generated text (MGT) detection has gained critical importance in the era of large language models, especially for maintaining trust in multilingual and cross-domain applica- tions. This paper presents Task 3 Subtask B: Adversarial Cross-Domain MGT Detection for in the COLING 2025 DAIGenC Workshop. Task 3 emphasizes the complexity of detecting AI-generated text across eight domains, eleven generative models, and four decoding strate- gies, with an added challenge of adversarial manipulation. We propose a robust detection framework transformer embeddings utilizing Domain-Adversarial Neural Networks (DANN) to address domain variability and adversarial robustness. Our model demonstrates strong performance in identifying AI-generated text under adversarial conditions while highlighting condition scope of future improvement.

pdf bib
MOSAIC at GENAI Detection Task 3 : Zero-Shot Detection Using an Ensemble of Models
Matthieu Dubois | François Yvon | Pablo Piantanida

MOSAIC introduces a new ensemble approach that combines several detector models to spot AI-generated texts. The method enhances the reliability of detection by integrating insights from multiple models, thus addressing the limitations of using a single detector model which often results in performance brittleness. This approach also involves using a theoretically grounded algorithm to minimize the worst-case expected encoding size across models, thereby optimizing the detection process. In this submission, we report evaluation results on the RAID benchmark, a comprehensive English-centric testbed for machine-generated texts. These results were obtained in the context of the “Cross-domain Machine-Generated Text Detection” shared task. We show that our model can be competitive for a variety of domains and generator models, but that it can be challenged by adversarial attacks and by changes in the text generation strategy.

pdf bib
GenAI Content Detection Task 3: Cross-Domain Machine Generated Text Detection Challenge
Liam Dugan | Andrew Zhu | Firoj Alam | Preslav Nakov | Marianna Apidianaki | Chris Callison-Burch

Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate—suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research.