Suraiya Parween


2024

pdf bib
Detecting AI-Generated Text with Pre-Trained Models Using Linguistic Features
Annepaka Yadagiri | Lavanya Shree | Suraiya Parween | Anushka Raj | Shreya Maurya | Partha Pakray
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

The advent of sophisticated large language models, such as ChatGPT and other AI-driven platforms, has led to the generation of text that closely mimics human writing, making it increasingly challenging to discern whether it is human-generated or AI-generated content. This poses significant challenges to content verification, academic integrity, and detecting misleading information. To address these issues, we developed a classification system to differentiate between human-written and AI-generated texts using a diverse HC3-English dataset. This dataset leveraged linguistic analysis and structural features, including part-of-speech tags, vocabulary size, word density, active and passive voice usage, and readability metrics such as Flesch Reading Ease, perplexity, and burstiness. We employed transformer-based and deep-learning models for the classification task, such as CNN_BiLSTM, RNN, BERT, GPT-2, and RoBERTa. Among these, the RoBERTa model demonstrated superior performance, achieving an impressive accuracy of 99.73. These outcomes demonstrate how cutting-edge deep learning methods can maintain information integrity in the digital realm.