Ruiting Shao


2024

pdf bib
A Natural Approach for Synthetic Short-Form Text Analysis
Ruiting Shao | Ryan Schwarz | Christopher Clifton | Edward Delp
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Detecting synthetically generated text in the wild has become increasingly difficult with advances in Natural Language Generation techniques and the proliferation of freely available Large Language Models (LLMs). Social media and news sites can be flooded with synthetically generated misinformation via tweets and posts while authentic users can inadvertently spread this text via shares and retweets. Most modern natural language processing techniques designed to detect synthetically generated text focus primarily on long-form content, such as news articles, or incorporate stylometric characteristics and metadata during their analysis. Unfortunately, for short form text like tweets, this information is often unavailable, usually detached from its original source, displayed out of context, and is often too short or informal to yield significant information from stylometry. This paper proposes a method of detecting synthetically generated tweets via a Transformer architecture and incorporating unique style-based features. Additionally, we have created a new dataset consisting of human-generated and Large Language Model generated tweets for 4 topics and another dataset consisting of tweets paraphrased by 3 different paraphrase models.