saeed A. Anabtawi

2026

A Stylometric and Statistical Pipeline for Urdu AI-Generated Text Classification
saeed A. Anabtawi
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

The proliferation of Large Language Models (LLMs) has introduced significant challenges regarding algorithmic bias, privacy, and the authenticity of digital content. While detection mechanisms for English are maturing, low-resource languages like Urdu—spoken by over 100 million people—require dedicated research. In this paper, we present a technical framework for Urdu AI-generated text detection developed for the *ACL shared task. We propose a hybrid pipeline that combines TF-IDF Character N-grams with a custom stylometric feature extractor designed to capture unique Urdu linguistic markers, including repeated word ratios, punctuation density, and formal function markers. Using a Linear Support Vector Machine (SVM) optimized via Stochastic Gradient Descent (SGD), our system achieves a balanced accuracy and F₁-score of 87.80% on a dataset of 6,800 records. Our results demonstrate that a computationally efficient, classical machine learning approach—prioritizing stylistic signals over heavy preprocessing—remains highly effective for distinguishing between human-written and AI-generated Urdu text.

Co-authors

Venues

AbjadNLP1
WS1

Fix author