A Stylometric and Statistical Pipeline for Urdu AI-Generated Text Classification

saeed A. Anabtawi


Abstract
The proliferation of Large Language Models (LLMs) has introduced significant challenges regarding algorithmic bias, privacy, and the authenticity of digital content. While detection mechanisms for English are maturing, low-resource languages like Urdu—spoken by over 100 million people—require dedicated research. In this paper, we present a technical framework for Urdu AI-generated text detection developed for the *ACL shared task. We propose a hybrid pipeline that combines TF-IDF Character N-grams with a custom stylometric feature extractor designed to capture unique Urdu linguistic markers, including repeated word ratios, punctuation density, and formal function markers. Using a Linear Support Vector Machine (SVM) optimized via Stochastic Gradient Descent (SGD), our system achieves a balanced accuracy and F1-score of 87.80% on a dataset of 6,800 records. Our results demonstrate that a computationally efficient, classical machine learning approach—prioritizing stylistic signals over heavy preprocessing—remains highly effective for distinguishing between human-written and AI-generated Urdu text.
Anthology ID:
2026.abjadnlp-1.58
Volume:
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
AbjadNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
472–475
Language:
URL:
https://aclanthology.org/2026.abjadnlp-1.58/
DOI:
Bibkey:
Cite (ACL):
saeed A. Anabtawi. 2026. A Stylometric and Statistical Pipeline for Urdu AI-Generated Text Classification. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 472–475, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
A Stylometric and Statistical Pipeline for Urdu AI-Generated Text Classification (Anabtawi, AbjadNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.abjadnlp-1.58.pdf