Jesse Atuhurra
2025
HLU: Human Vs LLM Generated Text Detection Dataset for Urdu at Multiple Granularities
Iqra Ali
|
Jesse Atuhurra
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics
The rise of large language models (LLMs) generating human-like text has raised concerns about misuse, especially in low-resource languages like Urdu. To address this gap, we introduce the HLU dataset, which consists of three datasets: Document, Paragraph, and Sentence level. The document-level dataset contains 1,014 instances of human-written and LLM-generated articles across 13 domains, while the paragraph and sentence-level datasets each contain 667 instances. We conducted both human and automatic evaluations. In the human evaluation, the average accuracy at the document level was 35%, while at the paragraph and sentence levels, accuracies were 75.68% and 88.45%, respectively. For automatic evaluation, we finetuned the XLMRoBERTa model for both monolingual and multilingual settings achieving consistent results in both. Additionally, we assessed the performance of GPT4 and Claude3Opus using zero-shot prompting. Our experiments and evaluations indicate that distinguishing between human and machine-generated text is challenging for both humans and LLMs, marking a significant step in addressing this issue in Urdu.