Corpora Generation for Urdu Grammatical Error Correction

Syed Ahad; Burhanuddin Aliasghar Ezzi; Muhammad Arsalan Hussain; Sandesh Kumar; Abdul Samad

Corpora Generation for Urdu Grammatical Error Correction

Syed Ahad, Burhanuddin Aliasghar Ezzi, Muhammad Arsalan Hussain, Sandesh Kumar, Abdul Samad

Abstract

Grammatical Error Correction (GEC) for Urdu remains an under-researched area due to the lack of annotated datasets. This paper addresses the challenge of generating a robust corpus for fine-tuning deep learning models aimed at Urdu GEC. We propose a method for synthesizing a large dataset by collecting errors from the Urdu WikiEdits history, learning from them, and inserting similar errors in grammatically correct sentences to generate incorrect sentences with grammatical errors, hence creating a pair of grammatically correct and incorrect sentences. We introduce UrduGEC-Synthetic, a synthetically generated dataset produced through this pipeline. Furthermore, we introduce UrduGEC-Gold, a Gold Dataset by extracting errors from exam copies of students. Finally, we also fine-tuned various models on UrduGEC-Synthetic and evaluated them against UrduGEC-Gold to show the quality of synthetic data generation.

Anthology ID:: 2026.findings-acl.2156
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 43428–43444
Language:
URL:: https://aclanthology.org/2026.findings-acl.2156/
DOI:
Bibkey:
Cite (ACL):: Syed Ahad, Burhanuddin Aliasghar Ezzi, Muhammad Arsalan Hussain, Sandesh Kumar, and Abdul Samad. 2026. Corpora Generation for Urdu Grammatical Error Correction. In Findings of the Association for Computational Linguistics: ACL 2026, pages 43428–43444, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Corpora Generation for Urdu Grammatical Error Correction (Ahad et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.2156.pdf
Checklist:: 2026.findings-acl.2156.checklist.pdf

PDF Cite Search Checklist Fix data