Hi-GEC: Hindi Grammar Error Correction in Low Resource Scenario

Ujjwal Sharma, Pushpak Bhattacharyya


Abstract
Automated Grammatical Error Correction (GEC) has been extensively researched in Natural Language Processing (NLP), primarily focusing on English and other resource-rich languages. This paper shifts the focus to GEC for a scarcely explored low-resource language, specifically Hindi, which presents unique challenges due to its intricate morphology and complex syntax. To address data resource limitations, this work explores various GEC data generation techniques. Our research introduces a carefully extracted and filtered, high-quality dataset, HiWikiEdits, which includes human-edited 8,137 instances sourced from Wikipedia, encompassing 17 diverse grammatical error types, with annotations performed using the ERRANT toolkit. Furthermore, we investigate Round Trip Translation (RTT) using diverse languages for synthetic Hindi GEC data generation, revealing that leveraging high-resource linguistically distant language for error generation outperforms mid-resource linguistically closer languages. Specifically, using English as a pivot language resulted in a 6.25% improvement in GLEU score compared to using Assamese or Marathi. Finally, we also investigate the neural model-based synthetic error-generation technique and show that it achieves comparable performance to other synthetic data generation methods, even in low-resource settings.
Anthology ID:
2025.coling-main.406
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6063–6075
Language:
URL:
https://aclanthology.org/2025.coling-main.406/
DOI:
Bibkey:
Cite (ACL):
Ujjwal Sharma and Pushpak Bhattacharyya. 2025. Hi-GEC: Hindi Grammar Error Correction in Low Resource Scenario. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6063–6075, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Hi-GEC: Hindi Grammar Error Correction in Low Resource Scenario (Sharma & Bhattacharyya, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.406.pdf