NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts

Yue Zhang, Bo Zhang, Haochen Jiang, Zhenghua Li, Chen Li, Fei Huang, Min Zhang


Abstract
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays. To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination. We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data. We further perform detailed analyses of the connections and gaps between our domains from both empirical and statistical views. We hope this work can inspire future studies on an important but under-explored direction–cross-domain GEC.
Anthology ID:
2023.findings-acl.630
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9935–9951
Language:
URL:
https://aclanthology.org/2023.findings-acl.630
DOI:
10.18653/v1/2023.findings-acl.630
Bibkey:
Cite (ACL):
Yue Zhang, Bo Zhang, Haochen Jiang, Zhenghua Li, Chen Li, Fei Huang, and Min Zhang. 2023. NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9935–9951, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts (Zhang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.630.pdf