Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

Shirong Ma; Yinghui Li; Rongyi Sun; Qingyu Zhou; Shulin Huang; Ding Zhang; Li Yangning; Ruiyang Liu; Zhongli Li; Yunbo Cao; Hai-Tao Zheng; Ying Shen

doi:10.18653/v1/2022.findings-emnlp.40

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, Haitao Zheng, Ying Shen

Abstract

Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.

Anthology ID:: 2022.findings-emnlp.40
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2022
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates
Editors:: Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 576–589
Language:
URL:: https://aclanthology.org/2022.findings-emnlp.40/
DOI:: 10.18653/v1/2022.findings-emnlp.40
Bibkey:
Cite (ACL):: Shirong Ma, Yinghui Li, Rongyi Sun, Qingyu Zhou, Shulin Huang, Ding Zhang, Li Yangning, Ruiyang Liu, Zhongli Li, Yunbo Cao, Haitao Zheng, and Ying Shen. 2022. Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 576–589, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):: Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction (Ma et al., Findings 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.findings-emnlp.40.pdf

PDF Cite Search Fix data