Towards Better Utilization of Multi-Reference Training Data for Chinese Grammatical Error Correction

Yumeng Liu, Zhenghua Li, HaoChen Jiang, Bo Zhang, Chen Li, Ji Zhang


Abstract
For the grammatical error correction (GEC) task, there usually exist multiple correction ways for an erroneous input sentence, leading to multiple references. Observing the high proportion of multi-reference instances in Chinese GEC training data, we target a systematic study on how to better utilize multi-reference training data. We propose two new approaches and a simple two-stage training strategy. We compare them against previously proposed approaches, on two Chinese training datasets, i.e., Lang-8 for second language learner texts and FCGEC-Train for native speaker texts, and three test datasets. The experiments and analyses demonstrate the effectiveness of our proposed approaches and reveal interesting insights. Our code is available at https://github.com/ymliucs/MrGEC.
Anthology ID:
2024.findings-acl.180
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3044–3052
Language:
URL:
https://aclanthology.org/2024.findings-acl.180
DOI:
10.18653/v1/2024.findings-acl.180
Bibkey:
Cite (ACL):
Yumeng Liu, Zhenghua Li, HaoChen Jiang, Bo Zhang, Chen Li, and Ji Zhang. 2024. Towards Better Utilization of Multi-Reference Training Data for Chinese Grammatical Error Correction. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3044–3052, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Towards Better Utilization of Multi-Reference Training Data for Chinese Grammatical Error Correction (Liu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.180.pdf