Training for Grammatical Error Correction Without Human-Annotated L2 Learners’ Corpora

Mikio Oda


Abstract
Grammatical error correction (GEC) is a challenging task for non-native second language (L2) learners and learning machines. Data-driven GEC learning requires as much human-annotated genuine training data as possible. However, it is difficult to produce larger-scale human-annotated data, and synthetically generated large-scale parallel training data is valuable for GEC systems. In this paper, we propose a method for rebuilding a corpus of synthetic parallel data using target sentences predicted by a GEC model to improve performance. Experimental results show that our proposed pre-training outperforms that on the original synthetic datasets. Moreover, it is also shown that our proposed training without human-annotated L2 learners’ corpora is as practical as conventional full pipeline training with both synthetic datasets and L2 learners’ corpora in terms of accuracy.
Anthology ID:
2023.bea-1.38
Volume:
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Ekaterina Kochmar, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Nitin Madnani, Anaïs Tack, Victoria Yaneva, Zheng Yuan, Torsten Zesch
Venue:
BEA
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
455–465
Language:
URL:
https://aclanthology.org/2023.bea-1.38
DOI:
10.18653/v1/2023.bea-1.38
Bibkey:
Cite (ACL):
Mikio Oda. 2023. Training for Grammatical Error Correction Without Human-Annotated L2 Learners’ Corpora. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 455–465, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Training for Grammatical Error Correction Without Human-Annotated L2 Learners’ Corpora (Oda, BEA 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.bea-1.38.pdf
Video:
 https://aclanthology.org/2023.bea-1.38.mp4