n-gram F-score for Evaluating Grammatical Error Correction

Shota Koyama, Ryo Nagata, Hiroya Takamura, Naoaki Okazaki


Abstract
M2 and its variants are the most widely used automatic evaluation metrics for grammatical error correction (GEC), which calculate an F-score using a phrase-based alignment between sentences. However, it is not straightforward at all to align learner sentences containing errors to their correct sentences. In addition, alignment calculations are computationally expensive. We propose GREEN, an alignment-free F-score for GEC evaluation. GREEN treats a sentence as a multiset of n-grams and extracts edits between sentences by set operations instead of computing an alignment. Our experiments confirm that GREEN performs better than existing methods for the corpus-level metrics and comparably for the sentence-level metrics even without computing an alignment. GREEN is available at https://github.com/shotakoyama/green.
Anthology ID:
2024.inlg-main.25
Volume:
Proceedings of the 17th International Natural Language Generation Conference
Month:
September
Year:
2024
Address:
Tokyo, Japan
Editors:
Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
303–313
Language:
URL:
https://aclanthology.org/2024.inlg-main.25
DOI:
Bibkey:
Cite (ACL):
Shota Koyama, Ryo Nagata, Hiroya Takamura, and Naoaki Okazaki. 2024. n-gram F-score for Evaluating Grammatical Error Correction. In Proceedings of the 17th International Natural Language Generation Conference, pages 303–313, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
n-gram F-score for Evaluating Grammatical Error Correction (Koyama et al., INLG 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.inlg-main.25.pdf