To Err Is Human, but Llamas Can Learn It Too

Agnes Luhtaru, Taido Purason, Martin Vainikko, Maksym Del, Mark Fishel


Abstract
This study explores enhancing grammatical error correction (GEC) through automatic error generation (AEG) using language models (LMs). Specifically, we fine-tune Llama 2 LMs for error generation and find that this approach yields synthetic errors akin to human errors. Next, we train GEC Llama models using these artificial errors and outperform previous state-of-the-art error correction models, with gains ranging between 0.8 and 6 F0.5 points across all tested languages (German, Ukrainian, and Estonian). Moreover, we demonstrate that generating errors by fine-tuning smaller sequence-to-sequence models and prompting large commercial LMs (GPT3.5 and GPT4) also results in synthetic errors beneficially affecting error generation models. We openly release trained models for error generation and correction as well as all the synthesized error datasets for the covered languages.
Anthology ID:
2024.findings-emnlp.727
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12466–12481
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.727
DOI:
Bibkey:
Cite (ACL):
Agnes Luhtaru, Taido Purason, Martin Vainikko, Maksym Del, and Mark Fishel. 2024. To Err Is Human, but Llamas Can Learn It Too. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12466–12481, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
To Err Is Human, but Llamas Can Learn It Too (Luhtaru et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.727.pdf