Developing NLP Tools with a New Corpus of Learner Spanish

Sam Davidson, Aaron Yamada, Paloma Fernandez Mira, Agustina Carando, Claudia H. Sanchez Gutierrez, Kenji Sagae


Abstract
The development of effective NLP tools for the L2 classroom depends largely on the availability of large annotated corpora of language learner text. While annotated learner corpora of English are widely available, large learner corpora of Spanish are less common. Those Spanish corpora that are available do not contain the annotations needed to facilitate the development of tools beneficial to language learners, such as grammatical error correction. As a result, the field has seen little research in NLP tools designed to benefit Spanish language learners and teachers. We introduce COWS-L2H, a freely available corpus of Spanish learner data which includes error annotations and parallel corrected text to help researchers better understand L2 development, to examine teaching practices empirically, and to develop NLP tools to better serve the Spanish teaching community. We demonstrate the utility of this corpus by developing a neural-network based grammatical error correction system for Spanish learner writing.
Anthology ID:
2020.lrec-1.894
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7238–7243
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.894
DOI:
Bibkey:
Cite (ACL):
Sam Davidson, Aaron Yamada, Paloma Fernandez Mira, Agustina Carando, Claudia H. Sanchez Gutierrez, and Kenji Sagae. 2020. Developing NLP Tools with a New Corpus of Learner Spanish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 7238–7243, Marseille, France. European Language Resources Association.
Cite (Informal):
Developing NLP Tools with a New Corpus of Learner Spanish (Davidson et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.894.pdf