Building a learner corpus

Jirka Hana, Alexandr Rosen, Barbora Štindlová, Petr Jäger


Abstract
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation.
Anthology ID:
L12-1591
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3228–3232
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/992_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Jirka Hana, Alexandr Rosen, Barbora Štindlová, and Petr Jäger. 2012. Building a learner corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3228–3232, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Building a learner corpus (Hana et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/992_Paper.pdf