Barbora Štindlová


pdf bib
The MERLIN corpus: Learner language and the CEFR
Adriane Boyd | Jirka Hana | Lionel Nicolas | Detmar Meurers | Katrin Wisniewski | Andrea Abel | Karin Schöne | Barbora Štindlová | Chiara Vettori
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The MERLIN corpus is a written learner corpus for Czech, German,and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains 2,290 learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that enable research into the empirical foundations of the CEFR scales and provide language teachers, test developers, and Second Language Acquisition researchers with concrete examples of learner performance and progress across multiple proficiency levels. For computational linguistics, it provide a range of authentic learner data for three target languages, supporting a broadening of the scope of research in areas such as automatic proficiency classification or native language identification. The annotated corpus and related information will be freely available as a corpus resource and through a freely accessible, didactically-oriented online platform.


pdf bib
Building a learner corpus
Jirka Hana | Alexandr Rosen | Barbora Štindlová | Petr Jäger
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation.


pdf bib
Error-Tagged Learner Corpus of Czech
Jirka Hana | Alexandr Rosen | Svatava Škodová | Barbora Štindlová
Proceedings of the Fourth Linguistic Annotation Workshop