Alexandr Rosen


Czech Grammar Error Correction with a Large and Diverse Corpus
Jakub Náplava | Milan Straka | Jana Straková | Alexandr Rosen
Transactions of the Association for Computational Linguistics, Volume 10

We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts, where errors are expected to be much less common. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research. Finally, we meta-evaluate common GEC metrics against human judgments on our data. We make the new Czech GEC corpus publicly available under the CC BY-SA 4.0 license at


Modeling non-standard language
Alexandr Rosen
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)

A specific language as used by different speakers and in different situations has a number of more or less distant varieties. Extending the notion of non-standard language to varieties that do not fit an explicitly or implicitly assumed norm or pattern, we look for methods and tools that could be applied to this domain. The needs start from the theoretical side: categories usable for the analysis of non-standard language are not readily available, and continue to methods and tools required for its detection and diagnostics. A general discussion of issues related to non-standard language is followed by two case studies. The first study presents a taxonomy of morphosyntactic categories as an attempt to analyse non-standard forms produced by non-native learners of Czech. The second study focusses on the role of a rule-based grammar and lexicon in the process of building and using a parsebank.


Analytic Morphology – Merging the Paradigmatic and Syntagmatic Perspective in a Treebank
Vladimír Petkevič | Alexandr Rosen | Hana Skoumalová | Přemysl Vítovec
The 5th Workshop on Balto-Slavic Natural Language Processing


Building a multilingual parallel corpus for human users
Alexandr Rosen | Martin Vavřín
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a comparison with some other multilingual parallel corpora we give an overview of the data collection procedure that covers text selection criteria, data format, conversion, alignment, lemmatization and tagging. Finally, we show a sample query using the web-based search interface and discuss challenges and prospects of the project.

Building a learner corpus
Jirka Hana | Alexandr Rosen | Barbora Štindlová | Petr Jäger
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation.

Korektor – A System for Contextual Spell-Checking and Diacritics Completion
Michal Richter | Pavel Straňák | Alexandr Rosen
Proceedings of COLING 2012: Posters


Error-Tagged Learner Corpus of Czech
Jirka Hana | Alexandr Rosen | Svatava Škodová | Barbora Štindlová
Proceedings of the Fourth Linguistic Annotation Workshop


Derivation of Underlying Valency Frames From a Learner’s Dictionary
Alexandr Rosen | Eva Hajicova | Jan Hajic
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics