Tomáš Jelínek
2016
SYN2015: Representative Corpus of Contemporary Written Czech
Michal Křen
|
Václav Cvrček
|
Tomáš Čapka
|
Anna Čermáková
|
Milena Hnátková
|
Lucie Chlumská
|
Tomáš Jelínek
|
Dominika Kováříková
|
Vladimír Petkevič
|
Pavel Procházka
|
Hana Skoumalová
|
Michal Škrabal
|
Petr Truneček
|
Pavel Vondřička
|
Adrian Jan Zasina
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The paper concentrates on the design, composition and annotation of SYN2015, a new 100-million representative corpus of contemporary written Czech. SYN2015 is a sequel of the representative corpora of the SYN series that can be described as traditional (as opposed to the web-crawled corpora), featuring cleared copyright issues, well-defined composition, reliability of annotation and high-quality text processing. At the same time, SYN2015 is designed as a reflection of the variety of written Czech text production with necessary methodological and technological enhancements that include a detailed bibliographic annotation and text classification based on an updated scheme. The corpus has been produced using a completely rebuilt text processing toolchain called SynKorp. SYN2015 is lemmatized, morphologically and syntactically annotated with state-of-the-art tools. It has been published within the framework of the Czech National Corpus and it is available via the standard corpus query interface KonText at http://kontext.korpus.cz as well as a dataset in shuffled format.
2014
Improvements to Dependency Parsing Using Automatic Simplification of Data
Tomáš Jelínek
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In dependency parsing, much effort is devoted to the development of new methods of language modeling and better feature settings. Less attention is paid to actual linguistic data and how appropriate they are for automatic parsing: linguistic data can be too complex for a given parser, morphological tags may not reflect well syntactic properties of words, a detailed, complex annotation scheme may be ill suited for automatic parsing. In this paper, I present a study of this problem on the following case: automatic dependency parsing using the data of the Prague Dependency Treebank with two dependency parsers: MSTParser and MaltParser. I show that by means of small, reversible simplifications of the text and of the annotation, a considerable improvement of parsing accuracy can be achieved. In order to facilitate the task of language modeling performed by the parser, I reduce variability of lemmas and forms in the text. I modify the system of morphological annotation to adapt it better for parsing. Finally, the dependency annotation scheme is also partially modified. All such modifications are automatic and fully reversible: after the parsing is done, the original data and structures are automatically restored. With MaltParser, I achieve an 8.3% error rate reduction.