Dominika Kováříková
2016
SYN2015: Representative Corpus of Contemporary Written Czech
Michal Křen
|
Václav Cvrček
|
Tomáš Čapka
|
Anna Čermáková
|
Milena Hnátková
|
Lucie Chlumská
|
Tomáš Jelínek
|
Dominika Kováříková
|
Vladimír Petkevič
|
Pavel Procházka
|
Hana Skoumalová
|
Michal Škrabal
|
Petr Truneček
|
Pavel Vondřička
|
Adrian Jan Zasina
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The paper concentrates on the design, composition and annotation of SYN2015, a new 100-million representative corpus of contemporary written Czech. SYN2015 is a sequel of the representative corpora of the SYN series that can be described as traditional (as opposed to the web-crawled corpora), featuring cleared copyright issues, well-defined composition, reliability of annotation and high-quality text processing. At the same time, SYN2015 is designed as a reflection of the variety of written Czech text production with necessary methodological and technological enhancements that include a detailed bibliographic annotation and text classification based on an updated scheme. The corpus has been produced using a completely rebuilt text processing toolchain called SynKorp. SYN2015 is lemmatized, morphologically and syntactically annotated with state-of-the-art tools. It has been published within the framework of the Czech National Corpus and it is available via the standard corpus query interface KonText at http://kontext.korpus.cz as well as a dataset in shuffled format.