2012
pdf
bib
abs
A High-Quality Web Corpus of Czech
Johanka Spoustová
|
Miroslav Spousta
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In our paper, we present main results of the Czech grant project Internet as a Language Corpus, whose aim was to build a corpus of Czech web texts and to develop and publicly release related software tools. Our corpus may not be the largest web corpus of Czech, but it maintains very good language quality due to high portion of human work involved in the corpus development process. We describe the corpus contents (2.65 billions of words divided into three parts -- 450 millions of words from news and magazines articles, 1 billion of words from blogs, diaries and other non-reviewed literary units, 1.1 billion of words from discussions messages), particular steps of the corpus creation (crawling, HTML and boilerplate removal, near duplicates removal, language filtering) and its automatic language annotation (POS tagging, syntactic parsing). We also describe our software tools being released under an open source license, especially a fast linear-time module for removing near-duplicates on a paragraph level.
2011
pdf
bib
Comparable Fora
Johanka Spoustová
|
Miroslav Spousta
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
2010
pdf
bib
abs
Building a Web Corpus of Czech
Drahomíra „johanka“ Spoustová
|
Miroslav Spousta
|
Pavel Pecina
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to keep the whole process language independent and thus applicable also for building web corpora of other languages. In the paper, we briefly describe the crawling, cleaning, and part-of-speech tagging procedures. Using a prototype corpus, we provide a comparison with a current corpora (in particular, SYN2005, part of the Czech National Corpora). We analyse part-of-speech tag distribution, OOV word ratio, average sentence length and Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words. Our results show that our prototype corpus is now quite homogenous. The challenging task is to find a way to decrease the homogeneity of the text while keeping the high quality of the data.
2009
pdf
bib
Semi-Supervised Training for the Averaged Perceptron POS Tagger
Drahomíra “johanka” Spoustová
|
Jan Hajič
|
Jan Raab
|
Miroslav Spousta
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
2008
pdf
bib
abs
Validating the Quality of Full Morphological Annotation
Drahomíra „johanka“ Spoustová
|
Pavel Pecina
|
Jan Hajič
|
Miroslav Spousta
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In our paper we present a methodology used for low-cost validation of quality of Part-of-Speech annotation of the Prague Dependency Treebank based on multiple re-annotation of data samples carefully selected with the help of several different Part-of-Speech taggers.
2007
pdf
bib
Towards the Automatic Extraction of Definitions in Slavic
Adam Przepiórkowski
|
Łukasz Degórski
|
Miroslav Spousta
|
Kiril Simov
|
Petya Osenova
|
Lothar Lemnitzer
|
Vladislav Kuboň
|
Beata Wójtowicz
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing