Miroslav Spousta

2012

A High-Quality Web Corpus of Czech
Johanka Spoustová | Miroslav Spousta
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In our paper, we present main results of the Czech grant project Internet as a Language Corpus, whose aim was to build a corpus of Czech web texts and to develop and publicly release related software tools. Our corpus may not be the largest web corpus of Czech, but it maintains very good language quality due to high portion of human work involved in the corpus development process. We describe the corpus contents (2.65 billions of words divided into three parts -- 450 millions of words from news and magazines articles, 1 billion of words from blogs, diaries and other non-reviewed literary units, 1.1 billion of words from discussions messages), particular steps of the corpus creation (crawling, HTML and boilerplate removal, near duplicates removal, language filtering) and its automatic language annotation (POS tagging, syntactic parsing). We also describe our software tools being released under an open source license, especially a fast linear-time module for removing near-duplicates on a paragraph level.

2011

pdf bib

Comparable Fora
Johanka Spoustová | Miroslav Spousta
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

2010

pdf bib abs

Building a Web Corpus of Czech
Drahomíra „johanka“ Spoustová | Miroslav Spousta | Pavel Pecina
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to keep the whole process language independent and thus applicable also for building web corpora of other languages. In the paper, we briefly describe the crawling, cleaning, and part-of-speech tagging procedures. Using a prototype corpus, we provide a comparison with a current corpora (in particular, SYN2005, part of the Czech National Corpora). We analyse part-of-speech tag distribution, OOV word ratio, average sentence length and Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words. Our results show that our prototype corpus is now quite homogenous. The challenging task is to find a way to decrease the homogeneity of the text while keeping the high quality of the data.

2009

pdf bib

Semi-Supervised Training for the Averaged Perceptron POS Tagger
Drahomíra “johanka” Spoustová | Jan Hajič | Jan Raab | Miroslav Spousta
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib abs

Victor: the Web-Page Cleaning Tool
Miroslav Spousta | Michal Marek | Pavel Pecina
Proceedings of the 4th Web as Corpus Workshop

In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing, computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content, HTML structure of the page. The blocks are automatically labeled either as content segments containing main web page content, which should be preserved, or as noisy segments not suitable for further linguistic processing, which should be eliminated. Our solution is based on the tool introduced at the CLEANEVAL 2007 shared task workshop. In this paper, we present new CRF features, a handy annotation tool„ new evaluation metrics. Evaluation itself is performed on a random sample of web pages automatically downloaded from the Czech web domain.

pdf bib abs

Validating the Quality of Full Morphological Annotation
Drahomíra „johanka“ Spoustová | Pavel Pecina | Jan Hajič | Miroslav Spousta
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In our paper we present a methodology used for low-cost validation of quality of Part-of-Speech annotation of the Prague Dependency Treebank based on multiple re-annotation of data samples carefully selected with the help of several different Part-of-Speech taggers.