A High-Quality Web Corpus of Czech

Johanka Spoustová, Miroslav Spousta


Abstract
In our paper, we present main results of the Czech grant project Internet as a Language Corpus, whose aim was to build a corpus of Czech web texts and to develop and publicly release related software tools. Our corpus may not be the largest web corpus of Czech, but it maintains very good language quality due to high portion of human work involved in the corpus development process. We describe the corpus contents (2.65 billions of words divided into three parts -- 450 millions of words from news and magazines articles, 1 billion of words from blogs, diaries and other non-reviewed literary units, 1.1 billion of words from discussions messages), particular steps of the corpus creation (crawling, HTML and boilerplate removal, near duplicates removal, language filtering) and its automatic language annotation (POS tagging, syntactic parsing). We also describe our software tools being released under an open source license, especially a fast linear-time module for removing near-duplicates on a paragraph level.
Anthology ID:
L12-1008
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
311–315
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/120_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Johanka Spoustová and Miroslav Spousta. 2012. A High-Quality Web Corpus of Czech. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 311–315, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
A High-Quality Web Corpus of Czech (Spoustová & Spousta, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/120_Paper.pdf