Building a 70 billion word corpus of English from ClueWeb

Jan Pomikálek; Miloš Jakubíček; Pavel Rychlý

Building a 70 billion word corpus of English from ClueWeb

Jan Pomikálek, Miloš Jakubíček, Pavel Rychlý

Abstract

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL -- Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour) from the resulting corpus.

Anthology ID:: L12-1624
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 502–506
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1047_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Jan Pomikálek, Miloš Jakubíček, and Pavel Rychlý. 2012. Building a 70 billion word corpus of English from ClueWeb. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 502–506, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: Building a 70 billion word corpus of English from ClueWeb (Pomikálek et al., LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1047_Paper.pdf

PDF Cite Search Fix data