LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl

Szymon Roziewski; Wojciech Stokowiec

LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl

Abstract

The web data contains immense amount of data, hundreds of billion words are waiting to be extracted and used for language research. In this work we introduce our tool LanguageCrawl which allows NLP researchers to easily construct web-scale corpus from Common Crawl Archive: a petabyte scale, open repository of web crawl information. Three use-cases are presented: filtering Polish websites, building an N-gram corpora and training continuous skip-gram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and N-gram ranks. Special effort has been put on high computing efficiency, by applying highly concurrent multitasking. We make our tool publicly available to enrich NLP resources. We strongly believe that our work will help to facilitate NLP research, especially in under-resourced languages, where the lack of appropriately sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.

Anthology ID:: L16-1443
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2789–2793
Language:
URL:: https://aclanthology.org/L16-1443/
DOI:
Bibkey:
Cite (ACL):: Szymon Roziewski and Wojciech Stokowiec. 2016. LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2789–2793, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl (Roziewski & Stokowiec, LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1443.pdf

PDF Cite Search Fix data