Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages

Dirk Goldhahn; Thomas Eckart; Uwe Quasthoff

Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages

Dirk Goldhahn, Thomas Eckart, Uwe Quasthoff

Abstract

The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of low density, where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.

Anthology ID:: L12-1154
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 759–765
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 759–765, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages (Goldhahn et al., LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf

PDF Cite Search Fix data