Crawling Under-Resourced Languages - a Portal for Community-Contributed Corpus Collection

Erik Körner, Felix Helfer, Christopher Schröder, Thomas Eckart, Dirk Goldhahn


Abstract
The “Web as corpus” paradigm opens opportunities for enhancing the current state of language resources for endangered and under-resourced languages. However, standard crawling strategies tend to overlook available resources of these languages in favor of already well-documented ones. Since 2016, the “Crawling Under-Resourced Languages” portal (CURL) has been contributing to bridging the gap between established crawling techniques and knowledge about relevant Web resources that is only available in the specific language communities. The aim of the CURL portal is to enlarge the amount of available text material for under-resourced languages thereby developing available datasets further and to use them as a basis for statistical evaluation and enrichment of already available resources. The application is currently provided and further developed as part of the thematic cluster “Non-Latin scripts and Under-resourced languages” in the German national research consortium Text+. In this context, its focus lies on the extraction of text material and statistical information for the data domain “Lexical resources”.
Anthology ID:
2022.dclrl-1.5
Volume:
Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Jonne Sälevä, Constantine Lignos
Venue:
DCLRL
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
36–43
Language:
URL:
https://aclanthology.org/2022.dclrl-1.5
DOI:
Bibkey:
Cite (ACL):
Erik Körner, Felix Helfer, Christopher Schröder, Thomas Eckart, and Dirk Goldhahn. 2022. Crawling Under-Resourced Languages - a Portal for Community-Contributed Corpus Collection. In Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, pages 36–43, Marseille, France. European Language Resources Association.
Cite (Informal):
Crawling Under-Resourced Languages - a Portal for Community-Contributed Corpus Collection (Körner et al., DCLRL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.dclrl-1.5.pdf