Hildur Hafsteinsdóttir


2022

pdf bib
Evolving Large Text Corpora: Four Versions of the Icelandic Gigaword Corpus
Starkaður Barkarson | Steinþór Steingrímsson | Hildur Hafsteinsdóttir
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Icelandic Gigaword Corpus was first published in 2018. Since then new versions have been published annually, containing new texts from additional sources as well as from previous sources. This paper describes the evolution of the corpus in its first four years. All versions are made available under permissive licenses and with each new version the texts are annotated with the latest and most accurate tools. We show how the corpus has grown almost 50% in size from the first version to the fourth and how it was restructured in order to better accommodate different meta-data for different subcorpora. Furthermore, other services have been set up to facilitate usage of the corpus for different use cases. These include a keyword-in-context concordance tool, an n-gram viewer, a word frequency database and pre-trained word embeddings.