Facilitating Corpus Usage: Making Icelandic Corpora More Accessible for Researchers and Language Users

Steinþór Steingrímsson, Starkaður Barkarson, Gunnar Thor Örnólfsson


Abstract
We introduce an array of open and accessible tools to facilitate the use of the Icelandic Gigaword Corpus, in the field of Natural Language Processing as well as for students, linguists, sociologists and others benefitting from using large corpora. A KWIC engine, powered by the Swedish Korp tool is adapted to the specifics of the corpus. An n-gram viewer, highly customizable to suit different needs, allows users to study word usage throughout the period of our text collection. A frequency dictionary provides much sought after information about word frequency statistics, computed for each subcorpus as well as aggregate, disambiguating homographs based on their respective lemmas and morphosyntactic tags. Furthermore, we provide n-grams based on the corpus, and a variety of pre-trained word embeddings models, based on word2vec, GloVe, fastText and ELMo. For three of the model types, multiple word embedding models are available trained with different algorithms and using either lemmatised or unlemmatised texts.
Anthology ID:
2020.lrec-1.416
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3399–3405
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.416
DOI:
Bibkey:
Cite (ACL):
Steinþór Steingrímsson, Starkaður Barkarson, and Gunnar Thor Örnólfsson. 2020. Facilitating Corpus Usage: Making Icelandic Corpora More Accessible for Researchers and Language Users. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3399–3405, Marseille, France. European Language Resources Association.
Cite (Informal):
Facilitating Corpus Usage: Making Icelandic Corpora More Accessible for Researchers and Language Users (Steingrímsson et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.416.pdf