MC-19: A Corpus of 19th Century Icelandic Texts

Steinþór Steingrímsson, Einar Freyr Sigurðsson, Atli Jasonarson


Abstract
We present MC-19, a new Icelandic historical corpus containing texts from the period 1800-1920. We describe approaches for enhancing a corpus of historical texts, by preparing the texts so that they can be processed using state-of-the-art NLP tools. We train encoder-decoder models to reduce the number of OCR errors while leaving other orthographical variation be. We generate a separate modern spelling layer by normalizing the spelling to comply with modern spelling rules, using a statistical modernization ruleset as well as a dictionary of the most common words. This allows for the texts to be PoS-tagged and lemmatized using available tools, facilitating usage of the corpus for researchers and language technologists. The published version of the corpus contains over 270 million tokens.
Anthology ID:
2025.nodalida-1.68
Volume:
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Month:
march
Year:
2025
Address:
Tallinn, Estonia
Editors:
Richard Johansson, Sara Stymne
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
680–687
Language:
URL:
https://aclanthology.org/2025.nodalida-1.68/
DOI:
Bibkey:
Cite (ACL):
Steinþór Steingrímsson, Einar Freyr Sigurðsson, and Atli Jasonarson. 2025. MC-19: A Corpus of 19th Century Icelandic Texts. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 680–687, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):
MC-19: A Corpus of 19th Century Icelandic Texts (Steingrímsson et al., NoDaLiDa 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.nodalida-1.68.pdf