2025
pdf
bib
abs
Playing by the Rules: A Benchmark Set for Standardized Icelandic Orthography
Bjarki Ármannsson
|
Hinrik Hafsteinsson
|
Jóhannes B. Sigtryggsson
|
Atli Jasonarson
|
Einar Freyr Sigurðsson
|
Steinþór Steingrímsson
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
We present the Icelandic Standardization Benchmark Set: Spelling and Punctuation (IceStaBS:SP), a dataset designed to provide standardized text examples for Icelandic orthography. The dataset includes non-standard orthography examples and their standardized counterparts, along with detailed explanations based on official Icelandic spelling rules. IceStaBS:SP aims to support the development and evaluation of automatic spell and grammar checkers, particularly in educational settings. We evaluate various spell and grammar checkers using IceStaBS:SP, demonstrating its utility as a benchmarking tool and highlighting areas for future improvement.
pdf
bib
abs
MC-19: A Corpus of 19th Century Icelandic Texts
Steinþór Steingrímsson
|
Einar Freyr Sigurðsson
|
Atli Jasonarson
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
We present MC-19, a new Icelandic historical corpus containing texts from the period 1800-1920. We describe approaches for enhancing a corpus of historical texts, by preparing the texts so that they can be processed using state-of-the-art NLP tools. We train encoder-decoder models to reduce the number of OCR errors while leaving other orthographical variation be. We generate a separate modern spelling layer by normalizing the spelling to comply with modern spelling rules, using a statistical modernization ruleset as well as a dictionary of the most common words. This allows for the texts to be PoS-tagged and lemmatized using available tools, facilitating usage of the corpus for researchers and language technologists. The published version of the corpus contains over 270 million tokens.
2024
pdf
bib
abs
Cogs in a Machine, Doing What They’re Meant to Do – the AMI Submission to the WMT24 General Translation Task
Atli Jasonarson
|
Hinrik Hafsteinsson
|
Bjarki Ármannsson
|
Steinþór Steingrímsson
Proceedings of the Ninth Conference on Machine Translation
This paper presents the submission of the Arni Magnusson Institute’s team to the WMT24 General translation task. We work on the English→Icelandic translation direction. Our system comprises four translation models and a grammar correction model. For training our systems we carefully curate our datasets, aggressively filtering out sentence pairs that may detrimentally affect the quality of our systems output. Some of our data are collected from human translations and some are synthetically generated. A part of the synthetic data is generated using an LLM, and we find that it increases the translation capability of our system significantly.
pdf
bib
abs
Killing Two Flies with One Stone: An Attempt to Break LLMs Using English-Icelandic Idioms and Proper Names
Bjarki Ármannsson
|
Hinrik Hafsteinsson
|
Atli Jasonarson
|
Steinþór Steingrímsson
Proceedings of the Ninth Conference on Machine Translation
The submission of the Árni Magnússon Institute’s team to the WMT24 test suite subtask focuses on idiomatic expressions and proper names for the English→Icelandic translation direction. Intuitively and empirically, idioms and proper names are known to be a significant challenge for neural translation models. We create two different test suites. The first evaluates the competency of MT systems in translating common English idiomatic expressions, as well as testing whether systems can distinguish between those expressions and the same phrases when used in a literal context. The second test suite consists of place names that should be translated into their Icelandic exonyms (and correctly inflected) and pairs of Icelandic names that share a surface form between the male and female variants, so that incorrect translations impact meaning as well as readibility. The scores reported are relatively low, especially for idiomatic expressions and place names, and indicate considerable room for improvement.
2023
pdf
bib
abs
Generating Errors: OCR Post-Processing for Icelandic
Atli Jasonarson
|
Steinþór Steingrímsson
|
Einar Sigurðsson
|
Árni Magnússon
|
Finnur Ingimundarson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.
pdf
bib
abs
Evaluating a Universal Dependencies Conversion Pipeline for Icelandic
Þórunn Arnardóttir
|
Hinrik Hafsteinsson
|
Atli Jasonarson
|
Anton Ingason
|
Steinþór Steingrímsson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We describe the evaluation and development of a rule-based treebank conversion tool, UDConverter, which converts treebanks from the constituency-based PPCHE annotation scheme to the dependency-based Universal Dependencies (UD) scheme. The tool has already been used in the production of three UD treebanks, although no formal evaluation of the tool has been carried out as of yet. By manually correcting new output files from the converter and comparing them to the raw output, we measured the labeled attachment score (LAS) and unlabeled attachment score (UAS) of the converted texts. We obtain an LAS of 82.87 and a UAS of 87.91. In comparison to other tools, UDConverter currently provides the best results in automatic UD treebank creation for Icelandic.