Atli Jasonarson


2023

pdf bib
Generating Errors: OCR Post-Processing for Icelandic
Atli Jasonarson | Steinþór Steingrímsson | Einar Sigurðsson | Árni Magnússon | Finnur Ingimundarson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.

pdf bib
Evaluating a Universal Dependencies Conversion Pipeline for Icelandic
Þórunn Arnardóttir | Hinrik Hafsteinsson | Atli Jasonarson | Anton Ingason | Steinþór Steingrímsson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We describe the evaluation and development of a rule-based treebank conversion tool, UDConverter, which converts treebanks from the constituency-based PPCHE annotation scheme to the dependency-based Universal Dependencies (UD) scheme. The tool has already been used in the production of three UD treebanks, although no formal evaluation of the tool has been carried out as of yet. By manually correcting new output files from the converter and comparing them to the raw output, we measured the labeled attachment score (LAS) and unlabeled attachment score (UAS) of the converted texts. We obtain an LAS of 82.87 and a UAS of 87.91. In comparison to other tools, UDConverter currently provides the best results in automatic UD treebank creation for Icelandic.