Árni Magnússon


2023

pdf bib
Generating Errors: OCR Post-Processing for Icelandic
Atli Jasonarson | Steinþór Steingrímsson | Einar Sigurðsson | Árni Magnússon | Finnur Ingimundarson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.