Generating Errors: OCR Post-Processing for Icelandic

Atli Jasonarson, Steinþór Steingrímsson, Einar Sigurðsson, Árni Magnússon, Finnur Ingimundarson


Abstract
We describe work on enhancing the performance of transformer-based encoder-decoder models for OCR post-correction on modern and historical Icelandic texts, where OCRed data are scarce. We trained six models, four from scratch and two fine-tuned versions of Google’s ByT5, on a combination of real data and texts populated with artificially generated errors. Our results show that the models trained from scratch, as opposed to the fine-tuned versions, benefited the most from the addition of artificially generated errors.
Anthology ID:
2023.nodalida-1.29
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
286–291
Language:
URL:
https://aclanthology.org/2023.nodalida-1.29
DOI:
Bibkey:
Cite (ACL):
Atli Jasonarson, Steinþór Steingrímsson, Einar Sigurðsson, Árni Magnússon, and Finnur Ingimundarson. 2023. Generating Errors: OCR Post-Processing for Icelandic. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 286–291, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Generating Errors: OCR Post-Processing for Icelandic (Jasonarson et al., NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.29.pdf