Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre

Florian Debaene, Aaron Maladry, Els Lefever, Veronique Hoste


Abstract
This paper explores the effectiveness of two types of transformer models — large generative models and sequence-to-sequence models — for automatically post-correcting Optical Character Recognition (OCR) output in early modern Dutch plays. To address the need for optimally aligned data, we create a parallel dataset based on the OCRed and ground truth versions from the EmDComF corpus using state-of-the-art alignment techniques. By combining character-based and semantic methods, we design and release a qualitative OCR-to-gold parallel dataset, selecting the alignment with the lowest Character Error Rate (CER) for all alignment pairs. We then fine-tune and evaluate five generative models and four sequence-to-sequence models on the OCR post-correction dataset. Results show that sequence-to-sequence models generally outperform generative models in this task, correcting more OCR errors and overgenerating and undergenerating less, with mBART as the best performing system.
Anthology ID:
2025.coling-main.690
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10367–10374
Language:
URL:
https://aclanthology.org/2025.coling-main.690/
DOI:
Bibkey:
Cite (ACL):
Florian Debaene, Aaron Maladry, Els Lefever, and Veronique Hoste. 2025. Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10367–10374, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Evaluating Transformers for OCR Post-Correction in Early Modern Dutch Theatre (Debaene et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.690.pdf