A Two-Step Approach for Automatic OCR Post-Correction

Robin Schaefer, Clemens Neudecker


Abstract
The quality of Optical Character Recognition (OCR) is a key factor in the digitisation of historical documents. OCR errors are a major obstacle for downstream tasks and have hindered advances in the usage of the digitised documents. In this paper we present a two-step approach to automatic OCR post-correction. The first component is responsible for detecting erroneous sequences in a set of OCRed texts, while the second is designed for correcting OCR errors in them. We show that applying the preceding detection model reduces both the character error rate (CER) compared to a simple one-step correction model and the amount of falsely changed correct characters.
Anthology ID:
2020.latechclfl-1.6
Volume:
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
December
Year:
2020
Address:
Online
Editors:
Stefania DeGaetano, Anna Kazantseva, Nils Reiter, Stan Szpakowicz
Venue:
LaTeCHCLfL
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
52–57
Language:
URL:
https://aclanthology.org/2020.latechclfl-1.6
DOI:
Bibkey:
Cite (ACL):
Robin Schaefer and Clemens Neudecker. 2020. A Two-Step Approach for Automatic OCR Post-Correction. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 52–57, Online. International Committee on Computational Linguistics.
Cite (Informal):
A Two-Step Approach for Automatic OCR Post-Correction (Schaefer & Neudecker, LaTeCHCLfL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.latechclfl-1.6.pdf
Code
 qurator-spk/sbb_ocr_postcorrection