Clemens Neudecker


2025

2020

The quality of Optical Character Recognition (OCR) is a key factor in the digitisation of historical documents. OCR errors are a major obstacle for downstream tasks and have hindered advances in the usage of the digitised documents. In this paper we present a two-step approach to automatic OCR post-correction. The first component is responsible for detecting erroneous sequences in a set of OCRed texts, while the second is designed for correcting OCR errors in them. We show that applying the preceding detection model reduces both the character error rate (CER) compared to a simple one-step correction model and the amount of falsely changed correct characters.

2016

The availability of openly available textual datasets (“corpora”) with highly accurate manual annotations (“gold standard”) of named entities (e.g. persons, locations, organizations, etc.) is crucial in the training and evaluation of named entity recognition systems. Currently there are only few such datasets available on the web, and even less for texts containing historical spelling variation. The production and subsequent release into the public domain of four such datasets with 100 pages each for the languages Dutch, French, German (including Austrian) as part of the Europeana Newspapers project is expected to contribute to the further development and improvement of named entity recognition systems with a focus on historical content. This paper describes how these datasets were produced, what challenges were encountered in their creation and informs about their final quality and availability.