Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results

Marco Dinarelli; Sophie Rosset

Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results

Abstract

In this paper we deal with named entity detection on data acquired via OCR process on documents dating from 1890. The resulting corpus is very noisy. We perform an analysis to find possible strategies to overcome errors introduced by the OCR process. We propose a preprocessing procedure in three steps to clean data and correct, at least in part, OCR mistakes. The task is made even harder by the complex tree-structure of named entities annotated on data, we solve this problem however by adopting an effective named entity detection system we proposed in previous work. We evaluate our procedure for preprocessing OCR-ized data in two ways: in terms of perplexity and OOV rate of a language model on development and evaluation data, and in terms of the performance of the named entity detection system on the preprocessed data. The preprocessing procedure results to be effective, allowing to improve by a large margin the system we proposed for the official evaluation campaign on Old Press, and allowing to outperform also the best performing system of the evaluation campaign.

Anthology ID:: L12-1623
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1266–1272
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1046_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Marco Dinarelli and Sophie Rosset. 2012. Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1266–1272, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results (Dinarelli & Rosset, LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1046_Paper.pdf

PDF Cite Search Fix data