Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

Emanuela Boros, Ahmed Hamdi, Elvys Linhares Pontes, Luis Adrián Cabrera-Diego, Jose G. Moreno, Nicolas Sidere, Antoine Doucet


Abstract
This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.
Anthology ID:
2020.conll-1.35
Volume:
Proceedings of the 24th Conference on Computational Natural Language Learning
Month:
November
Year:
2020
Address:
Online
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
431–441
Language:
URL:
https://aclanthology.org/2020.conll-1.35
DOI:
10.18653/v1/2020.conll-1.35
Bibkey:
Cite (ACL):
Emanuela Boros, Ahmed Hamdi, Elvys Linhares Pontes, Luis Adrián Cabrera-Diego, Jose G. Moreno, Nicolas Sidere, and Antoine Doucet. 2020. Alleviating Digitization Errors in Named Entity Recognition for Historical Documents. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 431–441, Online. Association for Computational Linguistics.
Cite (Informal):
Alleviating Digitization Errors in Named Entity Recognition for Historical Documents (Boros et al., CoNLL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.conll-1.35.pdf
Data
CoNLL-2003