Transferring Modern Named Entity Recognition to the Historical Domain: How to Take the Step?

Baptiste Blouin, Benoit Favre, Jeremy Auguste, Christian Henriot


Abstract
Named entity recognition is of high interest to digital humanities, in particular when mining historical documents. Although the task is mature in the field of NLP, results of contemporary models are not satisfactory on challenging documents corresponding to out-of-domain genres, noisy OCR output, or old-variants of the target language. In this paper we study how model transfer methods, in the context of the aforementioned challenges, can improve historical named entity recognition according to how much effort is allocated to describing the target data, manually annotating small amounts of texts, or matching pre-training resources. In particular, we explore the situation where the class labels, as well as the quality of the documents to be processed, are different in the source and target domains. We perform extensive experiments with the transformer architecture on the LitBank and HIPE historical datasets, with different annotation schemes and character-level noise. They show that annotating 250 sentences can recover 93% of the full-data performance when models are pre-trained, that the choice of self-supervised and target-task pre-training data is crucial in the zero-shot setting, and that OCR errors can be handled by simulating noise on pre-training data and resorting to recent character-aware transformers.
Anthology ID:
2021.nlp4dh-1.18
Volume:
Proceedings of the Workshop on Natural Language Processing for Digital Humanities
Month:
December
Year:
2021
Address:
NIT Silchar, India
Editors:
Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter
Venue:
NLP4DH
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
152–162
Language:
URL:
https://aclanthology.org/2021.nlp4dh-1.18
DOI:
Bibkey:
Cite (ACL):
Baptiste Blouin, Benoit Favre, Jeremy Auguste, and Christian Henriot. 2021. Transferring Modern Named Entity Recognition to the Historical Domain: How to Take the Step?. In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 152–162, NIT Silchar, India. NLP Association of India (NLPAI).
Cite (Informal):
Transferring Modern Named Entity Recognition to the Historical Domain: How to Take the Step? (Blouin et al., NLP4DH 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.nlp4dh-1.18.pdf
Data
LitBank