Named Entity Recognition in Estonian 19th Century Parish Court Records

Siim Orasmaa, Kadri Muischnek, Kristjan Poska, Anna Edela


Abstract
This paper presents a new historical language resource, a corpus of Estonian Parish Court records from the years 1821-1920, annotated for named entities (NE), and reports on named entity recognition (NER) experiments using this corpus. The hand-written records have been transcribed manually via a crowdsourcing project, so the transcripts are of high quality, but the variation of language and spelling is high in these documents due to dialectal variation and the fact that there was a considerable change in Estonian spelling conventions during the time of their writing. The typology of NEs for manual annotation includes 7 categories, but the inter-annotator agreement is as good as 95.0 (mean F1-score). We experimented with fine-tuning BERT-like transfer learning approaches for NER, and found modern Estonian BERT models highly applicable, despite the difficulty of the historical material. Our best model, finetuned Est-RoBERTa, achieved microaverage F1 score of 93.6, which is comparable to state-of-the-art NER performance on the contemporary Estonian.
Anthology ID:
2022.lrec-1.568
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5304–5313
Language:
URL:
https://aclanthology.org/2022.lrec-1.568
DOI:
Bibkey:
Cite (ACL):
Siim Orasmaa, Kadri Muischnek, Kristjan Poska, and Anna Edela. 2022. Named Entity Recognition in Estonian 19th Century Parish Court Records. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5304–5313, Marseille, France. European Language Resources Association.
Cite (Informal):
Named Entity Recognition in Estonian 19th Century Parish Court Records (Orasmaa et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.568.pdf
Code
 soras/vk_ner_lrec_2022