Creating a Dataset for Named Entity Recognition in the Archaeology Domain

Alex Brandsen, Suzan Verberne, Milco Wansleeben, Karsten Lambers


Abstract
In this paper, we present the development of a training dataset for Dutch Named Entity Recognition (NER) in the archaeology domain. This dataset was created as there is a dire need for semantic search within archaeology, in order to allow archaeologists to find structured information in collections of Dutch excavation reports, currently totalling around 60,000 (658 million words) and growing rapidly. To guide this search task, NER is needed. We created rigorous annotation guidelines in an iterative process, then instructed five archaeology students to annotate a number of documents. The resulting dataset contains ~31k annotations between six entity types (artefact, time period, place, context, species & material). The inter-annotator agreement is 0.95, and when we used this data for machine learning, we observed an increase in F1 score from 0.51 to 0.70 in comparison to a machine learning model trained on a dataset created in prior work. This indicates that the data is of high quality, and can confidently be used to train NER classifiers.
Anthology ID:
2020.lrec-1.562
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4573–4577
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.562
DOI:
Bibkey:
Cite (ACL):
Alex Brandsen, Suzan Verberne, Milco Wansleeben, and Karsten Lambers. 2020. Creating a Dataset for Named Entity Recognition in the Archaeology Domain. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4573–4577, Marseille, France. European Language Resources Association.
Cite (Informal):
Creating a Dataset for Named Entity Recognition in the Archaeology Domain (Brandsen et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.562.pdf