Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian

Dmytro Chaplynskyi, Mariana Romanyshyn


Abstract
This paper presents NER-UK 2.0, a corpus of texts in the Ukrainian language manually annotated for the named entity recognition task. The corpus contains 560 texts of multiple genres, boasting 21,993 entities in total. The annotation scheme covers 13 entity types, namely location, person name, organization, artifact, document, job title, date, time, period, money, percentage, quantity, and miscellaneous. Such a rich set of entities makes the corpus valuable for training named-entity recognition models in various domains, including news, social media posts, legal documents, and procurement contracts. The paper presents an updated baseline solution for named entity recognition in Ukrainian with 0.89 F1. The corpus is the largest of its kind for the Ukrainian language and is available for download.
Anthology ID:
2024.unlp-1.4
Volume:
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Mariana Romanyshyn, Nataliia Romanyshyn, Andrii Hlybovets, Oleksii Ignatenko
Venue:
UNLP
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
23–29
Language:
URL:
https://aclanthology.org/2024.unlp-1.4
DOI:
Bibkey:
Cite (ACL):
Dmytro Chaplynskyi and Mariana Romanyshyn. 2024. Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian. In Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, pages 23–29, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Introducing NER-UK 2.0: A Rich Corpus of Named Entities for Ukrainian (Chaplynskyi & Romanyshyn, UNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.unlp-1.4.pdf