Named Entity layer in Estonian UD treebanks

Kadri Muischnek, Kaili Müürisep


Abstract
In this paper we will introduce two new language resources, two NE-annotated corpora for Estonian: Estonian Universal Dependencies Treebank (EDT, 440,000 tokens) and Estonian Universal Dependencies Web Treebank (EWT, 90,000 tokens). Together they make up the largest publicly available Estonian named entity gold annotation dataset. Eight NE categories are manually annotated in this dataset, and the fact that it is also annotated for lemma, POS, morphological features and dependency syntactic relations, makes it more valuable. We will also show that dividing the set of named entities into clear-cut categories is not always easy.
Anthology ID:
2023.nodalida-1.19
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
179–184
Language:
URL:
https://aclanthology.org/2023.nodalida-1.19
DOI:
Bibkey:
Cite (ACL):
Kadri Muischnek and Kaili Müürisep. 2023. Named Entity layer in Estonian UD treebanks. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 179–184, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Named Entity layer in Estonian UD treebanks (Muischnek & Müürisep, NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.19.pdf