Creating a multi-layer Treebank for Tundra Nenets

Nikolett Mus, Bruno Guillaume, Sylvain Kahane, Daniel Zeman


Abstract
This paper presents the development of the Tundra Nenets Universal Dependencies (UD) Treebank, the first syntactically annotated resource for the Samoyedic branch of the Uralic family. The treebank integrates spoken-language data and adopts the morphologically enhanced Surface-Syntactic UD (mSUD) framework to capture inflectional morphology and morphology-based syntactic relations. It further incorporates Information Structure annotation. The methodological workflow includes data selection, transcription conventions, sentence and lexeme segmentation, annotation of spoken-language features, lemmatization, treatment of morpheme status, part-of-speech and morphological tagging, and syntactic annotation based on the functional and distributional properties of syntactic elements. We also outline the principles guiding multi-level annotation and justify the theoretical choices underlying the integration of prosodic, morphological, and syntactic information.
Anthology ID:
2025.iwclul-1.11
Volume:
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages
Month:
December
Year:
2025
Address:
Joensuu, Finland
Editors:
Mika Hämäläinen, Michael Rießler, Eiaki V. Morooka, Lev Kharlashkin
Venues:
IWCLUL | WS
SIG:
SIGUR
Publisher:
Association for Computational Linguistics
Note:
Pages:
77–86
Language:
URL:
https://aclanthology.org/2025.iwclul-1.11/
DOI:
Bibkey:
Cite (ACL):
Nikolett Mus, Bruno Guillaume, Sylvain Kahane, and Daniel Zeman. 2025. Creating a multi-layer Treebank for Tundra Nenets. In Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages, pages 77–86, Joensuu, Finland. Association for Computational Linguistics.
Cite (Informal):
Creating a multi-layer Treebank for Tundra Nenets (Mus et al., IWCLUL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.iwclul-1.11.pdf