Nikolett Mus
2026
Tokenisation of Turkic Copula Constructions in Universal Dependencies
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Identifying units, ’syntactic words’, for morphosyntactic analysis is important yet challenging for morphologically rich languages. In this paper we propose a set of guiding principles to determine units of morphosyntactic analysis, and apply them to the case of copular constructions in Turkic languages, in the context of Universal Dependencies (UD) framework. We also provide a survey of the practice in the Turkic UD treebanks published to date, and discuss the advantages and disadvantages of the proposed tokenisation for a selection of Turkic languages.
2025
Creating a multi-layer Treebank for Tundra Nenets
Nikolett Mus | Bruno Guillaume | Sylvain Kahane | Daniel Zeman
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages
Nikolett Mus | Bruno Guillaume | Sylvain Kahane | Daniel Zeman
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages
This paper presents the development of the Tundra Nenets Universal Dependencies (UD) Treebank, the first syntactically annotated resource for the Samoyedic branch of the Uralic family. The treebank integrates spoken-language data and adopts the morphologically enhanced Surface-Syntactic UD (mSUD) framework to capture inflectional morphology and morphology-based syntactic relations. It further incorporates Information Structure annotation. The methodological workflow includes data selection, transcription conventions, sentence and lexeme segmentation, annotation of spoken-language features, lemmatization, treatment of morpheme status, part-of-speech and morphological tagging, and syntactic annotation based on the functional and distributional properties of syntactic elements. We also outline the principles guiding multi-level annotation and justify the theoretical choices underlying the integration of prosodic, morphological, and syntactic information.