Unsupervised part-of-speech induction for language description: Modeling documentation materials in Kolyma Yukaghir

Albert Ventayol-boada, Nathan Roll, Simon Todd


Abstract
This study investigates the clustering of words into Part-of-Speech (POS) classes in Kolyma Yukaghir. In grammatical descriptions, lexical items are assigned to POS classes based on their morphological paradigms. Discursively, however, these classes share a fair amount of morphology. In this study, we turn to POS induction to evaluate if classes based on quantification of the distributions in which roots and affixes are used can be useful for language description purposes, and, if so, what those classes might be. We qualitatively compare clusters of roots and affixes based on four different definitions of their distributions. The results show that clustering is more reliable for words that typically bear more morphology. Additionally, the results suggest that the number of POS classes in Kolyma Yukaghir might be smaller than stated in current descriptions. This study thus demonstrates how unsupervised learning methods can provide insights for language description, particularly for highly inflectional languages.
Anthology ID:
2023.fieldmatters-1.3
Volume:
Proceedings of the Second Workshop on NLP Applications to Field Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Oleg Serikov, Ekaterina Voloshina, Anna Postnikova, Elena Klyachko, Ekaterina Vylomova, Tatiana Shavrina, Eric Le Ferrand, Valentin Malykh, Francis Tyers, Timofey Arkhangelskiy, Vladislav Mikhailov
Venue:
FieldMatters
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23–29
Language:
URL:
https://aclanthology.org/2023.fieldmatters-1.3
DOI:
10.18653/v1/2023.fieldmatters-1.3
Bibkey:
Cite (ACL):
Albert Ventayol-boada, Nathan Roll, and Simon Todd. 2023. Unsupervised part-of-speech induction for language description: Modeling documentation materials in Kolyma Yukaghir. In Proceedings of the Second Workshop on NLP Applications to Field Linguistics, pages 23–29, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Unsupervised part-of-speech induction for language description: Modeling documentation materials in Kolyma Yukaghir (Ventayol-boada et al., FieldMatters 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.fieldmatters-1.3.pdf
Video:
 https://aclanthology.org/2023.fieldmatters-1.3.mp4