Introducing Morphology in Universal Dependencies Japanese
Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)
This paper discusses the need for including morphological features in Japanese Universal Dependencies (UD). In the current version (v2.11) of the Japanese UD treebanks, sentences are tokenized at the morpheme level, and almost no morphological feature annotation is used. However, Japanese is not an isolating language that lacks morphological inflection but is an agglutinative language. Given this situation, we introduce a tentative scheme for retokenization and morphological feature annotation for Japanese UD. Then, we measure and compare the morphological complexity of Japanese with other languages to demonstrate that the proposed tokenizations show similarities to synthetic languages reflecting the linguistic typology.
Universal Dependencies Treebank for Tatar: Incorporating Intra-Word Code-Switching Information
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference
This paper introduces a new Universal Dependencies treebank for the Tatar language named NMCTT. A significant feature of the corpus is that it includes code-switching (CS) information at a morpheme level, given the fact that Tatar texts contain intra-word CS between Tatar and Russian. We first outline NMCTT with a focus on differences from other treebanks of Turkic languages. Then, to evaluate the merit of the CS annotation, this study concisely reports the results of a language identification task implemented with Conditional Random Fields that considers POS tag information, which is readily available in treebanks in the CoNLL-U format. Experimenting on NMCTT and the Turkish-German CS treebank (SAGT), we demonstrate that the proposed annotation scheme introduced in NMCTT can improve the performance of the subword-level language identification. This annotation scheme for CS is not only universally applicable to languages with CS, but also shows a possibility to employ morphosyntactic information for CS-related downstream tasks.
Transliteration for Low-Resource Code-Switching Texts: Building an Automatic Cyrillic-to-Latin Converter for Tatar
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
We introduce a Cyrillic-to-Latin transliterator for the Tatar language based on subword-level language identification. The transliteration is a challenging task due to the following two reasons. First, because modern Tatar texts often contain intra-word code-switching to Russian, a different transliteration set of rules needs to be applied to each morpheme depending on the language, which necessitates morpheme-level language identification. Second, the fact that Tatar is a low-resource language, with most of the texts in Cyrillic, makes it difficult to prepare a sufficient dataset. Given this situation, we proposed a transliteration method based on subword-level language identification. We trained a language classifier with monolingual Tatar and Russian texts, and applied different transliteration rules in accord with the identified language. The results demonstrate that our proposed method outscores other Tatar transliteration tools, and imply that it correctly transcribes Russian loanwords to some extent.