Tatiana Merzhevich

2022

Tupían Language Ressources: Data, Tools, Analyses
Lorena Martín Rodríguez | Tatiana Merzhevich | Wellington Silva | Tiago Tresoldi | Carolina Aragon | Fabrício F. Gerardi
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

TuLaR (Tupian Language Resources) is a project for collecting, documenting, analyzing, and developing computational and pedagogical material for low-resource Brazilian indigenous languages. It provides valuable data for language research regarding typological, syntactic, morphological, and phonological aspects. Here we present TuLaR’s databases, with special consideration to TuDeT (Tupian Dependency Treebanks), an annotated corpus under development for nine languages of the Tupian family, built upon the Universal Dependencies framework. The annotation within such a framework serves a twofold goal: enriching the linguistic documentation of the Tupian languages due to the rapid and consistent annotation, and providing computational resources for those languages, thanks to the suitability of our framework for developing NLP tools. We likewise present a related lexical database, some tools developed by the project, and examine future goals for our initiative.

pdf bib abs

Introducing YakuToolkit. Yakut Treebank and Morphological Analyzer.
Tatiana Merzhevich | Fabrício Ferraz Gerardi
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

This poster presents the first publicly available treebank of Yakut, a Turkic language spoken in Russia, and a morphological analyzer for this language. The treebank was annotated following the Universal Dependencies (UD) framework and the mor- phological analyzer can directly access and use its data. Yakut is an under-represented language whose prominence can be raised by making reliably annotated data and NLP tools that could process it freely accessible. The publication of both the treebank and the analyzer serves this purpose with the prospect of evolving into a benchmark for the development of NLP online tools for other languages of the Turkic family in the future.

pdf bib abs

SIGMORPHON 2022 Task 0 Submission Description: Modelling Morphological Inflection with Data-Driven and Rule-Based Approaches
Tatiana Merzhevich | Nkonye Gbadegoye | Leander Girrbach | Jingwen Li | Ryan Soh-Eun Shim
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes our participation in the 2022 SIGMORPHON-UniMorph Shared Task on Typologically Diverse and AcquisitionInspired Morphological Inflection Generation. We present two approaches: one being a modification of the neural baseline encoderdecoder model, the other being hand-coded morphological analyzers using finite-state tools (FST) and outside linguistic knowledge. While our proposed modification of the baseline encoder-decoder model underperforms the baseline for almost all languages, the FST methods outperform other systems in the respective languages by a large margin. This confirms that purely data-driven approaches have not yet reached the maturity to replace trained linguists for documentation and analysis especially considering low-resource and endangered languages.

Co-authors

Jingwen Li 1

Lorena Martín Rodríguez 1

Ryan Soh-Eun Shim 1

Wellington Silva 1

Tiago Tresoldi 1

Venues

SIGUL2
SIGMORPHON1

Fix author