Lemmatization and POS-tagging process by using joint learning approach. Experimental results on Classical Armenian, Old Georgian, and Syriac
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Classical Armenian, Old Georgian and Syriac are under-resourced digital languages. Even though a lot of printed critical editions or dictionaries are available, there is currently a lack of fully tagged corpora that could be reused for automatic text analysis. In this paper, we introduce an ongoing project of lemmatization and POS-tagging for these languages, relying on a recurrent neural network (RNN), specific morphological tags and dedicated datasets. For this paper, we have combine different corpora previously processed by automatic out-of-context lemmatization and POS-tagging, and manual proofreading by the collaborators of the GREgORI Project (UCLouvain, Louvain-la-Neuve, Belgium). We intend to compare a rule based approach and a RNN approach by using PIE specialized by Calfa (Paris, France). We introduce here first results. We reach a mean accuracy of 91,63% in lemmatization and of 92,56% in POS-tagging. The datasets, which were constituted and used for this project, are not yet representative of the different variations of these languages through centuries, but they are homogenous and allow reaching tangible results, paving the way for further analysis of wider corpora.
Languages Resources for Poorly Endowed Languages : The Case Study of Classical Armenian
Proceedings of the 12th Language Resources and Evaluation Conference
Classical Armenian is a poorly endowed language, that despite a great tradition of lexicographical erudition is coping with a lack of resources. Although numerous initiatives exist to preserve the Classical Armenian language, the lack of precise and complete grammatical and lexicographical resources remains. This article offers a situation analysis of the existing resources for Classical Armenian and presents the new digital resources provided on the Calfa platform. The Calfa project gathers existing resources and updates, enriches and enhances their content to offer the richest database for Classical Armenian today. Faced with the challenges specific to a poorly endowed language, the Calfa project is also developing new technologies and solutions to enable preservation, advanced research, and larger systems and developments for the Armenian language
Recycling and Comparing Morphological Annotation Models for Armenian Diachronic-Variational Corpus Processing
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Armenian is a language with significant variation and unevenly distributed NLP resources for different varieties. An attempt is made to process an RNN model for morphological annotation on the basis of different Armenian data (provided or not with morphologically annotated corpora), and to compare the annotation results of RNN and rule-based models. Different tests were carried out to evaluate the reuse of an unspecialized model of lemmatization and POS-tagging for under-resourced language varieties. The research focused on three dialects and further extended to Western Armenian with a mean accuracy of 94,00 % in lemmatization and 97,02% in POS-tagging, as well as a possible reusability of models to cover different other Armenian varieties. Interestingly, the comparison of an RNN model trained on Eastern Armenian with the Eastern Armenian National Corpus rule-based model applied to Western Armenian showed an enhancement of 19% in parsing. This model covers 88,79% of a short heterogeneous dataset in Western Armenian, and could be a baseline for a massive corpus annotation in that standard. It is argued that an RNN-based model can be a valid alternative to a rule-based one giving consideration to such factors as time-consumption, reusability for different varieties of a target language and significant qualitative results in morphological annotation.