Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
Ranka Stankovic | Branislava Šandrih | Cvetana Krstev | Miloš Utvić | Mihailo Skoric
Proceedings of the Twelfth Language Resources and Evaluation Conference
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.
A tool for enhanced search of multilingual digital libraries of e-journals
Ranka Stanković | Cvetana Krstev | Ivan Obradović | Aleksandra Trtovac | Miloš Utvić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed of several modules including a web application, which makes it readily accessible on the web. Its functionality has been tested on a collection of 44 TMX documents generated from articles published bilingually by the journal INFOtecha, yielding encouraging results. Further enhancements of the tool are underway, with the aim of transforming it from a powerful full-text and metadata search tool, to a useful translator's aid, which could be of assistance both in reviewing terminology used in context and in refining the multilingual resources used within the system.
E-Dictionaries and Finite-State Automata for the Recognition of Named Entities
Cvetana Krstev | Duško Vitas | Ivan Obradović | Miloš Utvić
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing
- Cvetana Krstev 3
- Ranka Stanković 2
- Ivan Obradović 2
- Aleksandra Trtovac 1
- Duško Vitas 1
- show all...