2020
pdf
bib
abs
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
Ranka Stankovic
|
Branislava Šandrih
|
Cvetana Krstev
|
Miloš Utvić
|
Mihailo Skoric
Proceedings of the Twelfth Language Resources and Evaluation Conference
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.
2019
pdf
bib
abs
Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names
Branislava Šandrih
|
Cvetana Krstev
|
Ranka Stankovic
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian newspaper texts that was used to prepare a gold standard annotated with personal names. It was further used to prepare training sets for four different levels of annotation, which were further used to train two Named Entity Recognition (NER) systems: Stanford and spaCy. All obtained models, together with a rule- and lexicon-based system were evaluated on two sample texts: a part of the gold standard and an independent newspaper text of approximately the same size. The results show that rule- and lexicon-based system outperforms trained models in all four scenarios (measured by F1), while Stanford models has the highest precision. All systems obtain best results in recognizing full names, while the recognition of first names only is rather poor. The produced models are incorporated into a Web platform NER&Beyond that provides various NE-related functions.
pdf
bib
Proceedings of the Student Research Workshop Associated with RANLP 2019
Venelin Kovatchev
|
Irina Temnikova
|
Branislava Šandrih
|
Ivelina Nikolova
Proceedings of the Student Research Workshop Associated with RANLP 2019
2018
pdf
bib
abs
Fingerprints in SMS messages: Automatic Recognition of a Short Message Sender Using Gradient Boosting
Branislava Šandrih
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)
This paper considers the following question: Is it possible to tell who is the short message sender just by analyzing a typing style of the sender, and not the meaning of the content itself? If possible, how reliable would the judgment be? Are we leaving some kind of “fingerprint” when we text, and can we tell something about others based just on their typing style? For this purpose, a corpus of ∼ 5,500 SMS messages was gathered from one person’s cell phone and two gradient boost classifiers were built: first one is trying to distinguish whether the message was sent by this exact person (cell phone owner) or by someone else; second one was trained to distinguish between messages sent by some public service (e.g. parking service, bank reports etc.) and messages sent by humans. The performance of the classifiers was evaluated in the 5-fold cross-validation setting, resulting in 73.6% and 99.3% overall accuracy for the first and the second classifier, respectively.
pdf
bib
Using English Baits to Catch Serbian Multi-Word Terminology
Cvetana Krstev
|
Branislava Šandrih
|
Ranka Stanković
|
Miljana Mladenović
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)