Mihailo Škorić

Also published as: Mihailo Skoric


2023

pdf bib
Football terminology: compilation and transformation into OntoLex-Lemon resource
Jelena Lazarević | Ranka Stanković | Mihailo Škorić | Biljana Rujević
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

pdf bib
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković | Cvetana Krstev | Branislava Šandrih Todorović | Dusko Vitas | Mihailo Skoric | Milica Ikonić Nešić
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published on different platforms in order to make it freely available to various users. Several use examples show that this sub-collection is usefull for both close and distant reading approaches.

pdf bib
From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)
Milica Ikonić Nešić | Ranka Stanković | Christof Schöch | Mihailo Skoric
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

In this paper we present the wikification of the ELTeC (European Literary Text Collection), developed within the COST Action “Distant Reading for European Literary History” (CA16204). ELTeC is a multilingual corpus of novels written in the time period 1840—1920, built to apply distant reading methods and tools to explore the European literary history. We present the pipeline that led to the production of the linked dataset, the novels’ metadata retrieval and named entity recognition, transformation, mapping and Wikidata population, followed by named entity linking and export to NIF (NLP Interchange Format). The speeding up of the process of data preparation and import to Wikidata is presented on the use case of seven sub-collections of ELTeC (English, Portuguese, French, Slovenian, German, Hungarian and Serbian). Our goal was to automate the process of preparing and importing information, so OpenRefine and QuickStatements were chosen as the best options. The paper also includes examples of SPARQL queries for retrieval of authors, novel titles, publication places and other metadata with different visualisation options as well as statistical overviews.

2020

pdf bib
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
Ranka Stankovic | Branislava Šandrih | Cvetana Krstev | Miloš Utvić | Mihailo Skoric
Proceedings of the Twelfth Language Resources and Evaluation Conference

The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.