2022
pdf
bib
abs
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković
|
Cvetana Krstev
|
Branislava Šandrih Todorović
|
Dusko Vitas
|
Mihailo Skoric
|
Milica Ikonić Nešić
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published on different platforms in order to make it freely available to various users. Several use examples show that this sub-collection is usefull for both close and distant reading approaches.
pdf
bib
abs
A Myriad of Ways to Say: “Wear a mask!”
Cvetana Krstev
|
Duško Vitas
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
This paper presents a small corpus of notices displayed at entrances of various Belgrade public premises asking those who enter to wear a mask. We analyze the various aspects of these notices: their physical appearance, script, lexica, syntax and style. A special attention is paid to various obligatory and optional parts of these notices. Obligatory parts deal with wearing masks, keeping the distance, limiting the number of persons on premises and using disinfection. We developed local grammars for modelling phrases that require wearing masks, that can be used both for recognition and for generation of paraphrases.
2020
pdf
bib
abs
Analysis of Similes in Serbian Literary Texts (1860-1920) using computational methods
Cvetana Krstev
|
Jelena Jaćimović
|
Duško Vitas
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
Similes are rhetorical figures which play an important role in literary texts. This paper presents a finite-state methodology developed for the description of adjectival similes, which enables their retrieval and annotation in Serbian novels written in the mid-19th and early 20th centuries. The results of a textometric analysis reveal the most frequent adjectival similes and the specificity of their usage, with respect to the author, title, or publication date, in a subset of the SrpELTeC corpus.
2018
pdf
bib
abs
Knowledge and Rule-Based Diacritic Restoration in Serbian
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the data obtained from SrpKor and local grammars assists in making a decision between several candidates in cases of ambiguity. The evaluation results reveal that, depending on the text, accuracy ranges from 95.03% to 99.36%, while the precision (average 98.93%) is always higher than the recall (average 94.94%).
2016
pdf
bib
abs
How to Differentiate the Closely Related Standard Languages?
Duško Vitas
|
Ljubomir Popović
|
Cvetana Krstev
|
Anđelka Zečević
Proceedings of the Second International Conference on Computational Linguistics in Bulgaria (CLIB 2016)
In this paper the adequacy of the SETimes corpus as a basis for the comparison of closely related languages that are used in countries that emerged after the breakup of Yugoslavia is discussed by comparing it with other corpora. It is shown that the phenomena observed in this corpus and used to illustrate differences most specifically between Serbian and Croatian are consistent neither with their standards nor with other sources. Thus, results obtained on the basis of the SETimes corpus are corpus-biased and have to be reconsidered. This proves that the size of a corpus and its composition used in a linguistic research are crucial for assessing the obtained results.
2014
pdf
bib
Enriching SerbianWordNet and Electronic Dictionaries with Terms from the Culinary Domain
Staša Vujičić Stanković
|
Cvetana Krstev
|
Duško Vitas
Proceedings of the Seventh Global Wordnet Conference
2011
pdf
bib
A tagged and aligned corpus for the study of Proper Names in translation
Emeline Lecuit
|
Denis Maurel
|
Duško Vitas
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora
pdf
bib
E-Dictionaries and Finite-State Automata for the Recognition of Named Entities
Cvetana Krstev
|
Duško Vitas
|
Ivan Obradović
|
Miloš Utvić
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing
2010
pdf
bib
abs
A Description of Morphological Features of Serbian: a Revision using Feature System Declaration
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper we discuss some well-known morphological descriptions used in various projects and applications (most notably MULTEXT-East and Unitex) and illustrate the encountered problems on Serbian. We have spotted four groups of problems: the lack of a value for an existing category, the lack of a category, the interdependence of values and categories lacking some description, and the lack of a support for some types of categories. At the same time, various descriptions often describe exactly the same morphological property using different approaches. We propose a new morphological description for Serbian following the feature structure representation defined by the ISO standard. In this description we try do incorporate all characteristics of Serbian that need to be specified for various applications. We have developed several XSLT scripts that transform our description into descriptions needed for various applications. We have developed the first version of this new description, but we treat it as an ongoing project because for some properties we have not yet found the satisfactory solution.
2009
pdf
bib
E-Connecting Balkan Languages
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Svetla Koeva
Proceedings of the Workshop Multilingual resources, technologies and evaluation for central and Eastern European languages
2008
pdf
bib
abs
The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Ivan Obradović
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we present how resources and tools developed within the Human Language Technology Group at the University of Belgrade can be used for tuning queries before submitting them to a web search engine. We argue that the selection of words chosen for a query, which are of paramount importance for the quality of results obtained by the query, can be substantially improved by using various lexical resources, such as morphological dictionaries and wordnets. These dictionaries enable semantic and morphological expansion of the query, the latter being very important in highly inflective languages, such as Serbian. Wordnets can also be used for adding another language to a query, if appropriate, thus making the query bilingual. Problems encountered in retrieving documents of interest are discussed and illustrated by examples. A brief description of resources is given, followed by an outline of the web tool which enables their integration. Finally, a set of examples is chosen in order to illustrate the use of the lexical resources and tool in question. Results obtained for these examples show that the number of documents obtained through a query by using our approach can double and even quadruple in some cases.
2006
pdf
bib
abs
WS4LR: A Workstation for Lexical Resources
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Ivan Obradović
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper we describe WS4LR, the workstation for lexical resources, a software tool developed within the Human Language Technology Group at the Faculty of Mathematics, University of Belgrade. The tool is aimed at manipulating heterogeneous lexical resources, and the need for such a tool came from the large volume of resources the Group has developed in the course of many years and within different projects. The tool handles morphological dictionaries, wordnets, aligned texts and transducers equally and has already proved very useful for various tasks. Although it has so far been used mainly for Serbian, WS4LR is not language dependent and can be successfully used for resources in other languages provided that they follow the described formats and methodologies. The tool operates on the .NET platform and runs on a personal computer under Windows 2000/XP/2003 operating system with at least 256MB of internal memory.
2004
pdf
bib
Combining Heterogeneous Lexical Resources
Cvetana Krstev
|
Duško Vitas
|
Ranka Stankoviæ
|
Ivan Obradoviæ
|
Gordana Pavloviæ-Lažetiæ
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2003
pdf
bib
The MULTEXT-East Morphosyntactic Specification for Slavic Languages
Tomaž Erjavec
|
Cvetana Krstev
|
Vladimír Petkevič
|
Kiril Simov
|
Marko Tadić
|
Duško Vitas
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
pdf
bib
Composite Tense Recognition and Tagging in Serbian
Duško Vitas
|
Cvetana Krstev
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages