2024
pdf
bib
abs
Towards the semantic annotation of SR-ELEXIS corpus: Insights into Multiword Expressions and Named Entities
Cvetana Krstev
|
Ranka Stanković
|
Aleksandra M. Marković
|
Teodora Sofija Mihajlov
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
This paper presents the work in progress on ELEXIS-sr corpus, the Serbian addition to the ELEXIS multilingual annotated corpus ElexisWSD, comprising semantic annotations and word sense repositories. The ELEXIS corpus has parallel annotations in ten European languages, serving as a cross-lingual benchmark for evaluating low and medium-resourced European languages. The focus in this paper is on multiword expressions (MWEs) and named entities (NEs), their recognition in the ELEXIS-sr sentence set, and comparison with annotations in other languages. The first steps in building the Serbian sense inventory are discussed, and some results concerning MWEs and NEs are analysed. Once completed, the ELEXIS-sr corpus will be the first sense annotated corpus using the Serbian WordNet (SrpWN). Finally, ideas to represent MWE lexicon entries as Linguistic Linked-Open Data (LLOD) and connect them with occurrences in the corpus are presented.
2023
pdf
bib
abs
PARSEME corpus release 1.3
Agata Savary
|
Cherifa Ben Khelil
|
Carlos Ramisch
|
Voula Giouli
|
Verginica Barbu Mititelu
|
Najet Hadj Mohamed
|
Cvetana Krstev
|
Chaya Liebeskind
|
Hongzhi Xu
|
Sara Stymne
|
Tunga Güngör
|
Thomas Pickard
|
Bruno Guillaume
|
Eduard Bejček
|
Archna Bhatia
|
Marie Candito
|
Polona Gantar
|
Uxoa Iñurrieta
|
Albert Gatt
|
Jolanta Kovalevskaite
|
Timm Lichte
|
Nikola Ljubešić
|
Johanna Monti
|
Carla Parra Escartín
|
Mehrnoush Shamsfard
|
Ivelina Stoyanova
|
Veronika Vincze
|
Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.
2022
pdf
bib
abs
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković
|
Cvetana Krstev
|
Branislava Šandrih Todorović
|
Dusko Vitas
|
Mihailo Skoric
|
Milica Ikonić Nešić
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published on different platforms in order to make it freely available to various users. Several use examples show that this sub-collection is usefull for both close and distant reading approaches.
pdf
bib
abs
A Myriad of Ways to Say: “Wear a mask!”
Cvetana Krstev
|
Duško Vitas
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
This paper presents a small corpus of notices displayed at entrances of various Belgrade public premises asking those who enter to wear a mask. We analyze the various aspects of these notices: their physical appearance, script, lexica, syntax and style. A special attention is paid to various obligatory and optional parts of these notices. Obligatory parts deal with wearing masks, keeping the distance, limiting the number of persons on premises and using disinfection. We developed local grammars for modelling phrases that require wearing masks, that can be used both for recognition and for generation of paraphrases.
2021
pdf
bib
abs
Serbian NER&Beyond: The Archaic and the Modern Intertwinned
Branislava Šandrih Todorović
|
Cvetana Krstev
|
Ranka Stanković
|
Milica Ikonić Nešić
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
In this work, we present a Serbian literary corpus that is being developed under the umbrella of the “Distant Reading for European Literary History” COST Action CA16204. Using this corpus of novels written more than a century ago, we have developed and made publicly available a Named Entity Recognizer (NER) trained to recognize 7 different named entity types, with a Convolutional Neural Network (CNN) architecture, having F1 score of ≈91% on the test dataset. This model has been further assessed on a separate evaluation dataset. We wrap up with comparison of the developed model with the existing one, followed by a discussion of pros and cons of the both models.
2020
pdf
bib
abs
Analysis of Similes in Serbian Literary Texts (1860-1920) using computational methods
Cvetana Krstev
|
Jelena Jaćimović
|
Duško Vitas
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
Similes are rhetorical figures which play an important role in literary texts. This paper presents a finite-state methodology developed for the description of adjectival similes, which enables their retrieval and annotation in Serbian novels written in the mid-19th and early 20th centuries. The results of a textometric analysis reveal the most frequent adjectival similes and the specificity of their usage, with respect to the author, title, or publication date, in a subset of the SrpELTeC corpus.
pdf
bib
abs
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm
|
Katrin Marheinecke
|
Stefanie Hegele
|
Stelios Piperidis
|
Kalina Bontcheva
|
Jan Hajič
|
Khalid Choukri
|
Andrejs Vasiļjevs
|
Gerhard Backfried
|
Christoph Prinz
|
José Manuel Gómez-Pérez
|
Luc Meertens
|
Paul Lukowicz
|
Josef van Genabith
|
Andrea Lösch
|
Philipp Slusallek
|
Morten Irgens
|
Patrick Gatellier
|
Joachim Köhler
|
Laure Le Bars
|
Dimitra Anastasiou
|
Albina Auksoriūtė
|
Núria Bel
|
António Branco
|
Gerhard Budin
|
Walter Daelemans
|
Koenraad De Smedt
|
Radovan Garabík
|
Maria Gavriilidou
|
Dagmar Gromann
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Eiríkur Rögnvaldsson
|
Mike Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Marko Tadić
|
Dan Tufiș
|
Tamás Váradi
|
Kadri Vider
|
Andy Way
|
François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.
pdf
bib
abs
Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
Ranka Stankovic
|
Branislava Šandrih
|
Cvetana Krstev
|
Miloš Utvić
|
Mihailo Skoric
Proceedings of the Twelfth Language Resources and Evaluation Conference
The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.
pdf
bib
abs
Multi-word Expressions for Abusive Speech Detection in Serbian
Ranka Stanković
|
Jelena Mitrović
|
Danka Jokić
|
Cvetana Krstev
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
This paper presents our work on the refinement and improvement of the Serbian language part of Hurtlex, a multilingual lexicon of words to hurt. We pay special attention to adding Multi-word expressions that can be seen as abusive, as such lexical entries are very important in obtaining good results in a plethora of abusive language detection tasks. We use Serbian morphological dictionaries as a basis for data cleaning and MWE dictionary creation. A connection to other lexical and semantic resources in Serbian is outlined and building of abusive language detection systems based on that connection is foreseen.
2019
pdf
bib
abs
Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names
Branislava Šandrih
|
Cvetana Krstev
|
Ranka Stankovic
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian newspaper texts that was used to prepare a gold standard annotated with personal names. It was further used to prepare training sets for four different levels of annotation, which were further used to train two Named Entity Recognition (NER) systems: Stanford and spaCy. All obtained models, together with a rule- and lexicon-based system were evaluated on two sample texts: a part of the gold standard and an independent newspaper text of approximately the same size. The results show that rule- and lexicon-based system outperforms trained models in all four scenarios (measured by F1), while Stanford models has the highest precision. All systems obtain best results in recognizing full names, while the recognition of first names only is rather poor. The produced models are incorporated into a Web platform NER&Beyond that provides various NE-related functions.
2018
pdf
bib
abs
Knowledge and Rule-Based Diacritic Restoration in Serbian
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the data obtained from SrpKor and local grammars assists in making a decision between several candidates in cases of ambiguity. The evaluation results reveal that, depending on the text, accuracy ranges from 95.03% to 99.36%, while the precision (average 98.93%) is always higher than the recall (average 94.94%).
pdf
bib
abs
Resource-based WordNet Augmentation and Enrichment
Ranka Stanković
|
Miljana Mladenović
|
Ivan Obradović
|
Marko Vitas
|
Cvetana Krstev
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)
In this paper we present an approach to support production of synsets for Serbian WordNet (SerWN) by adjusting Princeton WordNet (PWN) synsets using several bilingual English-Serbian resources. PWN synset definitions were automatically translated and post-edited, if needed, while candidate literals for Serbian synsets were obtained automatically from a list of translational equivalents compiled form bilingual resources. Preliminary results obtained from a set of 1248 selected PWN synsets show that the produced Serbian synsets contain 4024 literals, out of which 2278 were offered by the system we present in this paper, whereas experts added the remaining 1746. Approximately one half of synset definitions obtained automatically were accepted with no or minor corrections. These first results are encouraging, since the efficiency of synset production for SerWN was increased. There is also space for further improvement of this approach to wordnet enrichment.
pdf
bib
Using English Baits to Catch Serbian Multi-Word Terminology
Cvetana Krstev
|
Branislava Šandrih
|
Ranka Stanković
|
Miljana Mladenović
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
Rule-based Automatic Multi-word Term Extraction and Lemmatization
Ranka Stanković
|
Cvetana Krstev
|
Ivan Obradović
|
Biljana Lazić
|
Aleksandra Trtovac
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 to 0.960. The evaluation showed that 94% of distinct multi-word forms were evaluated as proper multi-word units, and among them 97% were associated with correct lemmas.
pdf
bib
abs
How to Differentiate the Closely Related Standard Languages?
Duško Vitas
|
Ljubomir Popović
|
Cvetana Krstev
|
Anđelka Zečević
Proceedings of the Second International Conference on Computational Linguistics in Bulgaria (CLIB 2016)
In this paper the adequacy of the SETimes corpus as a basis for the comparison of closely related languages that are used in countries that emerged after the breakup of Yugoslavia is discussed by comparing it with other corpora. It is shown that the phenomena observed in this corpus and used to illustrate differences most specifically between Serbian and Croatian are consistent neither with their standards nor with other sources. Thus, results obtained on the basis of the SETimes corpus are corpus-biased and have to be reconsidered. This proves that the size of a corpus and its composition used in a linguistic research are crucial for assessing the obtained results.
pdf
bib
abs
A Language-independent Model for Introducing a New Semantic Relation Between Adjectives and Nouns in a WordNet
Miljana Mladenović
|
Jelena Mitrović
|
Cvetana Krstev
Proceedings of the 8th Global WordNet Conference (GWC)
The aim of this paper is to show a language-independent process of creating a new semantic relation between adjectives and nouns in wordnets. The existence of such a relation is expected to improve the detection of figurative language and sentiment analysis (SA). The proposed method uses an annotated corpus to explore the semantic knowledge contained in linguistic constructs performing as the rhetorical figure Simile. Based on the frequency of occurrence of similes in an annotated corpus, we propose a new relation, which connects the noun synset with the synset of an adjective representing that noun’s specific attribute. We elaborate on adding this new relation in the case of the Serbian WordNet (SWN). The proposed method is evaluated by human judgement in order to determine the relevance of automatically selected relation items. The evaluation has shown that 84% of the automatically selected and the most frequent linguistic constructs, whose frequency threshold was equal to 3, were also selected by humans.
2014
pdf
bib
abs
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm
|
Hans Uszkoreit
|
Sophia Ananiadou
|
Núria Bel
|
Audronė Bielevičienė
|
Lars Borin
|
António Branco
|
Gerhard Budin
|
Nicoletta Calzolari
|
Walter Daelemans
|
Radovan Garabík
|
Marko Grobelnik
|
Carmen García-Mateo
|
Josef van Genabith
|
Jan Hajič
|
Inma Hernáez
|
John Judge
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Joseph Mariani
|
John McNaught
|
Maite Melero
|
Monica Monachini
|
Asunción Moreno
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Stelios Piperidis
|
Adam Przepiórkowski
|
Eiríkur Rögnvaldsson
|
Michael Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Koenraad De Smedt
|
Marko Tadić
|
Paul Thompson
|
Dan Tufiş
|
Tamás Váradi
|
Andrejs Vasiļjevs
|
Kadri Vider
|
Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.
pdf
bib
Developing and Maintaining a WordNet: Procedures and Tools
Miljana Mladenović
|
Jelena Mitrović
|
Cvetana Krstev
Proceedings of the Seventh Global Wordnet Conference
pdf
bib
Enriching SerbianWordNet and Electronic Dictionaries with Terms from the Culinary Domain
Staša Vujičić Stanković
|
Cvetana Krstev
|
Duško Vitas
Proceedings of the Seventh Global Wordnet Conference
pdf
bib
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
Jorge Baptista
|
Pushpak Bhattacharyya
|
Christiane Fellbaum
|
Mikel Forcada
|
Chu-Ren Huang
|
Svetla Koeva
|
Cvetana Krstev
|
Eric Laporte
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
2012
pdf
bib
abs
A tool for enhanced search of multilingual digital libraries of e-journals
Ranka Stanković
|
Cvetana Krstev
|
Ivan Obradović
|
Aleksandra Trtovac
|
Miloš Utvić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed of several modules including a web application, which makes it readily accessible on the web. Its functionality has been tested on a collection of 44 TMX documents generated from articles published bilingually by the journal INFOtecha, yielding encouraging results. Further enhancements of the tool are underway, with the aim of transforming it from a powerful full-text and metadata search tool, to a useful translator's aid, which could be of assistance both in reviewing terminology used in context and in refining the multilingual resources used within the system.
2011
pdf
bib
E-Dictionaries and Finite-State Automata for the Recognition of Named Entities
Cvetana Krstev
|
Duško Vitas
|
Ivan Obradović
|
Miloš Utvić
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing
2010
pdf
bib
abs
A Description of Morphological Features of Serbian: a Revision using Feature System Declaration
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper we discuss some well-known morphological descriptions used in various projects and applications (most notably MULTEXT-East and Unitex) and illustrate the encountered problems on Serbian. We have spotted four groups of problems: the lack of a value for an existing category, the lack of a category, the interdependence of values and categories lacking some description, and the lack of a support for some types of categories. At the same time, various descriptions often describe exactly the same morphological property using different approaches. We propose a new morphological description for Serbian following the feature structure representation defined by the ISO standard. In this description we try do incorporate all characteristics of Serbian that need to be specified for various applications. We have developed several XSLT scripts that transform our description into descriptions needed for various applications. We have developed the first version of this new description, but we treat it as an ongoing project because for some properties we have not yet found the satisfactory solution.
2009
pdf
bib
E-Connecting Balkan Languages
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Svetla Koeva
Proceedings of the Workshop Multilingual resources, technologies and evaluation for central and Eastern European languages
2008
pdf
bib
abs
The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Ivan Obradović
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we present how resources and tools developed within the Human Language Technology Group at the University of Belgrade can be used for tuning queries before submitting them to a web search engine. We argue that the selection of words chosen for a query, which are of paramount importance for the quality of results obtained by the query, can be substantially improved by using various lexical resources, such as morphological dictionaries and wordnets. These dictionaries enable semantic and morphological expansion of the query, the latter being very important in highly inflective languages, such as Serbian. Wordnets can also be used for adding another language to a query, if appropriate, thus making the query bilingual. Problems encountered in retrieving documents of interest are discussed and illustrated by examples. A brief description of resources is given, followed by an outline of the web tool which enables their integration. Finally, a set of examples is chosen in order to illustrate the use of the lexical resources and tool in question. Results obtained for these examples show that the number of documents obtained through a query by using our approach can double and even quadruple in some cases.
2006
pdf
bib
abs
WS4LR: A Workstation for Lexical Resources
Cvetana Krstev
|
Ranka Stanković
|
Duško Vitas
|
Ivan Obradović
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper we describe WS4LR, the workstation for lexical resources, a software tool developed within the Human Language Technology Group at the Faculty of Mathematics, University of Belgrade. The tool is aimed at manipulating heterogeneous lexical resources, and the need for such a tool came from the large volume of resources the Group has developed in the course of many years and within different projects. The tool handles morphological dictionaries, wordnets, aligned texts and transducers equally and has already proved very useful for various tasks. Although it has so far been used mainly for Serbian, WS4LR is not language dependent and can be successfully used for resources in other languages provided that they follow the described formats and methodologies. The tool operates on the .NET platform and runs on a personal computer under Windows 2000/XP/2003 operating system with at least 256MB of internal memory.
2004
pdf
bib
Combining Heterogeneous Lexical Resources
Cvetana Krstev
|
Duško Vitas
|
Ranka Stankoviæ
|
Ivan Obradoviæ
|
Gordana Pavloviæ-Lažetiæ
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2003
pdf
bib
The MULTEXT-East Morphosyntactic Specification for Slavic Languages
Tomaž Erjavec
|
Cvetana Krstev
|
Vladimír Petkevič
|
Kiril Simov
|
Marko Tadić
|
Duško Vitas
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages
pdf
bib
Composite Tense Recognition and Tagging in Serbian
Duško Vitas
|
Cvetana Krstev
Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages