Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
Ontopopulis is a multilingual weakly supervised terminology learning algorithm which takes on its input a set of seed terms for a semantic category and an unannotated text corpus. The algorithm learns additional terms, which belong to this category. For example, for the category “environmental disasters” the input seed set in English is environmental disaster, water pollution, climate change. Among the highest ranked new terms which the system learns for this semantic class are deforestation, global warming and so on.
We describe the structure and creation of the SageWrite corpus. This is a manually annotated corpus created to support automatic language generation and automatic quality assessment of academic articles. The corpus currently contains annotations for 100 excerpts taken from various scientific articles. For each of these excerpts, the corpus contains (i) a draft version of the excerpt (ii) annotations that reflect the stylistic and linguistics merits of the excerpt, such as whether or not the text is clearly structured. The SageWrite corpus is the first corpus for the fine-tuning of text-generation algorithms that specifically addresses academic writing.
Razmecheno: Named Entity Recognition from Digital Archive of Diaries “Prozhito”
Timofey Atnashev | Veronika Ganeeva | Roman Kazakov | Daria Matyash | Michael Sonkin | Ekaterina Voloshina | Oleg Serikov | Ekaterina Artemova
The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset “Razmecheno”, gathered from the diary texts of the project “Prozhito” in Russian. Our dataset is of interest for multiple research lines: literary studies of diary texts, transfer learning from other domains, low-resource or cross-lingual named entity recognition. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika. The annotation schema consists of five commonly used entity tags: person, characteristics, location, organisation, and facility. The labelling is carried out on the crowdsourcing platfrom Yandex.Toloka in two stages. First, workers selected sentences, which contain an entity of particular type. Second, they marked up entity spans. As a result 1113 entities were obtained. Empirical evaluation of Razmecheno is carried out with off-the-shelf NER tools and by fine-tuning pre-trained contextualized encoders. We release the annotated dataset for open access.
One of the main challenges within the rapidly developing field of neural machine translation is its application to low-resource languages. Recent attempts to provide large parallel corpora in rare language pairs include the generation of web-crawled corpora, which may be vast but are, unfortunately, excessively noisy. The corpus utilised to train machine translation models in the study is CCMatrix, provided by OPUS. Firstly, the corpus is cleaned based on a number of heuristic rules. Then, parts of it are selected in three discrete ways: at random, based on the “margin distance” metric that is native to the CCMatrix dataset, and based on scores derived through the application of a state-of-the-art classifier model (Acarcicek et al., 2020) utilised in a thematic WMT shared task. The performance of the issuing models is evaluated and compared. The classifier-based model does not reach high performance as compared with its margin-based counterpart, opening a discussion of ways for further improvement. Still, BLEU scores surpass those of Acarcicek et al.’s (2020) paper by over 15 points.
In recent years, we have seen a surge in the propagation of online hate speech on social media platforms. According to a multitude of sources such as the European Council, hate speech can lead to acts of violence and conflict on a broader scale. That has led to in- creased awareness by governments, companies, and the scientific community, and although the field is relatively new, there have been considerable advancements in the field as a result of the collective effort. Despite the increasingly better results, most of the research focuses on the more popular languages (i.e., English, German, or Arabic), whereas less popular languages such as Bulgarian and other Balkan languages have been neglected. We have aggregated a real-world dataset from Bulgarian online forums and manually annotated 108,142 sentences. About 1.74% of which can be described with the categories racism, sexism, rudeness, and profanity. We then developed and evaluated various classifiers on the dataset and found that a support vector machine with a linear kernel trained on character-level TF-IDF features is the best model. Our work can be seen as another piece in the puzzle to building a strong foundation for future work on hate speech classification in Bulgarian.
This paper presents an online Bulgarian sign language dictionary covering terminology related to crisis management. The pressing need for such a resource became evident during the COVID pandemic when critical information regarding government measures was delivered on a regular basis to the public including Deaf citizens. The dictionary is freely available on the internet and is aimed at the Deaf, sign language interpreters, learners of sign language, social workers and the wide public. Each dictionary entry is supplied with synonyms in spoken Bulgarian, a definition, one or more signs corresponding to the concept in Bulgarian sign language, additional information about derivationally related words and similar signs with different meaning, as well as links to translations in other languages, including American sign language.
The paper discusses the raising and control syntactic structures (marked as ‘xcomp’) in a UD parsed corpus of Bulgarian Parliamentary Sessions. The idea is: to investigate the linguistic status of this phenomenon in an automatically parsed corpus, with a focus on verbal constructions of a head and its dependant together with the shared subject; to detect the errors and get insights on how to improve the annotation scheme and the automatic detection of this phenomenon realizations in Bulgarian.
The paper presents a corpus-based study of emotive predicates (verbs and predicative constructions with adjectival, adverbial or noun phrases) in Bulgarian with respect to their syntactic characteristics. The sources of empirical data analyzed here are Bulgarian National Corpus, Corpus of Bulgarian Political and Journalistic Speech and Bulgarian part of Multilingual Comparable Corpora of Parliamentary Debates ParlaMint. The analyzes are organized in terms of morpho-syntactic features of emotive predicates, transitivity, syntactic functions and theta-roles of their arguments. Emotive predicates denote a state or an event involving an affective experience. As part of the special semantic class of psychological/Experiencer verbs, they have been studied in relation to the interaction between lexical semantics and argument realization. Bulgarian data confirm the well-established division of Psych predicates into three classes: Subject Experiencer (fear type verbs), Object Experiencer (frighten type verbs), Dative Experiencer. The third class is mostly represented by adverbial predicates.
Thе study explores the interaction between the participants in the communication process with respect to their knowledge about the situation presented in the utterance when transforming direct into indirect speech using a verbum dicendi. The speaker has a choice between firsthand (indicative tenses) which by definition denotes a witnessed situation and non-firsthand which presents the situation as non-witnessed. The interplay between the grammatical marking and the speaker’s evidential strategy is analyzed by applying a corpus method. The data of the Bulgarian National Corpus are used to detect the preferences for a given strategy considering also the grammatical person which indicates the level of knowledge of the communicants about the situation: the 1st person shows the strong knowledge of the speaker, the 2nd person is related to the strong knowledge of the listener, and the 3rd person is associated with a weak knowledge of both participants. Illustrative examples representative for a given situation are extracted from the corpus and subjected to a context analysis.
The present study explores the semantic and structural aspects of word formation processes in English, focusing on how verbs are derived by the suffixes -ize, -ify, -en, and -ate. Based on relevant derivatives extracted from the British National Corpus, their detailed observation is made from semantic and formal viewpoints. Then their theoretical analysis is carried out in the framework of generative theory. The BNC survey demonstrates that (i) the meanings of derived verbs are largely divided into five types and the submeanings are closely related to each other, (ii) the well-formedness of derived verbs is primarily determined by the semantic and formal features of their bases, and (iii) -ize suffixation is creative enough to provide a constant supply for new labels. To account for these empirical observations, the mechanism for forming -ize derivatives is proposed in which the semantic properties and creativity of -ize derivation stem solely from the underlying structure and the formal properties of the bases derive from the lexical entry of -ize.
We present a comparative study of p(e)re-reduplication in Bulgarian and Ukrainian, based on material from a parallel corpus of bilingual texts. We analyse all occurrences found in the corpus of close sequences and conjunctions of two cognate words, the second of which features the intensive and recursive prefix pre- (Bulgarian) or pere- (Ukrainian). We find that in Bulgarian this construction occurs more frequently with finite verb forms, and in Ukrainian with participles and nouns. There is also a correlation with the mode of action denoted by the prefix: in its intensive meaning it turns up more often in Bulgarian, in its recursive meaning in the two languages equally, and in Ukrainian there are more occasions where it cannot be identified as either intensive or recursive. Finally, in both languages instances of p(e)re-reduplication are most common, by a wide marge, in texts with Ukrainian originals.
The paper presents an open-domain Question Answering system for Romanian, answering COVID-19 related questions. The QA system pipeline involves automatic question processing, automatic query generation, web searching for the top 10 most relevant documents and answer extraction using a fine-tuned BERT model for Extractive QA, trained on a COVID-19 data set that we have manually created. The paper will present the QA system and its integration with the Romanian language technologies portal RELATE, the COVID-19 data set and different evaluations of the QA performance.
Extraction of event causality and especially implicit causality from text data is a challenging task. Causality is often treated as a specific relation type and can be considered as a part of relation extraction or relation classification task. Many causality identification-related tasks are designed to select the most plausible alternative of a set of possible causes and consider multiple-choice classification settings. Since there are powerful Question Answering (QA) systems pretrained on large text corpora, we investigated a zero-shot QA-based approach for event causality extraction using a Wikipedia-based dataset containing event descriptions (articles) and annotated causes. We aimed to evaluate to what extent reading comprehension ability of the QA-pipeline can be used for event-related causality extraction from plain text without any additional training. Some evaluation challenges and limitations of the data were discussed. We compared the performance of a two-step pipeline consisting of passage retrieval and extractive QA with QA-only pipeline on event-associated articles and mixed ones. Our systems achieved average cosine semantic similarity scores of 44 – 45% in different settings.
The focus of the paper is the Ontology of Visual Objects based on WordNet noun hierarchies. In particular, we present a methodology for bidirectional ontology engineering, which integrates the pre-existing knowledge resources and the selection of visual objects within the images representing particular thematic domains. The Ontology of Visual Objects organizes concepts labeled by corresponding classes (dominant classes, classes that are attributes to dominant classes, and classes that serve only as parents to dominant classes), relations between concepts and axioms defining the properties of the relations. The Ontology contains 851 classes (706 dominant and attribute classes), 15 relations and a number of axioms built upon them. The definition of relations between dominant and attribute classes and formulations of axioms based on the properties of the relations offers a reliable means for automatic object or image classification and description.
We present a sense-annotated corpus for Russian. The resource was obtained my manually annotating texts from the OpenCorpora corpus, an open corpus for the Russian language, by senses of Russian wordnet RuWordNet. The annotation was used as a test collection for comparing unsupervised (Personalized Pagerank) and pseudo-labeling methods for Russian word sense disambiguation.
In this paper we present a new version of the Romanian journalistic treebank annotated with verbal multiword expressions of four types: idioms, light verb constructions, reflexive verbs and inherently adpositional verbs, the last type being recently added to the corpus. These types have been defined and characterized in a multilingual setting (the PARSEME guidelines for annotating verbal multiword expressions). We present the annotation methodologies and offer quantitative data about the expressions occurring in the corpus. We discuss the characteristics of these expressions, with special reference to the difficulties they raise for the automatic processing of Romanian text, as well as for human usage. Special attention is paid to the challenges in the annotation of the inherently adpositional verbs. The corpus is freely available in two formats (CUPT and RDF), as well as queryable using a SPARQL endpoint.
This paper describes the creation of a parallel multilingual lexicon of named entities from English to three South Slavic languages: Serbian, Bulgarian and Macedonian, with Wikipedia as a source. The basics of the proposed methodology are well known. This methodology provides a cheap opportunity to build multilingual lexicons, without having expertise in target languages. Wikipedia’s database dump can be freely downloaded in SQL and XML formats. The method presented here has been used to build a Python application that extracts the English – Serbian – Bulgarian – Macedonian parallel titles from Wikipedia and classifies them using the English Wikipedia category system. The extracted named entity sets have been classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). It has been achieved using Wikipedia metadata. The quality of classification has been checked manually on 1,000 randomly chosen named entities. The following are the results obtained: 97% for precision and 90% for recall.
Automatic Language Identification (LI) is a widely addressed task, but not all users (for example linguists) have the means or interest to develop their own tool or to train the existing ones with their own data. There are several off-the-shelf LI tools, but for some languages, it is unclear which tool is the best for specific types of text. This article presents a comparison of the performance of several off-the-shelf language identification tools on Bulgarian social media data. The LI tools are tested on a multilingual Twitter dataset (composed of 2966 tweets) and an existing Bulgarian Twitter dataset on the topic of fake content detection of 3350 tweets. The article presents the manual annotation procedure of the first dataset, a dis- cussion of the decisions of the two annotators, and the results from testing the 7 off-the-shelf LI tools on both datasets. Our findings show that the tool, which is the easiest for users with no programming skills, achieves the highest F1-Score on Bulgarian social media data, while other tools have very useful functionalities for Bulgarian social media texts.
More than 13 million people suffer a stroke each year. Aphasia is known as a language disorder usually caused by a stroke that damages a specific area of the brain that controls the expression and understanding of language. Aphasia is characterized by a disturbance of the linguistic code affecting encoding and/or decoding of the language. Our project aims to propose a method that helps a person suffering from aphasia to communicate better with those around him. For this, we will propose a machine translation capable of correcting aphasic errors and helping the patient to communicate more easily. To build such a system, we need a parallel corpus; to our knowledge, this corpus does not exist, especially for French. Therefore, the main challenge and the objective of this task is to build a parallel corpus composed of sentences with aphasic errors and their corresponding correction. We will show how we create a pseudo-aphasia corpus from real data, and then we will show the feasibility of our project to translate from aphasia data to natural language. The preliminary results show that the deep learning methods we used achieve correct translations corresponding to a BLEU of 38.6.
In late 2016, Google Translate (GT), widely considered a machine translation leader, replaced its statistical machine translation (SMT) functions with a neural machine translation (NMT) model for many large languages, including Spanish, with other languages following thereafter. Whereas the capabilities of GT had previously advanced incrementally, this switch to NMT resulted in seemingly exponential improvement. However, half a dozen years later, while recognizing GT’s usefulness, it is also imperative to systematically evaluate ongoing shortcomings, including determining which challenges may reasonably be presumed as superable over time and those which, following a multiyear tracking study, prove unlikely ever to be fully resolved. While the research in question principally explores Spanish-English-Spanish machine translation, this paper examines similar problems with Bulgarian-English-Bulgarian GT renditions. Better understanding both the strengths and weaknesses of current machine translation applications is fundamental to knowing when such non-human natural language processing (NLP) technology is capable of performing all or most of a given task, and when heavy, perhaps even exclusive human intervention is still required.
This paper presents a small corpus of notices displayed at entrances of various Belgrade public premises asking those who enter to wear a mask. We analyze the various aspects of these notices: their physical appearance, script, lexica, syntax and style. A special attention is paid to various obligatory and optional parts of these notices. Obligatory parts deal with wearing masks, keeping the distance, limiting the number of persons on premises and using disinfection. We developed local grammars for modelling phrases that require wearing masks, that can be used both for recognition and for generation of paraphrases.
Recent developments in computer vision applications that are based on machine learning models allow real-time object detection, segmentation and captioning in image or video streams. The paper presents the development of an extension of the 80 COCO categories into a novel ontology with more than 700 classes covering 130 thematic subdomains related to Sport, Transport, Arts and Security. The development of an image dataset of object segmentation was accelerated by machine learning for automatic generation of objects’ boundaries and classes. The Multilingual image dataset contains over 20,000 images and 200,000 annotations. It was used to pre-train 130 models for object detection and classification. We show the established approach for the development of the new models and their integration into an application and evaluation framework.
We present BulFrame – a web-based system designed for creating, editing, validating and viewing conceptual frames. A unified theoretical model for the formal presentation of Conceptual frames is offered, which predetermines the architecture of the system with which the data is processed. A Conceptual frame defines a unique set of syntagmatic relations between verb synsets representing the frame and noun synsets expressing the frame elements. Thereby, the notion of Conceptual frame combines semantic knowledge presented in WordNet and FrameNet and builds upon it. The main difference with FrameNet semantic frames is the definition of the sets of nouns that can be combined with a given verb. This is achieved by an ontological representation of noun semantic classes. The framework is built and evaluated with Conceptual frames for Bulgarian verbs.
The distinction between arguments and adjuncts is a relevant topic in many linguistic theories (Tesniere, 1959; Chomsky, 1981; Langacker, 1987; Van Valin, 2001; Herbst, 2014, etc.). Even though theories provide similar definitions of arguments and adjuncts, sometimes it is difficult to draw a clear line between them. In order to determine ambiguous syntactic parts as arguments or adjuncts, various tests have been proposed, but they often give contradictory results and are not fully reliable. Nevertheless, they can be used as an auxiliary tool. The project Syntactic and Semantic Analysis of Arguments and Adjuncts in Croatian – SARGADA was launched with the aim of thoroughly investigating the distinction between arguments and adjuncts in Croatian, and to apply the theoretical results in a syntactic repository which would be a valuable resource for improving NLP tools and for researching and teaching Croatian. In this paper, we will present diagnostic tests chosen as a tool to distinguish between arguments and adjuncts in the Croatian language. The repository containing sentences with ambiguous syntactic phrases and our workflow will also be described.
Hydra is a Wordnet management system where the Synsets from different languages live in a common relational structure (Kripke frame) with a user-frendly GUI for searching, editing and alignment of the objects from the different languages. The data is retrieved by means of a modal logic query language. Despite its many merits the system stores only the current state of the wordnet data. Wordnet editing and development opens questions for wordnet data, structure and its consistency over time. The new Time Flow Hydra uses a Dynamic wordnet model with a discrete time embeded where all the states of all the objects are stored and accessed simultaneously. This provides the ability to track the changes, to detect the desired and undesired results of the data evolution. For example, we can ask which objects 10 days ago had 2 hyponyms, and 5 days later have 3.