Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)

Anthology ID:: 2018.clib-1
Month:: May
Year:: 2018
Address:: Sofia, Bulgaria
Venue:: CLIB
SIG:
Publisher:: Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
URL:: https://aclanthology.org/2018.clib-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2018.clib-1.pdf

pdf bib
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)

pdf bib abs
With a little help from NLP: My Language Technology applications with impact on society
Ruslan Mitkov

The keynote speech presents the speaker’s vision that research should lead to the development of applications which benefit society. To support this, the speaker will present three original methodologies proposed by him which underpin applications jointly implemented with colleagues from across his research group. These Language Technology tools already have a substantial societal impact in the following areas: learning and assessment, translation and care for people with language disabilities.

pdf bib abs
NLP-based Assessment of Reading Efficiency in Early Grade Children
Vito Pirrelli

Assessing reading skills is a laborious and time-consuming task, which requires monitoring a variety of interlocked abilities, ranging from accurate word rendering, reading fluency and lexical access, to linguistic comprehension, and interpretation, management and inference of complex events in working memory. No existing software, to our knowledge, is able to cover and integrate reading performance monitoring, instant feedback, personalised potentiation and intelligent decision support to teachers and speech therapists, assessment of response to intervention. NLP and ICT technologies can make such an ambitious platform an achievable target.

pdf bib abs
Figurative language processing: A developmental and NLP Perspective
Mila Vulchanova | Valentin Vulchanov

It is now common to employ evidence from human behaviour (e.g., child development) for the creation of computational models of this behaviour with a variety of applications (e.g., in developmental robotics). In this paper we address research in the comprehension and processing of figurative (non-literal) language in highly verbal individuals with autism in comparison with age- and language level-matched neuro-typical individuals and discuss critically what factors might account for the observed problems. Based on this evidence we try to outline the strategies used by human language users in understanding non-literal/non-compositional expressions and proceed to identifying possible solutions for automated language systems in the domain of idiomatic expressions.

pdf bib abs
Abstractive Text Summarization with Application to Bulgarian News Articles
Nikola Taushanov | Ivan Koychev | Preslav Nakov

With the development of the Internet, a huge amount of information is available every day. Therefore, text summarization has become critical part of our first access to the information. There are two major approaches for automatic text summarization: abstractive and extractive. In this work, we apply abstractive summarization algorithms on a corpus of Bulgarian news articles. In particular, we compare selected algorithms of both techniques and we show results which provide evidence that the selected state-of-the-art algorithms for abstractive text summarization perform better than the extractive ones for articles in Bulgarian. For the purpose of our experiments we collected a new dataset consisting of around 70,000 news articles and their topics. For research purposes we are also sharing the tools to easily collect and process such datasets.

pdf bib abs
Towards Lexical Meaning Formal Representation by virtue of the NL-DL Definition Transformation Method
Maria Gritz

The paper represents a part of an extensive study devoted to the issues of lexical meaning formal representation in OWL 2 DL notation. Both theoretical and methodological aspects of lexical meaning formalization within the framework of an ontology are observed in the paper. Model-theoretic semantics paradigm and Kripke model are considered to form a theoretical background for formalization of lexical meaning, whereas the NL-DL definition transformation method is investigated as a method designed to provide us with acceptable formal definitions in OWL 2 DL notation with natural language definitions given at the input. A brief critical study of the method has allowed to reveal particular problematic cases of the method application, which arise due to syntactic peculiarities of natural language definitions given at the input.

pdf bib abs
Narrow Productivity, Competition, and Blocking in Word Formation
Junya Morita

The present study explores the productivity of word formation processes in English, focusing on word composition by suffixes such as -ize (e.g. transcendentalize), -(a)(t)ion (territorization) , and -al (realizational). An optimal productivity measure for affixation is identified, which makes best use of hapax legomena in a large-scale corpus and attaches great importance to the base forms of an affix. This measure is then applied to the data collected from a large corpus to compute the productivity values of twelve kinds of affixes. The detailed investigation reveals that (i) the high productivity rate of an affix demonstrates a creative aspect of the affix, giving full support to the idea of “generative” morphology, (ii) productivity is gradient; very high, fairly high, and low productivity of affixes are recognizable, and (iii) this is necessarily reflected in determining the word form of a derivative (cf. territorization); competition is carried out to decide which affix is selected for a given base form (territorize) and the “losers” (-ment/-al) are blocked out.

pdf bib abs
Knowledge and Rule-Based Diacritic Restoration in Serbian
Cvetana Krstev | Ranka Stanković | Duško Vitas

In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the data obtained from SrpKor and local grammars assists in making a decision between several candidates in cases of ambiguity. The evaluation results reveal that, depending on the text, accuracy ranges from 95.03% to 99.36%, while the precision (average 98.93%) is always higher than the recall (average 94.94%).

pdf bib abs
Perfect Bulgarian Hyphenation and or How not to Stutter at End-of-line
Anton Zinoviev

What is Perfect Bulgarian Hyphenation? We know that it has to be based somehow on the syllables and on the morphology but considering that these two factors often contradict each other, how exactly are we going to combine them? And speaking about syllables, what are they and how are we going to determine them? Also, how are we going to find the morphemes in the words? Don’t we have to develop an electronic derivational dictionary of the Bulgarian language? Isn’t all this going to be forbiddingly difficult?

pdf bib abs
Russian Bridging Anaphora Corpus
Anna Roitberg | Denis Khachko

In this paper, we present a bridging anaphora corpus for Russian, introduce a syntactic approach for bridging annotation and discuss the difference between the syntactic and semantic approaches. We also discuss some special aspects of bridging annotation for Russian and other languages where definite nominal groups are not marked so frequently as e.g. in Romance or Germanic languages. In the end we list the main cases of annotator disagreement.

pdf bib abs
Aspectual and Temporal Characteristics of the Past Active Participles in Bulgarian – a Corpus-based Study
Ekaterina Tarpomanova

The paper presents a corpus-based study of the past active participles in Bulgarian with respect of their aspectual and temporal characteristics. As this type of participles combine two morphological markers, a special attention is paid on their interaction in different tenses, moods and evidentials. The source of language material used for the study is the Bulgarian National Corpus. The paper is organized in terms of morphological oppositions, aspectual and temporal, analyzing the functions of the participles in compound verbal forms.

pdf bib abs
Unmatched Feminitives in a Corpus of Bulgarian and Ukrainian Parallel Texts
Olena Siruk | Ivan Derzhanski

Feminitives are formed and used in all Slavic languages, but the productivity of their formation and the intensity of their use are not the same everywhere. They are often subject to various intralinguistic and extralinguistic restrictions. In this paper we present a study of feminitives based on a parallel Bulgarian– Ukrainian corpus, with a focus on those occasions on which a feminitive in one language corresponds to a masculine (rarely neuter) noun in the other. The experiment shows that Bulgarian uses feminitives with considerably greater regularity than Ukrainian does, and we discuss the semantic classes of nouns that fail to form feminitives most often and the efect of the source language in translated text and of the author’s and translator’s individual preferences.

pdf bib abs
The Bulgarian Summaries Corpus
Viktoriya Petrova

This article aims to present the Bulgarian Summaries Corpus, its advantages, its purpose and why it is necessary. It explains the selection of texts and process of summarization and the tool used, in addition of a quick overview of the current situation in Bulgaria. The paper also presents a general outline of the market needs, the use of this kind of tools and a short list of examples of a variety of corpora around the world both in language and field.

pdf bib abs
Ontologies for Natural Language Processing: Case of Russian
Natalia Loukachevitch | Boris Dobrov

The paper describes the RuThes family of Russian thesauri intended for natural language processing and information retrieval applications. RuThes-like thesauri include, besides RuThes, Sociopolitical thesaurus, Security Thesaurus, and Ontology on Natural Sciences and Technologies. The RuThes format is based on three approaches for developing computer resources: Princeton WordNet, information-retrieval thesauri, and formal ontologies. The published version of RuThes thesaurus (RuThes-lite 2.0) became a basis for semi-automatic generation of RuWordNet, WordNet-like thesaurus for Russian. Currently researchers can use both RuThes-lite or RuWordNet and compare them in applications. Other RuThes-like resources are being prepared to publication.

pdf bib abs
Resource-based WordNet Augmentation and Enrichment
Ranka Stanković | Miljana Mladenović | Ivan Obradović | Marko Vitas | Cvetana Krstev

In this paper we present an approach to support production of synsets for Serbian WordNet (SerWN) by adjusting Princeton WordNet (PWN) synsets using several bilingual English-Serbian resources. PWN synset definitions were automatically translated and post-edited, if needed, while candidate literals for Serbian synsets were obtained automatically from a list of translational equivalents compiled form bilingual resources. Preliminary results obtained from a set of 1248 selected PWN synsets show that the produced Serbian synsets contain 4024 literals, out of which 2278 were offered by the system we present in this paper, whereas experts added the remaining 1746. Approximately one half of synset definitions obtained automatically were accepted with no or minor corrections. These first results are encouraging, since the efficiency of synset production for SerWN was increased. There is also space for further improvement of this approach to wordnet enrichment.

pdf bib abs
Classifying Verbs in WordNet by Harnessing Semantic Resources
Svetlozara Leseva | Ivelina Stoyanova | Maria Todorova

This paper presents the principles and procedures involved in the construction of a classification of verbs using information from 3 semantic resources – WordNet, FrameNet and VerbNet. We adopt the FrameNet frames as the primary categories of the proposed classification and transfer them to WordNet synsets. The hierarchical relationships between the categories are projected both from the hypernymy relation in WordNet and from the hierarchy of some of the frame-to-frame relations in FrameNet. The semantic classes and their hierarchical organisation in WordNet are thus made explicit and allow for linguistic generalisations on the inheritance of semantic features and structures. We then select the beginners of the separate hierarchies and assign classification categories recursively to their hyponyms using a battery of procedures based on generalisations over the semantic primes and the hierarchical structure of WordNet and FrameNet and correspondences between VerbNet superclasses and FrameNet frames. The so-obtained suggestions are ranked according to probability. As a result, 13,465 out of 14,206 verb synsets are accommodated in the classification hierarchy at least through a general category, which provides a point of departure towards further refinement of categories. The resulting system of classification categories is initially derived from the WordNet hierarchy and is further validated against the hierarchy of frames within FrameNet. A set of procedures is established to address inconsistencies and heterogeneity of categories. The classification is subject to ongoing extensive manual verification, essential for ensuring the quality of the resource.

pdf bib abs
A Pilot Study for Enriching the Romanian WordNet with Medical Terms
Maria Mitrofan | Verginica Barbu Mititelu | Grigorina Mitrofan

This paper presents the preliminary investigations in the process of integrating a specialized vocabulary, namely medical terminology, into the Romanian wordnet. We focus here on four classes from this vocabulary: anatomy (or body parts), disorders, medical procedures and chemicals. In this pilot study we selected two large concepts from each class and created the Romanian terminological (sub)trees for each of them, starting from a medical thesaurus (SNOMED CT) and translating the terms, process which raised various challenges, all of them asking for the expertise of a specialist in the health care domain. The integration of these (sub)trees in the Romanian wordnet also required careful decision making, given the structural differences between a wordnet and a terminological thesaurus. They are presented and discussed herein.

pdf bib abs
Factors and Features Determining the Inheritance of Semantic Primes between Verbs and Nouns within WordNet
Ivelina Stoyanova

The paper outlines the mechanisms of inheriting semantic content between verbs and nouns as a result of derivational relations. The main factors determining the inheritance are: (1) the semantic class of the verb as represented by the noun; (2) the subcategorisation frame and argument structure of the verb predicate; (3) the derivational relation between the verb and the noun, as well as the resulting semantic relation made explicit through the derivation; (4) hierarchical relations within WordNet. The paper explores three types of verb-noun prime inheritance relations: (a) universal – not depending on the argument structure, which are eventive or circumstantial; (b) general – specific to classes of verbs, for example agentive or non-agentive; (c) verb-specific – depending on the specific subcategorisation frame of the verb as presented in VerbNet and/or FrameNet. The paper presents a possibility for extended coverage of semantic relations based on information about the argument structure of verbs. Further, the work focuses on the regularities in the way in which derivationally related nouns inherit semantic characteristics of the predicate. These regularities can be applied for the purposed of predicting derivationally and semantically related synsets within WordNet, as well as for the creation of language specific synsets, for consistency checks and verification.

pdf bib abs
Online Editor for WordNets
Borislav Rizov | Tsvetana Dimitrova

The paper presents an online editor for lexical-semantic databases with relational structure similar to the structure of WordNet – Hydra for Web. It supports functionalities for editing of relational data (including query, creation, change, and linking of relational objects), simultaneous access of multiple user profiles, parallel data visualization and editing of the data on top of single- and parallel mode visualization of the language data.

pdf bib abs
The Effect of Unobserved Word-Context Co-occurrences on a VectorMixture Approach for Compositional Distributional Semantics
Amir Bakarov

Swivel (Submatrix-WIse Vector Embedding Learner) is a distributional semantic model based on counting point-wise mutual information values, capable of capturing word-context co-occurrences in the PMI matrix that were not noted in the training corpus. This model outperforms mainstream word embedding training algorithms such as Continuous Bag-of-Words, GloVe and Skip-Gram in word similarity and word analogy tasks. But the properness of these intrinsic tasks could be questioned, and it is unclear if the ability to count unobservable word-context co-occurrences could also be helpful for downstream tasks. In this work we propose a comparison of Word2Vec and Swivel for two downstream tasks based on natural language sentence matching: the paraphrase detection task and the textual entailment task. As a result, we reveal that Swivel outperforms Word2Vec in both cases, but the difference is minuscule. We can conclude, that the ability to learn embeddings for rarely co-occurring words is not so crucial for downstream tasks.

pdf bib abs
Introducing Computational Linguistics and NLP to High School Students
Rositsa Dekova | Adelina Radeva

The paper addresses a possible way of introducing core concepts of Computational Linguistics through problems given at the linguistic contests organized for high school students in Bulgaria and abroad. Following a brief presentation of the foundation and the underlying objective of these contests, we outline some of the types of problems as reflecting the different levels of language processing and the diversity of approaches and tasks to be solved. By presenting the variety of problems given so far through the years, we would like to attract the attention of the academic community to this captivating method through which high school students might be acquainted with the challenges and the main goals of Computational Linguistics (CL) and Natural Language Processing (NLP).

pdf bib abs
Linguistic Problems on Number Names
Ivan Derzhanski | Milena Veneva

This paper presents a contrastive investigation of linguistic problems based on number names in different languages and intended for secondary-school students. We examine the eight problems of this type that have been assigned at the International Linguistics Olympiad throughout the years and compare the phenomena in the number systems featured there with those of the working languages of the Olympiad and other languages known to be familiar to the participants. On the basis of a statistical analysis of the results achieved by the contestants we draw conclusions regarding the ways in which the difficulty of a problem depends on its structure and the kinds of linguistic phenomena featured in it.

pdf bib abs
Parallel Web Display of Transcribed Spoken Bulgarian with its Normalised Version and an Indexed List of Lemmas
Marina Dzhonova | Kjetil Røa Hauge | Yovka Tisheva

We present and discuss problems in creating a lemmatised index to transcriptions of Bulgarian speech, including the prerequisites for such an index, and why we consider an index preferable to a search engine for this particular kind of text.

pdf bib abs
Integrating Crowdsourcing in Language Learning
Georgi Dzhumayov

This article aims to illustrate the use of crowdsourcing in an educational context. The practical part illustrates and provides the results of an online test conducted among 12th grade high school students from Bulgaria in order to gain new knowledge, find out common characteristics among the tenses and revise for their upcoming exams. They along with some interesting and inspiring teaching ideas could be used in an educational environment to provide easier, quicker and more interactive acquisition of a language. The experiment has been conducted by means of Google forms and sets the beginning of the establishment of an annotated corpus of right and wrong uses of the Bulgarian and English tenses too.

pdf bib abs
Bulgarian–English Parallel Corpus for the Purposes of Creating Statistical Translation Model of the Verb Forms. General Conception, Structure, Resources and Annotation
Todor Lazarov

This paper describes the process of creating a Bulgarian-English parallel corpus for the purposes of constructing a statistical translation model for verb forms in both languages. We briefly introduce the scientific problem behind the corpus, its main purpose, general conception, linguistic resources and annotation conception. In more details we describe the collection of language data for the purposes of creating the corpus, the preparatory processing of the gathered data, the annotation rules based on the characteristics of the gathered data and the chosen software. We discuss the current work on the training model and the future work on this linguistic resource and the aims of the scientific project.

pdf bib abs
Fingerprints in SMS messages: Automatic Recognition of a Short Message Sender Using Gradient Boosting
Branislava Šandrih

This paper considers the following question: Is it possible to tell who is the short message sender just by analyzing a typing style of the sender, and not the meaning of the content itself? If possible, how reliable would the judgment be? Are we leaving some kind of “fingerprint” when we text, and can we tell something about others based just on their typing style? For this purpose, a corpus of ∼ 5,500 SMS messages was gathered from one person’s cell phone and two gradient boost classifiers were built: first one is trying to distinguish whether the message was sent by this exact person (cell phone owner) or by someone else; second one was trained to distinguish between messages sent by some public service (e.g. parking service, bank reports etc.) and messages sent by humans. The performance of the classifiers was evaluated in the 5-fold cross-validation setting, resulting in 73.6% and 99.3% overall accuracy for the first and the second classifier, respectively.