Ivelina Stoyanova


2024

pdf bib
Multiword Expressions between the Corpus and the Lexicon: Universality, Idiosyncrasy, and the Lexicon-Corpus Interface
Verginica Barbu Mititelu | Voula Giouli | Kilian Evang | Daniel Zeman | Petya Osenova | Carole Tiberius | Simon Krek | Stella Markantonatou | Ivelina Stoyanova | Ranka Stanković | Christian Chiarcos
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

We present ongoing work towards defining a lexicon-corpus interface to serve as a benchmark in the representation of multiword expressions (of various parts of speech) in dedicated lexica and the linking of these entries to their corpus occurrences. The final aim is the harnessing of such resources for the automatic identification of multiword expressions in a text. The involvement of several natural languages aims at the universality of a solution not centered on a particular language, and also accommodating idiosyncrasies. Challenges in the lexicographic description of multiword expressions are discussed, the current status of lexica dedicated to this linguistic phenomenon is outlined, as well as the solution we envisage for creating an ecosystem of interlinked lexica and corpora containing and, respectively, annotated with multiword expressions.

pdf bib
Semantic features in the automatic analysis of verbs of creation in Bulgarian and English
Ivelina Stoyanova
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

The paper focuses on the semantic class of verbs of creation as a subclass of dynamic verbs. The objective is to present the description of creation verbs in terms of their corresponding semantic frames and to outline the semantic features of the frame elements with a view to their automatic identification and analysis in text. The observations are performed on Bulgarian and English data with the aim to establish the language-independent and language-specific features in the semantic description of the analysed class of verbs.

pdf bib
Multilingual Corpus of Illustrative Examples on Activity Predicates
Ivelina Stoyanova | Hristina Kukova | Maria Todorova | Tsvetana Dimitrova
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

The paper presents the ongoing process of compilation of a multilingual corpus of illustrative examples to supplement our work on the syntactic and semantic analysis of predicates representing activities in Bulgarian and other languages. The corpus aims to include over 1,000 illustrative examples on verbs from six semantic classes of predicates (verbs of motion, contact, consumption, creation, competition and bodily functions) which provide a basis for observations on the specificity of their realisation. The corpus of illustrative examples will be used for contrastive studies and further elaboration on the scope and behaviour of activity verbs in general, as well as its semantic subclasses.

2023

pdf bib
PARSEME corpus release 1.3
Agata Savary | Cherifa Ben Khelil | Carlos Ramisch | Voula Giouli | Verginica Barbu Mititelu | Najet Hadj Mohamed | Cvetana Krstev | Chaya Liebeskind | Hongzhi Xu | Sara Stymne | Tunga Güngör | Thomas Pickard | Bruno Guillaume | Eduard Bejček | Archna Bhatia | Marie Candito | Polona Gantar | Uxoa Iñurrieta | Albert Gatt | Jolanta Kovalevskaite | Timm Lichte | Nikola Ljubešić | Johanna Monti | Carla Parra Escartín | Mehrnoush Shamsfard | Ivelina Stoyanova | Veronika Vincze | Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

pdf bib
Expanding the Conceptual Description of Verbs in WordNet with Semantic and Syntactic Information
Ivelina Stoyanova | Svetlozara Leseva
Proceedings of the 12th Global Wordnet Conference

This paper describes an ongoing effort towards expanding the semantic and conceptual description of verbs in WordNet by combining information from two other resources, FrameNet and VerbNet, as well as enriching the verbs’ description with syntactic patterns extracted from the three resources. The conceptual description of verb synsets is provided by assigning a FrameNet frame which provides the relevant set of frame elements denoting the predicate’s participants and props. This information is supplemented by assigning a VerbNet class and the set of semantic roles associated with it. The information extracted from FrameNet and VerbNet and assigned to a synset is aligned (semi-automatically with subsequent manual corrections) at the following levels: (i) FrameNet frame: VerbNet class; (ii) FrameNet frame elements: VerbNet semantic roles; (iii) FrameNet semantic types and restrictions: VerbNet selectional restrictions. We then link the syntactic patterns associated with the units in FrameNet, VerbNet and WordNet, by unifying their representation and by matching the corresponding patterns at the level of syntactic groups. The alignment of the semantic components and their syntactic realisations is essential for the better exploitation of the abundance of information across resources, including shedding light on cross-resource similarities, discrepancies and inconsistencies. The syntactic patterns can facilitate the extraction of examples illustrating the use of verb synset literals in corpora and their semantic characterisation through the association of the syntactic groups with the components of semantic description (frame elements or semantic roles) and can be employed in various tasks requiring semantic and syntactic description. The resource is publicly available to the community. The components of the conceptual description are visualised showing the links to the original resources each component is drawn from.

2022

pdf bib
Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset
Svetla Koeva | Ivelina Stoyanova | Jordan Kralev
Proceedings of the Thirteenth Language Resources and Evaluation Conference

One of the processing tasks for large multimodal data streams is automatic image description (image classification, object segmentation and classification). Although the number and the diversity of image datasets is constantly expanding, still there is a huge demand for more datasets in terms of variety of domains and object classes covered. The goal of the project Multilingual Image Corpus (MIC 21) is to provide a large image dataset with annotated objects and object descriptions in 24 languages. The Multilingual Image Corpus consists of an Ontology of visual objects (based on WordNet) and a collection of thematically related images whose objects are annotated with segmentation masks and labels describing the ontology classes. The dataset is designed both for image classification and object detection and for semantic segmentation. The main contributions of our work are: a) the provision of large collection of high quality copyright-free images; b) the formulation of the Ontology of visual objects based on WordNet noun hierarchies; c) the precise manual correction of automatic object segmentation within the images and the annotation of object classes; and d) the association of objects and images with extended multilingual descriptions based on WordNet inner- and interlingual relations. The dataset can be used also for multilingual image caption generation, image-to-text alignment and automatic question answering for images and videos.

pdf bib
WordNet-Based Bulgarian Sign Language Dictionary of Crisis Management Terminology
Slavina Lozanova | Ivelina Stoyanova
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

This paper presents an online Bulgarian sign language dictionary covering terminology related to crisis management. The pressing need for such a resource became evident during the COVID pandemic when critical information regarding government measures was delivered on a regular basis to the public including Deaf citizens. The dictionary is freely available on the internet and is aimed at the Deaf, sign language interpreters, learners of sign language, social workers and the wide public. Each dictionary entry is supplied with synonyms in spoken Bulgarian, a definition, one or more signs corresponding to the concept in Bulgarian sign language, additional information about derivationally related words and similar signs with different meaning, as well as links to translations in other languages, including American sign language.

pdf bib
Linked Resources towards Enhancing the Conceptual Description of General Lexis Verbs Using Syntactic Information
Svetlozara Leseva | Ivelina Stoyanova
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

214–224

2021

pdf bib
Semantic Analysis of Verb-Noun Derivation in Princeton WordNet
Verginica Mititelu | Svetlozara Leseva | Ivelina Stoyanova
Proceedings of the 11th Global Wordnet Conference

We present here the results of a morphosemantic analysis of the verb-noun pairs in the Princeton WordNet as reflected in the standoff file containing pairs annotated with a set of 14 semantic relations. We have automatically distinguished between zero-derivation and affixal derivation in the data and identified the affixes and manually checked the results. The data show that for each semantic relation an affix prevails in creating new words, although we cannot talk about their specificity with respect to such a relation. Moreover, certain pairs of verb-noun semantic primes are better represented for each semantic relation, and some semantic clusters (in the form of WordNet subtrees) take shape as a result. We thus employ a large-scale data-driven linguistically motivated analysis afforded by the rich derivational and morphosemantic description in WordNet to the end of capturing finer regularities in the process of derivation as represented in the semantic properties of the words involved and as reflected in the structure of the lexicon.

2020

pdf bib
It Takes Two to Tango – Towards a Multilingual MWE Resource
Svetlozara Leseva | Verginica Barbu Mititelu | Ivelina Stoyanova
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)

Mature wordnets offer the opportunity of digging out interesting linguistic information otherwise not explicitly marked in the network. The focus in this paper is on the ways the results already obtained at two levels, derivation and multiword expressions, may be further employed. The parallel recent development of the two resources under discussion, the Bulgarian and the Romanian wordnets, has enabled interlingual analyses that reveal similarities and differences between the linguistic knowledge encoded in the two wordnets. In this paper we show how the resources developed and the knowledge gained are put together towards devising a linked MWE resource that is informed by layered dictionary representation and corpus annotation and analysis. This work is a proof of concept for the adopted method of compiling a multilingual MWE resource on the basis of information extracted from the Bulgarian, the Romanian and the Princeton wordnet, as well as additional language resources and automatic procedures.

pdf bib
Consistency Evaluation towards Enhancing the Conceptual Representation of Verbs in WordNet
Svetlozara Leseva | Ivelina Stoyanova
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)

This paper outlines the process of enhancing the conceptual description of verb synsets in WordNet using FrameNet frames. On the one hand we expand the coverage of the mapping between WordNet and FrameNet, while on the other – we improve the quality of the mapping using a set of consistency checks and verification procedures. The procedures include an automatic identification of potential inconsistencies and imbalanced relations, as well as suggestions for a more precise frame assignment followed by manual validation. We perform an evaluation of the procedures in terms of the quality of the suggestions measured as the potential improvement in precision and coverage, the relevance of the result and the efficiency of the procedure.

2019

pdf bib
Enhancing Conceptual Description through Resource Linking and Exploration of Semantic Relations
Ivelina Stoyanova | Svetlozara Leseva
Proceedings of the 10th Global Wordnet Conference

The paper presents current efforts towards linking two large lexical semantic resources – WordNet and FrameNet – to the end of their mutual enrichment and the facilitation of the access, extraction and analysis of various types of semantic and syntactic information. In the second part of the paper, we go on to examine the relation of inheritance and other semantic relations as represented in WordNet and FrameNet and how they correspond to each other when the resources are aligned. We discuss the implications with respect to the enhancement of the two resources through the definition of new relations and the detailisation of conceptual frames.

pdf bib
Hear about Verbal Multiword Expressions in the Bulgarian and the Romanian Wordnets Straight from the Horse’s Mouth
Verginica Barbu Mititelu | Ivelina Stoyanova | Svetlozara Leseva | Maria Mitrofan | Tsvetana Dimitrova | Maria Todorova
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

In this paper we focus on verbal multiword expressions (VMWEs) in Bulgarian and Romanian as reflected in the wordnets of the two languages. The annotation of VMWEs relies on the classification defined within the PARSEME Cost Action. After outlining the properties of various types of VMWEs, a cross-language comparison is drawn, aimed to highlight the similarities and the differences between Bulgarian and Romanian with respect to the lexicalization and distribution of VMWEs. The contribution of this work is in outlining essential features of the description and classification of VMWEs and the cross-language comparison at the lexical level, which is essential for the understanding of the need for uniform annotation guidelines and a viable procedure for validation of the annotation.

pdf bib
Structural Approach to Enhancing WordNet with Conceptual Frame Semantics
Svetlozara Leseva | Ivelina Stoyanova
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper outlines procedures for enhancing WordNet with conceptual information from FrameNet. The mapping of the two resources is non-trivial. We define a number of techniques for the validation of the consistency of the mapping and the extension of its coverage which make use of the structure of both resources and the systematic relations between synsets in WordNet and between frames in FrameNet, as well as between synsets and frames). We present a case study on causativity, a relation which provides enhancement complementary to the one using hierarchical relations, by means of linking in a systematic way large parts of the lexicon. We show how consistency checks and denser relations may be implemented on the basis of this relation. We, then, propose new frames based on causative-inchoative correspondences and in conclusion touch on the possibilities for defining new frames based on the types of specialisation that takes place from parent to child synset.

2018

pdf bib
Classifying Verbs in WordNet by Harnessing Semantic Resources
Svetlozara Leseva | Ivelina Stoyanova | Maria Todorova
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)

This paper presents the principles and procedures involved in the construction of a classification of verbs using information from 3 semantic resources – WordNet, FrameNet and VerbNet. We adopt the FrameNet frames as the primary categories of the proposed classification and transfer them to WordNet synsets. The hierarchical relationships between the categories are projected both from the hypernymy relation in WordNet and from the hierarchy of some of the frame-to-frame relations in FrameNet. The semantic classes and their hierarchical organisation in WordNet are thus made explicit and allow for linguistic generalisations on the inheritance of semantic features and structures. We then select the beginners of the separate hierarchies and assign classification categories recursively to their hyponyms using a battery of procedures based on generalisations over the semantic primes and the hierarchical structure of WordNet and FrameNet and correspondences between VerbNet superclasses and FrameNet frames. The so-obtained suggestions are ranked according to probability. As a result, 13,465 out of 14,206 verb synsets are accommodated in the classification hierarchy at least through a general category, which provides a point of departure towards further refinement of categories. The resulting system of classification categories is initially derived from the WordNet hierarchy and is further validated against the hierarchy of frames within FrameNet. A set of procedures is established to address inconsistencies and heterogeneity of categories. The classification is subject to ongoing extensive manual verification, essential for ensuring the quality of the resource.

pdf bib
Factors and Features Determining the Inheritance of Semantic Primes between Verbs and Nouns within WordNet
Ivelina Stoyanova
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)

The paper outlines the mechanisms of inheriting semantic content between verbs and nouns as a result of derivational relations. The main factors determining the inheritance are: (1) the semantic class of the verb as represented by the noun; (2) the subcategorisation frame and argument structure of the verb predicate; (3) the derivational relation between the verb and the noun, as well as the resulting semantic relation made explicit through the derivation; (4) hierarchical relations within WordNet. The paper explores three types of verb-noun prime inheritance relations: (a) universal – not depending on the argument structure, which are eventive or circumstantial; (b) general – specific to classes of verbs, for example agentive or non-agentive; (c) verb-specific – depending on the specific subcategorisation frame of the verb as presented in VerbNet and/or FrameNet. The paper presents a possibility for extended coverage of semantic relations based on information about the argument structure of verbs. Further, the work focuses on the regularities in the way in which derivationally related nouns inherit semantic characteristics of the predicate. These regularities can be applied for the purposed of predicting derivationally and semantically related synsets within WordNet, as well as for the creation of language specific synsets, for consistency checks and verification.

pdf bib
Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Carlos Ramisch | Silvio Ricardo Cordeiro | Agata Savary | Veronika Vincze | Verginica Barbu Mititelu | Archna Bhatia | Maja Buljan | Marie Candito | Polona Gantar | Voula Giouli | Tunga Güngör | Abdelati Hawwari | Uxoa Iñurrieta | Jolanta Kovalevskaitė | Simon Krek | Timm Lichte | Chaya Liebeskind | Johanna Monti | Carla Parra Escartín | Behrang QasemiZadeh | Renata Ramisch | Nathan Schneider | Ivelina Stoyanova | Ashwini Vaidya | Abigail Walsh
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

2017

pdf bib
The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Agata Savary | Carlos Ramisch | Silvio Cordeiro | Federico Sangati | Veronika Vincze | Behrang QasemiZadeh | Marie Candito | Fabienne Cap | Voula Giouli | Ivelina Stoyanova | Antoine Doucet
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

2016

pdf bib
Towards the Automatic Identification of Light Verb Constructions in Bulgarian
Ivelina Stoyanova | Svetlozara Leseva | Maria Todorova
Proceedings of the Second International Conference on Computational Linguistics in Bulgaria (CLIB 2016)

This paper presents work in progress focused on developing a method for automatic identification of light verb constructions (LVCs) as a subclass of Bulgarian verbal MWEs. The method is based on machine learning and is trained on a set of LVCs extracted from the Bulgarian WordNet (BulNet) and the Bulgarian National Corpus (BulNC). The machine learning uses lexical, morphosyntactic, syntactic and semantic features of LVCs. We trained and tested two separate classifiers using the Java package Weka and two learning decision tree algorithms – J48 and RandomTree. The evaluation of the method includes 10-fold cross-validation on the training data from BulNet (F1 = 0.766 obtained by the J48 decision tree algorithm and F1 = 0.725 by the RandomTree algorithm), as well as evaluation of the performance on new instances from the BulNC (F1 = 0.802 by J48 and F1 = 0.607 by the RandomTree algorithm). Preliminary filtering of the candidates gives a slight improvement (F1 = 0.802 by J48 and F1 = 0.737 by RandomTree).

pdf bib
Quotation Retrieval System for Bulgarian Media Content
Svetla Koeva | Ivelina Stoyanova | Martin Yalamov
Proceedings of the Second International Conference on Computational Linguistics in Bulgaria (CLIB 2016)

This paper presents a method for automatic retrieval and attribution of quotations from media texts in Bulgarian. It involves recognition of report verbs (including their analytical forms) and syntactic patterns introducing quotations, as well as source attribution of the quote by identification of personal names, descriptors, and anaphora. The method is implemented in a fully-functional online system which offers a live service processing media content and extracting quotations on a daily basis. The system collects and processes written news texts from six Bulgarian media websites. The results are presented in a structured way with description, as well as sorting and filtering functionalities which facilitate the monitoring and analysis of media content. The method has been applied to extract quotations from English texts as well and can be adapted to work with other languages, provided that the respective language specific resources are supplied.

pdf bib
Automatic Prediction of Morphosemantic Relations
Svetla Koeva | Svetlozara Leseva | Ivelina Stoyanova | Tsvetana Dimitrova | Maria Todorova
Proceedings of the 8th Global WordNet Conference (GWC)

This paper presents a machine learning method for automatic identification and classification of morphosemantic relations (MSRs) between verb and noun synset pairs in the Bulgarian WordNet (BulNet). The core training data comprise 6,641 morphosemantically related verb–noun literal pairs from BulNet. The core dataset were preprocessed quality-wise by applying validation and reorganisation procedures. Further, the data were supplemented with negative examples of literal pairs not linked by an MSR. The designed supervised machine learning method uses the RandomTree algorithm and is implemented in Java with the Weka package. A set of experiments were performed to test various approaches to the task. Future work on improving the classifier includes adding more training data, employing more features, and fine-tuning. Apart from the language specific information about derivational processes, the proposed method is language independent.

2015

pdf bib
Automatic Classification of WordNet Morphosemantic Relations
Svetlozara Leseva | Ivelina Stoyanova | Maria Todorova | Tsvetana Dimitrova | Borislav Rizov | Svetla Koeva
The 5th Workshop on Balto-Slavic Natural Language Processing

2014

pdf bib
Automatic Semantic Filtering of Morphosemantic Relations in WordNet
Svetlozara Leseva | Ivelina Stoyanova | Borislav Rizov | Maria Todorova | Ekaterina Tarpomanova
Proceedings of the First International Conference on Computational Linguistics in Bulgaria (CLIB 2014)

In this paper we present a method for automatic assignment of morphosemantic relations between derivationally related verb–noun pairs of synsets in the Bulgarian WordNet (BulNet) and for semantic filtering of those relations. The filtering process relies on the meaning of noun suffixes and the semantic compatibility of verb and noun taxonomic classes. We use the taxonomic labels assigned to all the synsets in the Princeton WordNet (PWN) – one label per synset – which denote their general semantic class. In the first iteration we employ the pairs <noun suffix : noun label> to filter out part of the relations. In the second iteration, which uses as input the output of the first one, we apply a stronger semantic filter. It makes use of the taxonomic labels of the noun-verb synset pairs observed for a given morphosemantic relation. In this way we manage to reliably filter out impossible or unlikely combinations. The results of the performed experiment may be applied to enrich BulNet with morphosemantic relations and new synsets semi-automatically, while facilitating the manual work and reducing its cost.

pdf bib
Automatic Categorisation of Multiword Expressions and Named Entities in Bulgarian
Ivelina Stoyanova
Proceedings of the First International Conference on Computational Linguistics in Bulgaria (CLIB 2014)

This paper describes an approach for automatic categorisation of various types of multiword expressions (MWEs) with a focus on multiword named entities (MNEs), which compose a large portion of MWEs in general. The proposed algorithm is based on a refined classification of MWEs according to their idiomaticity. While MWE categorisation can be considered as a separate and independent task, it complements the general task of MWE recognition. After outlining the method, we set up an experiment to demonstrate its performance. We use the corpus Wiki1000+ that comprises 6,311 annotated Wikipedia articles of 1,000 or more words each, amounting to 13.4 million words in total. The study also employs a large dictionary of 59,369 MWEs noun phrases (out of more than 85,000 MWEs), labelled with their respective types. The dictionary is compiled automatically and verified semi-automatically. The research presented here is based on Bulgarian although most of the ideas, the methodology and the analysis are applicable to other Slavic and possibly other European languages.

2013

pdf bib
Wordnet-Based Cross-Language Identification of Semantic Relations
Ivelina Stoyanova | Svetla Koeva | Svetlozara Leseva
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Text Modification for Bulgarian Sign Language Users
Slavina Lozanova | Ivelina Stoyanova | Svetlozara Leseva | Svetla Koeva | Boian Savtchev
Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations

2012

pdf bib
Application of Clause Alignment for Statistical Machine Translation
Svetla Koeva | Svetlozara Leseva | Ivelina Stoyanova | Rositsa Dekova | Angel Genov | Borislav Rizov | Tsvetana Dimitrova | Ekaterina Tarpomanova | Hristina Kukova
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Bulgarian X-language Parallel Corpus
Svetla Koeva | Ivelina Stoyanova | Rositsa Dekova | Borislav Rizov | Angel Genov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper presents the methodology and the outcome of the compilation and the processing of the Bulgarian X-language Parallel Corpus (Bul-X-Cor) which was integrated as part of the Bulgarian National Corpus (BulNC). We focus on building representative parallel corpora which include a diversity of domains and genres, reflect the relations between Bulgarian and other languages and are consistent in terms of compilation methodology, text representation, metadata description and annotation conventions. The approaches implemented in the construction of Bul-X-Cor include using readily available text collections on the web, manual compilation (by means of Internet browsing) and preferably automatic compilation (by means of web crawling ― general and focused). Certain levels of annotation applied to Bul-X-Cor are taken as obligatory (sentence segmentation and sentence alignment), while others depend on the availability of tools for a particular language (morpho-syntactic tagging, lemmatisation, syntactic parsing, named entity recognition, word sense disambiguation, etc.) or for a particular task (word and clause alignment). To achieve uniformity of the annotation we have either annotated raw data from scratch or transformed the already existing annotation to follow the conventions accepted for BulNC. Finally, actual uses of the corpora are presented and conclusions are drawn with respect to future work.