2024
pdf
bib
abs
Multiword Expressions between the Corpus and the Lexicon: Universality, Idiosyncrasy, and the Lexicon-Corpus Interface
Verginica Barbu Mititelu
|
Voula Giouli
|
Kilian Evang
|
Daniel Zeman
|
Petya Osenova
|
Carole Tiberius
|
Simon Krek
|
Stella Markantonatou
|
Ivelina Stoyanova
|
Ranka Stanković
|
Christian Chiarcos
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
We present ongoing work towards defining a lexicon-corpus interface to serve as a benchmark in the representation of multiword expressions (of various parts of speech) in dedicated lexica and the linking of these entries to their corpus occurrences. The final aim is the harnessing of such resources for the automatic identification of multiword expressions in a text. The involvement of several natural languages aims at the universality of a solution not centered on a particular language, and also accommodating idiosyncrasies. Challenges in the lexicographic description of multiword expressions are discussed, the current status of lexica dedicated to this linguistic phenomenon is outlined, as well as the solution we envisage for creating an ecosystem of interlinked lexica and corpora containing and, respectively, annotated with multiword expressions.
pdf
bib
abs
UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology
Agata Savary
|
Daniel Zeman
|
Verginica Barbu Mititelu
|
Anabela Barreiro
|
Olesea Caftanatov
|
Marie-Catherine de Marneffe
|
Kaja Dobrovoljc
|
Gülşen Eryiğit
|
Voula Giouli
|
Bruno Guillaume
|
Stella Markantonatou
|
Nurit Melnik
|
Joakim Nivre
|
Atul Kr. Ojha
|
Carlos Ramisch
|
Abigail Walsh
|
Beata Wójtowicz
|
Alina Wróblewska
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.
pdf
bib
abs
Evaluating Large Language Models for Linguistic Linked Data Generation
Maria Pia di Buono
|
Blerina Spahiu
|
Verginica Barbu Mititelu
Proceedings of the Workshop on Deep Learning and Linked Data (DLnLD) @ LREC-COLING 2024
Large language models (LLMs) have revolutionized human-machine interaction with their ability to converse and perform various language tasks. This study investigates the potential of LLMs for knowledge formalization using well-defined vocabularies, specifically focusing on OntoLex-Lemon. As a preliminary exploration, we test four languages (English, Italian, Albanian, Romanian) and analyze the formalization quality of nine words with varying characteristics applying a multidimensional evaluation approach. While manual validation provided initial insights, it highlights the need for developing scalable evaluation methods for future large-scale experiments. This research aims to initiate a discussion on the potential and challenges of utilizing LLMs for knowledge formalization within the Semantic Web framework.
pdf
bib
abs
Building a corpus for the anonymization of Romanian jurisprudence
Vasile Păiș
|
Dan Tufis
|
Elena Irimia
|
Verginica Barbu Mititelu
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)
Access to jurisprudence is of paramount importance for both law professionals (judges, lawyers, law students) and for the larger public. In Romania, the Superior Council of Magistracy holds a large database of jurisprudence from different courts in the country, which is updated daily. However, granting public access requires its anonymization. This paper presents the efforts behind building a corpus for the anonymization process. We present the annotation scheme, the manual annotation methods, and the platform used.
pdf
bib
abs
A Cross-model Study on Learning Romanian Parts of Speech with Transformer Models
Radu Ion
|
Verginica Barbu Mititelu
|
Vasile Păiş
|
Elena Irimia
|
Valentin Badea
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
This paper will attempt to determine experimentally if POS tagging of unseen words produces comparable performance, in terms of accuracy, as for words that were rarely seen in the training set (i.e. frequency less than 5), or more frequently seen (i.e. frequency greater than 10). To compare accuracies objectively, we will use the odds ratio statistic and its confidence interval testing to show that odds of being correct on unseen words are close to odds of being correct on rarely seen words. For the training of the POS taggers, we use different Romanian BERT models that are freely available on HuggingFace.
pdf
bib
abs
Function Multiword Expressions Annotated with Discourse Relations in the Romanian Reference Treebank
Verginica Barbu Mititelu
|
Tudor Voicu
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
For the Romanian Reference Treebank, a general language corpus, covering several genres and annotated according to the principles of Universal Dependencies, we present here the annotation of some function words, namely multiword conjunctions, with discourse relations from the Penn Discourse Treebank version 3.0 inventory of such relations. The annotation process was manual, with two annotators for each occurrence of the conjunctions. Lexical-semantic relations of the types synonymy, polysemy can be established between the senses of such conjunctions. The discourse relations are added to the CoNLL-U file in which the treebank is represented.
2023
pdf
bib
abs
Romanian Multiword Expression Detection Using Multilingual Adversarial Training and Lateral Inhibition
Andrei Avram
|
Verginica Barbu Mititelu
|
Dumitru-Clementin Cercel
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
Multiword expressions are a key ingredient for developing large-scale and linguistically sound natural language processing technology. This paper describes our improvements in automatically identifying Romanian multiword expressions on the corpus released for the PARSEME v1.2 shared task. Our approach assumes a multilingual perspective based on the recently introduced lateral inhibition layer and adversarial training to boost the performance of the employed multilingual language models. With the help of these two methods, we improve the F1-score of XLM-RoBERTa by approximately 2.7% on unseen multiword expressions, the main task of the PARSEME 1.2 edition. In addition, our results can be considered SOTA performance, as they outperform the previous results on Romanian obtained by the participants in this competition.
pdf
bib
abs
PARSEME corpus release 1.3
Agata Savary
|
Cherifa Ben Khelil
|
Carlos Ramisch
|
Voula Giouli
|
Verginica Barbu Mititelu
|
Najet Hadj Mohamed
|
Cvetana Krstev
|
Chaya Liebeskind
|
Hongzhi Xu
|
Sara Stymne
|
Tunga Güngör
|
Thomas Pickard
|
Bruno Guillaume
|
Eduard Bejček
|
Archna Bhatia
|
Marie Candito
|
Polona Gantar
|
Uxoa Iñurrieta
|
Albert Gatt
|
Jolanta Kovalevskaite
|
Timm Lichte
|
Nikola Ljubešić
|
Johanna Monti
|
Carla Parra Escartín
|
Mehrnoush Shamsfard
|
Ivelina Stoyanova
|
Veronika Vincze
|
Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.
pdf
bib
abs
PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions
Agata Savary
|
Sara Stymne
|
Verginica Barbu Mititelu
|
Nathan Schneider
|
Carlos Ramisch
|
Joakim Nivre
Northern European Journal of Language Technology, Volume 9
Multiword expressions (MWEs) are challenging and pervasive phenomena whose idiosyncratic properties show notably at the levels of lexicon, morphology, and syntax. Thus, they should best be annotated jointly with morphosyntax. We discuss two multilingual initiatives, Universal Dependencies and PARSEME, addressing these annotation layers in cross-lingually unified ways. We compare the annotation principles of these initiatives with respect to MWEs, and we put forward a roadmap towards their gradual unification. The expected outcomes are more consistent treebanking and higher universality in modeling idiosyncrasy.
2022
pdf
bib
abs
Use Case: Romanian Language Resources in the LOD Paradigm
Verginica Barbu Mititelu
|
Elena Irimia
|
Vasile Pais
|
Andrei-Marius Avram
|
Maria Mitrofan
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference
In this paper, we report on (i) the conversion of Romanian language resources to the Linked Open Data specifications and requirements, on (ii) their publication and (iii) interlinking with other language resources (for Romanian or for other languages). The pool of converted resources is made up of the Romanian Wordnet, the morphosyntactic and phonemic lexicon RoLEX, four treebanks, one for the general language (the Romanian Reference Treebank) and others for specialised domains (SiMoNERo for medicine, LegalNERo for the legal domain, PARSEME-Ro for verbal multiword expressions), frequency information on lemmas and tokens and word embeddings as extracted from the reference corpus for contemporary Romanian (CoRoLa) and a bi-modal (text and speech) corpus. We also present the limitations coming from the representation of the resources in Linked Data format. The metadata of LOD resources have been published in the LOD Cloud. The resources are available for download on our website and a SPARQL endpoint is also available for querying them.
pdf
bib
abs
Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text
Vasile Pais
|
Maria Mitrofan
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Roxana Micu
|
Carol Luca Gasan
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
Following the successful creation of a national representative corpus of contemporary Romanian language, we turned our attention to the social media text, as present in micro-blogging platforms. In this paper, we present the current activities as well as the challenges faced when trying to apply existing tools (for both annotation and indexing) to a Romanian language micro-blogging corpus. These challenges are encountered at all annotation levels, including tokenization, and at the indexing stage. We consider that existing tools for Romanian language processing must be adapted to recognize features such as emoticons, emojis, hashtags, unusual abbreviations, elongated words (commonly used for emphasis in micro-blogging), multiple words joined together (within oroutside hashtags), and code-mixed text.
pdf
bib
abs
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources
Tamás Váradi
|
Bence Nyéki
|
Svetla Koeva
|
Marko Tadić
|
Vanja Štefanec
|
Maciej Ogrodniczuk
|
Bartłomiej Nitoń
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Maria Mitrofan
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraž Repar
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
pdf
bib
abs
Aligning the Romanian Reference Treebank and the Valence Lexicon of Romanian Verbs
Ana-Maria Barbu
|
Verginica Barbu Mititelu
|
Cătălin Mititelu
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present here the efforts of aligning two language resources for Romanian: the Romanian Reference Treebank and the Valence Lexicon of Romanian Verbs: for each occurrence of those verbs in the treebank that were included as entries in the lexicon, a set of valence frames is automatically assigned, then manually validated by two linguists and, when necessary, corrected. Validating a valence frame also means semantically disambiguating the verb in the respective context. The validation is done by two linguists, on complementary datasets. However, a subset of verbs were validated by both annotators and Cohen’s κ is 0.87 for this subset. The alignment we have made also serves as a method of enhancing the quality of the two resources, as in the process we identify morpho-syntactic annotation mistakes, incomplete valence frames or missing ones. Information from each resource complements the information from the other, thus their value increases. The treebank and the lexicon are freely available, while the links discovered between them are also made available on GitHub.
pdf
bib
abs
An Open-Domain QA System for e-Governance
Radu Ion
|
Andrei-Marius Avram
|
Vasile Păis
|
Maria Mitrofan
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Valentin Badea
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
The paper presents an open-domain Question Answering system for Romanian, answering COVID-19 related questions. The QA system pipeline involves automatic question processing, automatic query generation, web searching for the top 10 most relevant documents and answer extraction using a fine-tuned BERT model for Extractive QA, trained on a COVID-19 data set that we have manually created. The paper will present the QA system and its integration with the Romanian language technologies portal RELATE, the COVID-19 data set and different evaluations of the QA performance.
pdf
bib
abs
A Romanian Treebank Annotated with Verbal Multiword Expressions
Verginica Barbu Mititelu
|
Mihaela Cristescu
|
Maria Mitrofan
|
Bianca-Mădălina Zgreabăn
|
Elena-Andreea Bărbulescu
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
In this paper we present a new version of the Romanian journalistic treebank annotated with verbal multiword expressions of four types: idioms, light verb constructions, reflexive verbs and inherently adpositional verbs, the last type being recently added to the corpus. These types have been defined and characterized in a multilingual setting (the PARSEME guidelines for annotating verbal multiword expressions). We present the annotation methodologies and offer quantitative data about the expressions occurring in the corpus. We discuss the characteristics of these expressions, with special reference to the difficulties they raise for the automatic processing of Romanian text, as well as for human usage. Special attention is paid to the challenges in the annotation of the inherently adpositional verbs. The corpus is freely available in two formats (CUPT and RDF), as well as queryable using a SPARQL endpoint.
pdf
bib
abs
Romanian micro-blogging named entity recognition including health-related entities
Vasile Pais
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Maria Mitrofan
|
Carol Luca Gasan
|
Roxana Micu
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
This paper introduces a manually annotated dataset for named entity recognition (NER) in micro-blogging text for Romanian language. It contains gold annotations for 9 entity classes and expressions: persons, locations, organizations, time expressions, legal references, disorders, chemicals, medical devices and anatomical parts. Furthermore, word embeddings models computed on a larger micro-blogging corpus are made available. Finally, several NER models are trained and their performance is evaluated against the newly introduced corpus.
2020
pdf
bib
abs
It Takes Two to Tango – Towards a Multilingual MWE Resource
Svetlozara Leseva
|
Verginica Barbu Mititelu
|
Ivelina Stoyanova
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
Mature wordnets offer the opportunity of digging out interesting linguistic information otherwise not explicitly marked in the network. The focus in this paper is on the ways the results already obtained at two levels, derivation and multiword expressions, may be further employed. The parallel recent development of the two resources under discussion, the Bulgarian and the Romanian wordnets, has enabled interlingual analyses that reveal similarities and differences between the linguistic knowledge encoded in the two wordnets. In this paper we show how the resources developed and the knowledge gained are put together towards devising a linked MWE resource that is informed by layered dictionary representation and corpus annotation and analysis. This work is a proof of concept for the adopted method of compiling a multilingual MWE resource on the basis of information extracted from the Bulgarian, the Romanian and the Princeton wordnet, as well as additional language resources and automatic procedures.
pdf
bib
abs
A Customizable WordNet Editor
Andrei-Marius Avram
|
Verginica Barbu Mititelu
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)
This paper presents an open-source wordnet editor that has been developed to ensure further expansion of the Romanian wordnet. It comes with a web interface that offers capabilities in selecting new synsets to be implemented, editing the list of literals and their sense numbers and adding these new synsets to the existing network, by importing from Princeton WordNet (and adjusting, when necessary) all the relations in which the newly created synsets and their literals are involved. The application also comes with an authorization mechanism that ensures control of the new synsets added in novice or lexicographer accounts. Although created to serve the current (more or less specific) needs in the development of the Romanian wordnet, it can be customized to fulfill new requirements from developers, either of the same wordnet or of a different one for which a similar approach is adopted.
pdf
bib
abs
The MARCELL Legislative Corpus
Tamás Váradi
|
Svetla Koeva
|
Martin Yamalov
|
Marko Tadić
|
Bálint Sass
|
Bartłomiej Nitoń
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Radu Ion
|
Elena Irimia
|
Maria Mitrofan
|
Vasile Păiș
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraz Repar
|
Matjaž Rihtar
|
Janez Brank
Proceedings of the Twelfth Language Resources and Evaluation Conference
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
pdf
bib
abs
Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions
Carlos Ramisch
|
Agata Savary
|
Bruno Guillaume
|
Jakub Waszczuk
|
Marie Candito
|
Ashwini Vaidya
|
Verginica Barbu Mititelu
|
Archna Bhatia
|
Uxoa Iñurrieta
|
Voula Giouli
|
Tunga Güngör
|
Menghan Jiang
|
Timm Lichte
|
Chaya Liebeskind
|
Johanna Monti
|
Renata Ramisch
|
Sara Stymne
|
Abigail Walsh
|
Hongzhi Xu
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.
2019
pdf
bib
abs
MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language
Maria Mitrofan
|
Verginica Barbu Mititelu
|
Grigorina Mitrofan
Proceedings of the 18th BioNLP Workshop and Shared Task
In an era when large amounts of data are generated daily in various fields, the biomedical field among others, linguistic resources can be exploited for various tasks of Natural Language Processing. Moreover, increasing number of biomedical documents are available in languages other than English. To be able to extract information from natural language free text resources, methods and tools are needed for a variety of languages. This paper presents the creation of the MoNERo corpus, a gold standard biomedical corpus for Romanian, annotated with both part of speech tags and named entities. MoNERo comprises 154,825 morphologically annotated tokens and 23,188 entity annotations belonging to four entity semantic groups corresponding to UMLS Semantic Groups.
pdf
bib
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
Agata Savary
|
Carla Parra Escartín
|
Francis Bond
|
Jelena Mitrović
|
Verginica Barbu Mititelu
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
pdf
bib
abs
Hear about Verbal Multiword Expressions in the Bulgarian and the Romanian Wordnets Straight from the Horse’s Mouth
Verginica Barbu Mititelu
|
Ivelina Stoyanova
|
Svetlozara Leseva
|
Maria Mitrofan
|
Tsvetana Dimitrova
|
Maria Todorova
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
In this paper we focus on verbal multiword expressions (VMWEs) in Bulgarian and Romanian as reflected in the wordnets of the two languages. The annotation of VMWEs relies on the classification defined within the PARSEME Cost Action. After outlining the properties of various types of VMWEs, a cross-language comparison is drawn, aimed to highlight the similarities and the differences between Bulgarian and Romanian with respect to the lexicalization and distribution of VMWEs. The contribution of this work is in outlining essential features of the description and classification of VMWEs and the cross-language comparison at the lexical level, which is essential for the understanding of the need for uniform annotation guidelines and a viable procedure for validation of the annotation.
pdf
bib
abs
The Romanian Corpus Annotated with Verbal Multiword Expressions
Verginica Barbu Mititelu
|
Mihaela Cristescu
|
Mihaela Onofrei
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
This paper reports on the Romanian journalistic corpus annotated with verbal multiword expressions following the PARSEME guidelines. The corpus is sentence split, tokenized, part-of-speech tagged, lemmatized, syntactically annotated and verbal multiword expressions are identified and classified. It offers insights into the frequency of such Romanian word combinations and allows for their characterization. We offer data about the types of verbal multiword expressions in the corpus and some of their characteristics, such as internal structure, diversity in the corpus, average length, productivity of the verbs. This is a language resource that is important per se, as well as for the task of automatic multiword expressions identification, which can be further used in other systems. It was already used as training and test material in the shared tasks for the automatic identification of verbal multiword expressions organized by PARSEME.
2018
pdf
bib
abs
A Pilot Study for Enriching the Romanian WordNet with Medical Terms
Maria Mitrofan
|
Verginica Barbu Mititelu
|
Grigorina Mitrofan
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)
This paper presents the preliminary investigations in the process of integrating a specialized vocabulary, namely medical terminology, into the Romanian wordnet. We focus here on four classes from this vocabulary: anatomy (or body parts), disorders, medical procedures and chemicals. In this pilot study we selected two large concepts from each class and created the Romanian terminological (sub)trees for each of them, starting from a medical thesaurus (SNOMED CT) and translating the terms, process which raised various challenges, all of them asking for the expertise of a specialist in the health care domain. The integration of these (sub)trees in the Romanian wordnet also required careful decision making, given the structural differences between a wordnet and a terminological thesaurus. They are presented and discussed herein.
pdf
bib
The Reference Corpus of the Contemporary Romanian Language (CoRoLa)
Verginica Barbu Mititelu
|
Dan Tufiș
|
Elena Irimia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
Ensemble Romanian Dependency Parsing with Neural Networks
Radu Ion
|
Elena Irimia
|
Verginica Barbu Mititelu
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
abs
Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Carlos Ramisch
|
Silvio Ricardo Cordeiro
|
Agata Savary
|
Veronika Vincze
|
Verginica Barbu Mititelu
|
Archna Bhatia
|
Maja Buljan
|
Marie Candito
|
Polona Gantar
|
Voula Giouli
|
Tunga Güngör
|
Abdelati Hawwari
|
Uxoa Iñurrieta
|
Jolanta Kovalevskaitė
|
Simon Krek
|
Timm Lichte
|
Chaya Liebeskind
|
Johanna Monti
|
Carla Parra Escartín
|
Behrang QasemiZadeh
|
Renata Ramisch
|
Nathan Schneider
|
Ivelina Stoyanova
|
Ashwini Vaidya
|
Abigail Walsh
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)
This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.
pdf
bib
abs
A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora
Eduard Barbu
|
Verginica Barbu Mititelu
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
A hybrid pipeline comprising rules and machine learning is used to filter a noisy web English-German parallel corpus for the Parallel Corpus Filtering task. The core of the pipeline is a module based on the logistic regression algorithm that returns the probability that a translation unit is accepted. The training set for the logistic regression is created by automatic annotation. The quality of the automatic annotation is estimated by manually labeling the training set.
2017
pdf
bib
abs
A data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
Tiberiu Boros
|
Sonia Pipa
|
Verginica Barbu Mititelu
|
Dan Tufis
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
“Multiword expressions” are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent the subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the verbal ones are particularly interesting for tasks such as parsing, as the verb is the central element in the syntactic organization of a sentence. In this paper we introduce our data-driven approach to verbal multiword expressions which was objectively validated during the PARSEME shared task on verbal multiword expressions identification. We tested our approach on 12 languages, and we provide detailed information about corpora composition, feature selection process, validation procedure and performance on all languages.
2016
pdf
bib
abs
The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Dan Tufiș
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Ștefan Daniel Dumitrescu
|
Tiberiu Boroș
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country’s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.
pdf
bib
abs
Linguistic Data Retrievable from a Treebank
Verginica Barbu Mititelu
|
Elena Irimia
Proceedings of the Second International Conference on Computational Linguistics in Bulgaria (CLIB 2016)
This paper describes the Romanian treebank annotated according to the Universal Dependency principles. We present the types of texts included in the treebank, their processing phases and the tools used for doing it, as well as the levels of annotation, with a focus on the syntactic level. We briefly present the syntactic formalism used, the principles followed and the set of relations. The perspective we adopted is the linguist’s who searches the treebank for information with relevance for the study of Romanian. (S)He can interpret the statistics based on the corpus and can also query the treebank for finding examples to support a theory, for testing hypothesis or for discovering new tendencies. We use here the passive constructions in Romanian as a case study for showing how statistical data help understanding this linguistic phenomenon. We also discuss the kinds of linguistic information retrievable and non-retrievable form the treebank, based on the annotation principles.
pdf
bib
Proceedings of the 8th Global WordNet Conference (GWC)
Christiane Fellbaum
|
Piek Vossen
|
Verginica Barbu Mititelu
|
Corina Forascu
Proceedings of the 8th Global WordNet Conference (GWC)
2015
pdf
bib
Universal and Language-specific Dependency Relations for Analysing Romanian
Verginica Barbu Mititelu
|
Cătălina Mărănduc
|
Elena Irimia
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)
2014
pdf
bib
abs
CoRoLa — The Reference Corpus of Contemporary Romanian Language
Verginica Barbu Mititelu
|
Elena Irimia
|
Dan Tufiș
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present the project of creating CoRoLa, a reference corpus of contemporary Romanian (from 1945 onwards). In the international context, the project finds its place among the initiatives of gathering huge collections of texts, of pre-processing and annotating them at several levels, and also of documenting them with metadata (CMDI). Our project is a joined effort of two institutes of the Romanian Academy. We foresee a corpus of more than 500 million word forms, covering all functional styles of the language. Although the vast majority of texts will be in written form, we target about 300 hours of oral texts, too, obligatorily with associated transcripts. Most of the texts will be from books, while the rest will be harvested from newspapers, booklets, technical reports, etc. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will be done at a morphological level in a first stage, followed by lemmatization, with the possibility of adding syntactic, semantic and discourse annotation in a later stage. A core of CoRoLa is described in the article. The target users of our corpus will be researchers in linguistics and language processing, teachers of Romanian, students.
pdf
bib
News about the Romanian Wordnet
Verginica Barbu Mititelu
|
Ștefan Daniel Dumitrescu
|
Dan Tufiș
Proceedings of the Seventh Global Wordnet Conference
pdf
bib
RACAI GEC – A hybrid approach to Grammatical Error Correction
Tiberiu Boroș
|
Stefan Daniel Dumitrescu
|
Adrian Zafiu
|
Verginica Barbu Mititelu
|
Ionut Paul Văduva
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task
pdf
bib
WordFinder
Catalin Mititelu
|
Verginica Barbu Mititelu
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)
pdf
bib
abs
Noun-Verb Derivation in the Bulgarian and the Romanian WordNet – A Comparative Approach
Ekaterina Tarpomanova
|
Svetlozara Leseva
|
Maria Todorova
|
Tsvetana Dimitrova
|
Borislav Rizov
|
Verginica Barbu Mititelu
|
Elena Irimia
Proceedings of the First International Conference on Computational Linguistics in Bulgaria (CLIB 2014)
Romanian and Bulgarian are Balkan languages with rich derivational morphology that, if introduced into their respective wordnets, can aid broadening of the wordnet content and the possible NLP applications. In this paper we present a joint work on introducing derivation into the Bulgarian and the Romanian WordNets, BulNet and RoWordNet, respectively, by identifying and subsequently labelling the derivationally and semantically related noun-verb pairs. Our research aims at providing a framework for a comparative study on derivation in the two languages and offering training material for the automatic identification and assignment of derivational and morphosemantic relations needed in various applications.
2012
pdf
bib
abs
Adding Morpho-semantic Relations to the Romanian Wordnet
Verginica Barbu Mititelu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Keeping pace with other wordnets development, we present the challenges raised by the Romanian derivational system and our methodology for identifying derived words and their stems in the Romanian Wordnet. To attain this aim we rely only on the list of literals in the wordnet and on a list of Romanian affixes; the automatically obtained pairs require automatic and manual validation, based on a few heuristics. The correct members of the pairs are linked together and the relation is associated a semantic label whenever necessary. This label is proved to have cross-language validity. The work reported here contributes to the increase of the number of relations both between literals and between synsets, especially the cross-part-of-speech links. Words belonging to the same lexical family are identified easily. The benefits of thus improving a language resource such as wordnet become self-evident. The paper also contains an overview of the current status of the Romanian wordnet and an envisaged plan for continuing the research.
2011
pdf
bib
Wordnets: State of the Art and Perspectives. Case Study: the Romanian Wordnet
Verginica Barbu Mititelu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011
2008
pdf
bib
abs
Annotation of WordNet Verbs with TimeML Event Classes
Georgiana Puşcaşu
|
Verginica Barbu Mititelu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper reports on the annotation of all English verbs included in WordNet 2.0 with TimeML event classes. Two annotators assign each verb present in WordNet the most relevant event class capturing most of that verbs meanings. At the end of the annotation process, inter-annotator agreement is measured using kappa statistics, yielding a kappa value of 0.87. The cases of disagreement between the two independent annotations are clarified by obtaining a third, and in some cases, a fourth opinion, and finally each of the 11,306 WordNet verbs is mapped to a unique event class. The resulted annotation is then employed to automatically assign the corresponding class to each occurrence of a finite or non-finite verb in a given text. The evaluation performed on TimeBank reveals an F-measure of 86.43% achieved for the identification of verbal events, and an accuracy of 85.25% in the task of classifying them into TimeML event classes.
2006
pdf
bib
abs
Romanian Valence Dictionary in XML Format
Ana-Maria Barbu
|
Emil Ionescu
|
Verginica Barbu Mititelu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Valence dictionaries are dictionaries in which logical predicates (most of the times verbs) are inventoried alongside with the semantic and syntactic information regarding the role of the arguments with which they combine, as well as the syntactic restrictions these arguments have to obey. In this article we present the incipient stage of the project Syntactic and semantic database in XML format: an HPSG representation of verb valences in Romanian. Its aim is the development of a valence dictionary in XML format for a set of 3000 Romanian verbs. Valences are specified for each sense of each verb, alongside with an illustrative example, possible argument alternations and a set of multiword expressions in which the respective verb occurs with the respective sense. The grammatical formalism we make use of is Head-driven Phrase Structure Grammar, which offers one of the most comprehensive frames of encoding various types of linguistic information for lexical items. XML is the most appropriate mark-up language for describing information structured in HPSG framework. The project can be further on extended so that to cover all Romanian verbs (around 7000) and also other predicates (nouns, adjectives, prepositions).
2005
pdf
bib
A Case Study in Automatic Building of Wordnets
Eduard Barbu
|
Verginica Barbu Mititelu
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources