Maria Mitrofan


2022

pdf bib
Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text
Vasile Pais | Maria Mitrofan | Verginica Barbu Mititelu | Elena Irimia | Roxana Micu | Carol Luca Gasan
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

Following the successful creation of a national representative corpus of contemporary Romanian language, we turned our attention to the social media text, as present in micro-blogging platforms. In this paper, we present the current activities as well as the challenges faced when trying to apply existing tools (for both annotation and indexing) to a Romanian language micro-blogging corpus. These challenges are encountered at all annotation levels, including tokenization, and at the indexing stage. We consider that existing tools for Romanian language processing must be adapted to recognize features such as emoticons, emojis, hashtags, unusual abbreviations, elongated words (commonly used for emphasis in micro-blogging), multiple words joined together (within oroutside hashtags), and code-mixed text.

pdf bib
Improving Romanian BioNER Using a Biologically Inspired System
Maria Mitrofan | Vasile Pais
Proceedings of the 21st Workshop on Biomedical Language Processing

Recognition of named entities present in text is an important step towards information extraction and natural language understanding. This work presents a named entity recognition system for the Romanian biomedical domain. The system makes use of a new and extended version of SiMoNERo corpus, that is open sourced. Also, the best system is available for direct usage in the RELATE platform.

pdf bib
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources
Tamás Váradi | Bence Nyéki | Svetla Koeva | Marko Tadić | Vanja Štefanec | Maciej Ogrodniczuk | Bartłomiej Nitoń | Piotr Pęzik | Verginica Barbu Mititelu | Elena Irimia | Maria Mitrofan | Dan Tufiș | Radovan Garabík | Simon Krek | Andraž Repar
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

pdf bib
RACAI@SMM4H’22: Tweets Disease Mention Detection Using a Neural Lateral Inhibitory Mechanism
Andrei-Marius Avram | Vasile Pais | Maria Mitrofan
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper presents our system employed for the Social Media Mining for Health (SMM4H) 2022 competition Task 10 - SocialDisNER. The goal of the task was to improve the detection of diseases in tweets. Because the tweets were in Spanish, we approached this problem using a system that relies on a pre-trained multilingual model and is fine-tuned using the recently introduced lateral inhibition layer. We further experimented on this task by employing a conditional random field on top of the system and using a voting-based ensemble that contains various architectures. The evaluation results outlined that our best performing model obtained 83.7% F1-strict on the validation set and 82.1% F1-strict on the test set.

pdf bib
Romanian micro-blogging named entity recognition including health-related entities
Vasile Pais | Verginica Barbu Mititelu | Elena Irimia | Maria Mitrofan | Carol Luca Gasan | Roxana Micu
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper introduces a manually annotated dataset for named entity recognition (NER) in micro-blogging text for Romanian language. It contains gold annotations for 9 entity classes and expressions: persons, locations, organizations, time expressions, legal references, disorders, chemicals, medical devices and anatomical parts. Furthermore, word embeddings models computed on a larger micro-blogging corpus are made available. Finally, several NER models are trained and their performance is evaluated against the newly introduced corpus.

pdf bib
An Open-Domain QA System for e-Governance
Radu Ion | Andrei-Marius Avram | Vasile Păis | Maria Mitrofan | Verginica Barbu Mititelu | Elena Irimia | Valentin Badea
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

The paper presents an open-domain Question Answering system for Romanian, answering COVID-19 related questions. The QA system pipeline involves automatic question processing, automatic query generation, web searching for the top 10 most relevant documents and answer extraction using a fine-tuned BERT model for Extractive QA, trained on a COVID-19 data set that we have manually created. The paper will present the QA system and its integration with the Romanian language technologies portal RELATE, the COVID-19 data set and different evaluations of the QA performance.

pdf bib
A Romanian Treebank Annotated with Verbal Multiword Expressions
Verginica Barbu Mititelu | Mihaela Cristescu | Maria Mitrofan | Bianca-Mădălina Zgreabăn | Elena-Andreea Bărbulescu
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

In this paper we present a new version of the Romanian journalistic treebank annotated with verbal multiword expressions of four types: idioms, light verb constructions, reflexive verbs and inherently adpositional verbs, the last type being recently added to the corpus. These types have been defined and characterized in a multilingual setting (the PARSEME guidelines for annotating verbal multiword expressions). We present the annotation methodologies and offer quantitative data about the expressions occurring in the corpus. We discuss the characteristics of these expressions, with special reference to the difficulties they raise for the automatic processing of Romanian text, as well as for human usage. Special attention is paid to the challenges in the annotation of the inherently adpositional verbs. The corpus is freely available in two formats (CUPT and RDF), as well as queryable using a SPARQL endpoint.

pdf bib
Use Case: Romanian Language Resources in the LOD Paradigm
Verginica Barbu Mititelu | Elena Irimia | Vasile Pais | Andrei-Marius Avram | Maria Mitrofan
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

In this paper, we report on (i) the conversion of Romanian language resources to the Linked Open Data specifications and requirements, on (ii) their publication and (iii) interlinking with other language resources (for Romanian or for other languages). The pool of converted resources is made up of the Romanian Wordnet, the morphosyntactic and phonemic lexicon RoLEX, four treebanks, one for the general language (the Romanian Reference Treebank) and others for specialised domains (SiMoNERo for medicine, LegalNERo for the legal domain, PARSEME-Ro for verbal multiword expressions), frequency information on lemmas and tokens and word embeddings as extracted from the reference corpus for contemporary Romanian (CoRoLa) and a bi-modal (text and speech) corpus. We also present the limitations coming from the representation of the resources in Linked Data format. The metadata of LOD resources have been published in the LOD Cloud. The resources are available for download on our website and a SPARQL endpoint is also available for querying them.

pdf bib
Romanian Language Translation in the RELATE Platform
Vasile Pais | Maria Mitrofan | Andrei-Marius Avram
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

This paper presents the usage of the RELATE platform for translation tasks involving the Romanian language. Using this platform, it is possible to perform text and speech data translations, either for single documents or for entire corpora. Furthermore, the platform was successfully used in international projects to create new resources useful for Romanian language translation.

2021

pdf bib
Named Entity Recognition in the Romanian Legal Domain
Vasile Pais | Maria Mitrofan | Carol Luca Gasan | Vlad Coneschi | Alexandru Ianov
Proceedings of the Natural Legal Language Processing Workshop 2021

Recognition of named entities present in text is an important step towards information extraction and natural language understanding. This work presents a named entity recognition system for the Romanian legal domain. The system makes use of the gold annotated LegalNERo corpus. Furthermore, the system combines multiple distributional representations of words, including word embeddings trained on a large legal domain corpus. All the resources, including the corpus, model and word embeddings are open sourced. Finally, the best system is available for direct usage in the RELATE platform.

pdf bib
Assessing multiple word embeddings for named entity recognition of professions and occupations in health-related social media
Vasile Pais | Maria Mitrofan
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

This paper presents our contribution to the ProfNER shared task. Our work focused on evaluating different pre-trained word embedding representations suitable for the task. We further explored combinations of embeddings in order to improve the overall results.

2020

pdf bib
Collection and Annotation of the Romanian Legal Corpus
Dan Tufiș | Maria Mitrofan | Vasile Păiș | Radu Ion | Andrei Coman
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the Romanian legislative corpus which is a valuable linguistic asset for the development of machine translation systems, especially for under-resourced languages. The knowledge that can be extracted from this resource is necessary for a deeper understanding of how law terminology is used and how it can be made more consistent. At this moment the corpus contains more than 140k documents representing the legislative body of Romania. This corpus is processed and annotated at different levels: linguistically (tokenized, lemmatized and pos-tagged), dependency parsed, chunked, named entities identified and labeled with IATE terms and EUROVOC descriptors. Each annotated document has a CONLL-U Plus format consisting in 14 columns, in addition to the standard 10-column format, four other types of annotations were added. Moreover the repository will be periodically updated as new legislative texts are published. These will be automatically collected and transmitted to the processing and annotation pipeline. The access to the corpus will be done through ELRC infrastructure.

pdf bib
The MARCELL Legislative Corpus
Tamás Váradi | Svetla Koeva | Martin Yamalov | Marko Tadić | Bálint Sass | Bartłomiej Nitoń | Maciej Ogrodniczuk | Piotr Pęzik | Verginica Barbu Mititelu | Radu Ion | Elena Irimia | Maria Mitrofan | Vasile Păiș | Dan Tufiș | Radovan Garabík | Simon Krek | Andraz Repar | Matjaž Rihtar | Janez Brank
Proceedings of the Twelfth Language Resources and Evaluation Conference

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

2019

pdf bib
Leaving No Stone Unturned When Identifying and Classifying Verbal Multiword Expressions in the Romanian Wordnet
Verginica Mititelu | Maria Mitrofan
Proceedings of the 10th Global Wordnet Conference

We present here the enhancement of the Romanian wordnet with a new type of information, very useful in language processing, namely types of verbal multi-word expressions. All verb literals made of two or more words are attached a label specific to the type of verbal multi-word expression they correspond to. These labels were created in the PARSEME Cost Action and were used in the version 1.1 of the shared task they organized. The results of this annotation are compared to those obtained in the annotation of a Romanian news corpus with the same labels. Given the alignment of the Romanian wordnet to the Princeton WordNet, this type of annotation can be further used for drawing comparisons between equivalent verbal literals in various languages, provided that such information is annotated in the wordnets of the respective languages and their wordnets are aligned to Princeton WordNet, and thus to the Romanian wordnet.

pdf bib
Evaluating the Wordnet and CoRoLa-based Word Embedding Vectors for Romanian as Resources in the Task of Microworlds Lexicon Expansion
Elena Irimia | Maria Mitrofan | Verginica Mititelu
Proceedings of the 10th Global Wordnet Conference

Within a larger frame of facilitating human-robot interaction, we present here the creation of a core vocabulary to be learned by a robot. It is extracted from two tokenised and lemmatized scenarios pertaining to two imagined microworlds in which the robot is supposed to play an assistive role. We also evaluate two resources for their utility for expanding this vocabulary so as to better cope with the robot’s communication needs. The language under study is Romanian and the resources used are the Romanian wordnet and word embedding vectors extracted from the large representative corpus of contemporary Romanian, CoRoLa. The evaluation is made for two situations: one in which the words are not semantically disambiguated before expanding the lexicon, and another one in which they are disambiguated with senses from the Romanian wordnet. The appropriateness of each resource is discussed.

pdf bib
RACAI’s System at PharmaCoNER 2019
Radu Ion | Vasile Florian Păiș | Maria Mitrofan
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

This paper describes the Named Entity Recognition system of the Institute for Artificial Intelligence “Mihai Drăgănescu” of the Romanian Academy (RACAI for short). Our best F1 score of 0.84984 was achieved using an ensemble of two systems: a gazetteer-based baseline and a RNN-based NER system, developed specially for PharmaCoNER 2019. We will describe the individual systems and the ensemble algorithm, compare the final system to the current state of the art, as well as discuss our results with respect to the quality of the training data and its annotation strategy. The resulting NER system is language independent, provided that language-dependent resources and preprocessing tools exist, such as tokenizers and POS taggers.

pdf bib
MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language
Maria Mitrofan | Verginica Barbu Mititelu | Grigorina Mitrofan
Proceedings of the 18th BioNLP Workshop and Shared Task

In an era when large amounts of data are generated daily in various fields, the biomedical field among others, linguistic resources can be exploited for various tasks of Natural Language Processing. Moreover, increasing number of biomedical documents are available in languages other than English. To be able to extract information from natural language free text resources, methods and tools are needed for a variety of languages. This paper presents the creation of the MoNERo corpus, a gold standard biomedical corpus for Romanian, annotated with both part of speech tags and named entities. MoNERo comprises 154,825 morphologically annotated tokens and 23,188 entity annotations belonging to four entity semantic groups corresponding to UMLS Semantic Groups.

pdf bib
Hear about Verbal Multiword Expressions in the Bulgarian and the Romanian Wordnets Straight from the Horse’s Mouth
Verginica Barbu Mititelu | Ivelina Stoyanova | Svetlozara Leseva | Maria Mitrofan | Tsvetana Dimitrova | Maria Todorova
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

In this paper we focus on verbal multiword expressions (VMWEs) in Bulgarian and Romanian as reflected in the wordnets of the two languages. The annotation of VMWEs relies on the classification defined within the PARSEME Cost Action. After outlining the properties of various types of VMWEs, a cross-language comparison is drawn, aimed to highlight the similarities and the differences between Bulgarian and Romanian with respect to the lexicalization and distribution of VMWEs. The contribution of this work is in outlining essential features of the description and classification of VMWEs and the cross-language comparison at the lexical level, which is essential for the understanding of the need for uniform annotation guidelines and a viable procedure for validation of the annotation.

2018

pdf bib
BioRo: The Biomedical Corpus for the Romanian Language
Maria Mitrofan | Dan Tufiş
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Adapting the TTL Romanian POS Tagger to the Biomedical Domain
Maria Mitrofan | Radu Ion
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

This paper presents the adaptation of the Hidden Markov Models-based TTL part-of-speech tagger to the biomedical domain. TTL is a text processing platform that performs sentence splitting, tokenization, POS tagging, chunking and Named Entity Recognition (NER) for a number of languages, including Romanian. The POS tagging accuracy obtained by the TTL POS tagger exceeds 97% when TTL’s baseline model is updated with training information from a Romanian biomedical corpus. This corpus is developed in the context of the CoRoLa (a reference corpus for the contemporary Romanian language) project. Informative description and statistics of the Romanian biomedical corpus are also provided.

pdf bib
Bootstrapping a Romanian Corpus for Medical Named Entity Recognition
Maria Mitrofan
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Named Entity Recognition (NER) is an important component of natural language processing (NLP), with applicability in biomedical domain, enabling knowledge-discovery from medical texts. Due to the fact that for the Romanian language there are only a few linguistic resources specific to the biomedical domain, it was created a sub-corpus specific to this domain. In this paper we present a newly developed Romanian sub-corpus for medical-domain NER, which is a valuable asset for the field of biomedical text processing. We provide a description of the sub-corpus, informative statistics about data-composition and we evaluate an automatic NER tool on the newly created resource.