Radu Ion

2024

A Cross-model Study on Learning Romanian Parts of Speech with Transformer Models
Radu Ion | Verginica Barbu Mititelu | Vasile Păiş | Elena Irimia | Valentin Badea
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

This paper will attempt to determine experimentally if POS tagging of unseen words produces comparable performance, in terms of accuracy, as for words that were rarely seen in the training set (i.e. frequency less than 5), or more frequently seen (i.e. frequency greater than 10). To compare accuracies objectively, we will use the odds ratio statistic and its confidence interval testing to show that odds of being correct on unseen words are close to odds of being correct on rarely seen words. For the training of the POS taggers, we use different Romanian BERT models that are freely available on HuggingFace.

2022

pdf bib abs

The paper presents an open-domain Question Answering system for Romanian, answering COVID-19 related questions. The QA system pipeline involves automatic question processing, automatic query generation, web searching for the top 10 most relevant documents and answer extraction using a fine-tuned BERT model for Extractive QA, trained on a COVID-19 data set that we have manually created. The paper will present the QA system and its integration with the Romanian language technologies portal RELATE, the COVID-19 data set and different evaluations of the QA performance.

pdf bib abs

This paper is about a multilingual chatbot developed for public administration within the CEF funded project ENRICH4ALL. We argue for multi-lingual chatbots empowered through MT and discuss the integration of the CEF eTranslation service in a chatbot solution.

2020

pdf bib abs

TermEval 2020: RACAI’s automatic term extraction system
Vasile Pais | Radu Ion
Proceedings of the 6th International Workshop on Computational Terminology

This paper describes RACAI’s automatic term extraction system, which participated in the TermEval 2020 shared task on English monolingual term extraction. We discuss the system architecture, some of the challenges that we faced as well as present our results in the English competition.

pdf bib abs

MWSA Task at GlobaLex 2020: RACAI’s Word Sense Alignment System using a Similarity Measurement of Dictionary Definitions
Vasile Pais | Dan Tufiș | Radu Ion
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

This paper describes RACAI’s word sense alignment system, which participated in the Monolingual Word Sense Alignment shared task organized at GlobaLex 2020 workshop. We discuss the system architecture, some of the challenges that we faced as well as present our results on several of the languages available for the task.

pdf bib abs

A Processing Platform Relating Data and Tools for Romanian Language
Vasile Păiș | Radu Ion | Dan Tufiș
Proceedings of the 1st International Workshop on Language Technology Platforms

This paper presents RELATE (http://relate.racai.ro), a high-performance natural language platform designed for Romanian language. It is meant both for demonstration of available services, from text-span annotations to syntactic dependency trees as well as playing or automatically synthesizing Romanian words, and for the development of new annotated corpora. It also incorporates the search engines for the large COROLA reference corpus of contemporary Romanian and the Romanian wordnet. It integrates multiple text and speech processing modules and exposes their functionality through a web interface designed for the linguist researcher. It makes use of a scheduler-runner architecture, allowing processing to be distributed across multiple computing nodes. A series of input/output converters allows large corpora to be loaded, processed and exported according to user preferences.

pdf bib abs

Collection and Annotation of the Romanian Legal Corpus
Dan Tufiș | Maria Mitrofan | Vasile Păiș | Radu Ion | Andrei Coman
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the Romanian legislative corpus which is a valuable linguistic asset for the development of machine translation systems, especially for under-resourced languages. The knowledge that can be extracted from this resource is necessary for a deeper understanding of how law terminology is used and how it can be made more consistent. At this moment the corpus contains more than 140k documents representing the legislative body of Romania. This corpus is processed and annotated at different levels: linguistically (tokenized, lemmatized and pos-tagged), dependency parsed, chunked, named entities identified and labeled with IATE terms and EUROVOC descriptors. Each annotated document has a CONLL-U Plus format consisting in 14 columns, in addition to the standard 10-column format, four other types of annotations were added. Moreover the repository will be periodically updated as new legislative texts are published. These will be automatically collected and transmitted to the processing and annotation pipeline. The access to the corpus will be done through ELRC infrastructure.

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

2019

pdf bib abs

RACAI’s System at PharmaCoNER 2019
Radu Ion | Vasile Florian Păiș | Maria Mitrofan
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

This paper describes the Named Entity Recognition system of the Institute for Artificial Intelligence “Mihai Drăgănescu” of the Romanian Academy (RACAI for short). Our best F1 score of 0.84984 was achieved using an ensemble of two systems: a gazetteer-based baseline and a RNN-based NER system, developed specially for PharmaCoNER 2019. We will describe the individual systems and the ensemble algorithm, compare the final system to the current state of the art, as well as discuss our results with respect to the quality of the training data and its annotation strategy. The resulting NER system is language independent, provided that language-dependent resources and preprocessing tools exist, such as tokenizers and POS taggers.

2018

pdf bib

Ensemble Romanian Dependency Parsing with Neural Networks
Radu Ion | Elena Irimia | Verginica Barbu Mititelu
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs

Adapting the TTL Romanian POS Tagger to the Biomedical Domain
Maria Mitrofan | Radu Ion
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

This paper presents the adaptation of the Hidden Markov Models-based TTL part-of-speech tagger to the biomedical domain. TTL is a text processing platform that performs sentence splitting, tokenization, POS tagging, chunking and Named Entity Recognition (NER) for a number of languages, including Romanian. The POS tagging accuracy obtained by the TTL POS tagger exceeds 97% when TTL’s baseline model is updated with training information from a Romanian biomedical corpus. This corpus is developed in the context of the CoRoLa (a reference corpus for the contemporary Romanian language) project. Informative description and statistics of the Romanian biomedical corpus are also provided.

2013

pdf bib

Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
Tiberiu Boros | Radu Ion | Dan Tufis
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib

Wikipedia as an SMT Training Corpus
Dan Tufiș | Radu Ion | Ștefan Dumitrescu | Dan Ștefănescu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib

Hybrid Parallel Sentence Mining from Comparable Corpora
Dan Ștefănescu | Radu Ion | Sabine Hunsicker
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf bib abs

Romanian to English automatic MT experiments at IWSLT12 – system description paper
Ştefan Daniel Dumitrescu | Radu Ion | Dan Ştefănescu | Tiberiu Boroş | Dan Tufiş
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

The paper presents the system developed by RACAI for the ISWLT 2012 competition, TED task, MT track, Romanian to English translation. We describe the starting baseline phrase-based SMT system, the experiments conducted to adapt the language and translation models and our post-translation cascading system designed to improve the translation without external resources. We further present our attempts at creating a better controlled decoder than the open-source Moses system offers.

pdf bib abs

ROMBAC: The Romanian Balanced Annotated Corpus
Radu Ion | Elena Irimia | Dan Ştefănescu | Dan Tufiș
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article describes the collecting, processing and validation of a large balanced corpus for Romanian. The annotation types and structure of the corpus are briefly reviewed. It was constructed at the Research Institute for Artificial Intelligence of the Romanian Academy in the context of an international project (METANET4U). The processing covers tokenization, POS-tagging, lemmatization and chunking. The corpus is in XML format generated by our in-house annotation tools; the corpus encoding schema is XCES compliant and the metadata specification is conformant to the METANET recommendations. To the best of our knowledge, this is the first large and richly annotated corpus for Romanian. ROMBAC is intended to be the foundation of a linguistic environment containing a reference corpus for contemporary Romanian and a comprehensive collection of interoperable processing tools.

pdf bib abs

PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora
Radu Ion
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Extracting parallel data from comparable corpora in order to enrich existing statistical translation models is an avenue that attracted a lot of research in recent years. There are experiments that convincingly show how parallel data extracted from comparable corpora is able to improve statistical machine translation. Yet, the existing body of research on parallel sentence mining from comparable corpora does not take into account the degree of comparability of the corpus being processed or the computation time it takes to extract parallel sentences from a corpus of a given size. We will show that the performance of a parallel sentence extractor crucially depends on the degree of comparability such that it is more difficult to process a weakly comparable corpus than a strongly comparable corpus. In this paper we describe PEXACC, a distributed (running on multiple CPUs), trainable parallel sentence/phrase extractor from comparable corpora. PEXACC is freely available for download with the ACCURAT Toolkit, a collection of MT-related tools developed in the ACCURAT project.

pdf bib

Nowadays, there are hundreds of Natural Language Processing applications and resources for different languages that are developed and/or used, almost exclusively with a few but notable exceptions, by their creators. Assuming that the right to use a particular application or resource is licensed by the rightful owner, the user is faced with the often not so easy task of interfacing it with his/her own systems. Even if standards are defined that provide a unified way of encoding resources, few are the cases when the resources are actually coded in conformance to the standard (and, at present time, there is no such thing as general NLP application interoperability). Semantic Web came with the promise that the web will be a universal medium for information exchange whatever its content. In this context, the present article outlines a collection of linguistic web services for Romanian and English, developed at the Research Institute for AI for the Romanian Academy (RACAI) which are ready to provide a standardized way of calling particular NLP operations and extract the results without caring about what exactly is going on in the background.

pdf bib abs

Unsupervised Lexical Acquisition for Part of Speech Tagging
Dan Tufiş | Elena Irimia | Radu Ion | Alexandru Ceauşu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

It is known that POS tagging is not very accurate for unknown words (words which the POS tagger has not seen in the training corpora). Thus, a first step to improve the tagging accuracy would be to extend the coverage of the taggers learned lexicon. It turns out that, through the use of a simple procedure, one can extend this lexicon without using additional, hard to obtain, hand-validated training corpora. The basic idea consists of merely adding new words along with their (correct) POS tags to the lexicon and trying to estimate the lexical distribution of these words according to similar ambiguity classes already present in the lexicon. We present a method of automatically acquire high quality POS tagging lexicons based on morphologic analysis and generation. Currently, this procedure works on Romanian for which we have a required paradigmatic generation procedure but the architecture remains general in the sense that given the appropriate substitutes for the morphological generator and POS tagger, one should obtain similar results.

2007

pdf bib

RACAI: Meaning Affinity Models
Radu Ion | Dan Tufiş
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib

Improved Lexical Alignment by Combining Multiple Reified Alignments
Dan Tufiş | Radu Ion | Alexandru Ceauşu | Dan Ştefănescu
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs

Dependency-Based Phrase Alignment
Radu Ion | Alexandru Ceauşu | Dan Tufiş
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Phrase alignment is the task that requires the constituent phrases of two halves of a bitext to be aligned. In order to align phrases, one must discover them first and this article presents a method of aligning phrases that are discovered automatically. Here, the notion of a 'phrase' will be understood as being given by a subtree of a dependency-like structure of a sentence called linkage. To discover phrases, we will make use of two distinct, language independent methods: the IBM-1 model (Brown et al., 1993) adapted to detect linkages and Constrained Lexical Attraction Models (Ion & Barbu Mititelu, 2006). The methods will be combined and the resulted model will be used to annotate the bitext. The accuracy of phrase alignment will be evaluated by obtaining word alignments from link alignments and then by checking the F-measure of the latter word aligner.