Andrei-Marius Avram - ACL Anthology

Andrei-Marius Avram

2025

MoRoVoc: A Large Dataset for Geographical Variation Identification of the Spoken Romanian Language
Andrei-Marius Avram | Bănescu Ema-Ioana | Anda-Teodora Robea | Dumitru-Clementin Cercel | Mihaela-Claudia Cercel
Findings of the Association for Computational Linguistics: EMNLP 2025

This paper introduces MoRoVoc, the largest dataset for analyzing the regional variation of spoken Romanian. It has more than 93 hours of audio and 88,192 audio samples, balanced between the Romanian language spoken in Romania and the Republic of Moldova. We further propose a multi-target adversarial training framework for speech models that incorporates demographic attributes (i.e., age and gender of the speakers) as adversarial targets, making models discriminative for primary tasks while remaining invariant to secondary attributes. The adversarial coefficients are dynamically adjusted via meta-learning to optimize performance. Our approach yields notable gains: Wav2Vec2-Base achieves 78.21% accuracy for the variation identification of spoken Romanian using gender as an adversarial target, while Wav2Vec2-Large reaches 93.08% accuracy for gender classification when employing both dialect and age as adversarial objectives.

RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation
Andrei-Marius Avram | Mircea Timpuriu | Andreea Iuga | Vlad-Cristian Matei | Iulian-Marius Taiatu | Tudor Găină | Dumitru-Clementin Cercel | Mihaela-Claudia Cercel | Florin Pop
Proceedings of the 31st International Conference on Computational Linguistics

Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum for benchmarking purposes. We manually evaluated the results of the best-performing system to gain insight into the potential pitfalls of this data set and future development.

2024

RoQLlama: A Lightweight Romanian Adapted Language Model
George-Andrei Dima | Andrei-Marius Avram | Cristian-George Craciun | Dumitru-Clementin Cercel
Findings of the Association for Computational Linguistics: EMNLP 2024

The remarkable achievements obtained by open-source large language models (LLMs) in recent years have predominantly been concentrated on tasks involving the English language. In this paper, we aim to advance the performance of Llama2 models on Romanian tasks. We tackle the problem of reduced computing resources by using QLoRA for training. We release RoQLlama-7b, a quantized LLM, which shows equal or improved results compared to its full-sized counterpart when tested on seven Romanian downstream tasks in the zero-shot setup. Also, it consistently achieves higher average scores across all few-shot prompts. Additionally, we introduce a novel Romanian dataset, namely RoMedQA, which contains single-choice medical questions in Romanian.

2022

RACAI@SMM4H’22: Tweets Disease Mention Detection Using a Neural Lateral Inhibitory Mechanism
Andrei-Marius Avram | Vasile Pais | Maria Mitrofan
Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper presents our system employed for the Social Media Mining for Health (SMM4H) 2022 competition Task 10 - SocialDisNER. The goal of the task was to improve the detection of diseases in tweets. Because the tweets were in Spanish, we approached this problem using a system that relies on a pre-trained multilingual model and is fine-tuned using the recently introduced lateral inhibition layer. We further experimented on this task by employing a conditional random field on top of the system and using a voting-based ensemble that contains various architectures. The evaluation results outlined that our best performing model obtained 83.7% F1-strict on the validation set and 82.1% F1-strict on the test set.

Legal Named Entity Recognition with Multi-Task Domain Adaptation
Răzvan-Alexandru Smădu | Ion-Robert Dinică | Andrei-Marius Avram | Dumitru-Clementin Cercel | Florin Pop | Mihaela-Claudia Cercel
Proceedings of the Natural Legal Language Processing Workshop 2022

Named Entity Recognition (NER) is a well-explored area from Information Retrieval and Natural Language Processing with an extensive research community. Despite that, few languages, such as English and German, are well-resourced, whereas many other languages, such as Romanian, have scarce resources, especially in domain-specific applications. In this work, we address the NER problem in the legal domain from both Romanian and German languages and evaluate the performance of our proposed method based on domain adaptation. We employ multi-task learning to jointly train a neural network on two legal and general domains and perform adaptation among them. The results show that domain adaptation increase performances by a small amount, under 1%, while considerable improvements are in the recall metric.

Distilling the Knowledge of Romanian BERTs Using Multiple Teachers
Andrei-Marius Avram | Darius Catrina | Dumitru-Clementin Cercel | Mihai Dascalu | Traian Rebedea | Vasile Pais | Dan Tufis
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Running large-scale pre-trained language models in computationally constrained environments remains a challenging problem yet to be addressed, while transfer learning from these models has become prevalent in Natural Language Processing tasks. Several solutions, including knowledge distillation, network quantization, or network pruning have been previously proposed; however, these approaches focus mostly on the English language, thus widening the gap when considering low-resource languages. In this work, we introduce three light and fast versions of distilled BERT models for the Romanian language: Distil-BERT-base-ro, Distil-RoBERT-base, and DistilMulti-BERT-base-ro. The first two models resulted from the individual distillation of knowledge from two base versions of Romanian BERTs available in literature, while the last one was obtained by distilling their ensemble. To our knowledge, this is the first attempt to create publicly available Romanian distilled BERT models, which were thoroughly evaluated on five tasks: part-of-speech tagging, named entity recognition, sentiment analysis, semantic textual similarity, and dialect identification. Our experimental results argue that the three distilled models offer performance comparable to their teachers, while being twice as fast on a GPU and ~35% smaller. In addition, we further test the similarity between the predictions of our students versus their teachers by measuring their label and probability loyalty, together with regression loyalty - a new metric introduced in this work.

Romanian Language Translation in the RELATE Platform
Vasile Pais | Maria Mitrofan | Andrei-Marius Avram
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

This paper presents the usage of the RELATE platform for translation tasks involving the Romanian language. Using this platform, it is possible to perform text and speech data translations, either for single documents or for entire corpora. Furthermore, the platform was successfully used in international projects to create new resources useful for Romanian language translation.

Use Case: Romanian Language Resources in the LOD Paradigm
Verginica Barbu Mititelu | Elena Irimia | Vasile Pais | Andrei-Marius Avram | Maria Mitrofan
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

In this paper, we report on (i) the conversion of Romanian language resources to the Linked Open Data specifications and requirements, on (ii) their publication and (iii) interlinking with other language resources (for Romanian or for other languages). The pool of converted resources is made up of the Romanian Wordnet, the morphosyntactic and phonemic lexicon RoLEX, four treebanks, one for the general language (the Romanian Reference Treebank) and others for specialised domains (SiMoNERo for medicine, LegalNERo for the legal domain, PARSEME-Ro for verbal multiword expressions), frequency information on lemmas and tokens and word embeddings as extracted from the reference corpus for contemporary Romanian (CoRoLa) and a bi-modal (text and speech) corpus. We also present the limitations coming from the representation of the resources in Linked Data format. The metadata of LOD resources have been published in the LOD Cloud. The resources are available for download on our website and a SPARQL endpoint is also available for querying them.

An Open-Domain QA System for e-Governance
Radu Ion | Andrei-Marius Avram | Vasile Păis | Maria Mitrofan | Verginica Barbu Mititelu | Elena Irimia | Valentin Badea
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

The paper presents an open-domain Question Answering system for Romanian, answering COVID-19 related questions. The QA system pipeline involves automatic question processing, automatic query generation, web searching for the top 10 most relevant documents and answer extraction using a fine-tuned BERT model for Extractive QA, trained on a COVID-19 data set that we have manually created. The paper will present the QA system and its integration with the Romanian language technologies portal RELATE, the COVID-19 data set and different evaluations of the QA performance.

2021

Dialect Identification through Adversarial Learning and Knowledge Distillation on Romanian BERT
George-Eduard Zaharia | Andrei-Marius Avram | Dumitru-Clementin Cercel | Traian Rebedea
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

Dialect identification is a task with applicability in a vast array of domains, ranging from automatic speech recognition to opinion mining. This work presents our architectures used for the VarDial 2021 Romanian Dialect Identification subtask. We introduced a series of solutions based on Romanian or multilingual Transformers, as well as adversarial training techniques. At the same time, we experimented with a knowledge distillation tool in order to check whether a smaller model can maintain the performance of our best approach. Our best solution managed to obtain a weighted F1-score of 0.7324, allowing us to obtain the 2nd place on the leaderboard.

UPB at SemEval-2021 Task 8: Extracting Semantic Information on Measurements as Multi-Turn Question Answering
Andrei-Marius Avram | George-Eduard Zaharia | Dumitru-Clementin Cercel | Mihai Dascalu
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Extracting semantic information on measurements and counts is an important topic in terms of analyzing scientific discourses. The 8th task of SemEval-2021: Counts and Measurements (MeasEval) aimed to boost research in this direction by providing a new dataset on which participants train their models to extract meaningful information on measurements from scientific texts. The competition is composed of five subtasks that build on top of each other: (1) quantity span identification, (2) unit extraction from the identified quantities and their value modifier classification, (3) span identification for measured entities and measured properties, (4) qualifier span identification, and (5) relation extraction between the identified quantities, measured entities, measured properties, and qualifiers. We approached these challenges by first identifying the quantities, extracting their units of measurement, classifying them with corresponding modifiers, and afterwards using them to jointly solve the last three subtasks in a multi-turn question answering manner. Our best performing model obtained an overlapping F1-score of 36.91% on the test set.

PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors
Andrei-Marius Avram | Vasile Pais | Dan Ioan Tufis
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

EuroVoc is a multilingual thesaurus that was built for organizing the legislative documentary of the European Union institutions. It contains thousands of categories at different levels of specificity and its descriptors are targeted by legal texts in almost thirty languages. In this work we propose a unified framework for EuroVoc classification on 22 languages by fine-tuning modern Transformer-based pretrained language models. We study extensively the performance of our trained models and show that they significantly improve the results obtained by a similar tool - JEX - on the same dataset. The code and the fine-tuned models were open sourced, together with a programmatic interface that eases the process of loading the weights of a trained model and of classifying a new document.

2020

Exploring the Power of Romanian BERT for Dialect Identification
George-Eduard Zaharia | Andrei-Marius Avram | Dumitru-Clementin Cercel | Traian Rebedea
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Dialect identification represents a key aspect for improving a series of tasks, for example, opinion mining, considering that the location of the speaker can greatly influence the attitude towards a subject. In this work, we describe the systems developed by our team for VarDial 2020: Romanian Dialect Identification, a task specifically created for challenging participants to solve the previously mentioned issue. More specifically, we introduce a series of neural systems based on Transformers, that combine a BERT model exclusively pre-trained on the Romanian language with techniques such as adversarial training or character-level embeddings. By using these approaches, we were able to obtain a 0.6475 macro F1 score on the test dataset, thus allowing us to be ranked 5th out of 8 participant teams.

Approaching SMM4H 2020 with Ensembles of BERT Flavours
George-Andrei Dima | Andrei-Marius Avram | Dumitru-Clementin Cercel
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

This paper describes our solutions submitted to the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020. We participated in the following tasks: Task 1 aimed at classifying if a tweet reports medications or not, Task 2 (only for the English dataset) aimed at discriminating if a tweet mentions adverse effects or not, and Task 5 aimed at recognizing if a tweet mentions birth defects or not. Our work focused on studying different neural network architectures based on various flavors of bidirectional Transformers (i.e., BERT), in the context of the previously mentioned classification tasks. For Task 1, we achieved an F1-score (70.5%) above the mean performance of the best scores made by all teams, whereas for Task 2, we obtained an F1-score of 37%. Also, we achieved a micro-averaged F1-score of 62% for Task 5.

UPB at SemEval-2020 Task 6: Pretrained Language Models for Definition Extraction
Andrei-Marius Avram | Dumitru-Clementin Cercel | Costin Chiru
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This work presents our contribution in the context of the 6th task of SemEval-2020: Extracting Definitions from Free Text in Textbooks (DeftEval). This competition consists of three subtasks with different levels of granularity: (1) classification of sentences as definitional or non-definitional, (2) labeling of definitional sentences, and (3) relation classification. We use various pretrained language models (i.e., BERT, XLNet, RoBERTa, SciBERT, and ALBERT) to solve each of the three subtasks of the competition. Specifically, for each language model variant, we experiment by both freezing its weights and fine-tuning them. We also explore a multi-task architecture that was trained to jointly predict the outputs for the second and the third subtasks. Our best performing model evaluated on the DeftEval dataset obtains the 32nd place for the first subtask and the 37th place for the second subtask. The code is available for further research at: https://github.com/avramandrei/DeftEval

Introducing RONEC - the Romanian Named Entity Corpus
Stefan Daniel Dumitrescu | Andrei-Marius Avram
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present RONEC - the Named Entity Corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition. It is available in BRAT and CoNLL-U Plus formats, and it is free to use and extend at github.com/dumitrescustefan/ronec

UPB at FinCausal-2020, Tasks 1 & 2: Causality Analysis in Financial Documents using Pretrained Language Models
Marius Ionescu | Andrei-Marius Avram | George-Andrei Dima | Dumitru-Clementin Cercel | Mihai Dascalu
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

Financial causality detection is centered on identifying connections between different assets from financial news in order to improve trading strategies. FinCausal 2020 - Causality Identification in Financial Documents – is a competition targeting to boost results in financial causality by obtaining an explanation of how different individual events or chain of events interact and generate subsequent events in a financial environment. The competition is divided into two tasks: (a) a binary classification task for determining whether sentences are causal or not, and (b) a sequence labeling task aimed at identifying elements related to cause and effect. Various Transformer-based language models were fine-tuned for the first task and we obtained the second place in the competition with an F1-score of 97.55% using an ensemble of five such language models. Subsequently, a BERT model was fine-tuned for the second task and a Conditional Random Field model was used on top of the generated language features; the system managed to identify the cause and effect relationships with an F1-score of 73.10%. We open-sourced the code and made it available at: https://github.com/avramandrei/FinCausal2020.

The birth of Romanian BERT
Stefan Dumitrescu | Andrei-Marius Avram | Sampo Pyysalo
Findings of the Association for Computational Linguistics: EMNLP 2020

Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus com-position and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.

A Customizable WordNet Editor
Andrei-Marius Avram | Verginica Barbu Mititelu
Proceedings of the Fourth International Conference on Computational Linguistics in Bulgaria (CLIB 2020)

This paper presents an open-source wordnet editor that has been developed to ensure further expansion of the Romanian wordnet. It comes with a web interface that offers capabilities in selecting new synsets to be implemented, editing the list of literals and their sense numbers and adding these new synsets to the existing network, by importing from Princeton WordNet (and adjusting, when necessary) all the relations in which the newly created synsets and their literals are involved. The application also comes with an authorization mechanism that ensures control of the new synsets added in novice or lexicographer accounts. Although created to serve the current (more or less specific) needs in the development of the Romanian wordnet, it can be customized to fulfill new requirements from developers, either of the same wordnet or of a different one for which a similar approach is adopted.