Stefan Daniel Dumitrescu

Also published as: Ștefan Daniel Dumitrescu, Ștefan Dumitrescu, Ştefan Daniel Dumitrescu, Stefan Dumitrescu

2024

Fine-Tuning and Retrieval Augmented Generation for Question Answering Using Affordable Large Language Models
Tiberiu Boros | Radu Chivereanu | Stefan Dumitrescu | Octavian Purcaru
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024

We present our proposed system named Sherlock to UNLP 2024 Shared Task on Question Answering winning first place. We employ a mix of methods, from using automatically translated datasets to perform supervised fine-tuning and direct preference optimization on instruction-tuned models, to model weight merging and retrieval augmented generation. We present and motivate our chosen sequence of steps, as well as an ablation study to understand the effect of each additional step. The resulting model and code are made publicly available (download links provided in the paper).

2022

pdf bib abs

RED v2: Enhancing RED Dataset for Multi-Label Emotion Detection
Alexandra Ciobotaru | Mihai Vlad Constantinescu | Liviu P. Dinu | Stefan Dumitrescu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

RED (Romanian Emotion Dataset) is a machine learning-based resource developed for the automatic detection of emotions in Romanian texts, containing single-label annotated tweets with one of the following emotions: joy, fear, sadness, anger and neutral. In this work, we propose REDv2, an open-source extension of RED by adding two more emotions, trust and surprise, and by widening the annotation schema so that the resulted novel dataset is multi-label. We show the overall reliability of our dataset by computing inter-annotator agreements per tweet using a formula suitable for our annotation setup and we aggregate all annotators’ opinions into two variants of ground truth, one suitable for multi-label classification and the other suitable for text regression. We propose strong baselines with two transformer models, the Romanian BERT and the multilingual XLM-Roberta model, in both categorical and regression settings.

2020

pdf bib abs

Introducing RONEC - the Romanian Named Entity Corpus
Stefan Daniel Dumitrescu | Andrei-Marius Avram
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present RONEC - the Named Entity Corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition. It is available in BRAT and CoNLL-U Plus formats, and it is free to use and extend at github.com/dumitrescustefan/ronec

pdf bib abs

The birth of Romanian BERT
Stefan Dumitrescu | Andrei-Marius Avram | Sampo Pyysalo
Findings of the Association for Computational Linguistics: EMNLP 2020

Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus com-position and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.

2018

pdf bib abs

NLP-Cube: End-to-End Raw Text Processing With Neural Networks
Tiberiu Boros | Stefan Daniel Dumitrescu | Ruxandra Burtica
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL’s “Multilingual Parsing from Raw Text to Universal Dependencies 2018” Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.

pdf bib

Attention-free encoder decoder for morphological processing
Stefan Daniel Dumitrescu | Tiberiu Boros
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

2017

pdf bib abs

CASSANDRA: A multipurpose configurable voice-enabled human-computer-interface
Tiberiu Boros | Stefan Daniel Dumitrescu | Sonia Pipa
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

Voice enabled human computer interfaces (HCI) that integrate automatic speech recognition, text-to-speech synthesis and natural language understanding have become a commodity, introduced by the immersion of smart phones and other gadgets in our daily lives. Smart assistants are able to respond to simple queries (similar to text-based question-answering systems), perform simple tasks (call a number, reject a call etc.) and help organizing appointments. With this paper we introduce a newly created process automation platform that enables the user to control applications and home appliances and to query the system for information using a natural voice interface. We offer an overview of the technologies that enabled us to construct our system and we present different usage scenarios in home and office environments.

pdf bib abs

Fast and Accurate Decision Trees for Natural Language Processing Tasks
Tiberiu Boros | Stefan Daniel Dumitrescu | Sonia Pipa
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Decision trees have been previously employed in many machine-learning tasks such as part-of-speech tagging, lemmatization, morphological-attribute resolution, letter-to-sound conversion and statistical-parametric speech synthesis. In this paper we introduce an optimized tree-computation algorithm, which is based on the original ID3 algorithm. We also introduce a tree-pruning method that uses a development set to delete nodes from over-fitted models. The later mentioned algorithm also uses a results caching method for speed-up. Our algorithm is almost 200 times faster than a naive implementation and yields accurate results on our test datasets.

pdf bib abs

RACAI’s Natural Language Processing pipeline for Universal Dependencies
Stefan Daniel Dumitrescu | Tiberiu Boros | Dan Tufis
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper presents RACAI’s approach, experiments and results at CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. We handle raw text and we cover tokenization, sentence splitting, word segmentation, tagging, lemmatization and parsing. All results are reported under strict training, development and testing conditions, in which the corpora provided for the shared tasks is used “as is”, without any modifications to the composition of the train and development sets.

2016

pdf bib abs

The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Dan Tufiș | Verginica Barbu Mititelu | Elena Irimia | Ștefan Daniel Dumitrescu | Tiberiu Boroș
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country’s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.

pdf bib abs

RACAI Entry for the IWSLT 2016 Shared Task
Sonia Pipa | Alin Florentin Vasile | Ioana Ionașcu | Stefan Daniel Dumitrescu | Tiberiu Boros
Proceedings of the 13th International Conference on Spoken Language Translation

Spoken Language Translation is currently a hot topic in the research community. This task is very complex, involving automatic speech recognition, text-normalization and machine translation. We present our speech translation system, which was compared against the other systems participating in the IWSLT 2016 Shared Task. We introduce our ASR system for English and our MT system for English to French (En-Fr) and English to German (En-De) language pairs. Additionally, for the English to French Challenge we introduce a methodology that enables the enhancement of statistical phrase-based translation with translation equivalents deduced from monolingual corpora using neural word embedding.

2014

pdf bib abs

RSS-TOBI - A Prosodically Enhanced Romanian Speech Corpus
Tiberiu Boroș | Adriana Stan | Oliver Watts | Stefan Daniel Dumitrescu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper introduces a recent development of a Romanian Speech corpus to include prosodic annotations of the speech data in the form of ToBI labels. We describe the methodology of determining the required pitch patterns that are common for the Romanian language, annotate the speech resource, and then provide a comparison of two text-to-speech synthesis systems to establish the benefits of using this type of information to our speech resource. The result is a publicly available speech dataset which can be used to further develop speech synthesis systems or to automatically learn the prediction of ToBI labels from text in Romanian language.

pdf bib

RACAI GEC – A hybrid approach to Grammatical Error Correction
Tiberiu Boroș | Stefan Daniel Dumitrescu | Adrian Zafiu | Verginica Barbu Mititelu | Ionut Paul Văduva
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib

News about the Romanian Wordnet
Verginica Barbu Mititelu | Ștefan Daniel Dumitrescu | Dan Tufiș
Proceedings of the Seventh Global Wordnet Conference

2013

pdf bib

Wikipedia as an SMT Training Corpus
Dan Tufiș | Radu Ion | Ștefan Dumitrescu | Dan Ștefănescu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib abs

Romanian to English automatic MT experiments at IWSLT12 – system description paper
Ştefan Daniel Dumitrescu | Radu Ion | Dan Ştefănescu | Tiberiu Boroş | Dan Tufiş
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

The paper presents the system developed by RACAI for the ISWLT 2012 competition, TED task, MT track, Romanian to English translation. We describe the starting baseline phrase-based SMT system, the experiments conducted to adapt the language and translation models and our post-translation cascading system designed to improve the translation without external resources. We further present our attempts at creating a better controlled decoder than the open-source Moses system offers.

pdf bib

Cascaded Phrase-Based Statistical Machine Translation Systems
Dan Tufiş | Ștefan Daniel Dumitrescu
Proceedings of the 16th Annual Conference of the European Association for Machine Translation