2024
pdf
bib
abs
“Vorbești Românește?” A Recipe to Train Powerful Romanian LLMs with English Instructions
Mihai Masala
|
Denis Ilie-Ablachim
|
Alexandru Dima
|
Dragos Georgian Corlatescu
|
Miruna-Andreea Zavelca
|
Ovio Olaru
|
Simina-Maria Terian
|
Andrei Terian
|
Marius Leordeanu
|
Horia Velicu
|
Marius Popescu
|
Mihai Dascalu
|
Traian Rebedea
Findings of the Association for Computational Linguistics: EMNLP 2024
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) with the goal of supporting and encouraging research on Romanian LLMs while concurrently creating a generalizable recipe adequate for other low or less-resourced languages.
2023
pdf
bib
abs
AD-NLP: A Benchmark for Anomaly Detection in Natural Language Processing
Matei Bejan
|
Andrei Manolache
|
Marius Popescu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Deep learning models have reignited the interest in Anomaly Detection research in recent years. Methods for Anomaly Detection in text have shown strong empirical results on ad-hoc anomaly setups that are usually made by downsampling some classes of a labeled dataset. This can lead to reproducibility issues and models that are biased toward detecting particular anomalies while failing to recognize them in more sophisticated scenarios. In the present work, we provide a unified benchmark for detecting various types of anomalies, focusing on problems that can be naturally formulated as Anomaly Detection in text, ranging from syntax to stylistics. In this way, we are hoping to facilitate research in Text Anomaly Detection. We also evaluate and analyze two strong shallow baselines, as well as two of the current state-of-the-art neural approaches, providing insights into the knowledge the neural models are learning when performing the anomaly detection task. We provide the code for evaluation, downloading, and preprocessing the dataset at https://github.com/mateibejan1/ad-nlp/.
2022
pdf
bib
abs
Rethinking the Authorship Verification Experimental Setups
Florin Brad
|
Andrei Manolache
|
Elena Burceanu
|
Antonio Barbalau
|
Radu Tudor Ionescu
|
Marius Popescu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
One of the main drivers of the recent advances in authorship verification is the PAN large-scale authorship dataset. Despite generating significant progress in the field, inconsistent performance differences between the closed and open test sets have been reported. To this end, we improve the experimental setup by proposing five new public splits over the PAN dataset, specifically designed to isolate and identify biases related to the text topic and to the author’s writing style. We evaluate several BERT-like baselines on these splits, showing that such models are competitive with authorship verification state-of-the-art methods. Furthermore, using explainable AI, we find that these baselines are biased towards named entities. We show that models trained without the named entities obtain better results and generalize better when tested on DarkReddit, our new dataset for authorship verification.
2021
pdf
bib
abs
jurBERT: A Romanian BERT Model for Legal Judgement Prediction
Mihai Masala
|
Radu Cristian Alexandru Iacob
|
Ana Sabina Uban
|
Marina Cidota
|
Horia Velicu
|
Traian Rebedea
|
Marius Popescu
Proceedings of the Natural Legal Language Processing Workshop 2021
Transformer-based models have become the de facto standard in the field of Natural Language Processing (NLP). By leveraging large unlabeled text corpora, they enable efficient transfer learning leading to state-of-the-art results on numerous NLP tasks. Nevertheless, for low resource languages and highly specialized tasks, transformer models tend to lag behind more classical approaches (e.g. SVM, LSTM) due to the lack of aforementioned corpora. In this paper we focus on the legal domain and we introduce a Romanian BERT model pre-trained on a large specialized corpus. Our model outperforms several strong baselines for legal judgement prediction on two different corpora consisting of cases from trials involving banks in Romania.
2017
pdf
bib
abs
Can string kernels pass the test of time in Native Language Identification?
Radu Tudor Ionescu
|
Marius Popescu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
We describe a machine learning approach for the 2017 shared task on Native Language Identification (NLI). The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from essays or speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided by the shared task organizers. For the learning stage, we choose Kernel Discriminant Analysis (KDA) over Kernel Ridge Regression (KRR), because the former classifier obtains better results than the latter one on the development set. In our previous work, we have used a similar machine learning approach to achieve state-of-the-art NLI results. The goal of this paper is to demonstrate that our shallow and simple approach based on string kernels (with minor improvements) can pass the test of time and reach state-of-the-art performance in the 2017 NLI shared task, despite the recent advances in natural language processing. We participated in all three tracks, in which the competitors were allowed to use only the essays (essay track), only the speech transcripts (speech track), or both (fusion track). Using only the data provided by the organizers for training our models, we have reached a macro F1 score of 86.95% in the closed essay track, a macro F1 score of 87.55% in the closed speech track, and a macro F1 score of 93.19% in the closed fusion track. With these scores, our team (UnibucKernel) ranked in the first group of teams in all three tracks, while attaining the best scores in the speech and the fusion tracks.
2016
pdf
bib
abs
UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels
Radu Tudor Ionescu
|
Marius Popescu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Unlike the common approach, we present a method that uses only character p-grams (also known as n-grams) as features for the Arabic Dialect Identification (ADI) Closed Shared Task of the DSL 2016 Challenge. The proposed approach combines several string kernels using multiple kernel learning. In the learning stage, we try both Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR), and we choose KDA as it gives better results in a 10-fold cross-validation carried out on the training set. Our approach is shallow and simple, but the empirical results obtained in the ADI Shared Task prove that it achieves very good results. Indeed, we ranked on the second place with an accuracy of 50.91% and a weighted F1 score of 51.31%. We also present improved results in this paper, which we obtained after the competition ended. Simply by adding more regularization into our model to make it more suitable for test data that comes from a different distribution than training data, we obtain an accuracy of 51.82% and a weighted F1 score of 52.18%. Furthermore, the proposed approach has an important advantage in that it is language independent and linguistic theory neutral, as it does not require any NLP tools.
pdf
bib
String Kernels for Native Language Identification: Insights from Behind the Curtains
Radu Tudor Ionescu
|
Marius Popescu
|
Aoife Cahill
Computational Linguistics, Volume 42, Issue 3 - September 2016
2014
pdf
bib
Can characters reveal your native language? A language-independent approach to native language identification
Radu Tudor Ionescu
|
Marius Popescu
|
Aoife Cahill
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
2013
pdf
bib
The Story of the Characters, the DNA and the Native Language
Marius Popescu
|
Radu Tudor Ionescu
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
2011
pdf
bib
Studying Translationese at the Character Level
Marius Popescu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011
2009
pdf
bib
Comparing Statistical Similarity Measures for Stylistic Multivariate Analysis
Marius Popescu
|
Liviu P. Dinu
Proceedings of the International Conference RANLP-2009
pdf
bib
What’s in a name? In some languages, grammatical gender
Vivi Nastase
|
Marius Popescu
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
2008
pdf
bib
abs
Authorship Identification of Romanian Texts with Controversial Paternity
Liviu Dinu
|
Marius Popescu
|
Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiales novel Sub pecetea tainei, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with the results obtained by a learning method (namely Support Vector Machines -SVM- with a string kernel).
pdf
bib
Rank Distance as a Stylistic Similarity
Marius Popescu
|
Liviu P. Dinu
Coling 2008: Companion volume: Posters
2004
pdf
bib
Regularized Least-Squares classification for Word Sense Disambiguation
Marius Popescu
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text