2024
pdf
bib
abs
“Vorbești Românește?” A Recipe to Train Powerful Romanian LLMs with English Instructions
Mihai Masala
|
Denis Ilie-Ablachim
|
Alexandru Dima
|
Dragos Georgian Corlatescu
|
Miruna-Andreea Zavelca
|
Ovio Olaru
|
Simina-Maria Terian
|
Andrei Terian
|
Marius Leordeanu
|
Horia Velicu
|
Marius Popescu
|
Mihai Dascalu
|
Traian Rebedea
Findings of the Association for Computational Linguistics: EMNLP 2024
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) with the goal of supporting and encouraging research on Romanian LLMs while concurrently creating a generalizable recipe adequate for other low or less-resourced languages.
pdf
bib
abs
Improving Legal Judgement Prediction in Romanian with Long Text Encoders
Mihai Masala
|
Traian Rebedea
|
Horia Velicu
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
In recent years,the entire field of Natural Language Processing (NLP) has enjoyed amazing novel results achieving almost human-like performance on a variety of tasks. Legal NLP domain has also been part of this process, as it has seen an impressive growth. However, general-purpose models are not readily applicable for legal domain. Due to the nature of the domain (e.g. specialized vocabulary, long documents) specific models and methods are often needed for Legal NLP. In this work we investigate both specialized and general models for predicting the final ruling of a legal case, task known as Legal Judgment Prediction (LJP). We particularly focus on methods to extend to sequence length of Transformer-based models to better understand the long documents present in legal corpora. Extensive experiments on 4 LJP datasets in Romanian, originating from 2 sources with significantly different sizes and document lengths, show that specialized models and handling long texts are critical for a good performance.
2021
pdf
bib
abs
jurBERT: A Romanian BERT Model for Legal Judgement Prediction
Mihai Masala
|
Radu Cristian Alexandru Iacob
|
Ana Sabina Uban
|
Marina Cidota
|
Horia Velicu
|
Traian Rebedea
|
Marius Popescu
Proceedings of the Natural Legal Language Processing Workshop 2021
Transformer-based models have become the de facto standard in the field of Natural Language Processing (NLP). By leveraging large unlabeled text corpora, they enable efficient transfer learning leading to state-of-the-art results on numerous NLP tasks. Nevertheless, for low resource languages and highly specialized tasks, transformer models tend to lag behind more classical approaches (e.g. SVM, LSTM) due to the lack of aforementioned corpora. In this paper we focus on the legal domain and we introduce a Romanian BERT model pre-trained on a large specialized corpus. Our model outperforms several strong baselines for legal judgement prediction on two different corpora consisting of cases from trials involving banks in Romania.