Stefan Ruseti


2024

pdf bib
How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics
Adrian Cosma | Stefan Ruseti | Mihai Dascalu | Cornelia Caragea
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.

pdf bib
A World CLASSE Student Summary Corpus
Scott Crossley | Perpetual Baffour | Mihai Dascalu | Stefan Ruseti
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper introduces the Common Lit Augmented Student Summary Evaluation (CLASSE) corpus. The corpus comprises 11,213 summaries written over six prompts by students in grades 3-12 while using the CommonLit website. Each summary was scored by expert human raters on analytic features related to main points, details, organization, voice, paraphrasing, and language beyond the source text. The human scores were aggregated into two component scores related to content and wording. The final corpus was the focus of a Kaggle competition hosted in late 2022 and completed in 2023 in which over 2,000 teams participated. The paper includes a baseline scoring model for the corpus based on a Large Language Model (Longformer model). The paper also provides an overview of the winning models from the Kaggle competition.

2021

pdf bib
Interpretable Identification of Cybersecurity Vulnerabilities from News Articles
Pierre Frode de la Foret | Stefan Ruseti | Cristian Sandescu | Mihai Dascalu | Sebastien Travadel
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

With the increasing adoption of technology, more and more systems become target to information security breaches. In terms of readily identifying zero-day vulnerabilities, a substantial number of news outlets and social media accounts reveal emerging vulnerabilities and threats. However, analysts often spend a lot of time looking through these decentralized sources of information in order to ensure up-to-date countermeasures and patches applicable to their organisation’s information systems. Various automated processing pipelines grounded in Natural Language Processing techniques for text classification were introduced for the early identification of vulnerabilities starting from Open-Source Intelligence (OSINT) data, including news websites, blogs, and social media. In this study, we consider a corpus of more than 1600 labeled news articles, and introduce an interpretable approach to the subject of cyberthreat early detection. In particular, an interpretable classification is performed using the Longformer architecture alongside prototypes from the ProSeNet structure, after performing a preliminary analysis on the Transformer’s encoding capabilities. The best interpretable architecture achieves an 88% F2-Score, arguing for the system’s applicability in real-life monitoring conditions of OSINT data.

2020

pdf bib
RoBERT – A Romanian BERT Model
Mihai Masala | Stefan Ruseti | Mihai Dascalu
Proceedings of the 28th International Conference on Computational Linguistics

Deep pre-trained language models tend to become ubiquitous in the field of Natural Language Processing (NLP). These models learn contextualized representations by using a huge amount of unlabeled text data and obtain state of the art results on a multitude of NLP tasks, by enabling efficient transfer learning. For other languages besides English, there are limited options of such models, most of which are trained only on multi-lingual corpora. In this paper we introduce a Romanian-only pre-trained BERT model – RoBERT – and compare it with different multi-lingual models on seven Romanian specific NLP tasks grouped into three categories, namely: sentiment analysis, dialect and cross-dialect topic identification, and diacritics restoration. Our model surpasses the multi-lingual models, as well as a another mono-lingual implementation of BERT, on all tasks.

2019

pdf bib
Answering questions by learning to rank - Learning to rank by answering questions
George Sebastian Pirtoaca | Traian Rebedea | Stefan Ruseti
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Answering multiple-choice questions in a setting in which no supporting documents are explicitly provided continues to stand as a core problem in natural language processing. The contribution of this article is two-fold. First, it describes a method which can be used to semantically rank documents extracted from Wikipedia or similar natural language corpora. Second, we propose a model employing the semantic ranking that holds the first place in two of the most popular leaderboards for answering multiple-choice questions: ARC Easy and Challenge. To achieve this, we introduce a self-attention based neural network that latently learns to rank documents by their importance related to a given question, whilst optimizing the objective of predicting the correct answer. These documents are considered relevant contexts for the underlying question. We have published the ranked documents so that they can be used off-the-shelf to improve downstream decision models.

2018

pdf bib
Natural Language Interface for Databases Using a Dual-Encoder Model
Ionel Alexandru Hosu | Radu Cristian Alexandru Iacob | Florin Brad | Stefan Ruseti | Traian Rebedea
Proceedings of the 27th International Conference on Computational Linguistics

We propose a sketch-based two-step neural model for generating structured queries (SQL) based on a user’s request in natural language. The sketch is obtained by using placeholders for specific entities in the SQL query, such as column names, table names, aliases and variables, in a process similar to semantic parsing. The first step is to apply a sequence-to-sequence (SEQ2SEQ) model to determine the most probable SQL sketch based on the request in natural language. Then, a second network designed as a dual-encoder SEQ2SEQ model using both the text query and the previously obtained sketch is employed to generate the final SQL query. Our approach shows improvements over previous approaches on two recent large datasets (WikiSQL and SENLIDB) suitable for data-driven solutions for natural language interfaces for databases.

2016

pdf bib
Using Embedding Masks for Word Categorization
Stefan Ruseti | Traian Rebedea | Stefan Trausan-Matu
Proceedings of the 1st Workshop on Representation Learning for NLP