Sergiu Nisioi


2024

pdf bib
Archaeology at MLSP 2024: Machine Translation for Lexical Complexity Prediction and Lexical Simplification
Petru Cristea | Sergiu Nisioi
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

We present the submissions of team Archaeology for the Lexical Simplification and Lexical Complexity Prediction Shared Tasks at BEA2024. Our approach for this shared task consists in creating two pipelines for generating lexical substitutions and estimating the complexity: one using machine translation texts into English and one using the original language.For the LCP subtask, our xgb regressor is trained with engineered features (based primarily on English language resources) and shallow word structure features. For the LS subtask we use a locally-executed quantized LLM to generate candidates and sort them by complexity score computed using the pipeline designed for LCP.These pipelines provide distinct perspectives on the lexical simplification process, offering insights into the efficacy and limitations of employing Machine Translation versus direct processing on the original language data.

pdf bib
Cheap Ways of Extracting Clinical Markers from Texts
Anastasia Sandu | Teodor Mihailescu | Sergiu Nisioi
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

This paper describes the Unibuc Archaeology team work for CLPsych’s 2024 Shared Task that involved finding evidence within the text supporting the assigned suicide risk level. Two types of evidence were required: highlights (extracting relevant spans within the text) and summaries (aggregating evidence into a synthesis). Our work focuses on evaluating Large Language Models (LLM) as opposed to an alternative method that is much more memory and resource efficient. The first approach employs an LLM that is used for generating the summaries and is guided to provide sequences of text indicating suicidal tendencies through a processing chain for highlights. The second approach involves implementing a good old-fashioned machine learning tf-idf with a logistic regression classifier, whose representative features we use to extract relevant highlights.

pdf bib
A Multilingual Parallel Corpus for Aromanian
Iulia Petrariu | Sergiu Nisioi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We report the creation of the first high-quality corpus of Aromanian - an endangered Romance language spoken in the Balkans - and the equivalent sentence-aligned translations into Romanian, English, and French. The corpus is released publicly using several orthographic standards and consists in short stories collected in the ‘70s in Romania. Additionally, we provide an corpus-based analysis of Aromanian linguistic particularities and the overall demographic and political context which impacts the contemporary development of the language.

2023

pdf bib
Clark Kent at SemEval-2023 Task 5: SVMs, Transformers, and Pixels for Clickbait Spoiling
Dragos-stefan Mihalcea | Sergiu Nisioi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

In this paper we present an analysis of our approaches for the 2023 SemEval-2023 Clickbait Challenge. We only participated in the sub-task aiming at identifying different clikcbait spoiling types comparing several machine learning and deep learning approaches. Our analysis confirms previous results on this task and show that automatic methods are able to reach approximately 70\% accuracy at predicting what type of additional content is needed to mitigate sensationalistic posts on social media. Furthermore, we provide a qualitative analysis of the results, showing that the models may do better in practice than the metric indicates since the evaluate does not depend only on the predictor, but also on the typology we choose to define clickbait spoiling.

2022

pdf bib
Identifying Draft Bills Impacting Existing Legislation: a Case Study on Romanian
Corina Ceausu | Sergiu Nisioi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In our paper, we present a novel corpus of historical legal documents on the Romanian public procurement legislation and an annotated subset of draft bills that have been screened by legal experts and identified as impacting past public procurement legislation. Using the manual annotations provided by the experts, we attempt to automatically identify future draft bills that have the potential to impact existing policies on public procurement.

2020

pdf bib
CoCo: A Tool for Automatically Assessing Conceptual Complexity of Texts
Sanja Stajner | Sergiu Nisioi | Ioana Hulpuș
Proceedings of the Twelfth Language Resources and Evaluation Conference

Traditional text complexity assessment usually takes into account only syntactic and lexical text complexity. The task of automatic assessment of conceptual text complexity, important for maintaining reader’s interest and text adaptation for struggling readers, has only been proposed recently. In this paper, we present CoCo - a tool for automatic assessment of conceptual text complexity, based on using the current state-of-the-art unsupervised approach. We make the code and API freely available for research purposes, and describe the code and the possibility for its personalization and adaptation in details. We compare the current implementation with the state of the art, discussing the influence of the choice of entity linker on the performances of the tool. Finally, we present results obtained on two widely used text simplification corpora, discussing the full potential of the tool.

2018

pdf bib
A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification
Sanja Štajner | Sergiu Nisioi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Content Extraction and Lexical Analysis from Customer-Agent Interactions
Sergiu Nisioi | Anca Bucur | Liviu P. Dinu
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

In this paper, we provide a lexical comparative analysis of the vocabulary used by customers and agents in an Enterprise Resource Planning (ERP) environment and a potential solution to clean the data and extract relevant content for NLP. As a result, we demonstrate that the actual vocabulary for the language that prevails in the ERP conversations is highly divergent from the standardized dictionary and further different from general language usage as extracted from the Common Crawl corpus. Moreover, in specific business communication circumstances, where it is expected to observe a high usage of standardized language, code switching and non-standard expression are predominant, emphasizing once more the discrepancy between the day-to-day use of language and the standardized one.

2017

pdf bib
Exploring Neural Text Simplification Models
Sergiu Nisioi | Sanja Štajner | Simone Paolo Ponzetto | Liviu P. Dinu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated TS systems, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve almost perfect grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems

2016

pdf bib
Using Word Embeddings to Translate Named Entities
Octavia-Maria Şulea | Sergiu Nisioi | Liviu P. Dinu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on comparable corpora yields comparable vector space representations of those corpora, reducing the problem of translating words to finding a rotation matrix, and results in (Zou et al., 2013), which showed that bilingual word embeddings can improve Chinese Named Entity Recognition (NER) and English to Chinese phrase translation, we use the sentence-aligned English-French EuroParl corpora and show that word embeddings extracted from a merged corpus (corpus resulted from the merger of the two aligned corpora) can be used to NE translation. We extrapolate that word embeddings trained on merged parallel corpora are useful in Named Entity Recognition and Translation tasks for resource-poor languages.

pdf bib
Comparing Speech and Text Classification on ICNALE
Sergiu Nisioi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we explore and compare a speech and text classification approach on a corpus of native and non-native English speakers. We experiment on a subset of the International Corpus Network of Asian Learners of English containing the recorded speeches and the equivalent text transcriptions. Our results suggest a high correlation between the spoken and written classification results, showing that native accent is highly correlated with grammatical structures found in text.

pdf bib
A Corpus of Native, Non-native and Translated Texts
Sergiu Nisioi | Ella Rabinovich | Liviu P. Dinu | Shuly Wintner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker’s country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.

pdf bib
On the Similarities Between Native, Non-native and Translated Texts
Ella Rabinovich | Sergiu Nisioi | Noam Ordan | Shuly Wintner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus
Anca Bucur | Sergiu Nisioi
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

In this paper we will discuss a method for data visualization together with its potential usefulness in digital humanities and philosophy of language. We compiled a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.

pdf bib
Vanilla Classifiers for Distinguishing between Similar Languages
Sergiu Nisioi | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.

2014

pdf bib
On the syllabic structures of Aromanian
Sergiu Nisioi
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2013

pdf bib
A clustering approach for translationese identification
Sergiu Nisioi | Liviu P. Dinu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
Authorial Studies using Ranked Lexical Features
Liviu P. Dinu | Sergiu Nisioi
Proceedings of COLING 2012: Demonstration Papers