Ana Sabina Uban - ACL Anthology

Ana Sabina Uban

Also published as: Ana Sabina Uban

2026

On the Intelligibility of Romance Language Varieties: Spanish and Portuguese in Europe and America
Liviu P. Dinu | Ana Sabina Uban | Teodor-George Marchitan | Ioan-Bogdan Iordache | Simona Georgescu
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects

Mutual intelligibility within language families presents a significant challenge for multilingual NLP, particularly due to the prevalence of dialectal variation and asymmetric comprehension. In this paper, we present a corpus-based computational analysis to quantify linguistic proximity across Romance language variants, with a focus on major Spanish (Argentine, Chilean and European) and Portuguese (Brazilian and European) varieties and the other main Romance languages (Italian, French, Romanian). We apply a computational metric of lexical intelligibility based on surface and semantic similarity of related words to measure mutual intelligibility for the five main Romance languages in relation to the Spanish and Portuguese varieties studied.

A Computational Analysis of the Emergence of Therapy-speak in Social Media
Alina Iacob | Ana Sabina Uban
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

The present article investigates semantic change in psychology-related concepts, in scientific and social media texts comparatively. We assess patterns of change over 15 years (2010-2025) and compare word usage in a corpus of Psychology journals abstracts and Reddit comments, testing whether specialized communities on social media align with psychology experts. We analyze semantic breadth, semantic displacement and neighbours similarity evolutions, and in addition include in our experiments contextual embeddings alongside static Word2Vec embeddings. Our results reveal diverse patterns of semantic change across the examined concepts and confirm that many terms are used differently on social media compared to specialized literature. Furthermore, Reddit communities focused on psychology discussions occupy an intermediate position, adopting a more objective stance than general-domain threads while remaining distinct from specialized literature.

Cross-lingual Lexical Semantic Change in Romance Languages
Ana Sabina Uban | Liviu P Dinu | Anca Daniela Dinu | Simona Georgescu
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

We present a comprehensive quantitative analysis of lexical semantic change in the five main Romance languages (Romanian, Italian, Spanish, French and Portuguese), based on the most exhaustive database of related words in these languages. We include both cognate words and borrowings (for the first time, to our knowledge), and compute semantic shift measures using different static and contextual embedding models, as well as three different corpora. We publish the obtained lists of semantic divergences across all related word pairs, compute global trends in language-level semantic divergence, and provide insights on particular study cases of highly stable and highly divergent words for different language pairs.

2025

Friend or Foe? A Computational Investigation of Semantic False Friends across Romance Languages
Ana Sabina Uban | Liviu P Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Claudia Vlad
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In this paper we present a comprehensive analysis of lexical semantic divergence between cognate words and borrowings in the Romance languages. We experiment with different algorithms for false friend detection including deceptive cognate and deceptive borrowings and correction and evaluate them systematically on cognate and borrowing pairs in the five Romance languages. We use the most complete and reliable dataset of cognate words based on etymological dictionaries for the five main Romance languages (Italian, Spanish, Portuguese, French and Romanian) to extract deceptive cognates and borrowings automatically based on usage, and freely publish the lexicon of obtained true and deceptive cognate and borrowings in every Romance language pair.

UniBuc-SB at ArchEHR-QA 2025: A Resource-Constrained Pipeline for Relevance Classification and Grounded Answer Synthesis
Sebastian Balmus | Dura Bogdan | Ana Sabina Uban
Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)

Capturing the Dynamics of Mental Well-Being: Adaptive and Maladaptive States in Social Media
Anastasia Sandu | Teodor Mihailescu | Ana Sabina Uban | Ana-Maria Bucur
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)

This paper describes the contributions of the BLUE team in the CLPsych 2025 Shared Task on Capturing Mental Health Dynamics from Social Media Timelines. We participate in all tasks with three submissions, for which we use two sets of approaches: an unsupervised approach using prompting of various large language models (LLM) with no fine-tuning for this task or domain, and a supervised approach based on several lightweight machine learning models trained to classify sentences for evidence extraction, based on an augmented training dataset sourced from public psychological questionnaires. We obtain the best results for summarization Tasks B and C in terms of consistency, and the best F1 score in Task A.2.

Towards a Map of Related Words in Romance Languages
Liviu P. Dinu | Ana Sabina Uban | Ioan-Bogdan Iordache | Claudia Vlad | Simona Georgescu | Laurentiu Zoicas | Anca Dinu
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

We propose a map of cognates and borrowings usage in Romance languages, having as a starting point the pairs of cognates and borrowings between any two of these idioms from RoBoCoP, the largest database built upon electronic dictionaries containing etymological information for Portuguese, Spanish, French, Italian and Romanian. Having in mind that words are used and evolve in language communities over time, on the basis of the pairs extracted from RoBoCoP, we determine how many of them occur and with what frequency in the context of the languages in use, based on three online parallel corpora that contain all five Romance languages: Wikipedia, Europarl – focusing on proceedings of the European Parliament and RomCro2.0 – containing literary texts in different languages, translated in Romance languages and Croatian.

SciBERT Meets Contrastive Learning: A Solution for Scientific Hallucination Detection
Crivoi Carla | Ana Sabina Uban
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

As AI systems become more involved in scientific research, there is growing concern about the accuracy of their outputs. Tools powered by large language models can generate summaries and answers that appear well-formed, but sometimes include claims that are not actually supported by the cited references. In this paper, we focus on identifying these hallucinated claims. We propose a system built on SciBERT and contrastive learning to detect whether a scientific claim can be inferred from the referenced content. Our method was evaluated in the SciHal 2025 shared task, which includes both coarse and fine-grained hallucination labels. The results show that our model performs well on supported and clearly unsupported claims, but struggles with ambiguous or low-resource categories. These findings highlight both the promise and the limitations of current models in improving the trustworthiness of AI-generated scientific content.

2024

Pater Incertus? There Is a Solution: Automatic Discrimination between Cognates and Borrowings for Romance Languages
Liviu P. Dinu | Ana Sabina Uban | Ioan-Bogdan Iordache | Alina Maria Cristea | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Identifying the type of relationship between words (cognates, borrowings, inherited) provides a deeper insight into the history of a language and allows for a better characterization of language relatedness. In this paper, we propose a computational approach for discriminating between cognates and borrowings, one of the most difficult tasks in historical linguistics. We compare the discriminative power of graphic and phonetic features and we analyze the underlying linguistic factors that prove relevant in the classification task. We perform experiments for pairs of languages in the Romance language family (French, Italian, Spanish, Portuguese, and Romanian), based on a comprehensive database of Romance cognates and borrowings. To our knowledge, this is one of the first attempts of this kind and the most comprehensive in terms of covered languages.

Verba volant, scripta volant? Don’t worry! There are computational solutions for protoword reconstruction
Liviu P Dinu | Ana Sabina Uban | Alina Maria Cristea | Ioan-Bogdan Iordache | Teodor-George Marchitan | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We introduce a new database of cognate words and etymons for the five main Romance languages, the most comprehensive one to date. We propose a strong benchmark for the automatic reconstruction of protowords for Romance languages, by applying a set of machine learning models and features on these data. The best results reach 90% accuracy in predicting the protoword of a given cognate set, surpassing existing state-of-the-art results for this task and showing that computational methods can be very useful in assisting linguists with protoword reconstruction.

SciTechBaitRO: ClickBait Detection for Romanian Science and Technology News
Raluca-Andreea Gînga | Ana Sabina Uban
Proceedings of the Third Workshop on NLP for Positive Impact

In this paper, we introduce a new annotated corpus of clickbait news in a low-resource language - Romanian, and a rarely covered domain - science and technology news: SciTechBaitRO. It is one of the first and the largest corpus (almost 11,000 examples) of annotated clickbait texts for the Romanian language and the first one to focus on the sci-tech domain, to our knowledge. We evaluate the possibility of automatically detecting clickbait through a series of data analysis and machine learning experiments with varied features and models, including a range of linguistic features, classical machine learning models, deep learning and pre-trained models. We compare the performance of models using different kinds of features, and show that the best results are given by the BERT models, with results of up to 89% F1 score. We additionally evaluate the models in a cross-domain setting for news belonging to other categories (i.e. politics, sports, entertainment) and demonstrate their capacity to generalize by detecting clickbait news outside of domain with high F1-scores.

UniBuc at SemEval-2024 Task 2: Tailored Prompting with Solar for Clinical NLI
Marius Micluta-Campeanu | Claudiu Creanga | Ana-maria Bucur | Ana Sabina Uban | Liviu P. Dinu
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper describes the approach of the UniBuc team in tackling the SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We used SOLAR Instruct, without any fine-tuning, while focusing on input manipulation and tailored prompting. By customizing prompts for individual CTR sections, in both zero-shot and few-shots settings, we managed to achieve a consistency score of 0.72, ranking 14th in the leaderboard. Our thorough error analysis revealed that our model has a tendency to take shortcuts and rely on simple heuristics, especially when dealing with semantic-preserving changes.

2023

CoToHiLi at SIGTYP 2023: Ensemble Models for Cognate and Derivative Words Detection
Liviu P. Dinu | Ioan-Bogdan Iordache | Ana Sabina Uban
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

The identification of cognates and derivatives is a fundamental process in historical linguistics, on which any further research is based. In this paper we present our contribution to the SIGTYP 2023 Shared Task on cognate and derivative detection. We propose a multi-lingual solution based on features extracted from the alignment of the orthographic and phonetic representations of the words.

A Computational Analysis of the Voices of Shakespeare’s Characters
Liviu P. Dinu | Ana Sabina Uban
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

In this paper we propose a study of a relatively novel problem in authorship attribution research: that of classifying the stylome of characters in a literary work. We choose as a case study the plays of William Shakespeare, presumably the most renowned and respected dramatist in the history of literature. Previous research in the field of authorship attribution has shown that the writing style of an author can be characterized and distinguished from that of other authors automatically. The question we propose to answer is a related but different one: can the styles of different characters be distinguished? We aim to verify in this way if an author managed to create believable characters with individual styles, and focus on Shakespeare’s iconic characters. We present our experiments using various features and models, including an SVM and a neural network, show that characters in Shakespeare’s plays can be classified with up to 50% accuracy.

2022

Multi-Aspect Transfer Learning for Detecting Low Resource Mental Disorders on Social Media
Ana Sabina Uban | Berta Chulvi | Paolo Rosso
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Mental disorders are a serious and increasingly relevant public health issue. NLP methods have the potential to assist with automatic mental health disorder detection, but building annotated datasets for this task can be challenging; moreover, annotated data is very scarce for disorders other than depression. Understanding the commonalities between certain disorders is also important for clinicians who face the problem of shifting standards of diagnosis. We propose that transfer learning with linguistic features can be useful for approaching both the technical problem of improving mental disorder detection in the context of data scarcity, and the clinical problem of understanding the overlapping symptoms between certain disorders. In this paper, we target four disorders: depression, PTSD, anorexia and self-harm. We explore multi-aspect transfer learning for detecting mental disorders from social media texts, using deep learning models with multi-aspect representations of language (including multiple types of interpretable linguistic features). We explore different transfer learning strategies for cross-disorder and cross-platform transfer, and show that transfer learning can be effective for improving prediction performance for disorders where little annotated data is available. We offer insights into which linguistic features are the most useful vehicles for transferring knowledge, through ablation experiments, as well as error analysis.

Investigating the Relationship Between Romanian Financial News and Closing Prices from the Bucharest Stock Exchange
Ioan-Bogdan Iordache | Ana Sabina Uban | Catalin Stoean | Liviu P. Dinu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

A new data set is gathered from a Romanian financial news website for the duration of four years. It is further refined to extract only information related to one company by selecting only paragraphs and even sentences that referred to it. The relation between the extracted sentiment scores of the texts and the stock prices from the corresponding dates is investigated using various approaches like the lexicon-based Vader tool, Financial BERT, as well as Transformer-based models. Automated translation is used, since some models could be only applied for texts in English. It is encouraging that all models, be that they are applied to Romanian or English texts, indicate a correlation between the sentiment scores and the increase or decrease of the stock closing prices.

CoToHiLi at LSCDiscovery: the Role of Linguistic Features in Predicting Semantic Change
Ana Sabina Uban | Alina Maria Cristea | Anca Daniela Dinu | Liviu P Dinu | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

This paper presents the contributions of the CoToHiLi team for the LSCDiscovery shared task on semantic change in the Spanish language. We participated in both tasks (graded discovery and binary change, including sense gain and sense loss) and proposed models based on word embedding distances combined with hand-crafted linguistic features, including polysemy, number of neological synonyms, and relation to cognates in English. We find that models that include linguistically informed features combined using weights assigned manually by experts lead to promising results.

2021

A Computational Exploration of Pejorative Language in Social Media
Liviu P. Dinu | Ioan-Bogdan Iordache | Ana Sabina Uban | Marcos Zampieri
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper we study pejorative language, an under-explored topic in computational linguistics. Unlike existing models of offensive language and hate speech, pejorative language manifests itself primarily at the lexical level, and describes a word that is used with a negative connotation, making it different from offensive language or other more studied categories. Pejorativity is also context-dependent: the same word can be used with or without pejorative connotations, thus pejorativity detection is essentially a problem similar to word sense disambiguation. We leverage online dictionaries to build a multilingual lexicon of pejorative terms for English, Spanish, Italian, and Romanian. We additionally release a dataset of tweets annotated for pejorative use. Based on these resources, we present an analysis of the usage and occurrence of pejorative words in social media, and present an attempt to automatically disambiguate pejorative usage in our dataset.

Studying the Evolution of Scientific Topics and their Relationships
Ana Sabina Uban | Cornelia Caragea | Liviu P. Dinu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

jurBERT: A Romanian BERT Model for Legal Judgement Prediction
Mihai Masala | Radu Cristian Alexandru Iacob | Ana Sabina Uban | Marina Cidota | Horia Velicu | Traian Rebedea | Marius Popescu
Proceedings of the Natural Legal Language Processing Workshop 2021

Transformer-based models have become the de facto standard in the field of Natural Language Processing (NLP). By leveraging large unlabeled text corpora, they enable efficient transfer learning leading to state-of-the-art results on numerous NLP tasks. Nevertheless, for low resource languages and highly specialized tasks, transformer models tend to lag behind more classical approaches (e.g. SVM, LSTM) due to the lack of aforementioned corpora. In this paper we focus on the legal domain and we introduce a Romanian BERT model pre-trained on a large specialized corpus. Our model outperforms several strong baselines for legal judgement prediction on two different corpora consisting of cases from trials involving banks in Romania.

Tracking Semantic Change in Cognate Sets for English and Romance Languages
Ana Sabina Uban | Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

Semantic divergence in related languages is a key concern of historical linguistics. We cross-linguistically investigate the semantic divergence of cognate pairs in English and Romance languages, by means of word embeddings. To this end, we introduce a new curated dataset of cognates in all pairs of those languages. We describe the types of errors that occurred during the automated cognate identification process and manually correct them. Additionally, we label the English cognates according to their etymology, separating them into two groups: old borrowings and recent borrowings. On this curated dataset, we analyse word properties such as frequency and polysemy, and the distribution of similarity scores between cognate sets in different languages. We automatically identify different clusters of English cognates, setting a new direction of research in cognates, borrowings and possibly false friends analysis in related languages.

Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages
Alina Maria Cristea | Liviu P. Dinu | Simona Georgescu | Mihnea-Lucian Mihai | Ana Sabina Uban
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper, we address the problem of automatically discriminating between inherited and borrowed Latin words. We introduce a new dataset and investigate the case of Romance languages (Romanian, Italian, French, Spanish, Portuguese and Catalan), where words directly inherited from Latin coexist with words borrowed from Latin, and explore whether automatic discrimination between them is possible. Having entered the language at a later stage, borrowed words are no longer subject to historical sound shift rules, hence they are presumably less eroded, which is why we expect them to have a different intrinsic structure distinguishable by computational means. We employ several machine learning models to automatically discriminate between inherited and borrowed words and compare their performance with various feature sets. We analyze the models’ predictive power on two versions of the datasets, orthographic and phonetic. We also investigate whether prior knowledge of the etymon provides better results, employing n-gram character features extracted from the word-etymon pairs and from their alignment.

Understanding Patterns of Anorexia Manifestations in Social Media Data with Deep Learning
Ana Sabina Uban | Berta Chulvi | Paolo Rosso
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Eating disorders are a growing problem especially among young people, yet they have been under-studied in computational research compared to other mental health disorders such as depression. Computational methods have a great potential to aid with the automatic detection of mental health problems, but state-of-the-art machine learning methods based on neural networks are notoriously difficult to interpret, which is a crucial problem for applications in the mental health domain. We propose leveraging the power of deep learning models for automatically detecting signs of anorexia based on social media data, while at the same time focusing on interpreting their behavior. We train a hierarchical attention network to detect people with anorexia and use its internal encodings to discover different clusters of anorexia symptoms. We interpret the identified patterns from multiple perspectives, including emotion expression, psycho-linguistic features and personality traits, and we offer novel hypotheses to interpret our findings from a psycho-social perspective. Some interesting findings are patterns of word usage in some users with anorexia which show that they feel less as being part of a group compared to control cases, as well as that they have abandoned explanatory activity as a result of a greater feeling of helplessness and fear.

Towards an Etymological Map of Romanian
Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Ana Sabina Uban | Laurentiu Zoicas
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper we investigate the etymology of Romanian words. We start from the Romanian lexicon and automatically extract information from multiple etymological dictionaries. We evaluate the results and perform extensive quantitative and qualitative analyses with the goal of building an etymological map of the language.

2020

Automatically Building a Multilingual Lexicon of False Friends With No Supervision
Ana Sabina Uban | Liviu P. Dinu
Proceedings of the Twelfth Language Resources and Evaluation Conference

Cognate words, defined as words in different languages which derive from a common etymon, can be useful for language learners, who can leverage the orthographical similarity of cognates to more easily understand a text in a foreign language. Deceptive cognates, or false friends, do not share the same meaning anymore; these can be instead deceiving and detrimental for language acquisition or text understanding in a foreign language. We use an automatic method of detecting false friends from a set of cognates, in a fully unsupervised fashion, based on cross-lingual word embeddings. We implement our method for English and five Romance languages, including a low-resource language (Romanian), and evaluate it against two different gold standards. The method can be extended easily to any language pair, requiring only large monolingual corpora for the involved languages and a small bilingual dictionary for the pair. We additionally propose a measure of “falseness” of a false friends pair. We publish freely the database of false friends in the six languages, along with the falseness scores for each cognate pair. The resource is the largest of the kind that we are aware of, both in terms of languages covered and number of word pairs.

2017

Finding a Character’s Voice: Stylome Classification on Literary Characters
Liviu P. Dinu | Ana Sabina Uban
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We investigate in this paper the problem of classifying the stylome of characters in a literary work. Previous research in the field of authorship attribution has shown that the writing style of an author can be characterized and distinguished from that of other authors automatically. In this paper we take a look at the less approached problem of how the styles of different characters can be distinguished, trying to verify if an author managed to create believable characters with individual styles. We present the results of some initial experiments developed on the novel “Liaisons Dangereuses”, showing that a simple bag of words model can be used to classify the characters.

2015

Cross-lingual Synonymy Overlap
Anca Dinu | Liviu P. Dinu | Ana Sabina Uban
Proceedings of the International Conference Recent Advances in Natural Language Processing