Anca Dinu - ACL Anthology

Anca Dinu

Also published as: Anca Daniela Dinu, Anca Daniela Dinu

2026

Cross-lingual Lexical Semantic Change in Romance Languages
Ana Sabina Uban | Liviu P Dinu | Anca Daniela Dinu | Simona Georgescu
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

We present a comprehensive quantitative analysis of lexical semantic change in the five main Romance languages (Romanian, Italian, Spanish, French and Portuguese), based on the most exhaustive database of related words in these languages. We include both cognate words and borrowings (for the first time, to our knowledge), and compute semantic shift measures using different static and contextual embedding models, as well as three different corpora. We publish the obtained lists of semantic divergences across all related word pairs, compute global trends in language-level semantic divergence, and provide insights on particular study cases of highly stable and highly divergent words for different language pairs.

2025

Dissonant Ballerinas and Crafty Carrots: A Comparative Multi-modal Analysis of Italian Brain Rot
Anca Dinu | Andra-Maria Florescu | Marius Micluța-Câmpeanu | Ștefana Arina Tăbușcă | Claudiu Creangă | Andreiana Mihail
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Testing Language Creativity of Large Language Models and Humans
Anca Dinu | Andra-Maria Florescu
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Since the advent of Large Language Models (LLMs), the interest and need for a better understanding of artificial creativity has increased.This paper aims to design and administer an integrated language creativity test, including multiple tasks and criteria, targeting both LLMs and humans, for a direct comparison. Language creativity refers to how one uses natural language in novel and unusual ways, by bending lexico-grammatical and semantic norms by using literary devices or by creating new words. The results show a slightly better performance of LLMs compared to humans. We analyzed the responses dataset with computational methods like sentiment analysis, clusterization, and binary classification, for a more in-depth understanding. Also, we manually inspected a part of the answers, which revealed that the LLMs mastered figurative speech, while humans responded more pragmatically.

Analyzing Large Language Models’ pastiche ability: a case study on a 20th century Romanian author
Anca Dinu | Andra-Maria Florescu | Liviu Dinu
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This study evaluated the ability of several Large Language Models (LLMs) to pastiche the literary style of the Romanian 20th century author Mateiu Caragiale, by continuing one of his novels left unfinished upon his death. We assembled a database of novels consisting of six texts by Mateiu Caragiale, including his unfinished one, six texts by Radu Albala, including a continuation of Mateiu’s novel, and six LLM generated novels that try to pastiche it. We compared the LLM generated texts with the continuation by Radu Albala, using various methods. We automatically evaluated the pastiches by standard metrics such as ROUGE, BLEU, and METEOR. We performed stylometric analysis, clustering, and authorship attribution, and a manual analysis. Both computational and manual analysis of the pastiches indicated that LLMs are able to produce fairly qualitative pastiches, without matching the professional writer performance. The study also showed that ML techniques outperformed the more recent DL ones in both clusterization and authorship attribution tasks, probably because the dataset consists of only a few literary archaic texts in Romanian. In addition, linguistically informed features were shown to be competitive compared to automatically extracted features.

AntiSemRO: Studying the Romanian Expression of Antisemitism
Anca Dinu | Andreea C. Moldovan | Adina Marincea
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

This study introduces an annotated dataset for the study of antisemitic hate speech and attitudes towards Jewish people in Romanian, collected from social media. We performed two types of annotation: with three simple tags (‘Neutral’, ‘Positive’, ‘Negative’), and with five more refined tags (Neutral’, ‘Ambiguous’, ‘Jewish Community’, Solidarity’, ‘Zionism’, ‘Antisemitism’). We perform several experiments on this dataset: clusterization, automatic classification, using classical machine learning models and transformer-based models, and sentiment analysis. The three classes clusterization produced well grouped clusters, while, as expected, the five classes clusterization produced moderately overlapping groups, except for ‘Antisemitism’, which is well away from the other four groups. We obtained a good F1-Score of 0.78 in the three classes classification task with Romanian BERT model and a moderate F1-score of 0.62 for the five classes classification task with a SVM model. The lowest negative sentiment was contained in the ‘Neuter’ class, while the highest was in ‘Zionism’, and not in ‘Antisemitism’, as expected. Also, the same ‘Zionism’ category displays the highest level of positive sentiment.

Towards a Map of Related Words in Romance Languages
Liviu P. Dinu | Ana Sabina Uban | Ioan-Bogdan Iordache | Claudia Vlad | Simona Georgescu | Laurentiu Zoicas | Anca Dinu
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

We propose a map of cognates and borrowings usage in Romance languages, having as a starting point the pairs of cognates and borrowings between any two of these idioms from RoBoCoP, the largest database built upon electronic dictionaries containing etymological information for Portuguese, Spanish, French, Italian and Romanian. Having in mind that words are used and evolve in language communities over time, on the basis of the pairs extracted from RoBoCoP, we determine how many of them occur and with what frequency in the context of the languages in use, based on three online parallel corpora that contain all five Romance languages: Wikipedia, Europarl – focusing on proceedings of the European Parliament and RomCro2.0 – containing literary texts in different languages, translated in Romance languages and Croatian.

2024

It takes two to borrow: a donor and a recipient. Who’s who?
Liviu Dinu | Ana Uban | Anca Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Laurentiu Zoicas
Findings of the Association for Computational Linguistics: ACL 2024

We address the open problem of automatically identifying the direction of lexical borrowing, given word pairs in the donor and recipient languages. We propose strong benchmarks for this task, by applying a set of machine learning models. We extract and publicly release a comprehensive borrowings dataset from the recent RoBoCoP cognates and borrowings database for five Romance languages. We experiment on this dataset with both graphic and phonetic representations and with different features, models and architectures. We interpret the results, in terms of F1 score, commenting on the influence of features and model choice, of the imbalanced data and of the inherent difficulty of the task for particular language pairs. We show that automatically determining the direction of borrowing is a feasible task, and propose additional directions for future work.

Comparing Large Language Models Verbal Creativity to Human Verbal Creativity
Anca Dinu | Andra Florescu
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

This study investigates verbal creativity differences and similarities between Large Language Models and humans, based ontheir answers given to the integrated verbal creativity test in [1 ]. Since this article reported a very small difference of scoresin favour of the machines, the aim of the present work is to thoroughly analyse the data through four methods: scoring theuniqueness of the answers of one human or one machine compared to all the others, semantic similarity clustering, binaryclassification and manual inspection of the data. The results showed that humans and machines are on a par in terms ofuniqueness scores, that humans and machines group in two well defined clusters based on semantics similarities, and that theanswers are not so easy to automatically classify in human answers and LLM answers.

2023

RoBoCoP: A Comprehensive ROmance BOrrowing COgnate Package and Benchmark for Multilingual Cognate Identification
Liviu Dinu | Ana Uban | Alina Cristea | Anca Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The identification of cognates is a fundamental process in historical linguistics, on which any further research is based. Even though there are several cognate databases for Romance languages, they are rather scattered, incomplete, noisy, contain unreliable information, or have uncertain availability. In this paper we introduce a comprehensive database of Romance cognates and borrowings based on the etymological information provided by the dictionaries. We extract pairs of cognates between any two Romance languages by parsing electronic dictionaries of Romanian, Italian, Spanish, Portuguese and French. Based on this resource, we propose a strong benchmark for the automatic detection of cognates, by applying machine learning and deep learning based methods on any two pairs of Romance languages. We find that automatic identification of cognates is possible with accuracy averaging around 94% for the more difficult task formulations.

2022

CoToHiLi at LSCDiscovery: the Role of Linguistic Features in Predicting Semantic Change
Ana Sabina Uban | Alina Maria Cristea | Anca Daniela Dinu | Liviu P Dinu | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

This paper presents the contributions of the CoToHiLi team for the LSCDiscovery shared task on semantic change in the Spanish language. We participated in both tasks (graded discovery and binary change, including sense gain and sense loss) and proposed models based on word embedding distances combined with hand-crafted linguistic features, including polysemy, number of neological synonyms, and relation to cognates in English. We find that models that include linguistically informed features combined using weights assigned manually by experts lead to promising results.

2021

Automatic Detection and Classification of Mental Illnesses from General Social Media Texts
Anca Dinu | Andreea-Codrina Moldovan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Mental health is getting more and more attention recently, depression being a very common illness nowadays, but also other disorders like anxiety, obsessive-compulsive disorders, feeding disorders, autism, or attention-deficit/hyperactivity disorders. The huge amount of data from social media and the recent advances of deep learning models provide valuable means to automatically detecting mental disorders from plain text. In this article, we experiment with state-of-the-art methods on the SMHD mental health conditions dataset from Reddit (Cohan et al., 2018). Our contribution is threefold: using a dataset consisting of more illnesses than most studies, focusing on general text rather than mental health support groups and classification by posts rather than individuals or groups. For the automatic classification of the diseases, we employ three deep learning models: BERT, RoBERTa and XLNET. We double the baseline established by Cohan et al. (2018), on just a sample of their dataset. We improve the results obtained by Jiang et al. (2020) on post-level classification. The accuracy obtained by the eating disorder classifier is the highest due to the pregnant presence of discussions related to calories, diets, recipes etc., whereas depression had the lowest F1 score, probably because depression is more difficult to identify in linguistic acts.

Tracking Semantic Change in Cognate Sets for English and Romance Languages
Ana Sabina Uban | Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

Semantic divergence in related languages is a key concern of historical linguistics. We cross-linguistically investigate the semantic divergence of cognate pairs in English and Romance languages, by means of word embeddings. To this end, we introduce a new curated dataset of cognates in all pairs of those languages. We describe the types of errors that occurred during the automated cognate identification process and manually correct them. Additionally, we label the English cognates according to their etymology, separating them into two groups: old borrowings and recent borrowings. On this curated dataset, we analyse word properties such as frequency and polysemy, and the distribution of similarity scores between cognate sets in different languages. We automatically identify different clusters of English cognates, setting a new direction of research in cognates, borrowings and possibly false friends analysis in related languages.

Towards an Etymological Map of Romanian
Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Ana Sabina Uban | Laurentiu Zoicas
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper we investigate the etymology of Romanian words. We start from the Romanian lexicon and automatically extract information from multiple etymological dictionaries. We evaluate the results and perform extensive quantitative and qualitative analyses with the goal of building an etymological map of the language.

2019

Linguistic classification: dealing jointly with irrelevance and inconsistency
Laura Franzoi | Andrea Sgarro | Anca Dinu | Liviu P. Dinu
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper, we present new methods for language classification which put to good use both syntax and fuzzy tools, and are capable of dealing with irrelevant linguistic features (i.e. features which should not contribute to the classification) and even inconsistent features (which do not make sense for specific languages). We introduce a metric distance, based on the generalized Steinhaus transform, which allows one to deal jointly with irrelevance and inconsistency. To evaluate our methods, we test them on a syntactic data set, due to the linguist G. Longobardi and his school. We obtain phylogenetic trees which sometimes outperform the ones obtained by Atkinson and Gray.

2017

On the annotation of vague expressions: a case study on Romanian historical texts
Anca Dinu | Walther von Hahn | Cristina Vertan
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe

Current approaches in Digital .Humanities tend to ignore a central as-pect of any hermeneutic introspection: the intrinsic vagueness of analyzed texts. Especially when dealing with his-torical documents neglecting vague-ness has important implications on the interpretation of the results. In this pa-per we present current limitation of an-notation approaches and describe a current methodology for annotating vagueness for historical Romanian texts.

On the stylistic evolution from communism to democracy: Solomon Marcus study case
Anca Dinu | Liviu P. Dinu | Bogdan Dumitru
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this article we propose a stylistic analysis of Solomon Marcus’ non-scientific published texts, gathered in six volumes, aiming to uncover some of his quantitative and qualitative fingerprints. Moreover, we compare and cluster two distinct periods of time in his writing style: 22 years of communist regime (1967-1989) and 27 years of democracy (1990-2016). The distributional analysis of Marcus’ text reveals that the passing from the communist regime period to democracy is sharply marked by two complementary changes in Marcus’ writing: in the pre-democracy period, the communist norms of writing style demanded on the one hand long phrases, long words and clichés, and on the other hand, a short list of preferred “official” topics; in democracy tendency was towards shorten phrases and words while approaching a broader area of topics.

Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
Anca Dinu | Petya Osenova | Cristina Vertan
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe

2015

Cross-lingual Synonymy Overlap
Anca Dinu | Liviu P. Dinu | Ana Sabina Uban
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

Aggregation methods for efficient collocation detection
Anca Dinu | Liviu Dinu | Ionut Sorodoc
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this article we propose a rank aggregation method for the task of collocations detection. It consists of applying some well-known methods (e.g. Dice method, chi-square test, z-test and likelihood ratio) and then aggregating the resulting collocations rankings by rank distance and Borda score. These two aggregation methods are especially well suited for the task, since the results of each individual method naturally forms a ranking of collocations. Combination methods are known to usually improve the results, and indeed, the proposed aggregation method performs better then each individual method taken in isolation.

Predicting Romanian Stress Assignment
Alina Maria Ciobanu | Anca Dinu | Liviu Dinu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

Temporal classification for historical Romanian texts
Alina Maria Ciobanu | Anca Dinu | Liviu Dinu | Vlad Niculae | Octavia-Maria Şulea
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Temporal Text Classification for Romanian Novels set in the Past
Alina Maria Ciobanu | Liviu P. Dinu | Octavia-Maria Şulea | Anca Dinu | Vlad Niculae
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

Alternative measures of word relatedness in distributional semantics
Anca Dinu | Alina Ciobanu
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

2011

A Mechanism to Restrict the Scope of Clause-Bounded Quantifiers in ‘Continuation’ Semantics
Anca Dinu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

Building a Generative Lexicon for Romanian
Anca Dinu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present in this paper an on-going research: the construction and annotation of a Romanian Generative Lexicon (RoGL). Our system follows the specifications of CLIPS project for Italian language. It contains a corpus, a type ontology, a graphical interface and a database from which we generate data in XML format.

2009

On the behavior of Romanian syllables related to minimum effort laws
Anca Dinu | Liviu P. Dinu
Proceedings of the Workshop Multilingual resources, technologies and evaluation for central and Eastern European languages

2008

Authorship Identification of Romanian Texts with Controversial Paternity
Liviu Dinu | Marius Popescu | Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiales novel Sub pecetea tainei, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with the results obtained by a learning method (namely Support Vector Machines -SVM- with a string kernel).

On Classifying Coherent/Incoherent Romanian Short Texts
Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present and discuss the results of a text coherence experiment performed on a small corpus of Romanian text from a number of alternative high school manuals. During the last 10 years, an abundance of alternative manuals for high school was produced and distributed in Romania. Due to the large amount of material and to the relative short time in which it was produced, the question of assessing the quality of this material emerged; this process relied mostly of subjective human personal opinion, given the lack of automatic tools for Romanian. Debates and claims of poor quality of the alternative manuals resulted in a number of examples of incomprehensible / incoherent paragraphs extracted from such manuals. Our goal was to create an automatic tool which may be used as an indication of poor quality of such texts. We created a small corpus of representative texts from Romanian alternative manuals. We manually classified the chosen paragraphs from such manuals into two categories: comprehensible/coherent text and incomprehensible/incoherent text. We then used different machine learning techniques to automatically classify them in a supervised manner. Our approach is rather simple, but the results are encouraging.

2006

Total Rank Distance and Scaled Total Rank Distance: Two Alternative Metrics in Computational Linguistics
Anca Dinu | Liviu P. Dinu
Proceedings of the Workshop on Linguistic Distances

On the data base of Romanian syllables and some of its quantitative and cryptographic aspects
Liviu Dinu | Anca Dinu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we argue for the need to construct a data base of Romanian syllables. We explain the reasons for our choice of the DOOM corpus which we have used. We describe the way syllabification was performed and explain how we have constructed the data base. The main quantitative aspects which we have extracted from our research are presented. We also computed the entropy of the syllables and the entropy of the syllables w.r.t. the consonant-vowel structure. The results are compared with results of similar researches realized for different languages.