Ioan-Bogdan Iordache

2026

On the Intelligibility of Romance Language Varieties: Spanish and Portuguese in Europe and America
Liviu P. Dinu | Ana Sabina Uban | Teodor-George Marchitan | Ioan-Bogdan Iordache | Simona Georgescu
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects

Mutual intelligibility within language families presents a significant challenge for multilingual NLP, particularly due to the prevalence of dialectal variation and asymmetric comprehension. In this paper, we present a corpus-based computational analysis to quantify linguistic proximity across Romance language variants, with a focus on major Spanish (Argentine, Chilean and European) and Portuguese (Brazilian and European) varieties and the other main Romance languages (Italian, French, Romanian). We apply a computational metric of lexical intelligibility based on surface and semantic similarity of related words to measure mutual intelligibility for the five main Romance languages in relation to the Spanish and Portuguese varieties studied.

2025

pdf bib abs

Towards a Map of Related Words in Romance Languages
Liviu P. Dinu | Ana Sabina Uban | Ioan-Bogdan Iordache | Claudia Vlad | Simona Georgescu | Laurentiu Zoicas | Anca Dinu
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

We propose a map of cognates and borrowings usage in Romance languages, having as a starting point the pairs of cognates and borrowings between any two of these idioms from RoBoCoP, the largest database built upon electronic dictionaries containing etymological information for Portuguese, Spanish, French, Italian and Romanian. Having in mind that words are used and evolve in language communities over time, on the basis of the pairs extracted from RoBoCoP, we determine how many of them occur and with what frequency in the context of the languages in use, based on three online parallel corpora that contain all five Romance languages: Wikipedia, Europarl – focusing on proceedings of the European Parliament and RomCro2.0 – containing literary texts in different languages, translated in Romance languages and Croatian.

pdf bib abs

Friend or Foe? A Computational Investigation of Semantic False Friends across Romance Languages
Ana Sabina Uban | Liviu P Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Claudia Vlad
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In this paper we present a comprehensive analysis of lexical semantic divergence between cognate words and borrowings in the Romance languages. We experiment with different algorithms for false friend detection including deceptive cognate and deceptive borrowings and correction and evaluate them systematically on cognate and borrowing pairs in the five Romance languages. We use the most complete and reliable dataset of cognate words based on etymological dictionaries for the five main Romance languages (Italian, Spanish, Portuguese, French and Romanian) to extract deceptive cognates and borrowings automatically based on usage, and freely publish the lexicon of obtained true and deceptive cognate and borrowings in every Romance language pair.

2024

pdf bib abs

Verba volant, scripta volant? Don’t worry! There are computational solutions for protoword reconstruction
Liviu P Dinu | Ana Sabina Uban | Alina Maria Cristea | Ioan-Bogdan Iordache | Teodor-George Marchitan | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We introduce a new database of cognate words and etymons for the five main Romance languages, the most comprehensive one to date. We propose a strong benchmark for the automatic reconstruction of protowords for Romance languages, by applying a set of machine learning models and features on these data. The best results reach 90% accuracy in predicting the protoword of a given cognate set, surpassing existing state-of-the-art results for this task and showing that computational methods can be very useful in assisting linguists with protoword reconstruction.

pdf bib abs

ItGraSyll: A Computational Analysis of Graphical Syllabification and Stress Assignment in Italian
Liviu Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Alina Maria Cristea | Bianca Guita
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

In this paper we build a dataset of Italian syllables. We perform quantitative and qualitative analyses on the syllabification and stress assignment in Italian. We propose a machine learning model, based on deep-learning techniques, for automatically inferring syllabification and stress assignment. For stress prediction we report 94.45% word-level accuracy, and for syllabification we report 98.41% word-level accuracy and 99.82% hyphen-level accuracy.

pdf bib abs

RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian
Adrian Cosma | Ioan-Bogdan Iordache | Paolo Rosso
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advanced LLMs. Currently, there are no datasets to measure the generalization power for code-generation models in a language other than English. In this work, we present RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models. Through our results and review of related works, we argue for the need to develop code models for languages other than English.

pdf bib abs

Pater Incertus? There Is a Solution: Automatic Discrimination between Cognates and Borrowings for Romance Languages
Liviu P. Dinu | Ana Sabina Uban | Ioan-Bogdan Iordache | Alina Maria Cristea | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Identifying the type of relationship between words (cognates, borrowings, inherited) provides a deeper insight into the history of a language and allows for a better characterization of language relatedness. In this paper, we propose a computational approach for discriminating between cognates and borrowings, one of the most difficult tasks in historical linguistics. We compare the discriminative power of graphic and phonetic features and we analyze the underlying linguistic factors that prove relevant in the classification task. We perform experiments for pairs of languages in the Romance language family (French, Italian, Spanish, Portuguese, and Romanian), based on a comprehensive database of Romance cognates and borrowings. To our knowledge, this is one of the first attempts of this kind and the most comprehensive in terms of covered languages.

pdf bib abs

We address the open problem of automatically identifying the direction of lexical borrowing, given word pairs in the donor and recipient languages. We propose strong benchmarks for this task, by applying a set of machine learning models. We extract and publicly release a comprehensive borrowings dataset from the recent RoBoCoP cognates and borrowings database for five Romance languages. We experiment on this dataset with both graphic and phonetic representations and with different features, models and architectures. We interpret the results, in terms of F1 score, commenting on the influence of features and model choice, of the imbalanced data and of the inherent difficulty of the task for particular language pairs. We show that automatically determining the direction of borrowing is a feasible task, and propose additional directions for future work.

2023

pdf bib abs

CoToHiLi at SIGTYP 2023: Ensemble Models for Cognate and Derivative Words Detection
Liviu P. Dinu | Ioan-Bogdan Iordache | Ana Sabina Uban
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

The identification of cognates and derivatives is a fundamental process in historical linguistics, on which any further research is based. In this paper we present our contribution to the SIGTYP 2023 Shared Task on cognate and derivative detection. We propose a multi-lingual solution based on features extracted from the alignment of the orthographic and phonetic representations of the words.

pdf bib abs

RoBoCoP: A Comprehensive ROmance BOrrowing COgnate Package and Benchmark for Multilingual Cognate Identification
Liviu Dinu | Ana Uban | Alina Cristea | Anca Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The identification of cognates is a fundamental process in historical linguistics, on which any further research is based. Even though there are several cognate databases for Romance languages, they are rather scattered, incomplete, noisy, contain unreliable information, or have uncertain availability. In this paper we introduce a comprehensive database of Romance cognates and borrowings based on the etymological information provided by the dictionaries. We extract pairs of cognates between any two Romance languages by parsing electronic dictionaries of Romanian, Italian, Spanish, Portuguese and French. Based on this resource, we propose a strong benchmark for the automatic detection of cognates, by applying machine learning and deep learning based methods on any two pairs of Romance languages. We find that automatic identification of cognates is possible with accuracy averaging around 94% for the more difficult task formulations.

2022

pdf bib abs

Investigating the Relationship Between Romanian Financial News and Closing Prices from the Bucharest Stock Exchange
Ioan-Bogdan Iordache | Ana Sabina Uban | Catalin Stoean | Liviu P. Dinu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

A new data set is gathered from a Romanian financial news website for the duration of four years. It is further refined to extract only information related to one company by selecting only paragraphs and even sentences that referred to it. The relation between the extracted sentiment scores of the texts and the stock prices from the corresponding dates is investigated using various approaches like the lexicon-based Vader tool, Financial BERT, as well as Transformer-based models. Automated translation is used, since some models could be only applied for texts in English. It is encouraging that all models, be that they are applied to Romanian or English texts, indicate a correlation between the sentiment scores and the increase or decrease of the stock closing prices.

pdf bib abs

Detecting Optimism in Tweets using Knowledge Distillation and Linguistic Analysis of Optimism
Ștefan Cobeli | Ioan-Bogdan Iordache | Shweta Yadav | Cornelia Caragea | Liviu P. Dinu | Dragoș Iliescu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Finding the polarity of feelings in texts is a far-reaching task. Whilst the field of natural language processing has established sentiment analysis as an alluring problem, many feelings are left uncharted. In this study, we analyze the optimism and pessimism concepts from Twitter posts to effectively understand the broader dimension of psychological phenomenon. Towards this, we carried a systematic study by first exploring the linguistic peculiarities of optimism and pessimism in user-generated content. Later, we devised a multi-task knowledge distillation framework to simultaneously learn the target task of optimism detection with the help of the auxiliary task of sentiment analysis and hate speech detection. We evaluated the performance of our proposed approach on the benchmark Optimism/Pessimism Twitter dataset. Our extensive experiments show the superior- ity of our approach in correctly differentiating between optimistic and pessimistic users. Our human and automatic evaluation shows that sentiment analysis and hate speech detection are beneficial for optimism/pessimism detection.

2021

pdf bib abs

A Computational Exploration of Pejorative Language in Social Media
Liviu P. Dinu | Ioan-Bogdan Iordache | Ana Sabina Uban | Marcos Zampieri
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper we study pejorative language, an under-explored topic in computational linguistics. Unlike existing models of offensive language and hate speech, pejorative language manifests itself primarily at the lexical level, and describes a word that is used with a negative connotation, making it different from offensive language or other more studied categories. Pejorativity is also context-dependent: the same word can be used with or without pejorative connotations, thus pejorativity detection is essentially a problem similar to word sense disambiguation. We leverage online dictionaries to build a multilingual lexicon of pejorative terms for English, Spanish, Italian, and Romanian. We additionally release a dataset of tweets annotated for pejorative use. Based on these resources, we present an analysis of the usage and occurrence of pejorative words in social media, and present an attempt to automatically disambiguate pejorative usage in our dataset.