Ioan-Bogdan Iordache


2024

pdf bib
Verba volant, scripta volant? Don’t worry! There are computational solutions for protoword reconstruction
Liviu P Dinu | Ana Sabina Uban | Alina Maria Cristea | Ioan-Bogdan Iordache | Teodor-George Marchitan | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We introduce a new database of cognate words and etymons for the five main Romance languages, the most comprehensive one to date. We propose a strong benchmark for the automatic reconstruction of protowords for Romance languages, by applying a set of machine learning models and features on these data. The best results reach 90% accuracy in predicting the protoword of a given cognate set, surpassing existing state-of-the-art results for this task and showing that computational methods can be very useful in assisting linguists with protoword reconstruction.

pdf bib
It takes two to borrow: a donor and a recipient. Who’s who?
Liviu Dinu | Ana Uban | Anca Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Laurentiu Zoicas
Findings of the Association for Computational Linguistics: ACL 2024

We address the open problem of automatically identifying the direction of lexical borrowing, given word pairs in the donor and recipient languages. We propose strong benchmarks for this task, by applying a set of machine learning models. We extract and publicly release a comprehensive borrowings dataset from the recent RoBoCoP cognates and borrowings database for five Romance languages. We experiment on this dataset with both graphic and phonetic representations and with different features, models and architectures. We interpret the results, in terms of F1 score, commenting on the influence of features and model choice, of the imbalanced data and of the inherent difficulty of the task for particular language pairs. We show that automatically determining the direction of borrowing is a feasible task, and propose additional directions for future work.

pdf bib
Pater Incertus? There Is a Solution: Automatic Discrimination between Cognates and Borrowings for Romance Languages
Liviu P. Dinu | Ana Sabina Uban | Ioan-Bogdan Iordache | Alina Maria Cristea | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Identifying the type of relationship between words (cognates, borrowings, inherited) provides a deeper insight into the history of a language and allows for a better characterization of language relatedness. In this paper, we propose a computational approach for discriminating between cognates and borrowings, one of the most difficult tasks in historical linguistics. We compare the discriminative power of graphic and phonetic features and we analyze the underlying linguistic factors that prove relevant in the classification task. We perform experiments for pairs of languages in the Romance language family (French, Italian, Spanish, Portuguese, and Romanian), based on a comprehensive database of Romance cognates and borrowings. To our knowledge, this is one of the first attempts of this kind and the most comprehensive in terms of covered languages.

pdf bib
RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian
Adrian Cosma | Ioan-Bogdan Iordache | Paolo Rosso
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advanced LLMs. Currently, there are no datasets to measure the generalization power for code-generation models in a language other than English. In this work, we present RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models. Through our results and review of related works, we argue for the need to develop code models for languages other than English.

2023

pdf bib
RoBoCoP: A Comprehensive ROmance BOrrowing COgnate Package and Benchmark for Multilingual Cognate Identification
Liviu Dinu | Ana Uban | Alina Cristea | Anca Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The identification of cognates is a fundamental process in historical linguistics, on which any further research is based. Even though there are several cognate databases for Romance languages, they are rather scattered, incomplete, noisy, contain unreliable information, or have uncertain availability. In this paper we introduce a comprehensive database of Romance cognates and borrowings based on the etymological information provided by the dictionaries. We extract pairs of cognates between any two Romance languages by parsing electronic dictionaries of Romanian, Italian, Spanish, Portuguese and French. Based on this resource, we propose a strong benchmark for the automatic detection of cognates, by applying machine learning and deep learning based methods on any two pairs of Romance languages. We find that automatic identification of cognates is possible with accuracy averaging around 94% for the more difficult task formulations.

pdf bib
CoToHiLi at SIGTYP 2023: Ensemble Models for Cognate and Derivative Words Detection
Liviu P. Dinu | Ioan-Bogdan Iordache | Ana Sabina Uban
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

The identification of cognates and derivatives is a fundamental process in historical linguistics, on which any further research is based. In this paper we present our contribution to the SIGTYP 2023 Shared Task on cognate and derivative detection. We propose a multi-lingual solution based on features extracted from the alignment of the orthographic and phonetic representations of the words.

2022

pdf bib
Detecting Optimism in Tweets using Knowledge Distillation and Linguistic Analysis of Optimism
Ștefan Cobeli | Ioan-Bogdan Iordache | Shweta Yadav | Cornelia Caragea | Liviu P. Dinu | Dragoș Iliescu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Finding the polarity of feelings in texts is a far-reaching task. Whilst the field of natural language processing has established sentiment analysis as an alluring problem, many feelings are left uncharted. In this study, we analyze the optimism and pessimism concepts from Twitter posts to effectively understand the broader dimension of psychological phenomenon. Towards this, we carried a systematic study by first exploring the linguistic peculiarities of optimism and pessimism in user-generated content. Later, we devised a multi-task knowledge distillation framework to simultaneously learn the target task of optimism detection with the help of the auxiliary task of sentiment analysis and hate speech detection. We evaluated the performance of our proposed approach on the benchmark Optimism/Pessimism Twitter dataset. Our extensive experiments show the superior- ity of our approach in correctly differentiating between optimistic and pessimistic users. Our human and automatic evaluation shows that sentiment analysis and hate speech detection are beneficial for optimism/pessimism detection.

pdf bib
Investigating the Relationship Between Romanian Financial News and Closing Prices from the Bucharest Stock Exchange
Ioan-Bogdan Iordache | Ana Sabina Uban | Catalin Stoean | Liviu P. Dinu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

A new data set is gathered from a Romanian financial news website for the duration of four years. It is further refined to extract only information related to one company by selecting only paragraphs and even sentences that referred to it. The relation between the extracted sentiment scores of the texts and the stock prices from the corresponding dates is investigated using various approaches like the lexicon-based Vader tool, Financial BERT, as well as Transformer-based models. Automated translation is used, since some models could be only applied for texts in English. It is encouraging that all models, be that they are applied to Romanian or English texts, indicate a correlation between the sentiment scores and the increase or decrease of the stock closing prices.

2021

pdf bib
A Computational Exploration of Pejorative Language in Social Media
Liviu P. Dinu | Ioan-Bogdan Iordache | Ana Sabina Uban | Marcos Zampieri
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper we study pejorative language, an under-explored topic in computational linguistics. Unlike existing models of offensive language and hate speech, pejorative language manifests itself primarily at the lexical level, and describes a word that is used with a negative connotation, making it different from offensive language or other more studied categories. Pejorativity is also context-dependent: the same word can be used with or without pejorative connotations, thus pejorativity detection is essentially a problem similar to word sense disambiguation. We leverage online dictionaries to build a multilingual lexicon of pejorative terms for English, Spanish, Italian, and Romanian. We additionally release a dataset of tweets annotated for pejorative use. Based on these resources, we present an analysis of the usage and occurrence of pejorative words in social media, and present an attempt to automatically disambiguate pejorative usage in our dataset.