2024
pdf
bib
abs
It takes two to borrow: a donor and a recipient. Who’s who?
Liviu Dinu
|
Ana Uban
|
Anca Dinu
|
Ioan-Bogdan Iordache
|
Simona Georgescu
|
Laurentiu Zoicas
Findings of the Association for Computational Linguistics: ACL 2024
We address the open problem of automatically identifying the direction of lexical borrowing, given word pairs in the donor and recipient languages. We propose strong benchmarks for this task, by applying a set of machine learning models. We extract and publicly release a comprehensive borrowings dataset from the recent RoBoCoP cognates and borrowings database for five Romance languages. We experiment on this dataset with both graphic and phonetic representations and with different features, models and architectures. We interpret the results, in terms of F1 score, commenting on the influence of features and model choice, of the imbalanced data and of the inherent difficulty of the task for particular language pairs. We show that automatically determining the direction of borrowing is a feasible task, and propose additional directions for future work.
2023
pdf
bib
abs
RoBoCoP: A Comprehensive ROmance BOrrowing COgnate Package and Benchmark for Multilingual Cognate Identification
Liviu Dinu
|
Ana Uban
|
Alina Cristea
|
Anca Dinu
|
Ioan-Bogdan Iordache
|
Simona Georgescu
|
Laurentiu Zoicas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
The identification of cognates is a fundamental process in historical linguistics, on which any further research is based. Even though there are several cognate databases for Romance languages, they are rather scattered, incomplete, noisy, contain unreliable information, or have uncertain availability. In this paper we introduce a comprehensive database of Romance cognates and borrowings based on the etymological information provided by the dictionaries. We extract pairs of cognates between any two Romance languages by parsing electronic dictionaries of Romanian, Italian, Spanish, Portuguese and French. Based on this resource, we propose a strong benchmark for the automatic detection of cognates, by applying machine learning and deep learning based methods on any two pairs of Romance languages. We find that automatic identification of cognates is possible with accuracy averaging around 94% for the more difficult task formulations.
2021
pdf
bib
abs
Towards an Etymological Map of Romanian
Alina Maria Cristea
|
Anca Dinu
|
Liviu P. Dinu
|
Simona Georgescu
|
Ana Sabina Uban
|
Laurentiu Zoicas
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
In this paper we investigate the etymology of Romanian words. We start from the Romanian lexicon and automatically extract information from multiple etymological dictionaries. We evaluate the results and perform extensive quantitative and qualitative analyses with the goal of building an etymological map of the language.
pdf
bib
abs
Automatic Detection and Classification of Mental Illnesses from General Social Media Texts
Anca Dinu
|
Andreea-Codrina Moldovan
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Mental health is getting more and more attention recently, depression being a very common illness nowadays, but also other disorders like anxiety, obsessive-compulsive disorders, feeding disorders, autism, or attention-deficit/hyperactivity disorders. The huge amount of data from social media and the recent advances of deep learning models provide valuable means to automatically detecting mental disorders from plain text. In this article, we experiment with state-of-the-art methods on the SMHD mental health conditions dataset from Reddit (Cohan et al., 2018). Our contribution is threefold: using a dataset consisting of more illnesses than most studies, focusing on general text rather than mental health support groups and classification by posts rather than individuals or groups. For the automatic classification of the diseases, we employ three deep learning models: BERT, RoBERTa and XLNET. We double the baseline established by Cohan et al. (2018), on just a sample of their dataset. We improve the results obtained by Jiang et al. (2020) on post-level classification. The accuracy obtained by the eating disorder classifier is the highest due to the pregnant presence of discussions related to calories, diets, recipes etc., whereas depression had the lowest F1 score, probably because depression is more difficult to identify in linguistic acts.
pdf
bib
abs
Tracking Semantic Change in Cognate Sets for English and Romance Languages
Ana Sabina Uban
|
Alina Maria Cristea
|
Anca Dinu
|
Liviu P. Dinu
|
Simona Georgescu
|
Laurentiu Zoicas
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
Semantic divergence in related languages is a key concern of historical linguistics. We cross-linguistically investigate the semantic divergence of cognate pairs in English and Romance languages, by means of word embeddings. To this end, we introduce a new curated dataset of cognates in all pairs of those languages. We describe the types of errors that occurred during the automated cognate identification process and manually correct them. Additionally, we label the English cognates according to their etymology, separating them into two groups: old borrowings and recent borrowings. On this curated dataset, we analyse word properties such as frequency and polysemy, and the distribution of similarity scores between cognate sets in different languages. We automatically identify different clusters of English cognates, setting a new direction of research in cognates, borrowings and possibly false friends analysis in related languages.
2019
pdf
bib
abs
Linguistic classification: dealing jointly with irrelevance and inconsistency
Laura Franzoi
|
Andrea Sgarro
|
Anca Dinu
|
Liviu P. Dinu
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
In this paper, we present new methods for language classification which put to good use both syntax and fuzzy tools, and are capable of dealing with irrelevant linguistic features (i.e. features which should not contribute to the classification) and even inconsistent features (which do not make sense for specific languages). We introduce a metric distance, based on the generalized Steinhaus transform, which allows one to deal jointly with irrelevance and inconsistency. To evaluate our methods, we test them on a syntactic data set, due to the linguist G. Longobardi and his school. We obtain phylogenetic trees which sometimes outperform the ones obtained by Atkinson and Gray.
2017
pdf
bib
abs
On the stylistic evolution from communism to democracy: Solomon Marcus study case
Anca Dinu
|
Liviu P. Dinu
|
Bogdan Dumitru
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
In this article we propose a stylistic analysis of Solomon Marcus’ non-scientific published texts, gathered in six volumes, aiming to uncover some of his quantitative and qualitative fingerprints. Moreover, we compare and cluster two distinct periods of time in his writing style: 22 years of communist regime (1967-1989) and 27 years of democracy (1990-2016). The distributional analysis of Marcus’ text reveals that the passing from the communist regime period to democracy is sharply marked by two complementary changes in Marcus’ writing: in the pre-democracy period, the communist norms of writing style demanded on the one hand long phrases, long words and clichés, and on the other hand, a short list of preferred “official” topics; in democracy tendency was towards shorten phrases and words while approaching a broader area of topics.
bib
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
Anca Dinu
|
Petya Osenova
|
Cristina Vertan
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
pdf
bib
abs
On the annotation of vague expressions: a case study on Romanian historical texts
Anca Dinu
|
Walther von Hahn
|
Cristina Vertan
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
Current approaches in Digital .Humanities tend to ignore a central as-pect of any hermeneutic introspection: the intrinsic vagueness of analyzed texts. Especially when dealing with his-torical documents neglecting vague-ness has important implications on the interpretation of the results. In this pa-per we present current limitation of an-notation approaches and describe a current methodology for annotating vagueness for historical Romanian texts.
2015
pdf
bib
Cross-lingual Synonymy Overlap
Anca Dinu
|
Liviu P. Dinu
|
Ana Sabina Uban
Proceedings of the International Conference Recent Advances in Natural Language Processing
2014
pdf
bib
abs
Aggregation methods for efficient collocation detection
Anca Dinu
|
Liviu Dinu
|
Ionut Sorodoc
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this article we propose a rank aggregation method for the task of collocations detection. It consists of applying some well-known methods (e.g. Dice method, chi-square test, z-test and likelihood ratio) and then aggregating the resulting collocations rankings by rank distance and Borda score. These two aggregation methods are especially well suited for the task, since the results of each individual method naturally forms a ranking of collocations. Combination methods are known to usually improve the results, and indeed, the proposed aggregation method performs better then each individual method taken in isolation.
pdf
bib
Predicting Romanian Stress Assignment
Alina Maria Ciobanu
|
Anca Dinu
|
Liviu Dinu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers
2013
pdf
bib
Temporal classification for historical Romanian texts
Alina Maria Ciobanu
|
Anca Dinu
|
Liviu Dinu
|
Vlad Niculae
|
Octavia-Maria Şulea
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
pdf
bib
Alternative measures of word relatedness in distributional semantics
Anca Dinu
|
Alina Ciobanu
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora
pdf
bib
Temporal Text Classification for Romanian Novels set in the Past
Alina Maria Ciobanu
|
Liviu P. Dinu
|
Octavia-Maria Şulea
|
Anca Dinu
|
Vlad Niculae
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
2011
pdf
bib
A Mechanism to Restrict the Scope of Clause-Bounded Quantifiers in ‘Continuation’ Semantics
Anca Dinu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011
2010
pdf
bib
abs
Building a Generative Lexicon for Romanian
Anca Dinu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We present in this paper an on-going research: the construction and annotation of a Romanian Generative Lexicon (RoGL). Our system follows the specifications of CLIPS project for Italian language. It contains a corpus, a type ontology, a graphical interface and a database from which we generate data in XML format.
2009
pdf
bib
On the behavior of Romanian syllables related to minimum effort laws
Anca Dinu
|
Liviu P. Dinu
Proceedings of the Workshop Multilingual resources, technologies and evaluation for central and Eastern European languages
2008
pdf
bib
abs
On Classifying Coherent/Incoherent Romanian Short Texts
Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we present and discuss the results of a text coherence experiment performed on a small corpus of Romanian text from a number of alternative high school manuals. During the last 10 years, an abundance of alternative manuals for high school was produced and distributed in Romania. Due to the large amount of material and to the relative short time in which it was produced, the question of assessing the quality of this material emerged; this process relied mostly of subjective human personal opinion, given the lack of automatic tools for Romanian. Debates and claims of poor quality of the alternative manuals resulted in a number of examples of incomprehensible / incoherent paragraphs extracted from such manuals. Our goal was to create an automatic tool which may be used as an indication of poor quality of such texts. We created a small corpus of representative texts from Romanian alternative manuals. We manually classified the chosen paragraphs from such manuals into two categories: comprehensible/coherent text and incomprehensible/incoherent text. We then used different machine learning techniques to automatically classify them in a supervised manner. Our approach is rather simple, but the results are encouraging.
pdf
bib
abs
Authorship Identification of Romanian Texts with Controversial Paternity
Liviu Dinu
|
Marius Popescu
|
Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiales novel Sub pecetea tainei, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with the results obtained by a learning method (namely Support Vector Machines -SVM- with a string kernel).
2006
pdf
bib
abs
On the data base of Romanian syllables and some of its quantitative and cryptographic aspects
Liviu Dinu
|
Anca Dinu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper we argue for the need to construct a data base of Romanian syllables. We explain the reasons for our choice of the DOOM corpus which we have used. We describe the way syllabification was performed and explain how we have constructed the data base. The main quantitative aspects which we have extracted from our research are presented. We also computed the entropy of the syllables and the entropy of the syllables w.r.t. the consonant-vowel structure. The results are compared with results of similar researches realized for different languages.
pdf
bib
Total Rank Distance and Scaled Total Rank Distance: Two Alternative Metrics in Computational Linguistics
Anca Dinu
|
Liviu P. Dinu
Proceedings of the Workshop on Linguistic Distances