2024
pdf
bib
abs
Verba volant, scripta volant? Don’t worry! There are computational solutions for protoword reconstruction
Liviu P Dinu
|
Ana Sabina Uban
|
Alina Maria Cristea
|
Ioan-Bogdan Iordache
|
Teodor-George Marchitan
|
Simona Georgescu
|
Laurentiu Zoicas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We introduce a new database of cognate words and etymons for the five main Romance languages, the most comprehensive one to date. We propose a strong benchmark for the automatic reconstruction of protowords for Romance languages, by applying a set of machine learning models and features on these data. The best results reach 90% accuracy in predicting the protoword of a given cognate set, surpassing existing state-of-the-art results for this task and showing that computational methods can be very useful in assisting linguists with protoword reconstruction.
pdf
bib
abs
It takes two to borrow: a donor and a recipient. Who’s who?
Liviu Dinu
|
Ana Uban
|
Anca Dinu
|
Ioan-Bogdan Iordache
|
Simona Georgescu
|
Laurentiu Zoicas
Findings of the Association for Computational Linguistics: ACL 2024
We address the open problem of automatically identifying the direction of lexical borrowing, given word pairs in the donor and recipient languages. We propose strong benchmarks for this task, by applying a set of machine learning models. We extract and publicly release a comprehensive borrowings dataset from the recent RoBoCoP cognates and borrowings database for five Romance languages. We experiment on this dataset with both graphic and phonetic representations and with different features, models and architectures. We interpret the results, in terms of F1 score, commenting on the influence of features and model choice, of the imbalanced data and of the inherent difficulty of the task for particular language pairs. We show that automatically determining the direction of borrowing is a feasible task, and propose additional directions for future work.
pdf
bib
abs
Pater Incertus? There Is a Solution: Automatic Discrimination between Cognates and Borrowings for Romance Languages
Liviu P. Dinu
|
Ana Sabina Uban
|
Ioan-Bogdan Iordache
|
Alina Maria Cristea
|
Simona Georgescu
|
Laurentiu Zoicas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Identifying the type of relationship between words (cognates, borrowings, inherited) provides a deeper insight into the history of a language and allows for a better characterization of language relatedness. In this paper, we propose a computational approach for discriminating between cognates and borrowings, one of the most difficult tasks in historical linguistics. We compare the discriminative power of graphic and phonetic features and we analyze the underlying linguistic factors that prove relevant in the classification task. We perform experiments for pairs of languages in the Romance language family (French, Italian, Spanish, Portuguese, and Romanian), based on a comprehensive database of Romance cognates and borrowings. To our knowledge, this is one of the first attempts of this kind and the most comprehensive in terms of covered languages.
2023
pdf
bib
abs
RoBoCoP: A Comprehensive ROmance BOrrowing COgnate Package and Benchmark for Multilingual Cognate Identification
Liviu Dinu
|
Ana Uban
|
Alina Cristea
|
Anca Dinu
|
Ioan-Bogdan Iordache
|
Simona Georgescu
|
Laurentiu Zoicas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
The identification of cognates is a fundamental process in historical linguistics, on which any further research is based. Even though there are several cognate databases for Romance languages, they are rather scattered, incomplete, noisy, contain unreliable information, or have uncertain availability. In this paper we introduce a comprehensive database of Romance cognates and borrowings based on the etymological information provided by the dictionaries. We extract pairs of cognates between any two Romance languages by parsing electronic dictionaries of Romanian, Italian, Spanish, Portuguese and French. Based on this resource, we propose a strong benchmark for the automatic detection of cognates, by applying machine learning and deep learning based methods on any two pairs of Romance languages. We find that automatic identification of cognates is possible with accuracy averaging around 94% for the more difficult task formulations.
2022
pdf
bib
abs
CoToHiLi at LSCDiscovery: the Role of Linguistic Features in Predicting Semantic Change
Ana Sabina Uban
|
Alina Maria Cristea
|
Anca Daniela Dinu
|
Liviu P Dinu
|
Simona Georgescu
|
Laurentiu Zoicas
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change
This paper presents the contributions of the CoToHiLi team for the LSCDiscovery shared task on semantic change in the Spanish language. We participated in both tasks (graded discovery and binary change, including sense gain and sense loss) and proposed models based on word embedding distances combined with hand-crafted linguistic features, including polysemy, number of neological synonyms, and relation to cognates in English. We find that models that include linguistically informed features combined using weights assigned manually by experts lead to promising results.
2021
pdf
bib
abs
Towards an Etymological Map of Romanian
Alina Maria Cristea
|
Anca Dinu
|
Liviu P. Dinu
|
Simona Georgescu
|
Ana Sabina Uban
|
Laurentiu Zoicas
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
In this paper we investigate the etymology of Romanian words. We start from the Romanian lexicon and automatically extract information from multiple etymological dictionaries. We evaluate the results and perform extensive quantitative and qualitative analyses with the goal of building an etymological map of the language.
pdf
bib
abs
Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages
Alina Maria Cristea
|
Liviu P. Dinu
|
Simona Georgescu
|
Mihnea-Lucian Mihai
|
Ana Sabina Uban
Findings of the Association for Computational Linguistics: EMNLP 2021
In this paper, we address the problem of automatically discriminating between inherited and borrowed Latin words. We introduce a new dataset and investigate the case of Romance languages (Romanian, Italian, French, Spanish, Portuguese and Catalan), where words directly inherited from Latin coexist with words borrowed from Latin, and explore whether automatic discrimination between them is possible. Having entered the language at a later stage, borrowed words are no longer subject to historical sound shift rules, hence they are presumably less eroded, which is why we expect them to have a different intrinsic structure distinguishable by computational means. We employ several machine learning models to automatically discriminate between inherited and borrowed words and compare their performance with various feature sets. We analyze the models’ predictive power on two versions of the datasets, orthographic and phonetic. We also investigate whether prior knowledge of the etymon provides better results, employing n-gram character features extracted from the word-etymon pairs and from their alignment.
pdf
bib
abs
Tracking Semantic Change in Cognate Sets for English and Romance Languages
Ana Sabina Uban
|
Alina Maria Cristea
|
Anca Dinu
|
Liviu P. Dinu
|
Simona Georgescu
|
Laurentiu Zoicas
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
Semantic divergence in related languages is a key concern of historical linguistics. We cross-linguistically investigate the semantic divergence of cognate pairs in English and Romance languages, by means of word embeddings. To this end, we introduce a new curated dataset of cognates in all pairs of those languages. We describe the types of errors that occurred during the automated cognate identification process and manually correct them. Additionally, we label the English cognates according to their etymology, separating them into two groups: old borrowings and recent borrowings. On this curated dataset, we analyse word properties such as frequency and polysemy, and the distribution of similarity scores between cognate sets in different languages. We automatically identify different clusters of English cognates, setting a new direction of research in cognates, borrowings and possibly false friends analysis in related languages.