Matteo Pellegrini

2025

DynaMorphPro: A New Diachronic and Multilingual Lexical Resource in the LLOD ecosystem
Matteo Pellegrini | Valeria Irene Boano | Francesco Gardani | Francesco Mambrini | Giovanni Moretti | Marco Carlo Passarotti
Proceedings of the 5th Conference on Language, Data and Knowledge

This paper describes the release as Linguistic Linked Open Data of DynaMorphPro, a lexical resource recording loanwords, conversions and class-shifts from Latin to Old Italian. We show how existing vocabularies are reused and integrated to allow for a rich semantic representation of these data. Our main reference is the OntoLex-lemon model for lexical information, but classes and properties from many other ontologies are also reused to express other aspects. In particular, we identify the CIDOC Concept Reference Model as the ideal tool to convey chronological information on historical processes of lexical innovation and change, and describe how it can be integrated with OntoLex-lemon.

2024

pdf bib abs

Representing Compounding with OntoLex. An Evaluation of Vocabularies for Word Formation Resources
Elena Benzoni | Matteo Pellegrini | Francesco Dedè | Marco Passarotti
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper explores how compounds are represented in resources documenting word formation, and proposes ways to convert them into Linked Open Data using the OntoLex model. The ultimate purpose is to offer a broad empirical evaluation of which of the two OntoLex modules allowing for the representation of compounds – Decomp and Morph – fits best the different formats and theoretical approaches of the resources we examine. We show that the vocabulary of Decomp alone is rarely sufficient to account for all relevant facts; in almost all cases, it is necessary to resort to the vocabulary of Morph, either to reify the relation between compounds and their constituents or to represent specifically morphological information or other aspects. Special attention is devoted to the format of the Universal Derivations project: the modelling strategy that we propose can be applied to all resources harmonized in that format, potentially allowing for the conversion into Linked Open Data of a large amount of structured data.

2023

pdf bib abs

Linking the Neulateinische Wortliste to the LiLa Knowledge Base of Interoperable Resources for Latin
Federica Iurescia | Eleonora Litta | Marco Passarotti | Matteo Pellegrini | Giovanni Moretti | Paolo Ruffolo
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper describes the process of interlinking a lexical resource consisting of a list of more than 20,000 Neo-Latin words with other resources for Latin. The resources are made interoperable thanks to their linking to the anonymous Knowledge Base, which applies Linguistic Linked Open Data practices and data categories to describe and publish on the Web both textual and lexical resources for the Latin language.

pdf bib

Beyond Obscuration and Visibility: Thoughts on the Different Strategies of Gender-Fair Language in Italian
Martina Rosola | Simona Frenda | Alessandra Teresa Cignarella | Matteo Pellegrini | Andrea Marra | Mara Floris
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

2022

pdf bib abs

The Index Thomisticus Treebank as Linked Data in the LiLa Knowledge Base
Francesco Mambrini | Marco Passarotti | Giovanni Moretti | Matteo Pellegrini
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Although the Universal Dependencies initiative today allows for cross-linguistically consistent annotation of morphology and syntax in treebanks for several languages, syntactically annotated corpora are not yet interoperable with many lexical resources that describe properties of the words that occur therein. In order to cope with such limitation, we propose to adopt the principles of the Linguistic Linked Open Data community, to describe and publish dependency treebanks as LLOD. In particular, this paper illustrates the approach pursued in the LiLa Knowledge Base, which enables interoperability between corpora and lexical resources for Latin, to publish as Linguistic Linked Open Data the annotation layers of two versions of a Medieval Latin treebank (the Index Thomisticus Treebank).

pdf bib abs

Computational Morphology with OntoLex-Morph
Christian Chiarcos | Katerina Gkirtzou | Fahad Khan | Penny Labropoulou | Marco Passarotti | Matteo Pellegrini
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

This paper describes the current status of the emerging OntoLex module for linguistic morphology. It serves as an update to the previous version of the vocabulary (Klimek et al. 2019). Whereas this earlier model was exclusively focusing on descriptive morphology and focused on applications in lexicography, we now present a novel part and a novel application of the vocabulary to applications in language technology, i.e., the rule-based generation of lexicons, introducing a dynamic component into OntoLex.

2020

pdf bib abs

Overview of the EvaLatin 2020 Evaluation Campaign
Rachele Sprugnoli | Marco Passarotti | Flavio Massimiliano Cecchini | Matteo Pellegrini
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

This paper describes the first edition of EvaLatin, a campaign totally devoted to the evaluation of NLP tools for Latin. The two shared tasks proposed in EvaLatin 2020, i. e. Lemmatization and Part-of-Speech tagging, are aimed at fostering research in the field of language technologies for Classical languages. The shared dataset consists of texts taken from the Perseus Digital Library, processed with UDPipe models and then manually corrected by Latin experts. The training set includes only prose texts by Classical authors. The test set, alongside with prose texts by the same authors represented in the training set, also includes data relative to poetry and to the Medieval period. This also allows us to propose the Cross-genre and Cross-time subtasks for each task, in order to evaluate the portability of NLP tools for Latin across different genres and time periods. The results obtained by the participants for each task and subtask are presented and discussed.

pdf bib abs

Using LatInfLexi for an Entropy-Based Assessment of Predictability in Latin Inflection
Matteo Pellegrini
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

This paper presents LatInfLexi, a large inflected lexicon of Latin providing information on all the inflected wordforms of 3,348 verbs and 1,038 nouns. After a description of the structure of the resource and some data on its size, the procedure followed to obtain the lexicon from the database of the Lemlat 3.0 morphological analyzer is detailed, as well as the choices made regarding overabundant and defective cells. The way in which the data of LatInfLexi can be exploited in order to perform a quantitative assessment of predictability in Latin verb inflection is then illustrated: results obtained by computing the conditional entropy of guessing the content of a paradigm cell assuming knowledge of one wordform or multiple wordforms are presented in turn, highlighting the descriptive and theoretical relevance of the analysis. Lastly, the paper envisages the advantages of an inclusion of LatInfLexi into the LiLa knowledge base, both for the presented resource and for the knowledge base itself.

pdf bib

(Stem and Word) Predictability in Italian Verb Paradigms: An Entropy-Based Study Exploiting the New Resource LeFFI
Matteo Pellegrini | Alessandra Teresa Cignarella
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)