Pierluigi Cassotti


2024

pdf bib
Computational modeling of semantic change
Pierluigi Cassotti | Francesco Periti | Stefano de Pascale | Haim Dubossarsky | Nina Tahmasebi
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Languages change constantly over time, influenced by social, technological, cultural and political factors that affect how people express themselves. In particular, words can undergo the process of semantic change, which can be subtle and significantly impact the interpretation of texts. For example, the word terrific used to mean ‘causing terror’ and was as such synonymous to terrifying. Nowadays, speakers use the word in the sense of ‘excessive’ and even ‘amazing’. In Historical Linguistics, tools and methods have been developed to analyse this phenomenon, including systematic categorisations of the types of change, the causes and the mechanisms underlying the different types of change. However, traditional linguistic methods, while informative, are often based on small, carefully curated samples. Thanks to the availability of both large diachronic corpora, the computational means to model word meaning unsupervised, and evaluation benchmarks, we are seeing an increasing interest in the computational modelling of semantic change. This is evidenced by the increasing number of publications in this new domain as well as the organisation of initiatives and events related to this topic, such as four editions of the International Workshop on Computational Approaches to Historical Language Change LChange1, and several evaluation campaigns (Schlechtweg et al., 2020a; Basile et al., 2020b; Kutuzov et al.; Zamora-Reina et al., 2022).

2023

pdf bib
Graph Databases for Diachronic Language Data Modelling
Barbara McGillivray | Pierluigi Cassotti | Davide Di Pierro | Paola Marongiu | Anas Fahad Khan | Stefano Ferilli | Pierpaolo Basile
Proceedings of the 4th Conference on Language, Data and Knowledge

pdf bib
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change
Nina Tahmasebi | Syrielle Montariol | Haim Dubossarsky | Andrey Kutuzov | Simon Hengchen | David Alfter | Francesco Periti | Pierluigi Cassotti
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

pdf bib
XL-LEXEME: WiC Pretrained Model for Cross-Lingual LEXical sEMantic changE
Pierluigi Cassotti | Lucia Siciliani | Marco DeGemmis | Giovanni Semeraro | Pierpaolo Basile
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The recent introduction of large-scale datasets for the WiC (Word in Context) task enables the creation of more reliable and meaningful contextualized word embeddings.However, most of the approaches to the WiC task use cross-encoders, which prevent the possibility of deriving comparable word embeddings.In this work, we introduce XL-LEXEME, a Lexical Semantic Change Detection model.XL-LEXEME extends SBERT, highlighting the target word in the sentence. We evaluate XL-LEXEME on the multilingual benchmarks for SemEval-2020 Task 1 - Lexical Semantic Change (LSC) Detection and the RuShiftEval shared task involving five languages: English, German, Swedish, Latin, and Russian.XL-LEXEME outperforms the state-of-the-art in English, German and Swedish with statistically significant differences from the baseline results and obtains state-of-the-art performance in the RuShiftEval shared task.

2022

pdf bib
swapUNIBA@FinTOC2022: Fine-tuning Pre-trained Document Image Analysis Model for Title Detection on the Financial Domain
Pierluigi Cassotti | Cataldo Musto | Marco DeGemmis | Georgios Lekkas | Giovanni Semeraro
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

In this paper, we introduce the results of our submitted system to the FinTOC 2022 task. We address the task using a two-stage process: first, we detect titles using Document Image Analysis, then we train a supervised model for the hierarchical level prediction. We perform Document Image Analysis using a pre-trained Faster R-CNN on the PublyaNet dataset. We fine-tuned the model on the FinTOC 2022 training set. We extract orthographic and layout features from detected titles and use them to train a Random Forest model to predict the title level. The proposed system ranked #1 on both Title Detection and the Table of Content extraction tasks for Spanish. The system ranked #3 on both the two subtasks for English and French.

2021

pdf bib
The Corpora They Are a-Changing: a Case Study in Italian Newspapers
Pierpaolo Basile | Annalina Caputo | Tommaso Caselli | Pierluigi Cassotti | Rossella Varvara
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

The use of automatic methods for the study of lexical semantic change (LSC) has led to the creation of evaluation benchmarks. Benchmark datasets, however, are intimately tied to the corpus used for their creation questioning their reliability as well as the robustness of automatic methods. This contribution investigates these aspects showing the impact of unforeseen social and cultural dimensions. We also identify a set of additional issues (OCR quality, named entities) that impact the performance of the automatic methods, especially when used to discover LSC.

2020

pdf bib
GM-CTSC at SemEval-2020 Task 1: Gaussian Mixtures Cross Temporal Similarity Clustering
Pierluigi Cassotti | Annalina Caputo | Marco Polignano | Pierpaolo Basile
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the system proposed by the Random team for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. We focus our approach on the detection problem. Given the semantics of words captured by temporal word embeddings in different time periods, we investigate the use of unsupervised methods to detect when the target word has gained or lost senses. To this end, we define a new algorithm based on Gaussian Mixture Models to cluster the target similarities computed over the two periods. We compare the proposed approach with a number of similarity-based thresholds. We found that, although the performance of the detection methods varies across the word embedding algorithms, the combination of Gaussian Mixture with Temporal Referencing resulted in our best system.