Pierluigi Cassotti

2026

Elections go bananas: A First Large-scale Multilingual Study of Pluralia Tantum using LLMs
Elena Spaziani | Kamyar Zeinalipour | Pierluigi Cassotti | Nina Tahmasebi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we study the expansion of pluralia tantum, i.e., defective nouns which lack a singular form, like scissors. We base our work on an annotation framework specifically developed for the study of lexicalization of pluralia tantum, namely Lexicalization profiles. On a corresponding hand-annotated testset, we show that the OpenAI and DeepSeek models provide useful annotators for semantic, syntactic and sense categories, with accuracy ranging from 51% to 89%, averaged across all feature groups and languages. Next, we turn to a large-scale investigation of pluralia tantum. Using dictionaries, we extract candidate words for Italian, Russian and English and keep those for which the changing ratio of singular and plural form is evident in a corresponding reference corpus. We use an LLM to annotate each instance from the reference corpora according to the annotation framework. We show that the large amount of automatically annotated sentences for each feature can be used to perform in-depth linguistic analysis. Focusing on the correlation between an annotated feature and the grammatical form (singular vs. plural), patterns of morpho-semantic change are noted.

pdf bib

The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
Nina Tahmasebi | Pierluigi Cassotti | Syrielle Montariol | Andrey Kutuzov | Netta Huebscher | Elena Spaziani | Naomi Baes
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

2025

pdf bib

A Hypothesis-Driven Framework for Detecting Lexical Semantic Change
Pierluigi Cassotti | Nina Tahmasebi
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

pdf bib abs

Sense-specific Historical Word Usage Generation
Pierluigi Cassotti | Nina Tahmasebi
Transactions of the Association for Computational Linguistics, Volume 13

Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.

pdf bib abs

Towards Language-Agnostic STIPA: Universal Phonetic Transcription to Support Language Documentation at Scale
Jacob Lee Suchardt | Hana El-Shazli | Pierluigi Cassotti
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

This paper explores the use of existing state-of-the-art speech recognition models (ASR) for the task of generating narrow phonetic transcriptions using the International Phonetic Alphabet (STIPA). Unlike conventional ASR systems focused on orthographic output for high-resource languages, STIPA can be used as a language-agnostic interface valuable for documenting under-resourced and unwritten languages. We introduce a new dataset for South Levantine Arabic and present the first large-scale evaluation of STIPA models across 51 language families. Additionally, we provide a use case on Sanna, a severely endangered language. Our findings show that fine-tuned ASR models can produce accurate IPA transcriptions with limited supervision, significantly reducing phonetic error rates even in extremely low-resource settings. The results highlight the potential of STIPA for scalable language documentation.

2024

pdf bib

Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change
Nina Tahmasebi | Syrielle Montariol | Andrey Kutuzov | David Alfter | Francesco Periti | Pierluigi Cassotti | Netta Huebscher
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

pdf bib abs

Computational modeling of semantic change
Pierluigi Cassotti | Francesco Periti | Stefano De Pascale | Haim Dubossarsky | Nina Tahmasebi
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Languages change constantly over time, influenced by social, technological, cultural and political factors that affect how people express themselves. In particular, words can undergo the process of semantic change, which can be subtle and significantly impact the interpretation of texts. For example, the word terrific used to mean ‘causing terror’ and was as such synonymous to terrifying. Nowadays, speakers use the word in the sense of ‘excessive’ and even ‘amazing’. In Historical Linguistics, tools and methods have been developed to analyse this phenomenon, including systematic categorisations of the types of change, the causes and the mechanisms underlying the different types of change. However, traditional linguistic methods, while informative, are often based on small, carefully curated samples. Thanks to the availability of both large diachronic corpora, the computational means to model word meaning unsupervised, and evaluation benchmarks, we are seeing an increasing interest in the computational modelling of semantic change. This is evidenced by the increasing number of publications in this new domain as well as the organisation of initiatives and events related to this topic, such as four editions of the International Workshop on Computational Approaches to Historical Language Change LChange1, and several evaluation campaigns (Schlechtweg et al., 2020a; Basile et al., 2020b; Kutuzov et al.; Zamora-Reina et al., 2022).

pdf bib abs

Analyzing Semantic Change through Lexical Replacements
Francesco Periti | Pierluigi Cassotti | Haim Dubossarsky | Nina Tahmasebi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Modern language models are capable of contextualizing words based on their surrounding context. However, this capability is often compromised due to semantic change that leads to words being used in new, unexpected contexts not encountered during pre-training. In this paper, we model semantic change by studying the effect of unexpected contexts introduced by lexical replacements. We propose a replacement schema where a target word is substituted with lexical replacements of varying relatedness, thus simulating different kinds of semantic change. Furthermore, we leverage the replacement schema as a basis for a novel interpretable model for semantic change. We are also the first to evaluate the use of LLaMa for semantic change detection.

pdf bib abs

TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse
Francesco Periti | Pierluigi Cassotti | Stefano Montanelli | Nina Tahmasebi | Dominik Schlechtweg
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Current approaches for detecting text reuse do not focus on recontextualization, i.e., how the new context(s) of a reused text differs from its original context(s). In this paper, we propose a novel framework called TRoTR that relies on the notion of topic relatedness for evaluating the diachronic change of context in which text is reused. TRoTR includes two NLP tasks: TRiC and TRaC. TRiC is designed to evaluate the topic relatedness between a pair of recontextualizations. TRaC is designed to evaluate the overall topic variation within a set of recontextualizations. We also provide a curated TRoTR benchmark of biblical text reuse, human-annotated with topic relatedness. The benchmark exhibits an inter-annotator agreement of .811. We evaluate multiple, established SBERT models on the TRoTR tasks and find that they exhibit greater sensitivity to textual similarity than topic relatedness. Our experiments show that fine-tuning these models can mitigate such a kind of sensitivity.

pdf bib abs

DWUGs-IT: Extending and Standardizing Lexical Semantic Change Detection for Italian
Pierluigi Cassotti | Pierpaolo Basile | Nina Tahmasebi
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Lexical Semantic Change Detection (LSCD) is the task of determining whether a word has undergone a change in meaning over time. There has been a marked increase in interest in this task, accompanied by a corresponding growth in the scientific community involved in developing computational approaches to semantic change. In recent years, a number of resources have been made available for the evaluation of LSC models in a number of languages, including English, Swedish, German, Latin, Russian and Chinese. DIACR-ITA is the only existing resource for LSCD in Italian. However, DIACR-ITA has a different format from that used for other languages. In this paper we present DWUGs-IT, which extends the DIACR-ITA dataset with additional target words and usage-sense pair annotations and adapts it to the DURel format, including the first implementation of a LSCD graded task for Italian.

pdf bib abs

More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages
Dominik Schlechtweg | Pierluigi Cassotti | Bill Noble | David Alfter | Sabine Schulte Im Walde | Nina Tahmasebi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Word Usage Graphs (WUGs) represent human semantic proximity judgments for pairs of word uses in a weighted graph, which can be clustered to infer word sense clusters from simple pairwise word use judgments, avoiding the need for word sense definitions. SemEval-2020 Task 1 provided the first and to date largest manually annotated, diachronic WUG dataset. In this paper, we check the robustness and correctness of the annotations by continuing the SemEval annotation algorithm for two more rounds and comparing against an established annotation paradigm. Further, we test the reproducibility by resampling a new, smaller set of word uses from the SemEval source corpora and annotating them. Our work contributes to a better understanding of the problems and opportunities of the WUG annotation paradigm and points to future improvements.

pdf bib abs

Using Synchronic Definitions and Semantic Relations to Classify Semantic Change Types
Pierluigi Cassotti | Stefano De Pascale | Nina Tahmasebi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

There is abundant evidence of the fact that the way words change their meaning can be classified in different types of change, highlighting the relationship between the old and new meanings (among which generalisation, specialisation and co-hyponymy transfer).In this paper, we present a way of detecting these types of change by constructing a model that leverages information both from synchronic lexical relations and definitions of word meanings. Specifically, we use synset definitions and hierarchy information from WordNet and test it on a digitized version of Blank’s (1997) dataset of semantic change types. Finally, we show how the sense relationships can improve models for both approximation of human judgments of semantic relatedness as well as binary Lexical Semantic Change Detection.

2023

pdf bib

On the Impact of Language Adaptation for Large Language Models: A Case Study for the Italian Language Using Only Open Resources
Pierpaolo Basile | Pierluigi Cassotti | Marco Polignano | Lucia Siciliani | Giovanni Semeraro
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

pdf bib

pdf bib abs

XL-LEXEME: WiC Pretrained Model for Cross-Lingual LEXical sEMantic changE
Pierluigi Cassotti | Lucia Siciliani | Marco DeGemmis | Giovanni Semeraro | Pierpaolo Basile
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The recent introduction of large-scale datasets for the WiC (Word in Context) task enables the creation of more reliable and meaningful contextualized word embeddings.However, most of the approaches to the WiC task use cross-encoders, which prevent the possibility of deriving comparable word embeddings.In this work, we introduce XL-LEXEME, a Lexical Semantic Change Detection model.XL-LEXEME extends SBERT, highlighting the target word in the sentence. We evaluate XL-LEXEME on the multilingual benchmarks for SemEval-2020 Task 1 - Lexical Semantic Change (LSC) Detection and the RuShiftEval shared task involving five languages: English, German, Swedish, Latin, and Russian.XL-LEXEME outperforms the state-of-the-art in English, German and Swedish with statistically significant differences from the baseline results and obtains state-of-the-art performance in the RuShiftEval shared task.

pdf bib

2022

pdf bib abs

swapUNIBA@FinTOC2022: Fine-tuning Pre-trained Document Image Analysis Model for Title Detection on the Financial Domain
Pierluigi Cassotti | Cataldo Musto | Marco DeGemmis | Georgios Lekkas | Giovanni Semeraro
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

In this paper, we introduce the results of our submitted system to the FinTOC 2022 task. We address the task using a two-stage process: first, we detect titles using Document Image Analysis, then we train a supervised model for the hierarchical level prediction. We perform Document Image Analysis using a pre-trained Faster R-CNN on the PublyaNet dataset. We fine-tuned the model on the FinTOC 2022 training set. We extract orthographic and layout features from detected titles and use them to train a Random Forest model to predict the title level. The proposed system ranked #1 on both Title Detection and the Table of Content extraction tasks for Spanish. The system ranked #3 on both the two subtasks for English and French.

2021

pdf bib

Extracting Relations from Italian Wikipedia using Self-Training
Lucia Siciliani | Pierluigi Cassotti | Pierpaolo Basile | Marco de Gemmis | Pasquale Lops | Giovanni Semeraro
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

pdf bib

Emerging Trends in Gender-Specific Occupational Titles in Italian Newspapers
Pierluigi Cassotti | Andrea Iovine | Pierpaolo Basile | Marco De Gemmis | Giovanni Semeraro
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

pdf bib abs

The Corpora They Are a-Changing: a Case Study in Italian Newspapers
Pierpaolo Basile | Annalina Caputo | Tommaso Caselli | Pierluigi Cassotti | Rossella Varvara
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

The use of automatic methods for the study of lexical semantic change (LSC) has led to the creation of evaluation benchmarks. Benchmark datasets, however, are intimately tied to the corpus used for their creation questioning their reliability as well as the robustness of automatic methods. This contribution investigates these aspects showing the impact of unforeseen social and cultural dimensions. We also identify a set of additional issues (OCR quality, named entities) that impact the performance of the automatic methods, especially when used to discover LSC.

2020

pdf bib

A Diachronic Italian Corpus based on “L’Unità”
Pierpaolo Basile | Annalina Caputo | Tommaso Caselli | Pierluigi Cassotti | Rossella Varvara
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

pdf bib abs

GM-CTSC at SemEval-2020 Task 1: Gaussian Mixtures Cross Temporal Similarity Clustering
Pierluigi Cassotti | Annalina Caputo | Marco Polignano | Pierpaolo Basile
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the system proposed by the Random team for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. We focus our approach on the detection problem. Given the semantics of words captured by temporal word embeddings in different time periods, we investigate the use of unsupervised methods to detect when the target word has gained or lost senses. To this end, we define a new algorithm based on Gaussian Mixture Models to cluster the target similarities computed over the two periods. We compare the proposed approach with a number of similarity-based thresholds. We found that, although the performance of the detection methods varies across the word embedding algorithms, the combination of Gaussian Mixture with Temporal Referencing resulted in our best system.

pdf bib

Analysis of Lexical Semantic Changes in Corpora with the Diachronic Engine
Pierluigi Cassotti | Pierpaolo Basile | Marco de Gemmis | Giovanni Semeraro
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

2017

pdf bib

Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
Pierpaolo Basile | Giovanni Semeraro | Pierluigi Cassotti
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)