Fabio Tamburini

Also published as: F. Tamburini

2025

Curated Data Does Not Mean Representative Data When Training Large Language Models: An Experiment Using Representative Data for Italian
Fabio Tamburini
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

pdf bib

BABILong-ITA: A New Benchmark for Testing Large Language Models Effective Context Length and a Context Extension Method
Fabio Tamburini
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

2024

pdf bib abs

Exploring Text-Embedding Retrieval Models for the Italian Language
Yuri Noviello | Fabio Tamburini
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Text retrieval systems have become essential in the field of natural language processing (NLP), serving as the backbone for applications such as search engines, document indexing, and information retrieval. With the rise of generative AI, particularly Retrieval-Augmented Generation (RAG) systems, the demand for robust text retrieval models has increased. However, existing large language models (LLMs) and datasets are often insufficiently optimized for Italian, limiting their performance in Italian text retrieval tasks. This paper addresses this gap by proposing both a data collection and specialized models tailored for Italian text retrieval. Through extensive experimentation, we analyze the improvements and limitations in retrieval performance, paving the way for more effective Italian NLP applications.

pdf bib abs

Automatic Detection of Rhythmic Features in Pathological Speech of MCI and Dementia Patients
Marica Belmonte | Gloria Gagliardi | Dimitrios Kokkinakis | Fabio Tamburini
Proceedings of the Fifth Workshop on Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments @LREC-COLING 2024

Linguistic alterations represent one of the prodromal signs of cognitive decline associated with Dementia. In recent years, a growing body of work has been devoted to the development of algorithms for the automatic linguistic analysis of both oral and written texts, for diagnostic purposes. The extraction of Digital Linguistic Biomarkers from patients’ verbal productions can indeed provide a rapid, ecological, and cost-effective system for large-scale screening of the pathology. This article contributes to the ongoing research in the field by exploring a traditionally less studied aspect of language in Dementia, namely the rhythmic characteristics of speech. In particular, the paper focuses on the automatic detection of rhythmic features in Italian-connected speech. A landmark-based system was developed and evaluated to segment the speech flow into vocalic and consonantal intervals and to calculate several rhythmic metrics. Additionally, the reliability of these metrics in identifying Mild Cognitive Impairment and Dementia patients was tested.

pdf bib abs

Voice Activity Detection on Italian Language
Shibingfeng Zhang | Gloria Gagliardi | Fabio Tamburini
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction pipeline for early mental disorder diagnosis. We experimented with various VAD systems and propose an ensemble VAD system that integrates the best-performing models. Our ensemble system shows significant improvements in speech event detection. This advancement lays a robust foundation for more accurate early detection of mental health issues using DLBs in Italian.

pdf bib abs

Complexifying BERT Using LoRA Adapters
Fabio Tamburini
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

This paper presents the first results of a pilot study for transforming a real-valued pre-trained transformer encoder into a complex-valued one. Following recent findings about pre-training using LoRA, the main idea is to employ complex-valued LoRA adapters to make the trick and continue the pre-training of a given Italian model for setting up the adapters. After pre-training, the proposed complex-valued model has been evaluated on a standardised benchmark for Italian natural-language understanding obtaining very encouraging results.

2023

pdf bib abs

Decipherment of Lost Ancient Scripts as Combinatorial Optimisation Using Coupled Simulated Annealing
Fabio Tamburini
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

This paper presents a new approach to the ancient scripts decipherment problem based on combinatorial optimisation and coupled simulated annealing, an advanced non-convex optimisation procedure. Solutions are encoded by using k-permutations allowing for null, oneto-many, and many-to-one mappings between signs. The proposed system is able to produce enhanced results in cognate identification when compared to the state-of-the-art systems on standard evaluation benchmarks used in literature.

2022

pdf bib abs

Combining ELECTRA and Adaptive Graph Encoding for Frame Identification
Fabio Tamburini
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents contributions in two directions: first we propose a new system for Frame Identification (FI), based on pre-trained text encoders trained discriminatively and graphs embedding, producing state of the art performance and, second, we take in consideration all the extremely different procedures used to evaluate systems for this task performing a complete evaluation over two benchmarks and all possible splits and cleaning procedures used in the FI literature.

pdf bib abs

The Automatic Extraction of Linguistic Biomarkers as a Viable Solution for the Early Diagnosis of Mental Disorders
Gloria Gagliardi | Fabio Tamburini
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Digital Linguistic Biomarkers extracted from spontaneous language productions proved to be very useful for the early detection of various mental disorders. This paper presents a computational pipeline for the automatic processing of oral and written texts: the tool enables the computation of a rich set of linguistic features at the acoustic, rhythmic, lexical, and morphosyntactic levels. Several applications of the instrument - for the detection of Mild Cognitive Impairments, Anorexia Nervosa, and Developmental Language Disorders - are also briefly discussed.

pdf bib abs

Contextual Unsupervised Clustering of Signs for Ancient Writing Systems
Michele Corazza | Fabio Tamburini | Miguel Valério | Silvia Ferrara
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

The application of machine learning techniques to ancient writing systems is a relatively new idea, and it poses interesting challenges for researchers. One particularly challenging aspect is the scarcity of data for these scripts, which contrasts with the large amounts of data usually available when applying neural models to computational linguistics and other fields. For this reason, any method that attempts to work on ancient scripts needs to be ad-hoc and consider paleographic aspects, in addition to computational ones. Considering the peculiar characteristics of the script that we used is therefore be a crucial part of our work, as any solution needs to consider the particular nature of the writing system that it is applied to. In this work we propose a preliminary evaluation of a novel unsupervised clustering method on Cypro-Greek syllabary, a writing system from Cyprus. This evaluation shows that our method improves clustering performance using information about the attested sequences of signs in combination with an unsupervised model for images, with the future goal of applying the methodology to undeciphered writing systems from a related and typologically similar script.

This paper presents a novel algorithm for Word Sense Disambiguation (WSD) based on Quantum Probability Theory. The Quantum WSD algorithm requires concepts representations as vectors in the complex domain and thus we have developed a technique for computing complex word and sentence embeddings based on the Paragraph Vectors algorithm. Despite the proposed method is quite simple and that it does not require long training phases, when it is evaluated on a standardized benchmark for this task it exhibits state-of-the-art (SOTA) performances.

2018

pdf bib

PoSTWITA-UD: an Italian Twitter Treebank in Universal Dependencies
Manuela Sanguinetti | Cristina Bosco | Alberto Lavelli | Alessandro Mazzei | Oronzo Antonelli | Fabio Tamburini
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Parsing Italian Texts Together is Better Than Parsing Them Alone!
Oronzo Antonelli | Fabio Tamburini
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib

A new Pitch Tracking Smoother based on Deep Neural Networks
Michele Ferro | Fabio Tamburini
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib

Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)
Elena Cabrio | Alessandro Mazzei | Fabio Tamburini
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

pdf bib

Preface
Elena Cabrio | Alessandro Mazzei | Fabio Tamburini
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

2017

pdf bib

Annotating Italian Social Media Texts in Universal Dependencies
Manuela Sanguinetti | Cristina Bosco | Alessandro Mazzei | Alberto Lavelli | Fabio Tamburini
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib

Developing a Large Scale FrameNet for Italian: the IFrameNet Experience
Silvia Brambilla | Danilo Croce | Fabio Tamburini | Roberto Basili
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

pdf bib

Semgrex-Plus: a Tool for Automatic Dependency-Graph Rewriting
Fabio Tamburini
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib abs

Towards Quantum Language Models
Ivano Basile | Fabio Tamburini
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper presents a new approach for building Language Models using the Quantum Probability Theory, a Quantum Language Model (QLM). It mainly shows that relying on this probability calculus it is possible to build stochastic models able to benefit from quantum correlations due to interference and entanglement. We extensively tested our approach showing its superior performances, both in terms of model perplexity and inserting it into an automatic speech recognition evaluation setting, when compared with state-of-the-art language modelling techniques.

2016

pdf bib abs

Automatic identification of Mild Cognitive Impairment through the analysis of Italian spontaneous speech productions
Daniela Beltrami | Laura Calzà | Gloria Gagliardi | Enrico Ghidoni | Norina Marcello | Rema Rossini Favretti | Fabio Tamburini
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents some preliminary results of the OPLON project. It aimed at identifying early linguistic symptoms of cognitive decline in the elderly. This pilot study was conducted on a corpus composed of spontaneous speech sample collected from 39 subjects, who underwent a neuropsychological screening for visuo-spatial abilities, memory, language, executive functions and attention. A rich set of linguistic features was extracted from the digitalised utterances (at phonetic, suprasegmental, lexical, morphological and syntactic levels) and the statistical significance in pinpointing the pathological process was measured. Our results show remarkable trends for what concerns both the linguistic traits selection and the automatic classifiers building.

pdf bib abs

Specialising Paragraph Vectors for Text Polarity Detection
Fabio Tamburini
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents some experiments for specialising Paragraph Vectors, a new technique for creating text fragment (phrase, sentence, paragraph, text, ...) embedding vectors, for text polarity detection. The first extension regards the injection of polarity information extracted from a polarity lexicon into embeddings and the second extension aimed at inserting word order information into Paragraph Vectors. These two extensions, when training a logistic-regression classifier on the combined embeddings, were able to produce a relevant gain in performance when compared to the standard Paragraph Vector methods proposed by Le and Mikolov (2014).

2012

pdf bib abs

AnIta: a powerful morphological analyser for Italian
Fabio Tamburini | Matias Melandri
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper we present AnIta, a powerful morphological analyser for Italian implemented within the framework of finite-state-automata models. It is provided by a large lexicon containing more than 110,000 lemmas that enable it to cover relevant portions of Italian texts. We describe our design choices for the management of inflectional phenomena as well as some interesting new features to explicitly handle derivational and compositional processes in Italian, namely the wordform segmentation structure and Derivation Graph. Two different evaluation experiments, for testing coverage (Recall) and Precision, are described in detail, comparing the AnIta performances with some other freely available tools to handle Italian morphology. The experiments results show that the AnIta Morphological Analyser obtains the best performances among the tested systems, with Recall = 97.21% and Precision = 98.71%. This tool was a fundamental building block for designing a performant PoS-tagger and Lemmatiser for the Italian language that participated to two EVALITA evaluation campaigns ranking, in both cases, together with the best performing systems.

pdf bib abs

A topologic view of Topic and Focus marking in Italian
Gloria Gagliardi | Edoardo Lombardi Vallauri | Fabio Tamburini
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Regularities in position and level of prosodic prominences associated to patterns of Information Structure are identified for some Italian varieties. The experiments' results suggest a possibly new structural hypothesis on the role and function of the main prominence in marking information patterns. (1) An abstract and merely structural, """"topologic"""" concept of Prominence location can be conceived of, as endowed with the function of demarcation between units, before their culmination and """"description"""". This may suffice to explain much of the process by which speakers interpret the IS of utterances in discourse. Further features, such as the specific intonational contours of the different IS units, may thus represent a certain amount of redundancy. (2) Real utterances do not always signal the distribution of Topic and Focus clearly. Acoustically, many remain underspecified in this respect. This is especially true for the distinction between Topic-Focus and Broad Focus, which indeed often has no serious effects on the progression of communicative dynamism in the subsequent discourse. (3) The consistency of such results with the law of least effort, and the very high percent of matching between perceptual evaluations and automatic measurement, seem to validate the used algorithm.

2008

pdf bib abs

EVALITA 2007, the first edition of the initiative devoted to the evaluation of Natural Language Processing tools for Italian, provided a shared framework where participants systems had the possibility to be evaluated on five different tasks, namely Part of Speech Tagging (organised by the University of Bologna), Parsing (organised by the University of Torino), Word Sense Disambiguation (organised by CNR-ILC, Pisa), Temporal Expression Recognition and Normalization (organised by CELCT, Trento), and Named Entity Recognition (organised by FBK, Trento). We believe that the diffusion of shared tasks and shared evaluation practices is a crucial step towards the development of resources and tools for Natural Language Processing. Experiences of this kind, in fact, are a valuable contribution to the validation of existing models and data, allowing for consistent comparisons among approaches and among representation schemes. The good response obtained by EVALITA, both in the number of participants and in the quality of results, showed that pursuing such goals is feasible not only for English, but also for other languages.

2006

pdf bib abs

The DiaCORIS project: a diachronic corpus of written Italian
C. Onelli | D. Proietti | C. Seidenari | F. Tamburini
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The DiaCORIS project aims at the construction of a diachronic corpus comprising written Italian texts produced between 1861 and 1945, extending the structure and the research possibilities of the synchronic 100-million word corpus CORIS/CODIS. A preliminary in depth study has been performed in order to design a representative and well balanced sample of the Italian language over a time period that contains all the main events of contemporary Italian history from the National Unification to the end of the Second World War. The paper describes in detail such design processes as the definition of the main subcorpora and their proportions, the type of documents inserted in each part of the corpus, the document annotation schema and the technological infrastructure designed to manage the corpus access as well as the web interface to corpus data.

pdf bib abs

POS tagset design for Italian
Raffaella Bernardi | Andrea Bolognesi | Corrado Seidenari | Fabio Tamburini
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We aim to automatically induce a PoS tagset for Italian by analysing the distributional behaviour of Italian words. To this end, we propose an algorithm that (a) extracts information from loosely labelled dependency structures that encode only basic and broadly accepted syntactic relations, namely Head/Dependent and the distinction of dependents into Argument vs. Adjunct, and (b) derives a possible set of word classes. The paper reports on some preliminary experiments carried out using the induced tagset in conjunction with state-of-the-art PoS taggers. The method proposed to design a proper tagset exploits little, if any, language-specific knowledge: hence it is in principle applicable to any language.