pdf
bib
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Felice Dell'Orletta
|
Alessandro Lenci
|
Simonetta Montemagni
|
Rachele Sprugnoli
pdf
bib
Preface to the CLiC-it 2024 Proceedings
Felice Dell’Orletta
|
Alessandro Lenci
|
Simonetta Montemagni
|
Rachele Sprugnoli
pdf
bib
abs
Lifeless Winter without Break: Ovid’s Exile Works and the LiLa Knowledge Base
Aurora Alagni
|
Francesco Mambrini
|
Marco Passarotti
In this paper we describe the process of semi-automatic annotation and linking performed to connect two works by the Latin poet Ovid to the LiLa Knowledge Base of interoperable linguistic resources. Written after Ovid’s exile from Rome, the Tristia and the Epistulae ex Ponto mark the beginning of the “literature of exile”. In spite of their importance, no lemmatized version existed and the two collections were not part of the major annotated corpora linked to LiLa. The paper discusses the workflow used to annotate and publish the works as Linked Open Data connected to the LiLa Knowledge Base. On account of their subject and the emotional tone attached to the theme of exile, the two works are particularly relevant for sentiment analysis. We discuss some results of a lexicon-based analysis that is enabled by the interlinking with LiLa. We use LatinAffectus, a manually-generated polarity lexicon for Latin nouns and adjectives, to perform Sentiment Analysis on the aforementioned works and interpret the (replicable) results by consulting and simultaneously enriching the available literary scholarship with new information.
pdf
bib
abs
Exploring the Use of Cohesive Devices in Dementia within an Elderly Italian Semi-spontaneous Speech Corpus
Giorgia Albertin
|
Elena Martinelli
The study of language disruption in dementia aimed at individuating which features correlate with the progression of cognitive impairment is a growing area in computational linguistic research. Still, it needs a further development in analyzing some discourse phenomena that also undergo deterioration, and can help expand our understanding of dementia-related speech and refine automatic tools. This paper explores the discourse property of cohesion by investigating three types of cohesive devices: reference, lexical iteration, and connectives. Ten features related to these categories have been defined and automatically extracted from an Italian corpus of semi-spontaneous speech collected from dementia patients and healthy controls. Some of the designed features have proven significant for the binary classification of the two groups and further quantitative analysis highlight interesting differences in the use of cohesive devices, that seem to be associated with cognitive decline.
pdf
bib
abs
SimilEx: The First Italian Dataset for Sentence Similarity with Natural Language Explanations
Chiara Alzetta
|
Felice Dell’orletta
|
Chiara Fazzone
|
Giulia Venturi
Large language models (LLMs) demonstrate great performance in natural language processing and understanding tasks. However, much work remains to enhance their interpretability. Annotated datasets with explanations could be key to addressing this issue, as they enable the development of models that provide human-like explanations for their decisions. In this paper, we introduce the SimilEx dataset, the first Italian dataset reporting human evaluations of similarity between pairs of sentences. For a subset of these pairs, the annotators also provided explanations in natural language for the scores assigned. The SimilEx dataset is valuable for exploring the variability in similarity perception between sentences and for training LLMs to offer human-like explanations for their predictions.
pdf
bib
abs
Data Augmentation for Low-Resource Italian NLP: Enhancing Semantic Processing with DRS
Muhammad Saad Amin
|
Luca Anselma
|
Alessandro Mazzei
Discourse Representation Structure (DRS), a formal meaning representation, has shown promising results in semantic parsing and natural language generation tasks for high-resource languages like English. This paper investigates enhancing the application of DRS to low-resource Italian Natural Language Processing (NLP), in both semantic parsing (Text-to-DRS) and natural language generation (DRS-to-Text). To address the scarcity of annotated corpora for Italian DRS, we propose a novel data augmentation technique that involves the use of external linguistic resources including: (i) WordNet for common nouns, adjectives, adverbs, and verbs; (ii) LLM-generated named entities for proper nouns; and (iii) rule-based algorithms fortense augmentation. This approach not only increases the quantity of training data but also introduces linguistic diversity, which is crucial for improving model performance and robustness. Using this augmented dataset, we developed neural semantic parser and generator models that demonstrated enhanced generalization ability compared to models trained on non-augmented data. We evaluated the effect of semantic data augmentation using two state-of-the-art transformer-based neural sequence-to-sequence models, i.e., byT5 and IT5. Our implementation shows promising results for Italian semanticprocessing. Data augmentation significantly increased the performance of semantic parsing from 76.10 to 90.56 (+14.46%) F1-SMATCH score and generation with 37.79 to 57.48 (+19.69%) BLEU, 30.83 to 40.95 (+10.12%) METEOR, 81.66 to 90.97 (+9.31%) COMET, 54.84 to 70.88 (+16.04%) chrF, and 88.86 to 92.97 (+4.11%) BERT scores. These results demonstrate the effectiveness of our novel augmentation approach in enhancing semantic processing capabilities for low-resource languages like Italian.
pdf
bib
abs
ItaEval and TweetyIta: A New Extensive Benchmark and Efficiency-First Language Model for Italian
Giuseppe Attanasio
|
Pieter Delobelle
|
Moreno La Quatra
|
Andrea Santilli
|
Beatrice Savoldi
Current development and benchmarking efforts for modern, large-scale Italian language models (LMs) are scattered.This paper situates such efforts by introducing two new resources: ItaEval, a comprehensive evaluation suite, and TweetyIta, an efficiency-first language model for Italian.Through ItaEval, we standardize evaluation across language understanding, commonsense and factual knowledge, and social bias-related tasks.In our attempt at language modeling, we experiment with efficient, tokenization-based adaption techniques. Our TweetyIta shows encouraging results after training on as little as 5G tokens from natural Italian corpora. We benchmark an extensive list of models against ItaEval and find several interesting insights. Surprisingly, i) models trained predominantly on English data dominate the leaderboard; ii) TweetyIta is competitive against other forms of adaptation or inherently monolingual models;iii) natural language understanding tasks are challenging for current models.We release code and data at https://github.com/RiTA-nlp/ita-eval and host a live leaderboard at https://huggingface.co/spaces/RiTA-nlp/ita-eval.
pdf
bib
abs
LLaMAntino against Cyber Intimate Partner Violence
Pierpaolo Basile
|
Marco Degemmis
|
Marco Polignano
|
Giovanni Semeraro
|
Lucia Siciliani
|
Vincenzo Tamburrano
|
Fabiana Battista
|
Rosa Scardigno
Intimate Partner Violence refers to the abusive behaviours perpetrated on their own partner. Unfortunately this is a social issue that has witnessed an increase over time, particularly after Covid-19. IPV be circumscribed into two broad categories known as Intimate Partner Violence (IPV) and Cyber Intimate Partner Violence (C-IPV). Social Media and technologies can exacerbate these types of behaviors but some “digital footprints”, such as textual conversations, can be exploited by Artificial Intelligence models to detect and, in turn, prevent them. With this aim in mind, this paper describes a scenario in which the Italian Language Model family LLAmAntino can be exploited to explain the presence of toxicity elements in conversations related to teenage relationships and then educate the interlocutor to recognize these elements in the messages received.
pdf
bib
abs
Taking Decisions in a Hybrid Conversational AI Architecture Using Influence Diagrams
Roberto Basile Giannini
|
Antonio Origlia
|
Maria Di Maro
This paper explores the application of the Influence Diagrams model for decision-making in the context of conversational agents. The system consists of a Conversational Recommender System (CoRS), in which the decision-making module is separate from the language generation module. It provides the capability to evolve a belief based on user responses, which in turn influences the decisions made by the conversational agent. The proposed system is based on a pre-existing CoRS that relies on Bayesian Networks informing a separate decision process. The introduction of Influence Diagrams aims to integrate both Bayesian inference and the dialogue move selection phase into a single model, thereby generalising the decision-making process. To test the effectiveness and plausibility of the dialogues generated by the developed CoRS, a dialogue simulator was created and the simulated interactions were evaluated by a pool of human judges.
pdf
bib
abs
KEVLAR: The Complete Resource for EuroVoc Classification of Legal Documents
Lorenzo Bocchi
|
Camilla Casula
|
Alessio Palmero Aprosio
The use of Machine Learning and Artificial Intelligence in the Public Administration (PA) has increased in the last years. In particular, recent guidelines proposed by various governments for the classification of documents released by the PA suggest to use the EuroVoc thesaurus. In this paper, we present KEVLAR, an all-in-one solution for performing the above-mentioned task on acts belonging to the Public Administration. First, we create a collection of 8 million documents in 24 languages, tagged with EuroVoc labels, taken from EUR-Lex, the web portal of the European Union legislation. Then, we train different pre-trained BERT-based models, comparing the performance of base models with domain-specific and multilingual ones. We release the corpus, the best-performing models, and a Docker image containing the source code of the trainer, the REST API, and the web interface. This image can be employed out-of-the-box for document classification.
pdf
bib
abs
Title Is (Not) All You Need for EuroVoc Multi-Label Classification of European Laws
Lorenzo Bocchi
|
Alessio Palmero Aprosio
Machine Learning and Artificial Intelligence approaches within Public Administration (PA) have grown significantly in recent years. Specifically, new guidelines from various governments recommend employing the EuroVoc thesaurus for the classification of documents issued by the PA.In this paper, we explore some methods to perform document classification in the legal domain, in order to mitigate the length limitation for input texts in BERT models.We first collect data from the European Union, already tagged with the aforementioned taxonomy.Then we reorder the sentences included in the text, with the aim of bringing the most informative part of the document in the first part of the text.Results show that the title and the context are both important, although the order of the text may not.Finally, we release on GitHub both the dataset and the source code used for the experiments.
pdf
bib
abs
Exploring the Dissociated Nucleus Phenomenon in Semantic Role Labeling
Tommaso Bonomo
|
Simone Conia
|
Roberto Navigli
Dependency-based Semantic Role Labeling (SRL) is bound to dependency parsing, as the arguments of a predicate are identified through the token that heads the dependency relation subtree of the argument. However, most dependency-based SRL corpora are susceptible to the dissociated nucleus problem: when a subclause’s semantic and structural cores are two separate words, the dependency tree chooses the structural token as the head of the subtree, coercing the SRL annotation into making the same choice. This leads to undesirable consequences: when directly using the output of a dependency-based SRL method in downstream tasks it is useful to work with the token representing the semantic core of a subclause, not the structural core. In this paper, we carry out a linguistically-driven investigation on the dissociated nucleus problem in dependency-based SRL and propose a novel algorithm that aligns predicate-argument structures to the syntactic structures from Universal Dependencies to select the semantic core of an argument. Our analysis shows that dissociated nuclei appear more often than one could expect, and that our novel algorithm greatly increases the richness of the semantic information in dependency-based SRL. We release the software to reproduce our experiments at http://omitted.link.
pdf
bib
abs
Data Augmentation through Back-Translation for Stereotypes and Irony Detection
Tom Bourgeade
|
Silvia Casola
|
Adel Mahmoud Wizan
|
Cristina Bosco
Complex linguistic phenomena such as stereotypes or irony are still challenging to detect, particularly due to the lower availability of annotated data. In this paper, we explore Back-Translation (BT) as a data augmentation method to enhance such datasets by artificially introducing semantics-preserving variations. We investigate French and Italian as source languages on two multilingual datasets annotated for the presence of stereotypes or irony and evaluate French/Italian, English, andArabic as pivot languages for the BT process. We also investigate cross-translation, i.e., augmenting one language subset of a multilingual dataset with translated instances from the other languages. We conduct an intrinsic evaluation of the quality of back-translated instances, identifying linguistic or translation model-specific errors that may occur with BT. We also perform an extrinsic evaluation of different data augmentation configurations to train a multilingual Transformer-based classifier forstereotype or irony detection on mono-lingual data.
pdf
bib
abs
Community-based Stance Detection
Emanuele Brugnoli
|
Donald Ruggiero Lo Sardo
Stance detection is a critical task in understanding the alignment or opposition of statements within social discourse. In this study, we present a novel stance detection model that labels claim-perspective pairs as either aligned or opposed. The primary innovation of our work lies in our training technique, which leverages social network data from X (formerly Twitter). Our dataset comprises tweets from opinion leaders, political entities and news outlets, along with their followers’ interactions through retweets and quotes. By reconstructing politically aligned communities based on retweet interactions, treated as endorsements, we check these communities against common knowledge representations of the political landscape.Our training dataset consists of tweet/quote pairs where the tweet comes from a political entity and the quote either originates from a follower who exclusively retweets that political entity (treated as aligned) or from a user who exclusively retweets a political entity from an opposing ideological community (treated as opposed). This curated subset is used to train an Italian language model based on the RoBERTa architecture, achieving an accuracy of approximately 85%. We then apply our model to label all tweet/quote pairs in the dataset, analyzing its out-of-sample predictions.This work not only demonstrates the efficacy of our stance detection model but also highlights the utility of social network structures in training robust NLP models. Our approach offers a scalable and accurate method for understanding political discourse and the alignment of social media statements.
pdf
bib
abs
Towards a Hate Speech Index with Attention-based LSTMs and XLM-RoBERTa
Mauro Bruno
|
Elena Catanese
|
Francesco Ortame
The uncontrolled diffusion of hate speech on social media requires robust detection mechanisms to measure its harmful impact. Analyzing texts from X (formerly Twitter) is challenging due to slang, neologisms, and sarcasm, which require advanced and intelligent detection approaches. While sophisticated models like large language models (LLMs) demonstrate impressive accuracy, their prohibitive inference times make it impractical to process millions of tweets. Therefore, we propose a mixed approach using a bidirectional long short-term memory model with an added attention mechanism (AT-BiLSTM) for improved natural language understanding. We benchmark this model against a standard BiLSTM model and a fine-tuned multilingual robustly optimized BERT (RoBERTa).The task of hate speech detection has been extensively explored in the EVALITA campaigns, which have achieved impressive results. Building on this foundation, we aim to develop a robust classifier to predict the content of approximately 20 million tweets related to immigration. The performance of our models is comparable to the top entries from the EVALITA campaigns, and we show the effects of training different networks on the dynamics of the Hate Speech Index (HSI). We also utilize a custom labeled dataset for benchmarking and training.
pdf
bib
abs
Written Goodbyes: How Genre and Sociolinguistic Factors Influence the Content and Style of Suicide Notes
Lucia Busso
|
Claudia Roberta Combei
The study analyses a novel corpus of 76 freely available English authentic suicide notes (SNs) (letters and social media posts), spanning from 1902 to 2023. By using computational and corpus linguistics, this research aims at decoding patterns of discourse, content, and emotions in SNs. In particular, we explore variation in linguistic features in SNs across sociolinguistic factors (age, gender, addressee, time period) and between genres (letter vs. post). To this end, we use topic models, subjectivity analysis, and sentiment and emotion analysis. Results highlight how both style, content, and emotion expression, show differences depending on genre, gender, age group and time period. We suggest a more nuanced approach to personalized prevention and intervention strategies based on insights from computer-assisted linguistic analysis.
pdf
bib
abs
Argument Mining in BioMedicine: Zero-Shot, In-Context Learning and Fine-tuning with LLMs
Jérémie Cabessa
|
Hugo Hernault
|
Umer Mushtaq
Argument Mining (AM) aims to extract the complex argumentative structure of a text and Argument Type Classification (ATC) is an essential sub-task of AM. Large Language Models (LLMs) have shown impressive capabilities in most NLP tasks and beyond. However, fine-tuning LLMs can be challenging. In-Context Learning (ICL) has been suggested as a bridging paradigm between training-free and fine-tuning settings for LLMs. In ICL, an LLM is conditioned to solve tasks using a few solved demonstration examples included in its prompt. We focuse on AM in the biomedical AbstRCT dataset. We address ATC using quantized and unquantized LLaMA-3 models through zero-shot learning, in-context learning, and fine-tuning approaches. We introduce a novel ICL strategy that combines $k$NN-based example selection with majority vote ensembling, along with a well-designed fine-tuning strategy for ATC. In zero-shot setting, we show that LLaMA-3 fails to achieve acceptable classification results, suggesting the need for additional training modalities. However, in our ICL training-free setting, LLaMA-3 can leverage relevant information from only a few demonstration examples to achieve very competitive results. Finally, in our fine-tuning setting, LLaMA-3 achieves state-of-the-art performance on ATC task in AbstRCT dataset.
pdf
bib
abs
Multisource Approaches to Italian Sign Language (LIS) Recognition: Insights from the MultiMedaLIS Dataset
Gaia Caligiore
|
Raffaele Mineo
|
Concetto Spampinato
|
Egidio Ragonese
|
Simone Palazzo
|
Sabina Fontana
Given their status as unwritten visual-gestural languages, research on the automatic recognition of sign languages has increasingly implemented multisource capturing tools for data collection and processing. This paper explores advancements in Italian Sign Language (LIS) recognition using a multimodal dataset in the medical domain: the MultiMedaLIS Dataset. We investigate the integration of RGB frames, depth data, optical flow, and skeletal information to develop and evaluate two computational models: Skeleton-Based Graph Convolutional Network (SL-GCN) and Spatiotemporal Separable Convolutional Network (SSTCN). RADAR data was collected but not included in the testing phase. Our experiments validate the effectiveness of these models in enhancing the accuracy and robustness of isolated LIS signs recognition. Our findings highlight the potential of multisource approaches in computational linguistics to improve linguistic accessibility and inclusivity for members of the signing community.
pdf
bib
abs
Combining Universal Dependencies and FrameNet to Identify Constructions in a Poetic Corpus: Syntax and Semantics of Latin Felix and Infelix in Virgilian Poetics
Giulia Calvi
|
Riccardo Ginevra
|
Federica Iurescia
The paper is a pilot study which argues for a constructionist and computer-based approach to the syntactic and semantic analysis of a poetic corpus in Latin. We focus on the terms felix and on its opposite infelix and perform manual annotation of their occurrences in Virgil’s poems using Universal Dependencies for the syntactic analysis and FrameNet for the semantic one. Integrating the approaches of Dependency Syntax and Construction Grammar, we analyze the linguistic contexts in which the two terms occur and identify the different “constructions” (pairings of form and function) that they instantiate. Our methodology is language-independent and has the potential to aid scholars in the comparative analysis of poetic texts, allowing for the detection of hidden parallels in the style and poetics of different texts and authors.
pdf
bib
abs
Lost in Disambiguation: How Instruction-Tuned LLMs Master Lexical Ambiguity
Luca Capone
|
Serena Auriemma
|
Martina Miliani
|
Alessandro Bondielli
|
Alessandro Lenci
This paper investigates how decoder-only instruction-tuned LLMs handle lexical ambiguity. Two distinct methodologies are employed: Eliciting rating scores from the model via prompting and analysing the cosine similarity between pairs of polysemous words in context. Ratings and embeddings are obtained by providing pairs of sentences from Haber and Poesio (2021) to the model. These ratings and cosine similarity scores are compared with each other and with the human similarity judgments in the dataset.Surprisingly, the model scores show only a moderate correlation with the subjects’ similarity judgments and no correlation with the target word embedding similarities. A vector space anisotropy inspection has also been performed, as a potential source of the experimental results. The analysis reveals that the embedding spaces of two out of the three analyzed models exhibit poor anisotropy, while the third model shows relatively moderate anisotropy compared to previous findings for models with similar architecture (Ethayarajh 2019). These findings offer new insights into the relationship between generation quality and vector representations in decoder-only LLMs.
pdf
bib
abs
BaBIEs: A Benchmark for the Linguistic Evaluation of Italian Baby Language Models
Luca Capone
|
Alice Suozzi
|
Gianluca Lebani
|
Alessandro Lenci
The possibility of comparing the linguistic competence of Language Models (LMs) to that of children has gained growing attention lately, raising the need for effective tools for evaluating both the former and the latter. To this purpose, we developed a resource for the linguistic evaluation of BabyLMs, which are LMs trained on datasets that comparable to the linguistic stimulus received by children. This resource adapts four standardized tests for the evaluation of linguistic skills of Italian-speaking children (BVL, TROG-2, TCGB-2 and Peabody). To verify the effectiveness of our benchmark, we administered it to Minerva, a LLM pretrained from scratch on Italian. Our results indicate that Minerva struggles to master certain linguistic aspects, achieving an age-equivalent score of 4 years, and that the type of task administered affects the model’s performance.
pdf
bib
abs
Beyond Headlines: A Corpus of Femicides News Coverage in Italian Newspapers
Eleonora Cappuccio
|
Benedetta Muscato
|
Laura Pollacci
|
Marta Marchiori Manerba
|
Clara Punzi
|
Chandana Mala
|
Margherita Lalli
|
Gizem Gezici
|
Michela Natilli
|
Fosca Giannotti
How newspapers cover news significantly impacts how facts are understood, perceived, and processed by the public. This is especially crucial when serious crimes are reported, e.g., in the case of femicides, where the description of the perpetrator and the victim builds a strong, often polarized opinion of this severe societal issue. This paper presents FMNews, a new dataset of articles reporting femicides extracted from Italian newspapers. Our core contribution aims to promote the development of a deeper framing and awareness of the phenomenon through an original resource available and accessible to the research community, facilitating further analyses on the topic. The paper also provides a preliminary study of the resulting collection through several example use cases and scenarios.
pdf
bib
abs
Women’s Professions and Targeted Misogyny Online
Alessio Cascione
|
Aldo Cerulli
|
Marta Marchiori Manerba
|
Lucia Passaro
With the increasing popularity of social media platforms, the dissemination of misogynistic content has become more prevalent and challenging to address. In this paper, we investigate the phenomenon of online misogyny on Twitter through the lens of hurtfulness, qualifying its different manifestation considering the profession of the targets of misogynistic attacks.By leveraging manual annotation and a BERTweet model trained for fine-grained misogyny identification, we find that specific types of misogynistic speech are more intensely directed towards particular professions: derailing discourse predominantly targets authors and cultural figures, while dominance-oriented speech and sexual harassment are mainly directed at politicians and athletes. Additionally, we use the HurtLex lexicon and ItEM to assign hurtfulness scores to tweets based on different hate speech categories. Our analysis reveals that these scores align with the profession-based distribution of misogynistic speech, highlighting the targeted nature of such attacks.
pdf
bib
abs
DWUGs-IT: Extending and Standardizing Lexical Semantic Change Detection for Italian
Pierluigi Cassotti
|
Pierpaolo Basile
|
Nina Tahmasebi
Lexical Semantic Change Detection (LSCD) is the task of determining whether a word has undergone a change in meaning over time. There has been a marked increase in interest in this task, accompanied by a corresponding growth in the scientific community involved in developing computational approaches to semantic change. In recent years, a number of resources have been made available for the evaluation of LSC models in a number of languages, including English, Swedish, German, Latin, Russian and Chinese. DIACR-ITA is the only existing resource for LSCD in Italian. However, DIACR-ITA has a different format from that used for other languages. In this paper we present DWUGs-IT, which extends the DIACR-ITA dataset with additional target words and usage-sense pair annotations and adapts it to the DURel format, including the first implementation of a LSCD graded task for Italian.
pdf
bib
abs
History Repeats: Historical Phase Recognition from Short Texts
Fabio Celli
|
Valerio Basile
This paper introduces a new multi-class classification task: the prediction of the Structural-Demographic phase of historical cycles - such as growth, impoverishment and crisis - from text describing historical events. To achieve this, we leveraged data from the Seshat project, annotated it following specific guidelines and then evaluated the consistency between three annotators. The classification experiments, with transformers and Large Language Models, show that 2 of 5 phases can be detected with good accuracy. We believe that this task could have a great impact on comparative history and can be helped by event extraction in NLP.
pdf
bib
abs
Emojilingo: Harnessing AI to Translate Words into Emojis
Francesca Chiusaroli
|
Federico Sangati
|
Johanna Monti
|
Maria Laura Pierucci
|
Tiberio Uricchio
This paper presents an AI experiment of translation in emoji conducted on a glossary from Dante Alighieri’s Comedy. The experiment is part of a project aiming to build up an automated emojibased pivot language providing an interlingua as a tool for linguistic simplification, accessibility, and international communication: Emojilingo. The present test involves human (Emojitaliano) and machine (Chat-GPT) translations in a comparative analysis to devise an automated integrated model highlighting emojis’ expressive ability in transferring senses, clarifying semantic obscurities and ambiguities, and simplifying language. A first preliminary evaluation highlights Chat-GPT’s ability to deal with a classic archaic literary vocabulary, also raising issues on managing criteria for better grasping the meanings and forms and about the multicultural extent of content transfer.
pdf
bib
abs
Towards an ASR System for Documenting Endangered Languages: A Preliminary Study on Sardinian
Ilaria Chizzoni
|
Alessandro Vietti
Speech recognition systems are still highly dependent on textual orthographic resources, posing a challenge for low-resourcelanguages. Recent research leverages self-supervised learning of unlabeled data or employs multilingual models pre-trainedon high resource languages for fine-tuning on the target low-resource language. These are effective approacheswhen the target language has a shared writing tradition, but when we are confronted with mainly spoken languages, beingthem endangered minority languages, dialects, or regional varieties, other than labeled data, we lack a shared metric toassess speech recognition performance. We first provide a research background on ASR for low-resource languages anddescribe the specific linguistic situation of Campidanese Sardinian, we then evaluate five multilingual ASR models usingtraditional evaluation metrics and an exploratory linguistic analysis. The paper addresses key challenges in developing a toolfor researchers to document and analyze the phonetics and phonology of spoken (endangered) languages.
pdf
bib
abs
Controllable Text Generation to Evaluate Linguistic Abilities of Italian LLMs
Cristiano Ciaccio
|
Felice Dell’orletta
|
Alessio Miaschi
|
Giulia Venturi
State-of-the-art Large Language Models (LLMs) demonstrate exceptional proficiency across diverse tasks, yet systematic evaluations of their linguistic abilities remain limited. This paper addresses this gap by proposing a new evaluation framework leveraging the potentialities of Controllable Text Generation. Our approach evaluates the models’ capacity to generate sentences that adhere to specific linguistic constraints and their ability to recognize the linguistic properties of their own generated sentences, also in terms of consistency with the specified constraints. We tested our approach on six Italian LLMs using various linguistic constraints.
pdf
bib
abs
A Modal Sense Classifier for the French Modal Verb Pouvoir
Anna Colli
|
Diego Rossini
|
Delphine Battistelli
In this paper we address the problem of modal sense classification for the French modal verb pouvoir in a transcribed spoken corpus. To the best of our knowledge, no studies have focused on this task in French. We fine-tuned various BERT-based models for French in order to determine which one performed best. It was found that the Flaubert-base-cased model was the most effective (F1-score of 0.94) and that the most frequent categories in our corpus were material possibility and ability, which are both part of the more global alethic category.
pdf
bib
abs
Topic Similarity of Heterogeneous Legal Sources Supporting the Legislative Process
Michele Corazza
|
Leonardo Zilli
|
Monica Palmirani
The legislative process starts with a deep analysis of the existing regulations at European and national levels to avoid conflicts and fostering the into force norms. Also the Constitutional Court decisions play a fundamental role in this analysis for checking the compliance with the constitutional framework and for including the inputs coming from this relevant court in the law-making process. Finally, it is also significant to compare the forthcoming proposal with the already presented bills regarding the same topic. This comparison is crucial to avoid overlapping and to coordinate the democratic dialogue with the different parties. In this light, this paper presents an unsupervised approach for calculating similarity between heterogeneous documents annotated in Akoma Ntoso XML, with the aim to support the information retrieval of similar documents using thematic taxonomy used in legal domain. The prototype has been developed for answering to a call for manifestation of interests launched by the Chamber of Deputy of Italy in order to adopt hybrid AI in the legislation process. It uses a completely unsupervised approach based on Sentence Transformers, meaning that neither annotated data or any fine-tuning process is required.
pdf
bib
abs
Join Together? Combining Data to Parse Italian Texts
Claudia Corbetta
|
Giovanni Moretti
|
Marco Passarotti
In this paper, we create and evaluate non-combined and combined models using Old and Contemporary Italian data to determine whether increasing the size of the training data with a combined model could improve parsing accuracy to facilitate manual annotation. We find that, despite the increased size of the training data, in-domain parsing performs better. Additionally, we discover that models trained on Old Italian data perform better on Contemporary Italian data than the reverse. We attempt to explain this result in terms of syntactic complexity, finding that Old Italian text exhibits higher sentence length and non-projectivity rate.
pdf
bib
abs
Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition
Federico D’asaro
|
Juan José Márquez Villacís
|
Giuseppe Rizzo
|
Andrea Bottino
Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or WeaklySupervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.
pdf
bib
abs
Building a Pragmatically Annotated Diachronic Corpus: The DIADIta Project
Irene De Felice
|
Francesca Strik Lievers
We present here the initial stages of the construction of the DIADIta corpus, a diachronic corpus of Italian annotated for interactional pragmatic phenomena. First, we describe the annotation scheme, which is structured into four levels: speech acts (e.g., apology; threat), forms (e.g., discourse marker; expressive), pragmatic functions (which are speaker-oriented, e.g., mitigation; turn-taking), and pragmatic aims (which are interlocutor-oriented, e.g., attention-getting; request for agreement). Next, we discuss how the results of a first annotation exercise provide indications for refining the annotation procedure.
pdf
bib
abs
Building CorefLat. a Linguistic Resource for Coreference and Anaphora Resolution in Latin
Eleonora Delfino
|
Roberta Leotta
|
Marco Passarotti
|
Giovanni Moretti
This paper presents the initial stages of a project focused on coreference and anaphora resolution in Latin texts. By building a corpus enhanced with coreference/anaphora annotation, the project wants to explore empirically a layer of metalinguistic analysis that has not been yet extensively investigated in linguistic resources and natural language processing for Latin. After reviewing the related work, the paper discusses annotation criteria and data analysis, providing examples about a few issues that emerged during the annotation process.
pdf
bib
abs
Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection
Chiara Di Bonaventura
|
Lucia Siciliani
|
Pierpaolo Basile
|
Albert Merono Penuela
|
Barbara Mcgillivray
Explainable abusive language detection has proven to help both users and content moderators, and recent research has focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation generation of abusive language detection due to (i) their cultural bias, and (ii) difficulty in reliably evaluating them with empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive language detection.
pdf
bib
abs
Scalable Query Understanding for E-commerce: An Ensemble Architecture with Graph-based Optimization
Giuseppe Di Fabbrizio
|
Evgeny Stepanov
|
Ludovico Frizziero
|
Filippo Tessaro
Query understanding is a critical component of e-commerce platforms, enabling accurate interpretation of users’ intents and efficient retrieval of relevant products. This paper presents a study on scalable query understanding techniques applied to a real use case in the e-commerce grocery domain. We propose a novel architecture that combines deep learning models with traditional ML models to capture query nuances and provide robust performance. Our model ensemble approach aims to capture the nuances of user queries and provide robust performance across various query types and categories. We conduct experiments on real-life datasets and demonstrate the effectiveness of our proposed solution in terms of accuracy and scalability. An optimized graph-based architecture using Ray enables efficient processing of high-volume traffic. The experimental results highlight the benefits of combining diverse models.
pdf
bib
abs
ELIta: A New Italian Language Resource for Emotion Analysis
Eliana Di Palma
Emotions and language are strongly associated. In recent years, many resources have been created to investigate this association and automatically detect emotions from texts.Presenting ELIta (Emotion Lexicon for Italian), this study provides a new language resource for the analysis and detection of emotions in Italian texts. It describes the process of lexicon creation, including lexicon selection and annotation methodologies, and compares the collected data with existing resources. By offering a non-aggregated lexicon, ELIta fills a crucial gap and is applicable to various research and practical applications. Furthermore, the work utilises the lexicon by analysing the relationships between emotions and gender.
pdf
bib
abs
Comparing Large Language Models Verbal Creativity to Human Verbal Creativity
Anca Dinu
|
Andra Florescu
This study investigates verbal creativity differences and similarities between Large Language Models and humans, based ontheir answers given to the integrated verbal creativity test in [1 ]. Since this article reported a very small difference of scoresin favour of the machines, the aim of the present work is to thoroughly analyse the data through four methods: scoring theuniqueness of the answers of one human or one machine compared to all the others, semantic similarity clustering, binaryclassification and manual inspection of the data. The results showed that humans and machines are on a par in terms ofuniqueness scores, that humans and machines group in two well defined clusters based on semantics similarities, and that theanswers are not so easy to automatically classify in human answers and LLM answers.
pdf
bib
abs
ItGraSyll: A Computational Analysis of Graphical Syllabification and Stress Assignment in Italian
Liviu Dinu
|
Ioan-Bogdan Iordache
|
Simona Georgescu
|
Alina Maria Cristea
|
Bianca Guita
In this paper we build a dataset of Italian syllables. We perform quantitative and qualitative analyses on the syllabification and stress assignment in Italian. We propose a machine learning model, based on deep-learning techniques, for automatically inferring syllabification and stress assignment. For stress prediction we report 94.45% word-level accuracy, and for syllabification we report 98.41% word-level accuracy and 99.82% hyphen-level accuracy.
pdf
bib
abs
Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises
Nicolò Donati
|
Matteo Periani
|
Paolo Di Natale
|
Giuseppe Savino
|
Paolo Torroni
English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners’ grammatical proficiency andcomprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Effective MCCexercises must be contextually relevant and engaging, incorporating distractors—plausible but incorrect alternatives—tobalance difficulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs)in education, their application in generating English grammar MCC exercises is still limited. Previous methods typicallyimpose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potentialof LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. Wehypothesize that LLMs can craft self-contained sentences that foster learner’s communicative competence. Our analysisof existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we addressthe lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions includedeveloping an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammartopics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance theautomatic generation of English grammar MCC exercises, enhancing both their quality and creativity.
pdf
bib
abs
ReCLAIM Project: Exploring Italian Slurs Reappropriation with Large Language Models
Lia Draetta
|
Chiara Ferrando
|
Marco Cuccarini
|
Liam James
|
Viviana Patti
Recently, social networks have become the primary means of communication for many people, leading computational linguistics researchers to focus on the language used on these platforms. As online interactions grow, recognizing and preventing offensive messages targeting various groups has become urgent. However, finding a balance between detecting hate speech and preserving free expression while promoting inclusive language is challenging. Previous studies have highlighted the risks of automated analysis misinterpreting context, which can lead to the censorship of marginalized groups. Our study is the first to explore the reappropriative use of slurs in Italian by leveraging Large Language Models (LLMs) witha zero-shot approach. We revised annotations of an existing Italian homotransphobic dataset, developed new guidelines, and designed various prompts to address the LLMs task. Our findings illustrate the difficulty of this challenge and provide preliminary results on using LLMs for such a language specific task.
pdf
bib
abs
You Write like a GPT
Andrea Esuli
|
Fabrizio Falchi
|
Marco Malvaldi
|
Giovanni Puccetti
We investigate how Raymond Queneau’s Exercises in Style are evaluated by automatic methods for detection of artificially-generated text. We work with the Queneau’s original French version, the Italian translation by Umberto Eco andthe English translation by Barbara Wright.We start by comparing how various methods for the detection of automatically generated text, also using different large language models and evaluate the different styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles.This work is an initial attempt at exploring how methods for detection artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.
pdf
bib
abs
Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription
Alice Fedotova
|
Adriano Ferraresi
|
Maja Miličević Petrović
|
Alberto Barrón-Cedeño
This paper presents a novel pipeline for constructing multimodal and multilingual parallel corpora, with a focus on evaluating state-of-the-art ASR tools for verbatim transcription. Our findings indicate that current technologies can streamline corpus construction, with fine-tuning showing promising results in terms of transcription quality compared to out-of-the-box Whisper models. The lowest overall WER achieved for English was 0.180, using a fine-tuned Whisper-small model. As for Italian, the fine-tuned Whisper-small model obtained a lower WER of 0.201 compared to the baseline Whisper-small’s WER of 0.219. While limitations remain, the updated pipeline is expected to drastically reduce the human efforts involved.
pdf
bib
abs
Exploring YouTube Comments Reacting to Femicide News in Italian
Chiara Ferrando
|
Marco Madeddu
|
Viviana Patti
|
Mirko Lai
|
Sveva Pasini
|
Giulia Telari
|
Beatrice Antola
In recent years, the Gender Based Violence (GBV) has become an important issue in modern society and a central topic in different research areas due to its alarming spread. Several Natural Language Processing (NLP) studies, concerning Hate Speech directed against women, have focused on slurs or incel communities. The main contribution of our work is the creation of the first dataset on social media comments to GBV, in particular to a femicide event. Our dataset, named GBV-Maltesi, contains 2,934 YouTube comments annotated following a new schema that we developed in order to study GBV and misogyny with an intersectional approach. During the experimental phase, we trained models on different corpora for binary misogyny detection and found that datasets that mostly include explicit expressions of misogyny are an easier challenge, compared to more implicit forms of misogyny contained in GVB-Maltesi.
pdf
bib
abs
Automatic Error Detection: Comparing AI vs. Human Performance on L2 Italian Texts
Irene Fioravanti
|
Luciana Forti
|
Stefania Spina
This paper reports on a study aimed at comparing AI vs. human performance in detecting and categorising errors in L2 Italian texts. Four LLMs were considered: ChatGPT, Copilot, Gemini and Llama3. Two groups of human annotators were involved: L1 and L2 speakers of Italian. A gold standard set of annotations was developed. A fine-grained annotation scheme was adopted, to reflect the specific traits of Italian morphosyntax, with related potential learner errors. Overall, we found that human annotation outperforms AI, with some degree of variation with respect tospecific error types. An increased attention to languages other than English in NLP may significantly improve AI performance in this pivotal task for the many domains of language-related disciplines.
pdf
bib
abs
Explainability for Speech Models: On the Challenges of Acoustic Feature Selection
Dennis Fucci
|
Beatrice Savoldi
|
Marco Gaido
|
Matteo Negri
|
Mauro Cettolo
|
Luisa Bentivogli
Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI) has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation for the speech modality is lagging behind. We argue that a key factor hindering the diffusion of such methods in speech processing research lies in the complexity of defining interpretable acoustic features. In this paper, we discuss the key challenges in selecting the features for speech explanations. Also in light of existing research, we highlight current gaps and propose future avenues to enhance the depth and informativeness of explanations for speech.
pdf
bib
abs
Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-LM Training on Child-Directed Speech in Italian
Achille Fusco
|
Matilde Barbini
|
Maria Letizia Piccini Bianchessi
|
Veronica Bressan
|
Sofia Neri
|
Sarah Rossi
|
Tommaso Sgrizzi
|
Cristiano Chesi
We discuss the strategies and results of a small-sized training program based on Italian child-directed speech (less than 3M tokens) for various network architectures. The rationale behind these experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on different models architecture. Preliminary findings lead us to conclude that (a) different tokenization strategies produce only numerical, but not statistically significant, improvements overall, although segmentation aligns more or less with linguistic intuitions; and (b) modified LSTM networks with a single layer and a structurally more controlled cell state perform worse in training (compared to standard one- and two-layered LSTM models) but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in autoregressive training procedures are linguistically irrelevant and, more generally, misleading, since the best-trained models qualify as poorer “linguistic theories” ([2], pace [3]).
pdf
bib
abs
On Cross-Language Entity Label Projection and Recognition
Paolo Gajo
|
Alberto Barrón-Cedeño
Most work on named entity recognition (NER) focuses solely on English. Through the use of training data augmentation via machine translation (MT), multilingual NER can become a powerful tool for information extraction in multilingual contexts. In this paper, we augment NER data from culinary recipe ingredient lists, by means of MT and word alignment (WA), following two approaches: (i) translating each entity separately, while taking into account the full context of the list and (ii) translating the whole list of ingredients and then aligning entities using three types of WA models: Giza++, Fast Align, and BERT, fine-tuned using a novel entity-shuffling approach. We depart from English data and produce Italian versions via MT, span-annotated with the entities projected from English. Then, we use the data produced by the two approaches to train mono- and multilingual NER BERT models. We test the performance of the WA and NER models on an annotated dataset of ingredient lists, partially out-of-domain compared to the training data. The results show that shuffling entities leads to better BERT aligner models. The higher quality NER data created by these models enables NER models to achieve better results, with multilingual models reaching performances equal to or greater than their monolingual counterparts.
pdf
bib
abs
NYTAC-CC: A Climate Change Subcorpus of New York Times Articles
Francesca Grasso
|
Ronny Patz
|
Manfred Stede
Over the past decade, the analysis of discourses on climate change (CC) has gained increased interest within the social sciences and the NLP community. Textual resources are crucial for understanding how narratives about this phenomenon are crafted and delivered. However, there still is a scarcity of datasets that cover CC in news media in a representative way. This paper presents a CC-specific subcorpus extracted from the 1.8 million New York Times Annotated Corpus, marking the first CC analysis on this data. The subcorpus was created by combining different methods for text selection to ensure representativeness and reliability, which is further validated using ClimateBERT. To provide initial insights into the CC subcorpus, we discuss the results of a topic modeling experiment (LDA). These show the diversity of contexts in which CC is discussed in news media over time, which is relevant for various downstream tasks.
pdf
bib
abs
Task-Incremental Learning on Long Text Sequences
Natalia Graziuso
|
Andrea Zugarini
|
Stefano Melacci
The extraordinary results achieved by Large Language Models are paired with issues that are critical in real-world applications. The costs of inference and, in particular, training are extremely large, both in terms of time and computational resources, and they become prohibitive when working in dynamic environments, where data and tasks are progressively provided over time. The model must be able to adapt to new knowledge, new domains, new settings, without forgetting the previously learned skills. Retraining from scratch easily becomes too costly, thus Continual Learning strategies are of crucial importance. This is even more evident when data consist of “long” documents, that require several resources to be processed by modern neural models, leading to very long prompts. This paper investigates LLM-based Task-Incremental Learning in the case of tasks exploiting long sequences of text, as it is typical in summarization, question-answering on long documents, reviewing long contracts, and several others. We show how adapting the model by Task Arithmetic with LoRA, which was proposed for visual data, yields promising results also in the case of such “long” text data. To our best knowledge, this is the first work along this challenging direction. The outcome of the investigation of this paper is generic enough to represent an important starting point for further research in processing linguistic data in every language.
pdf
bib
abs
The Vulnerable Identities Recognition Corpus (VIRC) for Hate Speech Analysis
Ibai Guillén-Pacho
|
Arianna Longo
|
Marco Antonio Stranisci
|
Viviana Patti
|
Carlos Badenes-Olmedo
This paper presents the Vulnerable Identities Recognition Corpus (VIRC), a novel resource designed to enhance hate speech analysis in Italian and Spanish news headlines. VIRC comprises 921 headlines, manually annotated for vulnerable identities, dangerous discourse, derogatory expressions, and entities. Our experiments reveal that large language models (LLMs) struggle significantly with the fine-grained identification of these elements, underscoring the complexity of detecting hate speech. VIRC stands out as the first resource of its kind in these languages, offering a richer annotation schema compared to existing corpora. The insights derived from VIRC can inform the development of sophisticated detection tools and the creation of policies and regulations to combat hate speech on social media, promoting a safer online environment. Future work will focus on expanding the corpus and refining annotation guidelines to further enhance its comprehensiveness and reliability.
pdf
bib
abs
The Self-Contained Italian Negation Test (SCIN)
Viola Gullace
|
David Kletz
|
Thierry Poibeau
|
Alessandro Lenci
|
Pascal Amsili
Recent research has focused extensively on state-of-the-art pretrained language models, particularly those based on Transformer architectures, and how well they account for negation and other linguistic phenomena in various tasks. This study aims to evaluate the understanding of negation in Italian bert- and roberta-based models, contrasting the predominant English-focused prior research. We develop the SCIN Set, an Italian dataset designed to model the influence of polarity constraints on models in a masked predictions task. Applying the SCIN Set reveals that these models do not adjust their behaviour based on sentences polarity, even when the resulting sentence is contradictory. We conclude that the tested models lack a clear understanding of how negation alters sentence meaning.
pdf
bib
abs
La Non Canonica L’hai Studiata? Exploring LLMs and Sentence Canonicity in Italian
Claudiu Hromei
|
Danilo Croce
|
Rodolfo Delmonte
|
Roberto Basili
This paper investigates the ability of Large Language Models (LLMs) to differentiate between canonical and non-canonical sentences in Italian, employing advanced neural architectures like LLaMA and its adaptations. Canonical sentences adhere to the standard Subject-Verb-Object (SVO) structure. We hypothesize that recent generative LLMs are influenced heavily by the English language, where non-canonical structures are very rare. Using the in-context learning technique, we probe these models and further fine-tune them for this specific task. Initial results indicate that these models continue to struggle with this task even after fine-tuning. Additionally, we introduce a new dataset comprising several hundred sentences from the poetry domain, which presents significant challenges for the canonical structure task.
pdf
bib
abs
Enhancing Job Posting Classification with Multilingual Embeddings and Large Language Models
Hamit Kavas
|
Marc Serra-Vidal
|
Leo Wanner
In the modern labour market, taxonomies such the European Skills, Competences, Qualifications and Occupations (ESCO) classification are used as an interlingua to match job postings with job seeker profiles. Both are classified with respect to ESCO occupations, and match if they align with the same occupation and the same skills assigned to the occupation. However, matching models usually struggle with the classification because of overlapping skills and similar definitions of occupations defined in the ESCO taxonomy. This often leads to imprecise classification outcomes. In this paper, we focus on the challenge of the classification of job postings written in Italian or Spanish against ESCO occupations written in English. We experiment with multilingual embeddings, zero-shot classification, and use of a large language model (LLM) and show that the use of an LLM leads to best results.
pdf
bib
abs
Divergent Discourses: A Comparative Examination of Blackout Tuesday and #BlackLivesMatter on Instagram
Aenne Knierim
|
Michael Achmann-Denkler
|
Ulrich Heid
|
Christian Wolff
On May 25th, 2020, a viral eleven-minute clip showing the murder of George Floyd sparked international outrage and solidarity, leading to the digital memorial event Blackout Tuesday on Instagram. We analyzed posts to compare Blackout Tuesday discourse with #blacklivesmatter movement conversations. Using topic modeling, we identified dominant themes and counter-narratives in Blackout Tuesday and #blacklivesmatter captions. Using hashtag co-occurrence analysis, we investigatehashtag networks to situate the discourses within spheres of Instagram activism. Our findings indicate that both corpora share themes like “calls to action”, but Blackout Tuesday posts are shorter and solidarity-focused, while #blacklivesmatter posts are longer and address white privilege more explicitly. #blacklivesmatter is linked to anti-racist activism hashtags, while Blackout Tuesday connects more with popular culture and #Alllivesmatter. This supports qualitative research on Blackout Tuesday’s performative allyship, adding a quantitative perspective to the field.
pdf
bib
abs
THAVQA: A German Task-oriented VQA Dataset Annotated with Human Visual Attention
Moritz Kronberger
|
Viviana Ventura
Video question answering (VQA) is a challenging task that requires models to generate answers by using both information from text and video. We present Task-oriented Human Attention Video Question Answering (THAVQA), a new VQA dataset consisting of third- and first- person videos of an instructor using a sewing machine. The sewing task is formalized step-by-step in a script: each step consists of a video annotated with German language open-ended question and answer (QA) pairs and with human visual attention. The paper also includes a first assessment of the performance of a pre-trained Multimodal Large Language Model (MLLM) in generating answers to the questions of our dataset across different experimental settings.Results show that our task-oriented dataset is challenging for pre-trained models. Specifically, the model struggles to answer questions requiring technical knowledge or spatio-temporal reasoning.
pdf
bib
abs
Are You a Good Assistant? Assessing LLM Trustability in Task-oriented Dialogues
Tiziano Labruna
|
Sofia Brenna
|
Giovanni Bonetta
|
Bernardo Magnini
Despite the impressive capabilities of recent Large Language Models (LLMs) to generate human-like text, their ability to produce contextually appropriate content for specific communicative situations is still a matter of debate. This issue is particularly crucial when LLMs are employed as assistants to help solve tasks or achieve goals within a given conversational domain. In such scenarios, the assistant is expected to access specific knowledge (e.g., a database of restaurants, a calendar of appointments) that is not directly accessible to the user and must be consistently utilised to accomplish the task.In this paper, we conduct experiments to evaluate the trustworthiness of automatic assistants in task-oriented dialogues. Our findings indicate that state-of-the-art open-source LLMs still face significant challenges in maintaining logical consistency with a knowledge base of facts, highlighting the need for further advancements in this area.
pdf
bib
abs
Comparative Evaluation of Computational Models Predicting Eye Fixation Patterns During Reading: Insights from Transformers and Simpler Architectures
Alessandro Lento
|
Andrea Nadalini
|
Nadia Khlif
|
Vito Pirrelli
|
Claudia Marzi
|
Marcello Ferro
Eye tracking data during reading provides significant insights into the cognitive processes underlying language comprehension. It allows for the estimation of lexical, contextual, and higher-level structural effects on word identification through metrics such as fixation duration. Despite advancements in psycholinguistic experiments that have elucidated these effects, the extent to which computational models can predict gaze patterns remains unclear. Recent developments in computational modeling, particularly the use of pre-trained transformer language models, have shown promising results in mirroring human reading behaviors. However, previous studies have not adequately compared these models to alternative architectures or considered various input features comprehensively. This paper addresses these gaps by replicating prior findings on English data, critically evaluating performance metrics, and proposing a stricter accuracy measurement method. Furthermore, it compares different computational models, demonstrating that simpler architectures can achieve results comparable to or better than transformers. The study also emphasizes the significance of individual differences in reading behavior, presenting challenges for simulating natural reading tasks.
pdf
bib
abs
Hits or Misses? A Linguistically Explainable Formula for Fanfiction Success
Giulio Leonardi
|
Dominique Brunato
|
Felice Dell’orletta
This study presents a computational analysis of Italian fanfiction, aiming to construct an interpretable model of successful writing within this emerging literary domain. Leveraging explicit features that capture both linguistic style and semantic content, we demonstrate the feasibility of automatically predicting successful writing in fanfiction and we identify a set of robust linguistic predictors that maintain their predictive power across diverse topics and time periods, offering insights into the universal aspects of engaging storytelling. This approach not only enhances our understanding of fanfiction as a genre but also offers potential applications in broader literary analysis and content creation.
pdf
bib
abs
A Novel Multi-Step Prompt Approach for LLM-based Q&As on Banking Supervisory Regulation
Daniele Licari
|
Canio Benedetto
|
Praveen Bushipaka
|
Alessandro De Gregorio
|
Marco De Leonardis
|
Tommaso Cucinotta
This paper investigates the use of large language models (LLMs) in analyzing and answering questions related to banking supervisory regulation concerning reporting obligations. We introduce a multi-step prompt construction method that enhances the context provided to the LLM, resulting in more precise and informative answers. This multi-step approach is compared with a standard “zero-shot” approach, which lacks context enrichment. To assess the quality of the generated responses, we utilize an LLM Evaluator. Our findings indicate that the multi-step approach significantly outperforms the zero-shot method, producing more comprehensive and accurate responses.
pdf
bib
abs
Lupus Alberto: A Transformer-Based Approach for SLE Information Extraction from Italian Clinical Reports
Livia Lilli
|
Laura Antenucci
|
Augusta Ortolan
|
Silvia Laura Bosello
|
Maria Antonietta D’agostino
|
Stefano Patarnello
|
Carlotta Masciocchi
|
Jacopo Lenkowicz
Natural Language Processing (NLP) is widely used across several fields, particularly in medicine, where information often originates from unstructured data sources. This creates the need for automated systems, in order to classify text and extract information from Electronic Health Records (EHRs). However, a significant challenge lies in the limited availability of pre-trained models for less common languages, such as Italian, and for specific medical domains.Our study aims to develop an NLP approach to extract Systemic Lupus Erythematosus (SLE) information from Italian EHRs at Gemelli Hospital in Rome. We then introduce Lupus Alberto, a fine-tuned version of AlBERTo, trained for classifying categories derived from three distinct domains: Diagnosis, Therapy and Symptom. We evaluated Lupus Alberto’s performance by comparing it with other baseline approaches, selecting from available BERT-based models for the Italian language and fine-tuning them for the same tasks.Evaluation results show that Lupus Alberto achieves overall F-Scores equal to 79%, 87%, and 76% for the Diagnosis, Therapy, and Symptom domains, respectively. Furthermore, our approach outperformed other baseline models in the Diagnosis and Symptom domains, demonstrating superior performance in identifying and categorizing relevant SLE information, thereby improving clinical decision-making and patient management.
pdf
bib
abs
The Lemma Bank of the LiITA Knowledge Base of Interoperable Resources for Italian
Eleonora Litta
|
Marco Passarotti
|
Paolo Brasolin
|
Giovanni Moretti
|
Valerio Basile
|
Andrea Di Fabio
|
Cristina Bosco
The paper introduces the LiIta Knowledge Base of interoperable linguistic resources for Italian. After describing the principles of the Linked Data paradigm, on which LiIta is grounded, the paper presents the lemma-centred architecture of the Knowledge Base and details its core component, consisting of a large collection of Italian lemmas (called the Lemma Bank) used to interlink distributed lexical and textual resources.
pdf
bib
abs
Multimodal Chain-of-Thought Prompting for Metaphor Generation
Sofia Lugli
|
Carlo Strapparava
This paper introduces an exploratory approach in the field of metaphorical and visual reasoning by proposing the Multimodal Chain-of-Thought Prompting for Metaphor Generation task aimed to generate metaphorical linguistic expressions from non-metaphorical images by using the multimodal LLaVA 1.5 model and the two-step approach of multimodal chain-of- thought prompting. The generated metaphors were evaluated in two ways: using BERTscore and by five human workers on Amazon Mechanical Turk. Concerning the automatic evaluation, each generated metaphorical expression was paired with a corresponding human metaphorical expressions. The overall BERTscore was the following: precision= 0.41, recall= 0.43, and F1= 0.42, suggesting that generated and human metaphors might not have captured the same semantic meaning. The human evaluation showed the model’s ability to generate metaphorical expressions, as 92% of them were classified as metaphors by the majority of the workers. Additionally, the evaluation revealed interesting patterns in terms of metaphoricity, familiarity and appeal scores across the generated metaphors: as the metaphoricity and appeal scores increased, the familiarity score decreased, suggesting that the model exhibited a certain degree of creativity, as it has also generated novel or unconventional metaphorical expressions. It is important to acknowledge that this work is exploratory in nature and has certain limitations.
pdf
bib
abs
Leveraging Advanced Prompting Strategies in LLaMA3-8B for Enhanced Hyperpartisan News Detection
Michele Maggini
|
Pablo Gamallo Otero
This paper explores advanced prompting strategies for hyperpartisan news detection using the LLaMA3-8b-Instruct model, an open-source LLM developed by Meta AI. We evaluate zero-shot, few-shot, and Chain-of-Thought (CoT) techniques on two datasets: SemEval-2019 Task 4 and a headline-specific corpus. Collaborating with a political science expert, we incorporate domain-specific knowledge and structured reasoning steps into our prompts, particularly for the CoT approach. Our findings reveal that zero-shot prompting, especially with general prompts, consistently outperforms other techniques across both datasets. This unexpected result challenges assumptions about the superiority of few-shot and CoT methods in specialized tasks. We discuss the implications of these findings for ICL in political text analysis and suggest directions for future research in leveraging large language models for nuanced content classification tasks.
pdf
bib
abs
Understanding High-complexity Technical Documents with State-of-Art Models
Bernardo Magnini
|
Roberto Zanoli
Technical documents, particularly those in civil engineering, contain crucial information that supports critical decision-making in construction, transportation and infrastructure projects. Large language models (LLMs) offer a promising solution for automating the extraction and comprehension of technical documents, potentially transforming our interaction with technical information. However, LLMs may encounter significant challenges when processing technical documents due to their complex structure, specialized terminology and reliance on graphical and visual elements. Moreover, LLMs are known to sometimes produce unexpected or incorrect analyses, a phenomenon referred to as hallucination.This study explores the potential of state-of-the-art LLMs, specifically GPT-4omni, to automate the comprehension of technical documents. The evaluation was performed on two types of PDF documents. The first type is selectable text PDFs, which are extractable and editable, focusing on civil engineering documents from the Italian state railways. The second type is scanned OCR PDFs, where text is derived from scanning or OCR, specifically focusing on the design of an outdoor swimming pool. These documents include textual and visual elements such as tables, figures and photos. Our findings suggest that GPT-4omni has a high potential for real-world use, although it may still be susceptible to producing misleading information.
pdf
bib
abs
Temporal Word Embeddings in the Study of Metaphor Change over Time and across Genres: A Proof-of-concept Study on English
Veronica Mangiaterra
|
Chiara Barattieri Di San Pietro
|
Valentina Bambini
Temporal word embeddings have been successfully employed in semantic change research to identify and trace shifts in the meaning of words. In a previous work, we developed an approach to study the diachrony of complex expressions, namely literary metaphors extracted from Italian literary texts. Capitalizing on the evidence that measures of cosine similarity between the two terms of a metaphor approximate human judgments on the difficulty of the expression, we used time-locked measures of similarity to reconstruct the evolution of processing costs of literary metaphors over the past two centuries. In this work, we present a proof-of-concept study testing the crosslinguistic applicability of this approach on a set of 19th-century English literary metaphors. Our results show that metaphors changed as a function of textual genre but not of epoch: cosine similarity between the two terms of literary metaphors is higher in literary compared to nonliterary texts, and this difference is stable across epochs. We show that the difference between genres is affected by the frequency of the metaphor’s vehicle and the stability of the meaning of both topic and vehicle. Overall, the processing costs of English literary metaphors do not differin different time points, but are influenced by the textual genres of language. In a broader perspective, general considerations can be drawn about the history of literary and nonliterary English language and the semantic change of words
pdf
bib
abs
Fine-grained Sexism Detection in Italian Newspapers
Federica Manzi
|
Leon Weber-Genzel
|
Barbara Plank
In recent years, tasks revolving around hate speech detection have experienced a growing interest in the field of Natural Language Processing. Two main trends stand out in the context of sexism recognition: the focus on overt forms of sexism such as misogyny on social media and tackling the problem as a text classification task. The main objective of this work is to introduce a new approach to tackle sexism recognition as a sequence labelling task, operating on the token level rather than the document one. To achieve this goal, we introduce (i) the FGSDI (Fine-Grained Sexism Detection in Italian) corpus, containing Italian newspaper articles annotated with fine-grained linguistic markers of sexism, and (ii) a two-step pipeline that sequentially performs sexism detection on the sentence level and sexism classification on the token one. Our primary findings include that (i) tackling the task of sexism recognition as a sequence labelling task is possible, however, a large amount of labelled data is needed; (ii) leveraging few-shot learning for sexism detection proves to be an effective solution in scenarios where only a limited amount of data are available; (iii) the proposed pipeline approach allows for better results compared to the baseline by doubling the overall precision and achieving a better F1-score.
pdf
bib
abs
Towards a More Comprehensive Evaluation for Italian LLMs
Luca Moroni
|
Simone Conia
|
Federico Martelli
|
Roberto Navigli
Recent Large Language Models (LLMs) have shown impressive performance in addressing complex aspects of human language. These models have also demonstrated significant capabilities in processing and generating Italian text, achieving state-of-the-art results on current benchmarks for the Italian language. However, the number of such benchmarks is still insufficient. A case in point is the “Open Ita LLM Leaderboard” which only supports three benchmarks, despite being one of the most popular evaluation suite for the evaluation of Italian-speaking LLMs. In this paper, we analyze the current pitfalls of existing evaluation suites and propose two ways to this gap: i) a new suite of automatically-translated benchmarks, drawn from the most popular English benchmarks; and ii) the adaptation of existing manual dataset so that they can be used to complement the evaluation of Italian LLMs. We discuss the pros and cons of both approaches and release all our data to foster further research on the evaluation of Italian-speaking LLMs.
pdf
bib
abs
A Study on the Soundness of Closed-ended Evaluation of Large Language Models Adapted to the Italian Language
Elio Musacchio
|
Lucia Siciliani
|
Pierpaolo Basile
|
Edoardo Michielon
|
Marco Pasqualini
|
Asia Beatrice Uboldi
|
Giovanni Semeraro
With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural LanguageGeneration tasks, an increasing number of open weights architectures have been developed and released online. In contrastwith older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shownoutstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend naturallanguage. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is usingbenchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance ineach of these individual tasks can imply having developed a model capable of understanding language. However, while thisassumption is not incorrect, it is evident that it is not sufficient, and the evaluation of Large Language Models still remains anopen challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets andhow a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditionalapproaches.
pdf
bib
abs
Understanding the Future Green Workforce through a Corpus of Curricula Vitae from Recent Graduates
Francesca Nannetti
|
Matteo Di Cristofaro
In view of the much-heralded ecological transition, to stay competitive and participate in the collective effort to face global warming and climate change, organisations need to select employees interested in and able to develop environmentally sustainable and innovative ideas. The existing literature however does not present consistent nor concordant results on the effective interest, involvement and expertise of Generation Z members – namely, the newest entrants into the workforce – in green issues. The aim of this study is to explore the profile of the upcoming workforce expected to present itself to companies, and to support them in managing the green transition. With CVs as one of the first interfaces between candidate and company in the recruitment process, this study is based on a purpose-built corpus consisting of 8,096 Curricula Vitae from recent graduates of the University of Modena and Reggio Emilia. Data is investigated through a Corpus-Assisted Discourse Studies (CADS) framework, proposing a novel interaction between structured metadata and textual information. The original contribution of this approach lies in the extraction of information from the narrative structure of CVs which, guiding the evaluation and exploration of metadata, ensures that the knowledge value of the data can be explored in a discursive manner and not reduced to lists of competences and qualifications.
pdf
bib
abs
Exploring Italian Sentence Embeddings Properties through Multi-tasking
Vivi Nastase
|
Giuseppe Samo
|
Chunyang Jiang
|
Paola Merlo
We investigate to what degree existing LLMs encode abstract linguistic information in Italian in a multi-task setting. We exploit curated synthetic data on a large scale – several Blackbird Language Matrices (BLMs) problems in Italian – and use them to study how sentence representations built using pre-trained language models encode specific syntactic and semantic information. We use a two-level architecture to model separately a compression of the sentence embeddings into a representation that contains relevant information for a task, and a BLM task. We then investigate whether we can obtain compressed sentence representations that encode syntactic and semantic information relevant to several BLM tasks. While we expected that the sentence structure – in terms of sequence of phrases/chunks – and chunk properties could be shared across tasks, performance and error analysis show that the clues for the different tasks are encoded in different manners in the sentence embeddings, suggesting that abstract linguistic notions such as constituents or thematic roles does not seem to be present in the pretrained sentence embeddings.
pdf
bib
abs
Exploring Syntactic Information in Sentence Embeddings through Multilingual Subject-verb Agreement
Vivi Nastase
|
Giuseppe Samo
|
Chunyang Jiang
|
Paola Merlo
In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon – subject-verb agreement across a variety of sentence structures – in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps – detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences – we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.
pdf
bib
abs
Dynamic Prompting: Large Language Models for Task Oriented Dialog
Jan Nehring
|
Akhil Juneja
|
Adnan Ahmad
|
Roland Roller
|
Dietrich Klakow
Large Language Models show impressive results in many different applications, most notably in the context of question-answering and open dialog situations. However, it is still an open question how to use those models for task-oriented dialogs such as booking or customer information systems, and such. In this work, we propose Dynamic Prompting, an architecture for task-oriented dialog, integrating the benefits of Large Language Models and showcasing the approach on the MultiWOZ 2.2 dataset. Our architecture leads to a high task success rate, provides sensible and specific answers, and is resistant to hallucinations. Further, we show that Dynamic Prompting is able to answer questions that were not anticipated by the dialog systems designer and that it can correct several types of errors and other characteristics of the system.
pdf
bib
abs
Exploring Text-Embedding Retrieval Models for the Italian Language
Yuri Noviello
|
Fabio Tamburini
Text retrieval systems have become essential in the field of natural language processing (NLP), serving as the backbone for applications such as search engines, document indexing, and information retrieval. With the rise of generative AI, particularly Retrieval-Augmented Generation (RAG) systems, the demand for robust text retrieval models has increased. However, existing large language models (LLMs) and datasets are often insufficiently optimized for Italian, limiting their performance in Italian text retrieval tasks. This paper addresses this gap by proposing both a data collection and specialized models tailored for Italian text retrieval. Through extensive experimentation, we analyze the improvements and limitations in retrieval performance, paving the way for more effective Italian NLP applications.
pdf
bib
abs
Introducing MultiLS-IT: A Dataset for Lexical Simplification in Italian
Laura Occhipinti
Lexical simplification is a fundamental task in Natural Language Processing, aiming to replace complex words with simpler synonyms while preserving the original meaning of the text. This task is crucial for improving the accessibility of texts for different user groups. In this article, we present MultiLS-IT, the first dataset specifically designed for automatic lexical simplification in Italian, as part of the larger multilingual Multi-LS dataset. We offer a detailed description of the data collection and annotation process, along with a comprehensive statistical analysis of the dataset. Our dataset provides a basis for the development and evaluation of automatic simplification models, contributing to the broader goal of making texts more accessible to all readers.
pdf
bib
abs
Enhancing Lexical Complexity Prediction in Italian through Automatic Morphological Segmentation
Laura Occhipinti
Morphological analysis is vital for various NLP tasks as it provides insights into word structures and enhances the understanding of morphological and syntactic relationships. This study focuses on surface morphological segmentation for the Italian language, addressing the lack of detailed morphological representation in existing corpora. By utilizing an automatic segmenter, we aim to extract quantitative morphological parameters to understand their impact on word complexity perception. Our correlation analysis reveals that morphological features significantly influence the perceived complexity of words.
pdf
bib
abs
Measuring Bias in Instruction-Following Models with ItaP-AT for the Italian Language
Dario Onorati
|
Davide Venditti
|
Elena Sofia Ruzzetti
|
Federico Ranaldi
|
Leonardo Ranaldi
|
Fabio Massimo Zanzotto
Instruction-Following Language Models (IFLMs) are the state-of-the-art for solving many downstream tasks. Given their widespread use, there is an urgent need to measure whether the sentences they generate contain toxic information or social biases. In this paper, we propose Prompt Association Test for the Italian language (ItaP-AT): a new resource for testing the presence of social bias in different domains in IFLMs. This work also aims to understand whether it is possible to make the responses of these models more fair by using context learning, using “one-shot anti-stereotypical prompts”.
pdf
bib
abs
Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data
Riccardo Orlando
|
Luca Moroni
|
Pere-Lluís Huguet Cabot
|
Simone Conia
|
Edoardo Barba
|
Sergio Orlandini
|
Giuseppe Fiameni
|
Roberto Navigli
The increasing popularity of Large Language Models (LLMs) has led to a surge in research on adapting existing models to different languages. However, the pretraining of non-English LLMs is still an underexplored area and there is no open-source endeavor that explores what is achievable with open Italian data. To address this issue, we present Minerva, the first family of LLMs trained from scratch on Italian data. The creation of Minerva is an opportunity to explore and investigate the pretraining of LLMs for the Italian language, outlining the challenges that arise when training LLMs with native Italian texts. Minerva demonstrates that an LLM for a specific language brings a number of practical benefits compared to the adaptation of an existing one, including deep control over the composition of the vocabulary and the training data. With this paper, we aim to provide a comprehensive overview of the design choices, results, and evaluation of our Minerva models, showing promising results on Italian benchmarks and downstream tasks. Most importantly, we share what we learned and the findings obtained during the development of Minerva, as we believe that our experience will be valuable for the academic and industrial communities interested in training non-English LLMs from scratch.
pdf
bib
abs
Benchmarking the Semantics of Taste: Towards the Automatic Extraction of Gustatory Language
Teresa Paccosi
|
Sara Tonelli
In this paper, we present a benchmark containing texts manually annotated with gustatory semantic information. We employ a FrameNet-like approach previously tested to address olfactory language, which we adapt to capture gustatory events. We then propose an exploration of the data in the benchmark to show the possible insights brought by this type of approach, addressing the investigation of emotional valence in text genres. Eventually, we present a supervised system trained with the taste benchmark for the extraction of gustatory information from historical and contemporary texts.
pdf
bib
abs
Nominal Class Assignment in Swahili: A Computational Account
Giada Palmieri
|
Konstantinos Kogkalidis
We discuss the open question of the relation between semantics and nominal class assignment in Swahili. We approach theproblem from a computational perspective, aiming first to quantify the extent of this relation, and then to explicate its nature,taking extra care to suppress morphosyntactic confounds. Our results are the first of their kind, providing a quantitativeevaluation of the semantic cohesion of each nominal class, as well as a nuanced taxonomic description of its semantic content.
pdf
bib
abs
Did Somebody Say ‘Gest-IT’? A Pilot Exploration of Multimodal Data Management
Ludovica Pannitto
|
Lorenzo Albanesi
|
Laura Marion
|
Federica Martines
|
Carmelo Caruso
|
Claudia Bianchini
|
Francesca Masini
|
Caterina Mauri
The paper presents a pilot exploration of the construction, management and analysis of a multimodal corpus. Through athree-layer annotation that provides orthographic, prosodic, and gestural transcriptions, the gest-IT resource allows oneto investigate the variation of gesture-making patterns in conversations between sighted people and people with visualimpairment. After discussing the transcription methods and technical procedures employed in our study, we will propose aunified CoNLL-U corpus and indicate our future steps.
pdf
bib
abs
Confronto tra Diversi Tipi di Valutazione del Miglioramento della Chiarezza di Testi Amministrativi in Lingua Italiana
Mariachiara Pascucci
|
Mirko Tavosanis
The paper presents a comparison of different types of evaluation of administrative texts in the Italian language on which a clarity improvement intervention was carried out. The clarity improvement was performed by human experts and ChatGPT. The evaluation was carried out in four different ways: by expert evaluators, used as a reference; by evaluators with good skills, subject to dedicated training; by generic evaluators recruited through a crowdsourcing platform; by ChatGPT. The results show that the closest match to the results of the evaluation by expert evaluators was reached, by a wide margin, by evaluators with good skills and dedicated training; the second best approach was reached by requesting evaluation from ChatGPT; the worst approach was reached by generic evaluators recruited through a crowdsourcing platform. Task features that may have influenced the outcome are also discussed.
pdf
bib
abs
Towards an Automatic Evaluation of (In)coherence in Student Essays
Filippo Pellegrino
|
Jennifer Frey
|
Lorenzo Zanasi
Coherence modeling is an important task in natural language processing (NLP) with potential impact on other NLP taskssuch as Natural Language Understanding or Automated Essay Scoring. But it can also offer interesting linguistic insightswith pedagogical implications. Early work on coherence modeling has focused on exploring definitions of the phenomenonand in recent years, neural models have entered also this field of research allowing to successfully distinguish coherent fromincoherent (synthetically created) texts or to identify the correct continuation for a given sample of texts as demonstratedfor Italian in the DisCoTex task of EVALITA 2023. In this article, we target coherence modeling for Italian language in astrongly domain-specific scenario, i.e. education. We use a corpus of student essays, collected to analyse student’s textcoherence and data augmentation techniques to experiment with the effect of various linguistically informed features ofincoherent writing on current coherence modelling strategies used in NLP. Our results show the capabilities of encodermodels to capture features of (in)coherence in a domain-specific scenario discerning natural from artificially corrupted texts.Our code is available at the following url https://gitlab.inf.unibz.it/commul/itaca/automatic_eval
pdf
bib
abs
MONICA: Monitoring Coverage and Attitudes of Italian Measures in Response to COVID-19
Fabio Pernisi
|
Giuseppe Attanasio
|
Debora Nozza
Modern social media have long been observed as a mirror for public discourse and opinions. Especially in the face of exceptional events, computational language tools are valuable for understanding public sentiment and reacting quickly. During the coronavirus pandemic, the Italian government issued a series of financial measures, each unique in target, requirements, and benefits. Despite the widespread dissemination of these measures, it is currently unclear how they were perceived and whether they ultimately achieved their goal.In this paper, we document the collection and release of MONICA, a new social media dataset for MONItoring Coverage and Attitudes to such measures. Data include approximately ten thousand posts discussing a variety of measures in ten months. We collected annotations for sentiment, emotion, irony, and topics for each post. We conducted an extensive analysis using computational models to learn these aspects from text. We release a compliant version of the dataset to foster future research on computational approaches for understanding public opinion about government measures. We will release the data at URL.
pdf
bib
abs
Unraveling the Enigma of SPLIT in Large-Language Models: The Unforeseen Impact of System Prompts on LLMs with Dissociative Identity Disorder
Marco Polignano
|
Marco De Gemmis
|
Giovanni Semeraro
Our work delves into the unexplored territory of Large-Language Models (LLMs) and their interactions with System Prompts, unveiling the previously undiscovered implications of SPLIT (System Prompt Induced Linguistic Transmutation) in commonly used state-of-the-art LLMs. Dissociative Identity Disorder, a complex and multifaceted mental health condition, is characterized by the presence of two or more distinct identities or personas within an individual, often with varying levels of awareness and control. The advent of large-language models has raised intriguing questions about the presence of such conditions in LLMs. Our research investigates the phenomenon of SPLIT, in which the System Prompt, a seemingly innocuous input, profoundly impacts the linguistic outputs of LLMs. The findings of our study reveal a striking correlation between the System Prompt and the emergence of distinct, persona-like linguistic patterns in the LLM’s responses. These patterns are not only reminiscent of the dissociative identities present in the original data but also exhibit a level of coherence and consistency that is uncommon in typical LLM outputs. As we continue to explore the capabilities of LLMs, it is imperative that we maintain a keen awareness of the potential for SPLIT and its significant implications for the development of more human-like and empathetic AI systems.
pdf
bib
abs
The limits of Italian in Reasoning Tasks
Leonardo Ranaldi
|
Giulia Pucci
|
Federico Ranaldi
|
Elena Sofia Ruzzetti
|
Fabio Massimo Zanzotto
Previous studies have demonstrated the effectiveness of reasoning methods in eliciting multi-step reasoned answers from Large Language Models (LLMs) by leveraging in-context demonstrations. These methods, exemplified by Chain-of-Thought (CoT) and Program-Aided Language Models (PAL), have been shown to reason well in monolingual contexts, primarily in English. There has, however, been limited exploration of their abilities in other languages, especially in Italian.To gain a deeper understanding of the role of reasoning methods in in-context demonstrations, we propose a multidimensional analysis tailored to Italian, focusing on arithmetic and symbolic reasoning tasks. Our findings indicate that the effectiveness of reasoning methods varies significantly beyond English. Specifically, CoT, which relies on natural language demonstrations, is limited to English. Conversely, the structured nature of PAL in-context demonstrations facilitates multilingual comprehension, enabling LLMs to generate programmatic answers in Italian as well. Finally, for a more comprehensive overview, we observe that additional alignment methods do not improve downstream performances; in contrast, in some cases, they limit the abilities of the original models. This leads to significant improvements in the accuracy and quality of the generated responses.
pdf
bib
abs
How Far Does the Sequence of Compositions Impact Multilingual Pre-Training?
Leonardo Ranaldi
|
Giulia Pucci
|
Fabio Massimo Zanzotto
The most efficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of fixed length through causal masking that estimates the probability of each token given its context.However, the role of the composition sequence pre-training technique in the models’ generalization properties has yet to be explored.In this paper, we show that operating via causal masking impacts model performance because it could include misleading information from previous text sequences during pre-training.To fill this gap, we propose intra-context causal masking where the probability of each token is conditional only on the previous in the same chunk of text, avoiding misleading information from different contexts.Hence, we demonstrate that organizing text chunks based on a policy that aligns with text similarity effectively reduces the risk of misleading context during pre-training by enhancing language models’ in-context learning and factual knowledge storage capabilities while maintaining efficiency.
pdf
bib
abs
From ‘It’s All Greek to Me’ to ‘Nur Bahnhof Verstehen’: An Investigation of mBERT’s Cross-Linguistic Capabilities
Aria Rastegar
|
Pegah Ramezani
This study investigates the impact of cross-linguistic similarities on idiom representation in mBERT, focusing on English and German idioms categorized by different degrees of similarity. We aim to determine whether different degrees of cross-linguistic similarities significantly affect mBERT’s representations and to observe how these representations change across its 12 layers. Contrary to our initial hypothesis, cross-linguistic similarity did not uniformly impact idiom representations across all layers. While early and middle layers showed no significant differences among idiom categories, higher layers (from Layer 8 onwards) revealed more nuanced processing. Specifically, significant differences between the control category and idioms with similar meaning (SM), as well as between idioms with similar lexical items (SL) and those with similar semantics (SM) were observed. Our analysis revealed that early layers provided general representations, while higher layers showed increased differentiation between literal and figurative meanings. This was evidenced by a general decrease in cosine similarities from Layer 5 onwards, with Layer 8 demonstrating the lowest cosine similarities across all categories. Interestingly, a trend suggests that mBERT performs slightly better with more literal hints. The order of cosine similarity for the categorizations was: idioms with a degree of formal similarity, control idioms, idioms with both formal and semantic similarity, and finally idioms with only semantic similarity. These findings indicate that mBERT’s processing of idioms evolves significantly across its layers, with cross-linguistic might affect more significantly in higher layers where more abstract semantic processing likely occurs.
pdf
bib
abs
Is Sentence Splitting a Solved Task? Experiments to the Intersection between NLP and Italian Linguistics
Arianna Redaelli
|
Rachele Sprugnoli
Sentence splitting, that is the segmentation of the raw input text into sentences, is a fundamental step in text processing. Although it is considered a solved task for texts such as news articles and Wikipedia pages, the performance of systems can vary greatly depending on the text genre. This paper presents the evaluation of the performance of eight sentence splitting tools adopting different approaches (rule-based, supervised, semi-supervised, and unsupervised learning) on Italian 19th-century novels, a genre that has not received sufficient attention so far but which can be an interesting common ground between Natural Language Processing and Digital Humanities.
pdf
bib
abs
From Explanation to Detection: Multimodal Insights into Disagreement in Misogynous Memes
Giulia Rizzi
|
Paolo Rosso
|
Elisabetta Fersini
This paper presents a probabilistic approach to identifying the disagreement-related elements in misogynistic memes by considering both modalities that compose a meme (i.e., visual and textual sources). Several methodologies to exploit such elements in the identification of disagreement among annotators have been investigated and evaluated on the Multimedia Automatic Misogyny Identification (MAMI) dataset. The proposed unsupervised approach reaches comparable performances, and in some cases even better, with state-of-the-art approaches, but with a reduced number of parameters to be estimated.
pdf
bib
abs
To Click It or Not to Click It: An Italian Dataset for Neutralising Clickbait Headlines
Daniel Russo
|
Oscar Araque
|
Marco Guerini
Clickbait is a common technique aimed to attract reader’s attention, although it can result inaccurate and lead to misinformation. This work explores the role of current Natural Language Processing methods to reduce its negative impact. To do so, a novel Italian dataset is generated, containing manual annotations for classification, spoiling, and neutralisation of clickbait. Besides, several experimental evaluations are performed, assessing the performance of current language models. On the one hand, we evaluate the performance in the task of clickbait detection in a multilingual setting, showing that augmenting the data with English instance largely improves overall performance. On the other hand, the generation tasks of clickbait spoiling and neutralisation are explored. The latter is a novel task that is designed to increase the informativeness of a headline, thus removing the information gap. This work opens a new research avenue that has been largely uncharted in the Italian language.
pdf
bib
abs
AI vs. Human: Effectiveness of LLMs in Simplifying Italian Administrative Documents
Marco Russodivito
|
Vittorio Ganfi
|
Giuliana Fiorentino
|
Rocco Oliveto
This study investigates the effectiveness of Large Language Models (LLMs) in simplifying Italian administrative texts compared to human informants. This research evaluates the performance of several well-known LLMs, including GPT-3.5-Turbo, GPT-4, LLaMA 3, and Phi 3, in simplifying a corpus of Italian administrative documents (s-ItaIst), a representative corpus of Italian administrative texts. To accurately compare the simplification abilities of humans and LLMs, six parallel corpora of a subsection of ItaIst are collected. These parallel corpora were analyzed using both complexity and similarity metrics to assess the outcomes of LLMs and human participants. Our findings indicate that while LLMs perform comparably to humans in many aspects, there are notable differences in structural and semantic changes. The results of our study underscore the potential and limitations of using AI for administrative text simplification, highlighting areas where LLMs need improvement to achieve human-level proficiency.
pdf
bib
abs
Assessing the Asymmetric Behaviour of Italian Large Language Models across Different Syntactic Structures
Elena Sofia Ruzzetti
|
Federico Ranaldi
|
Dario Onorati
|
Davide Venditti
|
Leonardo Ranaldi
|
Tommaso Caselli
|
Fabio Massimo Zanzotto
While LLMs get more proficient at solving tasks and generating sentences, we aim to investigate the role that differentsyntactic structures have on models’ performances on a battery of Natural Language Understanding tasks. We analyze theperformance of five LLMs on semantically equivalent sentences that are characterized by different syntactic structures. Tocorrectly solve the tasks, a model is implicitly required to correctly parse the sentence. We found out that LLMs strugglewhen there are more complex syntactic structures, with an average drop of 16.13(±11.14) points in accuracy on Q&A task.Additionally, we propose a method based on token attribution to spot which area of the LLMs encode syntactic knowledge,by identifying model heads and layers responsible for the generation of a correct answer
pdf
bib
abs
Morphological vs. Lexical Antonyms in Italian: A Computational Study on Lexical Competition
Martina Saccomando
|
Andrea Zaninello
|
Francesca Masini
In this paper, we examine the competition between pairs of adjectives in Italian that are antonyms of the same term: one is a “morphological antonym” formed by negative prefixation, the other is a “lexical antonym” with no morphological relationship with the term in question. We consider pairs of adjectives that are reported as antonyms in lexicographic resources and extract the nouns that can be modified by both adjectives from a large corpus. We select a set of 8 nouns for each pair that present higher, lower, and comparable frequencies combined with each antonym respectively and then we perform two experiments with a LLM. Firstly, we perform experiments for masked-token prediction of the adjective, to study the correlation between prediction accuracy and the frequency of the noun-antonym pair. Secondly, we perform a polarity-flip experiment with a multilingual LLM, asking to change the adjective into its positive counterpart, and study the cases where the antonym is changed to the morphological antonym’s lexical base, under the hypothesis that a flip to the lexical base indicates a narrower set of senses of the antonymic counterpart.
pdf
bib
abs
Multimodal Attention Is All You Need
Marco Saioni
|
Cristina Giannone
In this paper, we present a multimodal model for classifying fake news. The main peculiarity of the proposed model is the cross attention mechanism. Cross-attention is an evolution of the attention mechanism that allows the model to examine intermodal relationships to better understand information from different modalities, enabling it to simultaneously focus on the relevant parts of the data extracted from each. We tested the model using MULTI-Fake-DetectiVE data from Evalita 2023. The presented model is particularly effective in both the tasks of classifying fake news and evaluating the intermodal relationship.
pdf
bib
abs
Assessing Italian Large Language Models on Energy Feedback Generation: A Human Evaluation Study
Manuela Sanguinetti
|
Alessandro Pani
|
Alessandra Perniciano
|
Luca Zedda
|
Andrea Loddo
|
Maurizio Atzori
This work presents a comparison of some recently-released instruction-tuned large language models for Italian, focusing in particular on their effectiveness in a specific application scenario, i.e., that of delivering energy feedback. This work is part of a larger project aimed at developing a conversational interface for users of a renewable energy community, where clarity and accuracy of the provided feedback are important for a proper energy management. This comparison is based on the human evaluation of the output produced by such models using energy data as input. Specifically, the data pertains to information regarding the power flows within a household equipped with a photovoltaic (PV) plant and a battery storage system. The goal of the feedback is precisely that of providing the user with such information in a meaningful way based on the specific aspect they intend to monitor at a given moment (e.g., self-consumption levels, the power generated by the PV panels or imported from the main grid, or the battery state of charge). This evaluation experiment has the two-fold purpose of providing an exploratory analysis of the models’ abilities on this specific generation task solely relying on the information and instruction provided in the prompt, and as an initial investigation into their potential as reliable tools for generating user-friendly energy feedback in this intended scenario.
pdf
bib
abs
Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses
Gabriele Sarti
|
Tommaso Caselli
|
Malvina Nissim
|
Arianna Bisazza
Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models’ performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models’ linguistic proficiency and sequential instruction-following skills.
pdf
bib
abs
Leveraging Large Language Models for Fact Verification in Italian
Antonio Scaiella
|
Stefano Costanzo
|
Elisa Passone
|
Danilo Croce
|
Giorgio Gambosi
In recent years, Automatic Fact Checking has become a crucial tool in combating fake news, leveraging AI to verify the accuracy of information. Despite significant advancements, most datasets and models are predominantly available in English, posing challenges for other languages. This paper presents an Italian resource based on the dataset made available in the FEVER evaluation campaign, created to train and evaluate fact-checking models in Italian. The dataset comprises approximately 240k examples, with over 2k test examples manually validated. Additionally, we fine-tuned a state-of-the-art LLM, namely LLaMA3, on both the original English and translated Italian datasets, demonstrating that fine-tuning significantly improves model performance. Our results suggest that the fine-tuned models achieve comparable accuracy in both languages, highlighting the value of the proposed resource.
pdf
bib
abs
A Gentle Push Funziona Benissimo: Making Instructed Models in Italian via Contrastive Activation Steering
Daniel Scalena
|
Elisabetta Fersini
|
Malvina Nissim
Adapting models to a language that was only partially present in the pre-training data requires fine-tuning, which is expensive in terms of both data and computational resources. As an alternative to fine-tuning, we explore the potential of activation steering-based techniques to enhance model performance on Italian tasks. Through our experiments we show that Italian steering (i) can be successfully applied to different models, (ii) achieves performances comparable to, or even better than, fine-tuned models for Italian, and (iii) yields higher quality and consistency in Italian generations. We also discuss the utility of steering and fine-tuning in the contemporary LLM landscape where models are anyway getting high Italian performances even if not explicitly trained in this language.
pdf
bib
abs
Subcategorization of Italian Verbs with LLMs and T-PAS
Luca Simonetti
|
Elisabetta Jezek
|
Guido Vetere
This study explores the application of Large Language Models (LLMs) to verb subcategorization in Italian, focusing on the identification and classification of syntactic patterns in sentences. While LLMs have made lexical analysis more implicit, explicit argument structure identification remains crucial in domain-specific contexts. The research leverages T-PAS, a rich lexical resource for Italian verbs, to fine-tune the open multilingual model Mistral 7B using the Iterative Reasoning Preference Optimization (IRPO) technique. This approach aims to enhance the recognition and extraction of verbal patterns from Italian sentences, addressing challenges in resource quality, coverage, and frame extraction methods. By combining curated lexical-semantic resources with neural language models, this work contributes to improving verb subcategorization tasks, particularly for the Italian language, and demonstrates the potential of LLMs in refining linguistic analysis tools.
pdf
bib
abs
Unipa-GPT: A Framework to Assess Open-source Alternatives to Chat-GPT for Italian Chat-bots
Irene Siragusa
|
Roberto Pirrone
This paper illustrates the implementation of Open Unipa-GPT, an open source version of the Unipa-GPT chatbot that leverages on open-source Large Language Models for embeddings and text generation. The system relies on a Retrieval Augmented Generation approach, thus mitigating hallucination errors in the generation phase. A detailed comparison between different models is reported to illustrate their performance as regards embedding generation, retrieval, and text generation. In the last case, models were tested in simple inference setup after a fine-tuning procedure. Experiments demonstrate that an open-source LLMs can be efficiently used for embedding generation, but noon of the models does reach the performances obtained by closed models, such as gpt-3.5-turbo in generating answers.
pdf
bib
abs
Annotation and Detection of Emotion Polarity in “I Promessi Sposi”: Dataset and Experiments
Rachele Sprugnoli
|
Arianna Redaelli
Emotions play a crucial role in literature and are studied by various disciplines, e.g. literary criticism, psychology, anthropology and, more recently, also with computational methods in NLP. However, studies in the Italian context are still limited. This work therefore aims to advance the state of the art in the field of emotion analysis applied to historical texts by proposing a new dataset and describing the results of a set of emotion polarity detection experiments. The text analyzed is “I Promessi Sposi” in its final edition (published in 1840), one of the most important novels in the Italian literary and linguistic canon.
pdf
bib
abs
Complexifying BERT Using LoRA Adapters
Fabio Tamburini
This paper presents the first results of a pilot study for transforming a real-valued pre-trained transformer encoder into a complex-valued one. Following recent findings about pre-training using LoRA, the main idea is to employ complex-valued LoRA adapters to make the trick and continue the pre-training of a given Italian model for setting up the adapters. After pre-training, the proposed complex-valued model has been evaluated on a standardised benchmark for Italian natural-language understanding obtaining very encouraging results.
pdf
bib
abs
How Do We Counter Hate Speech in Italy?
Vittoria Tonini
|
Simona Frenda
|
Marco Antonio Stranisci
|
Viviana Patti
The phenomenon of online hate speech is a growing challenge and various organisations try to prevent its spread answering promptly to hateful messages online. In this context, we propose a new dataset of activists’ and users’ comments on Facebook reacting to specific news headlines: AmnestyCounterHS. Taking into account the literature on counterspeech, we defined a new schema of annotation and applied it to our dataset, in order to examine the most used counter-narrative strategies in Italy. This research aims to support the future development of automatic counterspeech generation. This paper presents also a comparative analysis of our dataset with other two datasets in Italian (Counter-TWIT and multilingual CONAN) containing hate speech and counter narratives. Through this analysis, we will understand how the environment (artificial vs. ecological) and the topics of discussions online influence the nature of counter narratives. Our findings highlight the predominance of negative sentiment and emotions, the varying presence of stereotypes, and the strategic differences in counter narratives across dataset.
pdf
bib
abs
Nesciun Lengaz Lascià Endò: Machine Translation for Fassa Ladin
Giovanni Valer
|
Nicolò Penzo
|
Jacopo Staiano
Despite the remarkable success recently obtained by Large Language Models, a significant gap in performance still exists when dealing with low-resource languages which are often poorly supported by off-the-shelf models. In this work we focus on Fassa Ladin, a Rhaeto-Romance linguistic variety spoken by less than ten thousand people in the Dolomitic regions, and set to build the first bidirectional Machine Translation system supporting Italian, English, and Fassa Ladin. To this end, we collected a small though representative corpus compounding 1135 parallel sentences in these three languages, and spanning five domains. We evaluated several models including the open (Meta AI’s No Language Left Behind, NLLB-200) and commercial (OpenAI’s gpt-4o) state-of-the-art, and indeed found that both obtain unsatisfactory performance. We therefore proceeded to finetune the NLLB-200 model on the data collected, using different approaches. We report a comparative analysis of the results obtained, showing that 1) jointly training for multilingual translation (Ladin-Italian and Ladin-English) significantly improves the performance, and 2) knowledge-transfer is highly effective (e.g., leveraging similarities between Ladin and Friulian), highlighting the importance of targeted data collection and model adaptation in the context of low-resource/endangered languages for which little textual data is available.
pdf
bib
abs
Neutral Score Detection in Lexicon-based Sentiment Analysis: The Quartile-based Approach
Marco Vassallo
|
Giuliano Gabrieli
|
Valerio Basile
|
Cristina Bosco
The neutrality detection in Sentiment Analysis (SA) still constitutes an unsolved and debated issue. This work proposes an empirical method based on the quartiles of the polarity distribution for a lexicon-based SA approach. Our experiments are based on the Italian linguistic resource MAL (Morphologically-inflected Affective Lexicon) and applied to two annotated corpora. The findings provided a better detection of the neutral opinions with preserving a substantial overall polarity prediction.
pdf
bib
abs
Sensitivity of Syllable-Based ASR Predictions to Token Frequency and Lexical Stress
Alessandro Vietti
|
Domenico De Cristofaro
|
Picciau Sara
Automatic Speech Recognition systems (ASR) based on neural networks achieve great results, but it remains unclear which are the linguistic features and representations that the models leverage to perform the recognition. In our study, we used phonological syllables as tokens to fine-tune an end-to-end ASR model due to their relevance as linguistic units. Furthermore, this strategy allowed us to keep track of different types of linguistic features characterizing the tokens. The analysis of the transcriptions generated by the model reveals that factors such as token frequency and lexical stress have a variable impact on the prediction strategies adopted by the ASR system.
pdf
bib
abs
Modelling Filled Particles and Prolongation Using End-to-end Automatic Speech Recognition Systems: A Quantitative and Qualitative Analysis.
Vincenzo Norman Vitale
|
Loredana Schettino
|
Francesco Cutugno
State-of-the-art automatic speech recognition systems based on End-to-End models (E2E-ASRs) achieve remarkable perfor mances. However, phenomena that characterize spoken language such as fillers (eeh ehm) or segmental prolongations (theee) are still mostly considered as disrupting objects that should not be included to obtain optimal transcriptions, despite their acknowledged regularity and communicative value. A recent study showed that two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders – a Connectionist Temporal Classification (CTC) decoder and a Transducer decoder – tend to model some speech features that are functional for the identification of filled pauses and prolongation in speech. This work builds upon these findings by investigating which of the two systems is better at fillers and prolongations detection tasks and by conducting an error analysis to deepen our understanding of how these systems work.
pdf
bib
abs
Implicit Stereotypes: A Corpus-Based Study for Italian
Wolfgang Wolfgang Schmeisser-Nieto
|
Giacomo Ricci
|
Simona Frenda
|
Mariona Taule
|
Cristina Bosco
Detecting stereotypes is a challenging task, particularly when they are not expressed explicitly. In this study, we applied an annotation schema from the literature designed to formalize implicit stereotypes. We analyzed implicit stereotypes towards immigrants in two datasets: StereoHoax-IT and SterheoSchool, which are created from different sources. StereoHoax-IT consists of reactions on Twitter to specific hoaxes aimed at discriminating against immigrants, while SterheoSchool includes comments from teenagers on fake news generated in psychological experiments. We describe the annotation process, annotator disagreements, and provide both quantitative and qualitative analyses to shed light on how implicitness characterizes stereotypes in different texts. Our findings suggest that implicit stereotypes are often conveyed through logical linguistic relations, such as entailment and behavioral evaluations of immigrants.
pdf
bib
abs
SLIMER-IT: Zero-Shot NER on Italian Language
Andrew Zamai
|
Leonardo Rigutini
|
Marco Maggini
|
Andrea Zugarini
Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages. In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of SLIMER-IT on never-seen-before entity tags.
pdf
bib
abs
Harnessing LLMs for Educational Content-Driven Italian Crossword Generation
Kamyar Zeinalipour
|
Achille Fusco
|
Asya Zanollo
|
Marco Maggini
|
Marco Gori
In this work, we unveil a novel tool for generating Italian crossword puzzles from text, utilizing advanced language models such as GPT-4o, Mistral-7B-Instruct-v0.3, and Llama3-8B-Instruct. Crafted specifically for educational applications, this cutting-edge generator makes use of the comprehensive Italian-Clue-Instruct dataset, which comprises over 30,000 entries including diverse text, solutions, and types of clues. This carefully assembled dataset is designed to facilitate the creation of contextually relevant clues in various styles associated with specific texts and keywords.The study delves into four distinctive styles of crossword clues: those without format constraints, those formed as definite determiner phrases, copular sentences, and bare noun phrases. Each style introduces unique linguistic structures to diversify clue presentation.Given the lack of sophisticated educational tools tailored to the Italian language, this project seeks to enhance learning experiences and cognitive development through an engaging, interactive platform. By meshing state-of-the-art AI with contemporary educational strategies, our tool can dynamically generate crossword puzzles from Italian educational materials, thereby providing an enjoyable and interactive learning environment. This technological advancement not only redefines educational paradigms but also sets a new benchmark for interactive and cognitive language learning solutions.
pdf
bib
abs
Voice Activity Detection on Italian Language
Shibingfeng Zhang
|
Gloria Gagliardi
|
Fabio Tamburini
Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction pipeline for early mental disorder diagnosis. We experimented with various VAD systems and propose an ensemble VAD system that integrates the best-performing models. Our ensemble system shows significant improvements in speech event detection. This advancement lays a robust foundation for more accurate early detection of mental health issues using DLBs in Italian.
pdf
bib
abs
Topic Modeling for Auditing Purposes in the Banking Sector
Alessandro Giaconia
|
Valeria Chiariello
|
Marco Passarotti
This study explores the application of topic modeling techniques for auditing purposes in the banking sector, focusing on the analysis of suspicious activity reports. We compare three topic modeling algorithms: Latent Dirichlet Allocation (LDA), Embedded Topic Model (ETM), and Product of Experts LDA (ProdLDA), using a dataset of 35,000 suspicious activity reports from an Italian bank. The models were evaluated using coherence score, NPMI coherence, and topic diversity metrics. Our results show that ProdLDA consistently outperformed LDA and ETM, with the best performance achieved using 1-gram word embeddings. The study reveals distinct topics related to specific client activities, cross-border transactions, and high-risk business sectors like gambling. These results demonstrate the potential of advanced topic modeling techniques in enhancing the efficiency and effectiveness of auditing processes in the banking sector, particularly in the analysis of suspicious activities, that could be tied to money laundering and terrorism.
pdf
bib
abs
IDRE: AI Generated Dataset for Enhancing Empathetic Chatbot Interactions in Italian Language.
Simone Manai
|
Laura Gemme
|
Roberto Zanoli
|
Alberto Lavelli
This paper introduces IDRE (Italian Dataset for Rephrasing with Empathy), a novel automatically generated Italian linguistic dataset. IDRE comprises typical chatbot user utterances in the healthcare domain, corresponding chatbot responses, and empathetically enhanced chatbot responses. The dataset was generated using the Llama2 language model and evaluated by human raters based on predefined metrics. The IDRE dataset offers a comprehensive and realistic collection of Italian chatbot-user interactions suitable for training and refining chatbot models in the healthcare domain. This facilitates the development of chatbots capable of natural and productive conversations with healthcare users. Notably, the dataset incorporates empathetically enhanced chatbot responses, enabling researchers to investigate the effects of empathetic language on fostering more positive and engaging human-machine interactions within healthcare settings. The methodology employed for the construction of the IDRE dataset can be extended to generate phrases in additional languages and domains, thereby expanding its applicability and utility. The IDRE dataset is publicly available for research purposes.
pdf
bib
abs
Multimodal Online Manipulation: Empirical Analysis of Fact-Checking Reports
Olga Uryupina
This paper presents an in-depth exploratory quantitative study of the interaction between multimedia and textual components in online manipulative content. We discuss relations between content layers (such as proof or support) as well as unscrupulous techniques compromising visual content. The study is based on fakes reported and analyzed by PolitiFact and comprises documents from Facebook, Twitter and Instagram.
pdf
bib
abs
Life and Death of Fakes: On Data Persistence for Manipulative Social Media Content
Olga Uryupina
This work presents an in-depth investigation of the data decay for publicly fact-checked online content. We monitor compromised posts on major social media platforms for one year, tracking the changes in their visibility and availability. We show that data persistence is an important issue for manipulative content, on a larger scale than previously reported for online content in general. Our finding also suggest the (much) higher data decay rate for the platforms suffering most from online disinformation, indicating an important area for data collection/preservation.
pdf
bib
abs
CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian
Giuseppe Attanasio
|
Pierpaolo Basile
|
Federico Borazio
|
Danilo Croce
|
Maria Francis
|
Jacopo Gili
|
Elio Musacchio
|
Malvina Nissim
|
Viviana Patti
|
Matteo Rinaldi
|
Daniel Scalena
The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.
pdf
bib
abs
ItaEval: A CALAMITA Challenge
Giuseppe Attanasio
|
Moreno La Quatra
|
Andrea Santilli
|
Beatrice Savoldi
In recent years, new language models for Italian have been spurring.However, evaluation methodologies for these models have not kept pace, remaining fragmented and often limited to the experimental sections of individual model releases. This paper introduces ItaEval, a multifaceted evaluation suite designed to address this gap. By reviewing recent literature on the evaluation of contemporary language models, we devise three overarching task categories—natural language understanding, commonsense and factual knowledge, and bias, fairness, and safety—that a contemporary model should be able to address. Next, we collect a set of 18 tasks encompassing existing and new datasets. The so-compiled ItaEval suite provides a standardized, multifaceted framework for evaluating Italian language models, facilitating more rigorous and comparative assessments of model performance. We release code and data at https://rita-nlp.org/sprints/itaeval.
pdf
bib
abs
PERSEID - Perspectivist Irony Detection: A CALAMITA Challenge
Valerio Basile
|
Silvia Casola
|
Simona Frenda
|
Soda Marem Lo
Works in perspectivism and human label variation have emphasized the need to collect and leverage various voices and points of view in the whole Natural Language Processing pipeline.PERSEID places itself in this line of work. We consider the task of irony detection from short social media conversations in Italian collected from Twitter (X) and Reddit. To do so, we leverage data from MultiPICO, a recent multilingual dataset with disaggregated annotations and annotators’ metadata, containing 1000 Post, Reply pairs with five annotations each on average.We aim to evaluate whether prompting LLMs with additional annotators’ demographic information (namely gender only, age only, and the combination of the two) results in improved performance compared to a baseline in which only the input text is provided.The evaluation is zero-shot; and we evaluate the results on the disaggregated annotations using f1.
pdf
bib
abs
TRACE-it: Testing Relative clAuses Comprehension through Entailment in ITalian: A CALAMITA Challenge
Dominique Brunato
Introduced in the context of CALAMITA 2024, TRACE-it (Testing Relative clAuses Comprehension through Entailment in ITalian) is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to comprehend a specific type of complex syntactic construction in Italian: object relative clauses. In this report, we outline the theoretical framework that informed the creation of the dataset and provide a comprehensive overview of the linguistic materials used.
pdf
bib
abs
MAGNET - MAchines GeNErating Translations: A CALAMITA Challenge
Mauro Cettolo
|
Andrea Piergentili
|
Sara Papi
|
Marco Gaido
|
Matteo Negri
|
Luisa Bentivogli
We propose MAGNET - MAchines GeNErating Translations, a CALAMITA Challenge which aims at testing the ability of large language models (LLMs) in the hot topic of automatic translation, focusing on Italian and English (in both directions) to overcome the marginality with which Italian is considered by the machine translation community. We propose a benchmark composed of two portions with different distribution policies (one free to use, the other not discloseable), allowing to handle data contamination issues. The publicly available section of the benchmark is distributed on Hugging Face, whereas in this report we describe the details of our challenge, including the prompt formats to be used. Additionally, we report the performance of five models, including a LLM and different sized translation models, in terms of four evaluation metrics, whose scores allow an overall evaluation of the quality of the automatically generated translations.
pdf
bib
abs
GATTINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge
Maria Francis
|
Matteo Rinaldi
|
Jacopo Gili
|
Leonardo De Cosmo
|
Sandro Iannaccone
|
Malvina Nissim
|
Viviana Patti
We introduce a new benchmark designed to evaluate the ability of Large Language Models (LLMs) to generate Italian-language headlines for science news articles. The benchmark is based on a large dataset of science news articles obtained from Ansa Scienza and Galileo, two important Italian media outlets. Effective headline generation requires more than summarizing article content; headlines must also be informative, engaging, and suitable for the topic and target audience, making automatic evaluation particularly challenging. To address this, we propose two novel transformer-based metrics to assess headline quality. We aim for this benchmark to support the evaluation of Italian LLMs and to foster the development of tools to assist in editorial workflows.
pdf
bib
abs
GFG - Gender-Fair Generation: A CALAMITA Challenge
Simona Frenda
|
Andrea Piergentili
|
Beatrice Savoldi
|
Marco Madeddu
|
Martina Rosola
|
Silvia Casola
|
Chiara Ferrando
|
Viviana Patti
|
Matteo Negri
|
Luisa Bentivogli
Gender-fair language aims at promoting gender equality by using terms and expressions that include all identities and avoid reinforcing gender stereotypes. Implementing gender-fair strategies is particularly challenging in heavily gender-marked languages, such as Italian. To address this, the Gender-Fair Generation challenge intends to help shift toward gender-fair language in written communication. The challenge, designed to assess and monitor the recognition and generation of gender-fair language in both mono- and cross-lingual scenarios, includes three tasks: (1) the detection of gendered expressions in Italian sentences, (2) the reformulation of gendered expressions into gender-fair alternatives, and (3) the generation of gender-fair language in automatic translation from English to Italian. The challenge relies on three different annotated datasets: the GFL-it corpus, which contains Italian texts extracted from administrative documents provided by the University of Brescia; GeNTE, a bilingual test set for gender-neutral rewriting and translation built upon a subset of the Europarl dataset; and Neo-GATE, a bilingual test set designed to assess the use of non-binary neomorphemes in Italian for both fairformulation and translation tasks. Finally, each task is evaluated with specific metrics: average of F1-score obtained by means of BERTScore computed on each entry of the datasets for task 1, an accuracy measured with a gender-neutral classifier, and a coverage-weighted accuracy for tasks 2 and 3.
pdf
bib
abs
VeryfIT - Benchmark of Fact-Checked Claims for Italian: A CALAMITA Challenge
Jacopo Gili
|
Viviana Patti
|
Lucia Passaro
|
Tommaso Caselli
Achieving factual accuracy is a known pending issue for language models. Their design centered around the interactive component of user interaction and the extensive use of “spontaneous” training data, has made them highly adept at conversational tasks but not fully reliable in terms of factual correctness. VeryfIT addresses this issue by evaluating the in-memory factual knowledge of language models on data written by professional fact-checkers, posing it as a true or false question.Topics of the statements vary but most are in specific domains related to the Italian government, policies, and social issues. The task presents several challenges: extracting statements from segments of speeches, determining appropriate contextual relevance both temporally and factually, and ultimately verifying the accuracy of the statements.
pdf
bib
abs
AMELIA - Argument Mining Evaluation on Legal documents in ItAlian: A CALAMITA Challenge
Giulia Grundler
|
Andrea Galassi
|
Piera Santin
|
Alessia Fidelangeli
|
Federico Galli
|
Elena Palmieri
|
Francesca Lagioia
|
Giovanni Sartor
|
Paolo Torroni
This challenge consists of three classification tasks, in the context of argument mining in the legal domain. The tasks are based on a dataset of 225 Italian decisions on Value Added Tax, annotated to identify and categorize argumentative text. The objective of the first task is to classify each argumentative component as premise or conclusion, while the second and third tasks aim at classifying the type of premise: legal vs factual, and its corresponding argumentation scheme. The classes are highly unbalanced, hence evaluation is based on the macro F1 score.
pdf
bib
abs
BLM-It - Blackbird Language Matrices for Italian: A CALAMITA Challenge
Chunyang Jiang
|
Giuseppe Samo
|
Vivi Nastase
|
Paola Merlo
In this challenge, we propose Blackbird Language Matrices (BLMs), linguistic puzzles to learn language-related problems and delve into deeper formal and semantic properties of language, through a process of paradigm understanding. A BLM matrix consists of a context set and an answer set. The context is a sequence of sentences that encode implicitly an underlying generative linguistic rule. The contrastive multiple-choice answer set includes negative examples following corrupted generating rules. We propose three subtasks —agreement concord, causative and object-drop alternation detection— each in two variants of increasing lexical complexity.The datasets comprise a few prompts for few-shot learning and a large test set.
pdf
bib
abs
DIMMI - Drug InforMation Mining in Italian: A CALAMITA Challenge
Raffaele Manna
|
Maria Pia Di Buono
|
Luca Giordano
Patients’ knowledge about drugs and medications is crucial as it allows them to administer them safely. This knowledgefrequently comes from written prescriptions, patient information leaflets (PILs), or from reading drug Web pages. DIMMI(Drug InforMation Mining in Italian) is a challenge aiming at evaluating the proficiency of Large Language Models in extractingdrug-specific information from PILs. The challenge seeks to advance the understanding of effectiveness in processing complexmedical information in Italian, and to enhance drug information extraction and pharmacovigilance efforts. Participants areprovided with a dataset of 600 Italian PILs and the objective is to develop models capable of accurately answering specificquestions related to drug dosage, usage, side effects, drug-drug interactions. The challenge should be approached as aninformation extraction task through a zero-shot mode, purely based on the model pre-existing knowledge and understandingor through in-context learning (Retrieval-Augmented Generation (RAG) or few-shot mode). The answers generated by themodels will be compared against the gold standard (GS), created to establish a reliable, accurate, and a comprehensive setof answers against which participant submissions can be evaluated. For each drug and each information category, the GScontains the correct information extracted from the leaflets through a manual annotation.
pdf
bib
abs
GITA4CALAMITA - Evaluating the Physical Commonsense Understanding of Italian LLMs in a Multi-layered Approach: A CALAMITA Challenge
Giulia Pensa
|
Ekhi Azurmendi
|
Julen Etxaniz
|
Begoña Altuna
|
Itziar Gonzalez-Dios
In the context of the CALAMITA Challenge, we investigate the physical commonsense reasoning capabilities of large language models (LLMs) and introduce a methodology to assess their low-level understanding of the physical world. To this end, we use a test set designed to evaluate physical commonsense reasoning in LLMs for the Italian language. We present a tiered dataset, named the Graded Italian Annotated dataset (GITA), which is written and annotated by a professional linguist. This dataset enables us to focus on three distinct levels of commonsense understanding. Our benchmark aims to evaluate three specific tasks: identifying plausible and implausible stories within our dataset, identifying the conflict that generates an implausible story, and identifying the physical states that make a story implausible. We perform these tasks using LLAMA3, and Gemma. Our findings reveal that, although the models may excel at high-level classification tasks, their reasoning is inconsistent and unverifiable, as they fail to capture intermediate evidence.
pdf
bib
abs
ABRICOT - ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge
Giovanni Puccetti
|
Claudia Collacciani
|
Andrea Amelio Ravelli
|
Andrea Esuli
|
Marianna Bolognesi
The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.
pdf
bib
abs
INVALSI - Mathematical and Language Understanding in Italian: A CALAMITA Challenge
Giovanni Puccetti
|
Maria Cassese
|
Andrea Esuli
While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian.These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students’ performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are developed with the goal of assessing students’ skills that are essential in the learning process, ensuring that the benchmark proposed here measures key knowledge for undergraduate students.Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into 4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap). Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question) and altro (other).We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi MATE is 55% while best accuracy on Invalsi ITA is 80%.
pdf
bib
abs
Termite Italian Text-to-SQL: A CALAMITA Challenge
Federico Ranaldi
|
Elena Sofia Ruzzetti
|
Dario Onorati
|
Fabio Massimo Zanzotto
|
Leonardo Ranaldi
We introduce Termite, which is a definitely unseen resource for evaluating Text-to-SQL in Italian. Specifically,we transfer evaluation pipelines beyond English, proposing novel, definitely unseen resources that avoid data-contamination phenomena while assessing the ability of models to perform Text-to-SQL tasks when natural language queries are written in Italian. We establish an evaluation grid based on execution accuracy.
pdf
bib
abs
Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge
Matteo Rinaldi
|
Jacopo Gili
|
Maria Francis
|
Mattia Goffetti
|
Viviana Patti
|
Malvina Nissim
Multi-choice question answering (MCQA) is a powerful tool for evaluating the factual knowledge and reasoning capacities of Large Language Models (LLMs). However, there is a lack of large-scale MCQA datasets originally written in Italian. Existing Italian MCQA benchmarks are often automatically translated from English, an approach with two key drawbacks: Firstly, automatic translations may sound unnatural, contain errors, or use linguistics constructions that do not align with the target language. Secondly, they may introduce topical and ideological biases reflecting Anglo-centric perspectives. To addressthis gap, we present Mult-IT, an MCQA dataset comprising over 110,000 manually written questions across a wide range of topics. All questions are sourced directly from preparation quizzes for Italian university entrance exams, or for exams for public sector employment in Italy. We are hopeful that this contribution enables a more comprehensive evaluation of LLMs’ proficiency, not only in the Italian language, but also in their grasp of Italian cultural and contextual knowledge.
pdf
bib
abs
EurekaRebus - Verbalized Rebus Solving with LLMs: A CALAMITA Challenge
Gabriele Sarti
|
Tommaso Caselli
|
Arianna Bisazza
|
Malvina Nissim
Language games can be valuable resources for testing the ability of large language models (LLMs) to conduct challenging multi-step, knowledge-intensive inferences while respecting predefined constraints. Our proposed challenge prompts LLMs to reason step-by-step to solve verbalized variants of rebus games recently introduced with the EurekaRebus dataset. Verbalized rebuses replace visual cues with crossword definitions to create an encrypted first pass, making the problem entirely text-based. We introduce a simplified task variant with word length hints and adopt a comprehensive set of metrics to obtain a granular overview of models’ performance in knowledge recall, constraints adherence, and re-segmentation abilities across reasoning steps.
pdf
bib
abs
GEESE - Generating and Evaluating Explanations for Semantic Entailment: A CALAMITA Challenge
Andrea Zaninello
|
Bernardo Magnini
In the GEESE challenge, we present a pipeline to evaluate generated explanations for the task of Recognizing Textual Entailment (RTE) in Italian. The challenge focuses on evaluating the impact of generated explanations on the predictive performance of language models. Using a dataset enriched with human-written explanations, we employ two large language models (LLMs) to generate and utilize explanations for semantic relationships between sentence pairs. Our methodology assesses the quality of generated explanations by measuring changes in prediction accuracy when explanations are provided. Through reproducible experimentation, we establish benchmarks against various baseline approaches, demonstrating the potential of explanation injection to enhance model interpretability and performance.
pdf
bib
abs
ITA-SENSE - Evaluate LLMs’ ability for ITAlian word SENSE disambiguation: A CALAMITA Challenge
Pierpaolo Basile
|
Elio Musacchio
|
Lucia Siciliani
The challenge is designed to assess LLMs’ abilities in understanding lexical semantics through Word Sense Disambiguation, providing valuable insights into their performance.The idea is to cast the classical Word Sense Disambiguation task in a generative problem following two directions. Our idea is to propose two tasks: (T1) Given a target word and a sentence in which the word occurs, the LLM must generate the correct meaning definition, (T2) Given a target word and a sentence in which the word occurs, the LLM should choose from a predefined set the correct meaning definition.For T1, we compare the generated definition with respect to the correct one taken from a sense inventory, while for T2, a classical accuracy metric is used.In T1, we adopt metrics that measures the quality of the generated definition such as RougeL and the BERTscore.For CALAMITA, we test LLMs using a zero-shot setting.
pdf
bib
abs
BEEP - BEst DrivEr’s License Performer: A CALAMITA Challenge
Fabio Mercorio
|
Daniele Potertì
|
Antonio Serino
|
Andrea Seveso
We present BEEP (BEst DrivEr’s License Performer), a benchmark challenge to evaluate large language models in the context of a simulated Italian driver’s license exam. This challenge tests the models’ ability to understand and apply traffic laws, road safety regulations, and vehicle-related knowledge through a series of true/false questions. The dataset is derived from official ministerial materials used in the Italian licensing process, specifically targeting Category B licenses.We evaluate models such as LLaMA and Mixtral across multiple categories. In addition, we simulate a driving license test to assess the models’ real-world applicability, where the pass rate is determined based on the number of errors allowed. While scaling up model size improved performance, even larger models struggled to pass the exam consistently. The challenge demonstrates the capabilities and limitations of LLMs in handling real-world, high-stakes scenarios, providing insights into their practical use and areas for further improvement.
pdf
bib
abs
PejorativITy - In-Context Pejorative Language Disambiguation: A CALAMITA Challenge
Arianna Muti
Misogyny is often expressed through figurative language. Some neutral words can assume a negative connotation when functioning as pejorative epithets, and they can be used to express misogyny. Disambiguating the meaning of such terms might help the detection of misogyny. This challenge addresses a) the disambiguation of specific ambiguous words in a given context; b) the detection of misogyny in instances that contain such polysemic words. In particular, framed as a binary classification, our task is divided into two parts. In Task A, the model is asked to define if, given a tweet, the target word is used in pejorative or non-pejorative way. In Task B, the model is asked whether the whole sentence is misogynous or not.
pdf
bib
abs
MACID - Multimodal ACtion IDentification: A CALAMITA Challenge
Andrea Amelio Ravelli
|
Rossella Varvara
|
Lorenzo Gregori
This paper presents the Multimodal ACtion IDentification challenge (MACID), part of the first CALAMITA competition. The objective of this task is to evaluate the ability of large language models (LLMs) to differentiate between closely related action concepts based on textual descriptions alone. The challenge is inspired by the “find the intruder” task, where models must identify an outlier among a set of 4 sentences that describe similar yet distinct actions. The dataset highlights action-predicate mismatches, where the same verb may describe different actions or different verbs may refer to the same action. Although currently mono-modal (text-only), the task is designed for future multimodal integration, linking visual and textual representations to enhance action recognition. By probing a model’s capacity to resolve subtle linguistic ambiguities, the challenge underscores the need for deeper cognitive understanding in action-language alignment, ultimately testing the boundaries of LLMs’ ability to interpret action verbs and their associated concepts.
pdf
bib
abs
ECWCA - Educational CrossWord Clues Answering: A CALAMITA Challenge
Andrea Zugarini
|
Kamyar Zeinalipour
|
Achille Fusco
|
Asya Zanollo
This paper presents ECWCA (Educational CrossWord Clues Answering), a novel challenge designed to evaluate knowledge and reasoning capabilities of large language models through crossword clue-answering. The challenge consists of two tasks: a standard question-answering format where the LLM has to solve crossword clues, and a variation of it, where the model is receives hints about the word lengths of the answers, which is expected to help models with reasoning abilities. To construct the ECWCA dataset, synthetic clues were generated based on entities and facts extracted from Italian Wikipedia. Generated clues were then selected manually in order to ensure high-quality examples with factually correct and unambiguous clues.