Andrea Esuli

2025

pdf bib abs
The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian
Giovanni Puccetti | Maria Cassese | Andrea Esuli
Proceedings of the 31st International Conference on Computational Linguistics

While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language under standing in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian highschool math Olympics. We evaluate 10 powerful language models on these benchmarks and we find that they are bound by 71% accuracy on Invalsi MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct accuracy is 45%.

2024

pdf bib abs
AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Giovanni Puccetti | Anna Rogers | Chiara Alzetta | Felice Dell’Orletta | Andrea Esuli
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) are increasingly used as ‘content farm’ models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real ‘content farm’. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic news-like texts ‘in the wild’, while generating them is too easy. We highlight the urgency of more NLP research on this problem.

pdf bib abs
You Write like a GPT
Andrea Esuli | Fabrizio Falchi | Marco Malvaldi | Giovanni Puccetti
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

We investigate how Raymond Queneau’s Exercises in Style are evaluated by automatic methods for detection of artificially-generated text. We work with the Queneau’s original French version, the Italian translation by Umberto Eco andthe English translation by Barbara Wright.We start by comparing how various methods for the detection of automatically generated text, also using different large language models and evaluate the different styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles.This work is an initial attempt at exploring how methods for detection artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.

pdf bib abs
ABRICOT - ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge
Giovanni Puccetti | Claudia Collacciani | Andrea Amelio Ravelli | Andrea Esuli | Marianna Bolognesi
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.

pdf bib abs
INVALSI - Mathematical and Language Understanding in Italian: A CALAMITA Challenge
Giovanni Puccetti | Maria Cassese | Andrea Esuli
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian.These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students’ performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are developed with the goal of assessing students’ skills that are essential in the learning process, ensuring that the benchmark proposed here measures key knowledge for undergraduate students.Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into 4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap). Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question) and altro (other).We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi MATE is 55% while best accuracy on Invalsi ITA is 80%.

In this work we present SENTIWORDNET 3.0, a lexical resource explicitly devised for supporting sentiment classification and opinion mining applications. SENTIWORDNET 3.0 is an improved version of SENTIWORDNET 1.0, a lexical resource publicly available for research purposes, now currently licensed to more than 300 research groups and used in a variety of research projects worldwide. Both SENTIWORDNET 1.0 and 3.0 are the result of automatically annotating all WORDNET synsets according to their degrees of positivity, negativity, and neutrality. SENTIWORDNET 1.0 and 3.0 differ (a) in the versions of WORDNET which they annotate (WORDNET 2.0 and 3.0, respectively), (b) in the algorithm used for automatically annotating WORDNET, which now includes (additionally to the previous semi-supervised learning step) a random-walk step for refining the scores. We here discuss SENTIWORDNET 3.0, especially focussing on the improvements concerning aspect (b) that it embodies with respect to version 1.0. We also report the results of evaluating SENTIWORDNET 3.0 against a fragment of WORDNET 3.0 manually annotated for positivity, negativity, and neutrality; these results indicate accuracy improvements of about 20% with respect to SENTIWORDNET 1.0.

pdf bib
ISTI@SemEval-2 Task 8: Boosting-Based Multiway Relation Classification
Andrea Esuli | Diego Marcheggiani | Fabrizio Sebastiani
Proceedings of the 5th International Workshop on Semantic Evaluation

2008

pdf bib abs
Annotating Expressions of Opinion and Emotion in the Italian Content Annotation Bank
Andrea Esuli | Fabrizio Sebastiani | Ilaria Urciuoli
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the result of manually annotating I-CAB, the Italian Content Annotation Bank, by expressions of private state (EPSs), i.e., expressions that denote the presence of opinions, emotions, and other cognitive states. The aim of this effort was the generation of a standard resource for supporting the development of opinion extraction algorithms for Italian, and of a benchmark for testing such algorithms. To this end we have employed a previously existing annotation language (here dubbed WWC, from the initials of its proponents). We here describe the results of this annotation effort, including the results of a thorough inter-annotator agreement test. We conclude by discussing how WWC can be adapted to the specificities of a Romance language such as Italian.

2007

pdf bib
PageRanking WordNet Synsets: An Application to Opinion Mining
Andrea Esuli | Fabrizio Sebastiani
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Determining Term Subjectivity and Term Orientation for Opinion Mining
Andrea Esuli | Fabrizio Sebastiani
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs
SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining
Andrea Esuli | Fabrizio Sebastiani
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Opinion mining (OM) is a recent subdiscipline at the crossroads of information retrieval and computational linguistics which is concerned not with the topic a document is about, but with the opinion it expresses. OM has a rich set of applications, ranging from tracking users opinions about products or about political candidates as expressed in online forums, to customer relationship management. In order to aid the extraction of opinions from text, recent research has tried to automatically determine the PNpolarity of subjective terms, i.e. identify whether a term that is a marker of opinionated content has a positive or a negative connotation. Research on determining whether a term is indeed a marker of opinionated content (a subjective term) or not (an objective term) has been instead much scarcer. In this work we describe SENTIWORDNET, a lexical resource in which each WORDNET synset sis associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. The method used to develop SENTIWORDNET is based on the quantitative analysis of the glosses associated to synsets, and on the use of the resulting vectorial term representations for semi-supervised synset classi.cation. The three scores are derived by combining the results produced by a committee of eight ternary classi.ers, all characterized by similar accuracy levels but different classification behaviour. SENTIWORDNET is freely available for research purposes, and is endowed with a Web-based graphical user interface.