2025
pdf
bib
abs
The Invalsi Benchmarks: measuring the Linguistic and Mathematical understanding of Large Language Models in Italian
Giovanni Puccetti
|
Maria Cassese
|
Andrea Esuli
Proceedings of the 31st International Conference on Computational Linguistics
While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative Large Language Models (LLMs) in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language under standing in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian highschool math Olympics. We evaluate 10 powerful language models on these benchmarks and we find that they are bound by 71% accuracy on Invalsi MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct accuracy is 45%.
pdf
bib
abs
GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Yuxia Wang
|
Artem Shelmanov
|
Jonibek Mansurov
|
Akim Tsvigun
|
Vladislav Mikhailov
|
Rui Xing
|
Zhuohan Xie
|
Jiahui Geng
|
Giovanni Puccetti
|
Ekaterina Artemova
|
Jinyan Su
|
Minh Ngoc Ta
|
Mervat Abassy
|
Kareem Ashraf Elozeiri
|
Saad El Dine Ahmed El Etter
|
Maiya Goloburda
|
Tarek Mahmoud
|
Raj Vardhan Tomar
|
Nurkhan Laiyk
|
Osama Mohammed Afzal
|
Ryuto Koike
|
Masahiro Kaneko
|
Alham Fikri Aji
|
Nizar Habash
|
Iryna Gurevych
|
Preslav Nakov
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.
2024
pdf
bib
abs
M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
Yuxia Wang
|
Jonibek Mansurov
|
Petar Ivanov
|
Jinyan Su
|
Artem Shelmanov
|
Akim Tsvigun
|
Osama Mohammed Afzal
|
Tarek Mahmoud
|
Giovanni Puccetti
|
Thomas Arnold
|
Alham Aji
|
Nizar Habash
|
Iryna Gurevych
|
Preslav Nakov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain and multi-generator corpus of MGTs — M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.
pdf
bib
abs
AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Giovanni Puccetti
|
Anna Rogers
|
Chiara Alzetta
|
Felice Dell’Orletta
|
Andrea Esuli
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) are increasingly used as ‘content farm’ models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real ‘content farm’. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic news-like texts ‘in the wild’, while generating them is too easy. We highlight the urgency of more NLP research on this problem.
pdf
bib
abs
You Write like a GPT
Andrea Esuli
|
Fabrizio Falchi
|
Marco Malvaldi
|
Giovanni Puccetti
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
We investigate how Raymond Queneau’s Exercises in Style are evaluated by automatic methods for detection of artificially-generated text. We work with the Queneau’s original French version, the Italian translation by Umberto Eco andthe English translation by Barbara Wright.We start by comparing how various methods for the detection of automatically generated text, also using different large language models and evaluate the different styles in the opera. We then link this automatic evaluation to distinct characteristic related to content and structure of the various styles.This work is an initial attempt at exploring how methods for detection artificially-generated text can find application as tools to evaluate the qualities and characteristics of human writing, to support better writing in terms of originality, informativeness, clarity.
pdf
bib
abs
ABRICOT - ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge
Giovanni Puccetti
|
Claudia Collacciani
|
Andrea Amelio Ravelli
|
Andrea Esuli
|
Marianna Bolognesi
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.
pdf
bib
abs
INVALSI - Mathematical and Language Understanding in Italian: A CALAMITA Challenge
Giovanni Puccetti
|
Maria Cassese
|
Andrea Esuli
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
While Italian is a high resource language, there are few Italian-native benchmarks to evaluate Language Models (LMs) generative abilities in this language. This work presents two new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian and Invalsi ITA to evaluate language understanding in Italian.These benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system. These tests are prepared by expert pedagogists and have the explicit goal of testing average students’ performance over time across Italy. Therefore, the questions are well written, appropriate for the age of the students, and are developed with the goal of assessing students’ skills that are essential in the learning process, ensuring that the benchmark proposed here measures key knowledge for undergraduate students.Invalsi MATE is composed of 420 questions about mathematical understanding, these questions range from simple money counting problems to Cartesian geometry questions, e.g. determining if a point belongs to a given line. They are divided into 4 different types: scelta multipla (multiple choice), vero/falso (true/false), numero (number), completa frase (fill the gap). Invalsi ITA is composed of 1279 questions regarding language understanding, these questions involve both the ability to extract information and answer questions about a text passage as well as questions about grammatical knowledge. They are divided into 4 different types: scelta multipla (multiple choice), binaria (binary), domanda aperta (open question) and altro (other).We evaluate 4 powerful language models both English-first and tuned for Italian to see that best accuracy on Invalsi MATE is 55% while best accuracy on Invalsi ITA is 80%.
pdf
bib
abs
SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection
Yuxia Wang
|
Jonibek Mansurov
|
Petar Ivanov
|
Jinyan Su
|
Artem Shelmanov
|
Akim Tsvigun
|
Osama Mohammed Afzal
|
Tarek Mahmoud
|
Giovanni Puccetti
|
Thomas Arnold
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.
2022
pdf
bib
abs
Outlier Dimensions that Disrupt Transformers are Driven by Frequency
Giovanni Puccetti
|
Anna Rogers
|
Aleksandr Drozd
|
Felice Dell’Orletta
Findings of the Association for Computational Linguistics: EMNLP 2022
While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlate with the frequencies of encoded tokens in pre-training data, and they also contribute to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotopicity in future models we need pre-training schemas that would better take into account the skewed token distributions.
2021
pdf
bib
abs
How Do BERT Embeddings Organize Linguistic Knowledge?
Giovanni Puccetti
|
Alessio Miaschi
|
Felice Dell’Orletta
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
Several studies investigated the linguistic information implicitly encoded in Neural Language Models. Most of these works focused on quantifying the amount and type of information available within their internal representations and across their layers. In line with this scenario, we proposed a different study, based on Lasso regression, aimed at understanding how the information encoded by BERT sentence-level representations is arrange within its hidden units. Using a suite of several probing tasks, we showed the existence of a relationship between the implicit knowledge learned by the model and the number of individual units involved in the encodings of this competence. Moreover, we found that it is possible to identify groups of hidden units more relevant for specific linguistic properties.