2024
pdf
bib
abs
A Study on the Soundness of Closed-ended Evaluation of Large Language Models Adapted to the Italian Language
Elio Musacchio
|
Lucia Siciliani
|
Pierpaolo Basile
|
Edoardo Michielon
|
Marco Pasqualini
|
Asia Beatrice Uboldi
|
Giovanni Semeraro
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural LanguageGeneration tasks, an increasing number of open weights architectures have been developed and released online. In contrastwith older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shownoutstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend naturallanguage. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is usingbenchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance ineach of these individual tasks can imply having developed a model capable of understanding language. However, while thisassumption is not incorrect, it is evident that it is not sufficient, and the evaluation of Large Language Models still remains anopen challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets andhow a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditionalapproaches.
pdf
bib
abs
CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian
Giuseppe Attanasio
|
Pierpaolo Basile
|
Federico Borazio
|
Danilo Croce
|
Maria Francis
|
Jacopo Gili
|
Elio Musacchio
|
Malvina Nissim
|
Viviana Patti
|
Matteo Rinaldi
|
Daniel Scalena
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.
pdf
bib
abs
ITA-SENSE - Evaluate LLMs’ ability for ITAlian word SENSE disambiguation: A CALAMITA Challenge
Pierpaolo Basile
|
Elio Musacchio
|
Lucia Siciliani
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
The challenge is designed to assess LLMs’ abilities in understanding lexical semantics through Word Sense Disambiguation, providing valuable insights into their performance.The idea is to cast the classical Word Sense Disambiguation task in a generative problem following two directions. Our idea is to propose two tasks: (T1) Given a target word and a sentence in which the word occurs, the LLM must generate the correct meaning definition, (T2) Given a target word and a sentence in which the word occurs, the LLM should choose from a predefined set the correct meaning definition.For T1, we compare the generated definition with respect to the correct one taken from a sense inventory, while for T2, a classical accuracy metric is used.In T1, we adopt metrics that measures the quality of the generated definition such as RougeL and the BERTscore.For CALAMITA, we test LLMs using a zero-shot setting.
pdf
bib
abs
Leveraging Large Language Models for Spell-Generation in Dungeons & Dragons
Elio Musacchio
|
Lucia Siciliani
|
Pierpaolo Basile
|
Giovanni Semeraro
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024
Dungeons & Dragons (D&D) is a classic tabletop game with a 50-year history. Its intricate and customizable gameplay allows players to create endless worlds and stories. Due to the highly narrative component of this game, D&D and many other interactive games represent a challenging setting for the Natural Language Generation (NLG) capabilities of LLMs. This paper explores using LLMs to generate new spells, which are one of the most captivating aspects of D&D gameplay. Due to the scarcity of resources available for such a specific task, we build a dataset of 3,259 instances by combining official and fan-made D&D spells. We considered several LLMs in generating spells, which underwent a quantitative and qualitative evaluation. Metrics including Bleu and BertScore were computed for quantitative assessments. Subsequently, we also conducted an in-vivo evaluation with a survey involving D&D players, which could assess the quality of the generated spells as well as their adherence to the rules. Furthermore, the paper emphasizes the open-sourcing of all models, datasets, and findings, aiming to catalyze further research on this topic.