Andrea Zugarini

2026

PharmaQA.IT: an Italian dataset for Q&A in the pharmaceutical domain
Kamyar Zeinalipour | Andrea Zugarini | Asya Zanollo | Leonardo Rigutini
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

The growing use of Large Language Models (LLMs) for medical Question Answering (QA) requires reliable, evidence-grounded benchmarks beyond English. In Italy, Riassunti delle Caratteristiche del Prodotto (RCP) issued by the Italian Medicines Agency (AIFA) are the main regulatory source on medicines, yet no QA dataset exists on these documents, limiting the development and evaluation of trustworthy Italian QA systems.We introduce PharmaQA.IT, an Italian extractive QA dataset built from RCPs in PharmaER.IT. Using a semi-automatic pipeline, we (i) select informative pages from 1,077 leaflets, (ii) prompt a multimodal LLM on page images with professional personas to generate candidate question–answer pairs, and (iii) validate and normalise them with expert revision. The final dataset contains 861 high-quality question–answer pairs on indications, contraindications, dosage, warnings, interactions, and pharmacological properties.We frame PharmaQA.IT as an extractive QA benchmark with structured JSON outputs and evaluate a range of open and proprietary LLMs. Results show that open models approach closed-source performance under a chunking-and-retrieval setup. PharmaQA.IT, together with all code, prompts, and evaluation scripts, will be publicly released to support research on trustworthy Italian biomedical QA.PharmaQA.IT, together with all code, prompts, and evaluation scripts, is publicly available on Hugging Face to support research on trustworthy Italian biomedical QA.

2025

pdf bib abs

Investigating Large Language Models for Text-to-SPARQL Generation
Jacopo D’Abramo | Andrea Zugarini | Paolo Torroni
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing

Large Language Models (LLMs) have demonstrated strong capabilities in code generation, such as translating natural language questions into SQL queries. However, state-of-the-art solutions often involve a costly fine-tuning step. In this study, we extensively evaluate In-Context Learning (ICL) solutions for text-to-SPARQL generation with different architectures and configurations, based on methods for retrieving relevant demonstrations for few-shot prompting and working with multiple generated hypotheses. In this way, we demonstrate that LLMs can formulate SPARQL queries achieving state-of-the-art results on several Knowledge Graph Question Answering (KGQA) benchmark datasets without fine-tuning.

pdf bib

PharmaER.IT: An Italian Dataset for Entity Recognition in the Pharmaceutical Domain
Andrea Zugarini | Leonardo Rigutini
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

2024

pdf bib abs

SLIMER-IT: Zero-Shot NER on Italian Language
Andrew Zamai | Leonardo Rigutini | Marco Maggini | Andrea Zugarini
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages. In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of SLIMER-IT on never-seen-before entity tags.

pdf bib abs

Clue-Instruct: Text-Based Clue Generation for Educational Crossword Puzzles
Andrea Zugarini | Kamyar Zeinalipour | Surya Sai Kadali | Marco Maggini | Marco Gori | Leonardo Rigutini
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Crossword puzzles are popular linguistic games often used as tools to engage students in learning. Educational crosswords are characterized by less cryptic and more factual clues that distinguish them from traditional crossword puzzles. Despite there exist several publicly available clue-answer pair databases for traditional crosswords, educational clue-answer pairs datasets are missing. In this article, we propose a methodology to build educational clue generation datasets that can be used to instruct Large Language Models (LLMs). By gathering from Wikipedia pages informative content associated with relevant keywords, we use Large Language Models to automatically generate pedagogical clues related to the given input keyword and its context. With such an approach, we created clue-instruct, a dataset containing 44,075 unique examples with text-keyword pairs associated with three distinct crossword clues. We used clue-instruct to instruct different LLMs to generate educational clues from a given input content and keyword. Both human and automatic evaluations confirmed the quality of the generated clues, thus validating the effectiveness of our approach.

pdf bib abs

ECWCA - Educational CrossWord Clues Answering: A CALAMITA Challenge
Andrea Zugarini | Kamyar Zeinalipour | Achille Fusco | Asya Zanollo
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

This paper presents ECWCA (Educational CrossWord Clues Answering), a novel challenge designed to evaluate knowledge and reasoning capabilities of large language models through crossword clue-answering. The challenge consists of two tasks: a standard question-answering format where the LLM has to solve crossword clues, and a variation of it, where the model is receives hints about the word lengths of the answers, which is expected to help models with reasoning abilities. To construct the ECWCA dataset, synthetic clues were generated based on entities and facts extracted from Italian Wikipedia. Generated clues were then selected manually in order to ensure high-quality examples with factually correct and unambiguous clues.

pdf bib abs

Task-Incremental Learning on Long Text Sequences
Natalia Graziuso | Andrea Zugarini | Stefano Melacci
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

The extraordinary results achieved by Large Language Models are paired with issues that are critical in real-world applications. The costs of inference and, in particular, training are extremely large, both in terms of time and computational resources, and they become prohibitive when working in dynamic environments, where data and tasks are progressively provided over time. The model must be able to adapt to new knowledge, new domains, new settings, without forgetting the previously learned skills. Retraining from scratch easily becomes too costly, thus Continual Learning strategies are of crucial importance. This is even more evident when data consist of “long” documents, that require several resources to be processed by modern neural models, leading to very long prompts. This paper investigates LLM-based Task-Incremental Learning in the case of tasks exploiting long sequences of text, as it is typical in summarization, question-answering on long documents, reviewing long contracts, and several others. We show how adapting the model by Task Arithmetic with LoRA, which was proposed for visual data, yields promising results also in the case of such “long” text data. To our best knowledge, this is the first work along this challenging direction. The outcome of the investigation of this paper is generic enough to represent an important starting point for further research in processing linguistic data in every language.

2023

pdf bib abs

BUSTER: a “BUSiness Transaction Entity Recognition” dataset
Andrea Zugarini | Andrew Zamai | Marco Ernandes | Leonardo Rigutini
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Albeit Natural Language Processing has seen major breakthroughs in the last few years, transferring such advances into real-world business cases can be challenging. One of the reasons resides in the displacement between popular benchmarks and actual data. Lack of supervision, unbalanced classes, noisy data and long documents often affect real problems in vertical domains such as finance, law and health. To support industry-oriented research, we present BUSTER, a BUSiness Transaction Entity Recognition dataset. The dataset consists of 3779 manually annotated documents on financial transactions. We establish several baselines exploiting both general-purpose and domain-specific language models. The best performing model is also used to automatically annotate 6196 documents, which we release as an additional silver corpus to BUSTER.

pdf bib abs

Multi-word Tokenization for Sequence Compression
Leonidas Gee | Leonardo Rigutini | Marco Ernandes | Andrea Zugarini
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

Large Language Models have proven highly successful at modelling a variety of tasks. However, this comes at a steep computational cost that hinders wider industrial uptake. In this paper, we present MWT: a Multi-Word Tokenizer that goes beyond word boundaries by representing frequent multi-word expressions as single tokens. MWTs produce a more compact and efficient tokenization that yields two benefits: (1) Increase in performance due to a greater coverage of input data given a fixed sequence length budget; (2) Faster and lighter inference due to the ability to reduce the sequence length with negligible drops in performance. Our results show that MWT is more robust across shorter sequence lengths, thus allowing for major speedups via early sequence truncation.

pdf bib

pdf bib abs

Are Compressed Language Models Less Subgroup Robust?
Leonidas Gee | Andrea Zugarini | Novi Quadrianto
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression.

2022

pdf bib abs

Fast Vocabulary Transfer for Language Model Compression
Leonidas Gee | Andrea Zugarini | Leonardo Rigutini | Paolo Torroni
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Real-world business applications require a trade-off between language model performance and size. We propose a new method for model compression that relies on vocabulary transfer. We evaluate the method on various vertical domains and downstream tasks. Our results indicate that vocabulary transfer can be effectively used in combination with other compression techniques, yielding a significant reduction in model size and inference time while marginally compromising on performance.

2021

pdf bib

A Multi-Strategy Approach to Crossword Clue Answer Retrieval and Ranking
Andrea Zugarini | Marco Ernandes
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

2020

pdf bib abs

Vulgaris: Analysis of a Corpus for Middle-Age Varieties of Italian Language
Andrea Zugarini | Matteo Tiezzi | Marco Maggini
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Italian is a Romance language that has its roots in Vulgar Latin. The birth of the modern Italian started in Tuscany around the 14th century, and it is mainly attributed to the works of Dante Alighieri, Francesco Petrarca and Giovanni Boccaccio, who are among the most acclaimed authors of the medieval age in Tuscany. However, Italy has been characterized by a high variety of dialects, which are often loosely related to each other, due to the past fragmentation of the territory. Italian has absorbed influences from many of these dialects, as also from other languages due to dominion of portions of the country by other nations, such as Spain and France. In this work we present Vulgaris, a project aimed at studying a corpus of Italian textual resources from authors of different regions, ranging in a time period between 1200 and 1600. Each composition is associated to its author, and authors are also grouped in families, i.e. sharing similar stylistic/chronological characteristics. Hence, the dataset is not only a valuable resource for studying the diachronic evolution of Italian and the differences between its dialects, but it is also useful to investigate stylistic aspects between single authors. We provide a detailed statistical analysis of the data, and a corpus-driven study in dialectology and diachronic varieties.

Venues

LREC1

VarDial1

WS1

Fix author

Andrea Zugarini

2026

2025

2024

2023

2022

2021

2020

Co-authors

Venues