Leonardo Ranaldi - ACL Anthology

Leonardo Ranaldi

2025

R2-MultiOmnia: Leading Multilingual Multimodal Reasoning via Self-Training
Leonardo Ranaldi | Federico Ranaldi | Giulia Pucci
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reasoning is an intricate process that transcends both language and vision; yet, despite its inherently modality-agnostic nature, develop-ing effective multilingual and multimodal reasoning capabilities remains a substantial challenge for Multimodal Large Language Models (MLLMs). They struggle to activate complex reasoning behaviours, delivering step-wise explanation, questioning and reflection, particularly in multilingual settings where high-quality supervision across languages is lacking. Recent works have introduced eclectic strategies to enhance MLLMs’ reasoning; however, they remain related to a single language.To make MLLMs’ reasoning capabilities aligned among languages and improve modality performances, we propose R2-MultiOmnia, a modular approach that instructs the models to abstract key elements of the reasoning process and then refine reasoning trajectories via self-correction. Specifically, we instruct the models producing multimodal synthetic resources by bridging modalities and then self-improving their capabilities. To stabilise learning and the reasoning processes structure, we propose Curriculum Learning Reasoning Stabilisation with structured output rewards to gradually refine the models’ capabilities to learn and deliver robust reasoning processes. Experiments show that R2-MultiOmnia improves multimodal reasoning, gets aligned performances among the languages approaching strong models.

Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions
Leonardo Ranaldi | Marco Valentino | Andre Freitas
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chain-of-Though (CoT) represents a common strategy for reasoning in Large Language Models (LLMs) by decomposing complex tasks into intermediate inference steps. However, explanations generated via CoT are susceptible to content biases that negatively affect their robustness and faithfulness. To mitigate existing limitations, recent work has proposed using logical formalisms coupled with external symbolic solvers. However, fully symbolic approaches possess the bottleneck of requiring a complete translation from natural language to formal languages, a process that affects efficiency and flexibility. To achieve a trade-off, this paper investigates methods to disentangle content from logical reasoning without a complete formalisation. In particular, we present QuaSAR (for Quasi-Symbolic Abstract Reasoning), a variation of CoT that guides LLMs to operate at a higher level of abstraction via quasi-symbolic explanations. Our framework leverages the capability of LLMs to formalise only relevant variables and predicates, enabling the coexistence of symbolic elements with natural language. We show the impact of QuaSAR for in-context learning and for constructing demonstrations to improve the reasoning capabilities of smaller models. Our experiments show that quasi-symbolic abstractions can improve CoT-based methods by up to 8% accuracy, enhancing robustness and consistency on challenging adversarial variations on both natural language (i.e. MMLU-Redux) and symbolic reasoning tasks (i.e., GSM-Symbolic).

No Longer Left Behind: Self-training Reasoning Models in Italian
Federico Ranaldi | Leonardo Ranaldi
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations
Leonardo Ranaldi | Federico Ranaldi | Fabio Massimo Zanzotto | Barry Haddow | Alexandra Birch
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Retrieval-augmented generation (RAG) is key to improving large language models (LLMs) in systematically accessing richer factual knowledge. Yet, using RAG mechanisms brings intrinsic challenges, as LLMs must deal with conflicting knowledge, especially in multilingual retrieval, where the heterogeneity of knowledge retrieved may deliver different outlooks. To make RAG more analytical, critical and grounded, we introduce Dialectic-RAG (D-RAG), a modular approach guided by Argumentative Explanations, i.e., structured reasoning process that systematically evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. Given a query and a set of multilingual related documents, D-RAG selects and exemplifies relevant knowledge for delivering dialectic explanations that, by critically weighing opposing arguments and filtering extraneous content, clearly determine the final response. We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models. Our experiments demonstrate that D-RAG significantly improves RAG approaches, requiring low-impact computational effort and providing robustness to knowledge perturbations.

Advancing Oversight Reasoning across Languages for Audit Sycophantic Behaviour via X-Agent
Giulia Pucci | Leonardo Ranaldi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have demonstrated capabilities that are highly satisfactory to a wide range of users by adapting to their culture and wisdom. Yet, this can translate into a propensity to produce responses that align with users’ viewpoints, even when the latter are wrong. This behaviour is known as sycophancy, the tendency of LLMs to generate misleading responses as long as they align with the user’s, inducing bias and reducing reliability. To make interactions consistent, reliable and safe, we introduce X-Agent, an Oversight Reasoning framework that audits human–LLM dialogues, reasons about them, captures sycophancy and corrects the final outputs. Concretely, X-Agent extends debate-based frameworks by (i) auditing human-LLM conversations, (ii) applying a defence layer that steers model behaviour and goes beyond user beliefs, and (iii) extracting reasoning traces from evaluations that serve as training signals for mitigating sycophancy. We evaluate X-Agent across diverse scenarios and languages, showing that it consistently detects sycophancy, reduces unwarranted agreement, and improves cross-turn consistency, advancing a reasoning-as-overview paradigm for safer LLM interaction. Our approach introduces a novel paradigm in which reasoning is not merely a means to solve problems, but as a mechanism for overseeing the problem-solving processes of other models.

Exploring Backward Reasoning in Large Language Models
Leonardo Ranaldi | Giulia Pucci
Findings of the Association for Computational Linguistics: NAACL 2025

Multi-step reasoning through in-context learning strategies have been extensively explored, highlighting the abilities of Large Language Models (LLMs) to generate answers derived from step-by-step reasoning. These studies focus the attention on LLMs’ forward reasoning abilities epitomised in a series of general premises leading to a final solution. In this paper, by taking the reverse perspective, we study the backward reasoning abilities of LLMs, namely the inference that leads to the causal hypothesis. Behind formalising the backward problems, we analyse whether the LLMs are able to reason about the conclusion and reconstruct the original question that led to the delivery of the final answer. Operating with question-answering tasks involving symbolic reasoning, understanding, and commonsense abilities, we observe that the proposed models reveal robust comprehension capabilities managing different kinds of input; however, they are not always able to reason in the backward direction. Finally, to challenge this limitation, we demonstrate that instructing LLMs to generate the answer by reconsidering the structure of the problem allows for improved backward reasoning direction.

When natural language is not enough: The limits of in-context learning demonstrations in multilingual reasoning
Leonardo Ranaldi | Barry Haddow | Alexandra Birch
Findings of the Association for Computational Linguistics: NAACL 2025

Previous studies have demonstrated the effectiveness of reasoning methods in eliciting multi-step reasoned answers from Large Language Models (LLMs) by leveraging in-context demonstrations. These methods, exemplified by Chain-of-Thought (CoT) and Program-Aided Language Models (PAL), have been shown to perform well in monolingual contexts, primarily in English. There has, however, been limited exploration of their abilities in other languages.To gain a deeper understanding of the role of reasoning methods for in-context demonstrations, we investigate how well CoT and PAL perform across languages for arithmetic and symbolic reasoning tasks. Our findings indicate that the effectiveness of reasoning methods varies significantly across different languages and models. Specifically, CoT, which relies on natural language demonstrations, tends to be more accurate in high-resource than in low-resource languages. Conversely, the structured nature of PAL demonstrations facilitates multilingual comprehension, enabling LLMs to generate programmatic answers in both high- and low-resource languages and leading to significant performance improvements over CoT concerning the accuracy of the generated responses.

Position Paper: MeMo: Towards Language Models with Associative Memory Mechanisms
Fabio Massimo Zanzotto | Elena Sofia Ruzzetti | Giancarlo A. Xompero | Leonardo Ranaldi | Davide Venditti | Federico Ranaldi | Cristina Giannone | Andrea Favalli | Raniero Romagnoli
Findings of the Association for Computational Linguistics: ACL 2025

Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this position/theory paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.

Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)
Marco Valentino | Deborah Ferreira | Mokanarangan Thayaparan | Leonardo Ranaldi | Andre Freitas
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)

Eliciting Critical Reasoning in Retrieval-Augmented Generation via Contrastive Explanations
Leonardo Ranaldi | Marco Valentino | Andre Freitas
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Retrieval-augmented generation (RAG) have emerged as a critical mechanism in contemporary NLP to support Large Language Models (LLMs) in systematically accessing richer factual context. However, the integration of RAG mechanisms bring its inherent challenges, as LLMs need to integrate potentially noisy contexts. Recent studies have shown that LLMs still struggle to critically analyse RAG-based in-context information, a limitation that may lead to incorrect inferences and hallucinations. In this paper, we investigate how to elicit critical arguments in RAG via contrastive explanations. In particular, we propose Contrastive-RAG (CRAG), a framework that (i) retrieves relevant documents given a query,(ii) selects and exemplifies relevant passages, and (iii) generates explanations that explicitly contrast the relevance of the passages to (iv) support the final answer. We show the impact of C-RAG building contrastive reasoning demonstrations from LLMs to instruct smaller models for retrieval-augmented tasks. Extensive experiments demonstrate that CRAG improves state-of-the-art RAG models while (a) requiring significantly fewer prompts and demonstrations and (b) being robust to perturbations in the retrieved documents.

Multilingual Reasoning via Self-training
Leonardo Ranaldi | Giulia Pucci
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Although reasoning is innately language-agnostic, the multilingual capacities remains a significant challenge for large language models (LLMs). Their ability to generate structured, step-wise explanations is constantly restricted to dominant languages in pre-training data, making cross-lingual generalisation difficult and hindering broader global adoption. Recent works have introduced eclectic strategies to improve reasoning beyond English; however, these methods remain related to specific language that is not always optimal for reasoning.To improve LLMs’ multilingual reasoning abilities, we propose a modular approach that instructs the models to structure reasoning passages in a different problem space and then self-refine their capabilities to deliver step-wise reasoning passages that lead to the solution. Experiments show that our approach stably achieves significant improvements in the multilingual reasoning of various models and task, with improved reasoning consistency across languages.

2024

Measuring Bias in Instruction-Following Models with ItaP-AT for the Italian Language
Dario Onorati | Davide Venditti | Elena Sofia Ruzzetti | Federico Ranaldi | Leonardo Ranaldi | Fabio Massimo Zanzotto
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Instruction-Following Language Models (IFLMs) are the state-of-the-art for solving many downstream tasks. Given their widespread use, there is an urgent need to measure whether the sentences they generate contain toxic information or social biases. In this paper, we propose Prompt Association Test for the Italian language (ItaP-AT): a new resource for testing the presence of social bias in different domains in IFLMs. This work also aims to understand whether it is possible to make the responses of these models more fair by using context learning, using “one-shot anti-stereotypical prompts”.

The limits of Italian in Reasoning Tasks
Leonardo Ranaldi | Giulia Pucci | Federico Ranaldi | Elena Sofia Ruzzetti | Fabio Massimo Zanzotto
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Previous studies have demonstrated the effectiveness of reasoning methods in eliciting multi-step reasoned answers from Large Language Models (LLMs) by leveraging in-context demonstrations. These methods, exemplified by Chain-of-Thought (CoT) and Program-Aided Language Models (PAL), have been shown to reason well in monolingual contexts, primarily in English. There has, however, been limited exploration of their abilities in other languages, especially in Italian.To gain a deeper understanding of the role of reasoning methods in in-context demonstrations, we propose a multidimensional analysis tailored to Italian, focusing on arithmetic and symbolic reasoning tasks. Our findings indicate that the effectiveness of reasoning methods varies significantly beyond English. Specifically, CoT, which relies on natural language demonstrations, is limited to English. Conversely, the structured nature of PAL in-context demonstrations facilitates multilingual comprehension, enabling LLMs to generate programmatic answers in Italian as well. Finally, for a more comprehensive overview, we observe that additional alignment methods do not improve downstream performances; in contrast, in some cases, they limit the abilities of the original models. This leads to significant improvements in the accuracy and quality of the generated responses.

How Far Does the Sequence of Compositions Impact Multilingual Pre-Training?
Leonardo Ranaldi | Giulia Pucci | Fabio Massimo Zanzotto
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

The most efficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of fixed length through causal masking that estimates the probability of each token given its context.However, the role of the composition sequence pre-training technique in the models’ generalization properties has yet to be explored.In this paper, we show that operating via causal masking impacts model performance because it could include misleading information from previous text sequences during pre-training.To fill this gap, we propose intra-context causal masking where the probability of each token is conditional only on the previous in the same chunk of text, avoiding misleading information from different contexts.Hence, we demonstrate that organizing text chunks based on a policy that aligns with text similarity effectively reduces the risk of misleading context during pre-training by enhancing language models’ in-context learning and factual knowledge storage capabilities while maintaining efficiency.

Assessing the Asymmetric Behaviour of Italian Large Language Models across Different Syntactic Structures
Elena Sofia Ruzzetti | Federico Ranaldi | Dario Onorati | Davide Venditti | Leonardo Ranaldi | Tommaso Caselli | Fabio Massimo Zanzotto
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

While LLMs get more proficient at solving tasks and generating sentences, we aim to investigate the role that differentsyntactic structures have on models’ performances on a battery of Natural Language Understanding tasks. We analyze theperformance of five LLMs on semantically equivalent sentences that are characterized by different syntactic structures. Tocorrectly solve the tasks, a model is implicitly required to correctly parse the sentence. We found out that LLMs strugglewhen there are more complex syntactic structures, with an average drop of 16.13(±11.14) points in accuracy on Q&A task.Additionally, we propose a method based on token attribution to spot which area of the LLMs encode syntactic knowledge,by identifying model heads and layers responsible for the generation of a correct answer

Termite Italian Text-to-SQL: A CALAMITA Challenge
Federico Ranaldi | Elena Sofia Ruzzetti | Dario Onorati | Fabio Massimo Zanzotto | Leonardo Ranaldi
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

We introduce Termite, which is a definitely unseen resource for evaluating Text-to-SQL in Italian. Specifically,we transfer evaluation pipelines beyond English, proposing novel, definitely unseen resources that avoid data-contamination phenomena while assessing the ability of models to perform Text-to-SQL tasks when natural language queries are written in Italian. We establish an evaluation grid based on execution accuracy.

Aligning Large and Small Language Models via Chain-of-Thought Reasoning
Leonardo Ranaldi | Andre Freitas
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Chain-of-Thought (CoT) prompting empowersthe reasoning abilities of Large Language Models (LLMs), eliciting them to solve complexreasoning tasks in a step-wise manner. However, these capabilities appear only in models with billions of parameters, which represent an entry barrier for many users who are constrained to operate on a smaller model scale, i.e., Small Language Models (SLMs). Although many companies are releasing LLMs of the same family with fewer parameters, these models tend not to preserve all the reasoning capabilities of the original models, including CoT reasoning.In this paper, we propose a method for aligning and transferring reasoning abilities between larger to smaller Language Models. By using an Instruction-tuning-CoT method, that is, an Instruction-tuning designed around CoT-Demonstrations, we enable the SLMs to generate multi-step controlled reasoned answers when they are elicited with the CoT mechanism. Hence, we instruct a smaller Language Model using outputs generated by more robust models belonging to the same family or not, evaluating the impact across different types of models. Results obtained on question-answering and mathematical reasoning benchmarks show that LMs instructed via the Instruction-tuning CoT method produced by LLMs outperform baselines within both in-domain and out-domain scenarios.

Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models
Leonardo Ranaldi | Andre Freitas
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The alignment of reasoning abilities between smaller and larger Language Models are largely conducted via supervised fine-tuning using demonstrations generated from robust Large Language Models (LLMs). Although these approaches deliver more performant models, they do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations.In this paper, we propose the Self-refine Instruction-tuning method that elicits Smaller Language Models to self-improve their abilities.Our approach is based on a two-stage process, where reasoning abilities are first transferred between LLMs and Small Language Models (SLMs) via Instruction-tuning on synthetic demonstrations provided by LLMs, and then the instructed models self-improve their abilities through preference optimization strategies.In particular, the second phase operates refinement heuristics based on Direct Preference Optimization, where the SLMs are elicited to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the LLMs.Results obtained on commonsense and math reasoning tasks show that this approach consistently outperforms Instruction-tuning in both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger language models.

Empowering Multi-step Reasoning across Languages via Program-Aided Language Models
Leonardo Ranaldi | Giulia Pucci | Barry Haddow | Alexandra Birch
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In-context learning methods are popular inference strategies where Large Language Models (LLMs) are elicited to solve a task using provided demonstrations without parameter updates. Among these approaches are the reasoning methods, best exemplified by Chain-of-Thought (CoT) and Program-Aided Language Models (PAL), which elicit LLMs to generate reasoning paths, thus promoting accuracy and attracting increasing attention. However, despite the success of these methods, the ability to deliver multi-step reasoning remains limited to a single language, making it challenging to generalize to other languages and hindering global development.In this work, we propose Cross-lingual Program-Aided Language Models (CrossPAL), a method for aligning reasoning programs across languages. In particular, our method delivers programs as intermediate reasoning steps in different languages through a double-step cross-lingual prompting mechanism inspired by the Program-Aided approach. In addition, we introduce Self-consistent CrossPAL (SCrossPAL) to ensemble different reasoning paths across languages. Our experimental evaluations show that our method significantly outperforms existing prompting methods, reducing the number of interactions and achieving state-of-the-art performance.

A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages
Leonardo Ranaldi | Giulia Pucci | Federico Ranaldi | Elena Sofia Ruzzetti | Fabio Massimo Zanzotto
Findings of the Association for Computational Linguistics: NAACL 2024

Reasoning methods, best exemplified by the well-known Chain-of-Thought (CoT), empower the reasoning abilities of Large Language Models (LLMs) by eliciting them to solve complex tasks in a step-by-step manner. Although they are achieving significant success, the ability to deliver multi-step reasoning remains limited to English because of the imbalance in the distribution of pre-training data, which makes other languages a barrier. In this paper, we propose Cross-lingual Tree-of-Thoughts (Cross-ToT), a method for aligning Cross-lingual CoT reasoning across languages. The proposed method, through a self-consistent cross-lingual prompting mechanism inspired by the Tree-of-Thoughts approach, provides multi-step reasoning paths in different languages that, during the steps, lead to the final solution. Experimental evaluations show that our method significantly outperforms existing prompting methods by reducing the number of interactions and achieving state-of-the-art performance.

Empowering cross-lingual abilities of instruction-tuned large language models by translation-following demonstrations
Leonardo Ranaldi | Giulia Pucci | Andre Freitas
Findings of the Association for Computational Linguistics: ACL 2024

The language ability of Large Language Models (LLMs) is often unbalanced towards English because of the imbalance in the distribution of the pre-training data. This disparity is demanded in further fine-tuning and affecting the cross-lingual abilities of LLMs. In this paper, we propose to empower Instruction-tuned LLMs (It-LLMs) in languages other than English by building semantic alignment between them. Hence, we propose CrossAlpaca, an It-LLM with cross-lingual Instruction-following and Translation-following demonstrations to improve semantic alignment between languages. We validate our approach on the multilingual Question Answering (QA) benchmarks XQUAD and MLQA and adapted versions of MMLU and BBH.Our models, tested over six different languages, outperform the It-LLMs tuned on monolingual data. The final results show that instruction tuning on non-English data is not enough and that semantic alignment can be further improved by Translation-following demonstrations.

Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL translation
Federico Ranaldi | Elena Sofia Ruzzetti | Dario Onorati | Leonardo Ranaldi | Cristina Giannone | Andrea Favalli | Raniero Romagnoli | Fabio Massimo Zanzotto
Findings of the Association for Computational Linguistics: ACL 2024

Understanding textual description to generate code seems to be an achieved capability of instruction-following Large Language Models (LLMs) in zero-shot scenario. However, there is a severe possibility that this translation ability may be influenced by having seen target textual descriptions and the related code. This effect is known as Data Contamination.In this study, we investigate the impact of Data Contamination on the performance of GPT-3.5 in the Text-to-SQL code-generating tasks. Hence, we introduce a novel method to detect Data Contamination in GPTs and examine GPT-3.5’s Text-to-SQL performances using the known Spider Dataset and our new unfamiliar dataset Termite. Furthermore, we analyze GPT-3.5’s efficacy on databases with modified information via an adversarial table disconnection (ATD) approach, complicating Text-to-SQL tasks by removing structural pieces of information from the database. Our results indicate a significant performance drop in GPT-3.5 on the unfamiliar Termite dataset, even with ATD modifications, highlighting the effect of Data Contamination on LLMs in Text-to-SQL translation tasks.

Does the Order Matter? Curriculum Learning over Languages
Leonardo Ranaldi | Giulia Pucci | Andrè Freitas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Curriculum Learning (CL) has been emerged as an effective technique for improving the performances and reducing the cost of pre-training Large Language Models (LLMs). The efficacy of CL demonstrated in different scenarios is in the training LLMs by organizing examples from the simplest to the most complex. Although improvements have been shown extensively, this approach was used for pre-training, leaving novel fine-tuning approaches such as instruction-tuning unexplored. In this paper, we propose a novel complexity measure to empower the instruction-tuning method using the CL paradigm. To complement previous works, we propose cognitively motivated measures to determine the complexity of training demonstrations used in the instruction-tuning paradigm. Hence, we experiment with the proposed heuristics first in English and then in other languages. The downstream results show that delivering training examples by complexity ranking is also effective for instruction tuning, as it improves downstream performance while reducing costs. Furthermore, the technique can be easily transferred to languages other than English, e.g., Italian and French, without any adaptation, maintaining functionality and effectiveness.

HANS, are you clever? Clever Hans Effect Analysis of Neural Systems
Leonardo Ranaldi | Fabio Zanzotto
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Large Language Models (LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. In fact, several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models’ abilities. However, earlier works demonstrate the presence of inherent “order bias” in LLMs, posing challenges to the appropriate evaluation. In this paper, we investigate LLMs’ resilience abilities through a series of probing tests using four MCQ benchmarks. Introducing adversarial examples, we show a significant performance gap, mainly when varying the order of the choices, which reveals a selection bias and brings into discussion reasoning abilities. Following a correlation between first positions and model choices due to positional bias, we hypothesized the presence of structural heuristics in the decision-making process of the LLMs, strengthened by including significant examples in few-shot scenarios. Finally, by using the Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate the bias by obtaining more robust models.

A Trip Towards Fairness: Bias and De-Biasing in Large Language Models
Leonardo Ranaldi | Elena Sofia Ruzzetti | Davide Venditti | Dario Onorati | Fabio Massimo Zanzotto
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Cheap-to-Build Very Large-Language Models (CtB-LLMs) with affordable training are emerging as the next big revolution in natural language processing and understanding. These CtB-LLMs are democratizing access to trainable Very Large-Language Models (VLLMs) and, thus, may represent the building blocks of many NLP systems solving downstream tasks. Hence, a little or a large bias in CtB-LLMs may cause huge harm. In this paper, we performed a large investigation of the bias of three families of CtB-LLMs, and we showed that debiasing techniques are effective and usable. Indeed, according to current tests, the LLaMA and the OPT families have an important bias in gender, race, religion, and profession. In contrast to the analysis for other LMMs, we discovered that bias depends not on the number of parameters but on the perplexity. Finally, the debiasing of OPT using LORA reduces bias up to 4.12 points in the normalized stereotype score.

2023

Legal Summarization: to Each Court its Own Model
Flavia Achena | David Preti | Davide Venditti | Leonardo Ranaldi | Cristina Giannone | Fabio Massimo Zanzotto | Andrea Favalli | Raniero Romagnoli
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Are All Languages Equal? Curriculum Learning over Different Languages
Giulia Pucci | Leonardo Ranaldi | Fabio Massimo Zanzotto
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Prompting LLMs in Italian Language for Text-to-SQL Translation
Federico Ranaldi | Elena Sofia Ruzzetti | Leonardo Ranaldi | Davide Venditti | Cristina Giannone | Andrea Favalli | Raniero Romagnoli | Fabio Massimo Zanzotto
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Teasing LLMs Adapted to Italian
Leonardo Ranaldi | Giulia Pucci | Elena Sofia Ruzzetti | Fabio Massimo Zanzotto | André Freitas
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Investigating Gender Bias in Large Language Models for the Italian Language
Elena Sofia Ruzzetti | Dario Onorati | Leonardo Ranaldi | Davide Venditti | Fabio Massimo Zanzotto
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Measuring bias in Instruction-Following models with P-AT
Dario Onorati | Elena Sofia Ruzzetti | Davide Venditti | Leonardo Ranaldi | Fabio Massimo Zanzotto
Findings of the Association for Computational Linguistics: EMNLP 2023

Instruction-Following Language Models (IFLMs) are promising and versatile tools for solving many downstream, information-seeking tasks. Given their success, there is an urgent need to have a shared resource to determine whether existing and new IFLMs are prone to produce biased language interactions. In this paper, we propose Prompt Association Test (P-AT): a new resource for testing the presence of social biases in IFLMs. P-AT stems from WEAT (Caliskan et al., 2017) and generalizes the notion of measuring social biases to IFLMs. Basically, we cast WEAT word tests in promptized classification tasks, and we associate a metric - the bias score. Our resource consists of 2310 prompts. We then experimented with several families of IFLMs discovering gender and race biases in all the analyzed models. We expect P-AT to be an important tool for quantifying bias across different dimensions and, therefore, for encouraging the creation of fairer IFLMs before their distortions have consequences in the real world.

Exploring Linguistic Properties of Monolingual BERTs with Typological Classification among Languages
Elena Sofia Ruzzetti | Federico Ranaldi | Felicia Logozzo | Michele Mastromattei | Leonardo Ranaldi | Fabio Massimo Zanzotto
Findings of the Association for Computational Linguistics: EMNLP 2023

The impressive achievements of transformers force NLP researchers to delve into how these models represent the underlying structure of natural language. In this paper, we propose a novel standpoint to investigate the above issue: using typological similarities among languages to observe how their respective monolingual models encode structural information. We aim to layer-wise compare transformers for typologically similar languages to observe whether these similarities emerge for particular layers. For this investigation, we propose to use Centered Kernel Alignment to measure similarity among weight matrices. We found that syntactic typological similarity is consistent with the similarity between the weights in the middle layers, which are the pretrained BERT layers to which syntax encoding is generally attributed. Moreover, we observe that a domain adaptation on semantically equivalent texts enhances this similarity among weight matrices.

Does the English Matter? Elicit Cross-lingual Abilities of Large Language Models
Leonardo Ranaldi | Giulia Pucci
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)

Modeling Easiness for Training Transformers with Curriculum Learning
Leonardo Ranaldi | Giulia Pucci | Fabio Massimo Zanzotto
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Directly learning from complex examples is generally problematic for humans and machines. Indeed, a better strategy is exposing learners to examples in a reasonable, pedagogically-motivated order. Curriculum Learning (CL) has been proposed to import this strategy when training machine learning models. In this paper, building on Curriculum Learning, we propose a novel, linguistically motivated measure to determine example complexity for organizing examples during learning. Our complexity measure - LRC- is based on length, rarity, and comprehensibility. Our resulting learning model is CL-LRC, that is, CL with LRC. Experiments on downstream tasks show that CL-LRC outperforms existing CL and non-CL methods for training BERT and RoBERTa from scratch. Furthermore, we analyzed different measures, including perplexity, loss, and learning curve of different models pre-trained from scratch, showing that CL-LRC performs better than the state-of-the-art.

The Dark Side of the Language: Pre-trained Transformers in the DarkNet
Leonardo Ranaldi | Aria Nourbakhsh | Elena Sofia Ruzzetti | Arianna Patrizi | Dario Onorati | Michele Mastromattei | Francesca Fallucchi | Fabio Massimo Zanzotto
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Pre-trained Transformers are challenging human performances in many Natural Language Processing tasks. The massive datasets used for pre-training seem to be the key to their success on existing tasks. In this paper, we explore how a range of pre-trained natural language understanding models performs on definitely unseen sentences provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks perform on par with pre-trained Transformers even after fine-tuning. Only after what we call extreme domain adaptation, that is, retraining with the masked language model task on all the novel corpus, pre-trained Transformers reach their standard high results. This suggests that huge pre-training corpora may give Transformers unexpected help since they are exposed to many of the possible sentences.

PreCog: Exploring the Relation between Memorization and Performance in Pre-trained Language Models
Leonardo Ranaldi | Elena Sofia Ruzzetti | Fabio Massimo Zanzotto
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Large Language Models (LLMs) are impressive machines with the ability to memorize, possibly generalized learning examples. We present here a small, focused contribution to the analysis of the interplay between memorization and performance of BERT in downstream tasks. We propose PreCog, a measure for evaluating memorization from pre-training, and we analyze its correlation with the BERT’s performance. Our experiments show that highly memorized examples are better classified, suggesting memorization is an essential key to success for BERT.

2022

Lacking the Embedding of a Word? Look it up into a Traditional Dictionary
Elena Sofia Ruzzetti | Leonardo Ranaldi | Michele Mastromattei | Francesca Fallucchi | Noemi Scarpato | Fabio Massimo Zanzotto
Findings of the Association for Computational Linguistics: ACL 2022

Word embeddings are powerful dictionaries, which may easily capture language variations. However, these dictionaries fail to give sense to rare words, which are surprisingly often covered by traditional dictionaries. In this paper, we propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words. For this purpose, we introduce two methods: Definition Neural Network (DefiNNet) and Define BERT (DefBERT). In our experiments, DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods devised for producing embeddings of unknown words. In fact, DefiNNet significantly outperforms FastText, which implements a method for the same task-based on n-grams, and DefBERT significantly outperforms the BERT method for OOV words. Then, definitions in traditional dictionaries are useful to build word embeddings for rare words.

2021

KERMIT for Sentiment Analysis in Italian Healthcare Reviews
Leonardo Ranaldi | Michele Mastromattei | Dario Onorati | Elena Sofia Ruzzetti | Francesca Fallucchi | Fabio Massimo Zanzotto
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

2020

KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations
Fabio Massimo Zanzotto | Andrea Santilli | Leonardo Ranaldi | Dario Onorati | Pierfrancesco Tommasino | Francesca Fallucchi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Syntactic parsers have dominated natural language understanding for decades. Yet, their syntactic interpretations are losing centrality in downstream tasks due to the success of large-scale textual representation learners. In this paper, we propose KERMIT (Kernel-inspired Encoder with Recursive Mechanism for Interpretable Trees) to embed symbolic syntactic parse trees into artificial neural networks and to visualize how syntax is used in inference. We experimented with KERMIT paired with two state-of-the-art transformer-based universal sentence encoders (BERT and XLNet) and we showed that KERMIT can indeed boost their performance by effectively embedding human-coded universal syntactic representations in neural networks

Venues