Gabriele Sarti - ACL Anthology

Gabriele Sarti

2026

Steering Large Language Models for Machine Translation Personalization
Daniel Scalena | Gabriele Sarti | Arianna Bisazza | Elisabetta Fersini | Malvina Nissim
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models have simplified the production of personalized translations reflecting predefined stylistic constraints. However, these systems still struggle when stylistic requirements are implicitly represented by a set of examples, such as texts produced by a specific human translator. In this work, we explore various strategies for personalizing automatically generated translations when few examples are available, with a focus on the challenging domain of literary translation. We begin by determining the feasibility of the task and how style information is encoded within model representations. Then, we evaluate various prompting strategies and inference-time interventions for steering model generations towards a personalized style, with a particular focus on contrastive steering with sparse autoencoder (SAE) latents to identify salient personalization properties. We demonstrate that contrastive SAE steering yields robust style conditioning and translation quality, resulting in higher inference-time computational efficiency than prompting approaches. We further examine the impact of steering on model activations, finding that layers encoding personalization properties are impacted similarly by prompting and SAE steering, suggesting a similar mechanism at play.

2025

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
Dana Arad | Yonatan Belinkov | Hanjie Chen | Najoung Kim | Hosein Mohebbi | Aaron Mueller | Gabriele Sarti | Martin Tutek
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Mechanistic interpretability (MI) seeks to uncover how language models (LMs) implement specific behaviors, yet measuring progress in MI remains challenging. The recently released Mechanistic Interpretability Benchmark (MIB) provides a standardized framework for evaluating circuit and causal variable localization. Building on this foundation, the BlackboxNLP 2025 Shared Task extends MIB into a community-wide reproducible comparison of MI techniques. The shared task features two tracks: circuit localization, which assesses methods that identify causally influential components and interactions driving model behavior, and causal variable localization, which evaluates approaches that map activations into interpretable features. With three teams spanning eight different methods, participants achieved notable gains in circuit localization using ensemble and regularization strategies for circuit discovery. With one team spanning two methods, participants achieved significant gains in causal variable localization using low-dimensional and non-linear projections to featurize activation vectors. The MIB leaderboard remains open; we encourage continued work in this standard evaluation framework to measure progress in MI research going forward.

QE4PE: Word-level Quality Estimation for Human Post-Editing
Gabriele Sarti | Vilém Zouhar | Grzegorz Chrupała | Ana Guerberof-Arenas | Malvina Nissim | Arianna Bisazza
Transactions of the Association for Computational Linguistics, Volume 13

Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality, and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors’ speed are critical factors in determining highlights’ effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Yonatan Belinkov | Aaron Mueller | Najoung Kim | Hosein Mohebbi | Hanjie Chen | Dana Arad | Gabriele Sarti
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation
Khondoker Ittehadul Islam | Gabriele Sarti
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)

Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models’ predictions, highlighting different trends across models and languages.

Crossword Space: Latent Manifold Learning for Italian Crosswords and beyond
Cristiano Ciaccio | Gabriele Sarti | Alessio Miaschi | Felice Dell’Orletta
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Gabriele Sarti | Vilém Zouhar | Malvina Nissim | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

2024

Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation
Lukas Edman | Gabriele Sarti | Antonio Toral | Gertjan van Noord | Arianna Bisazza
Transactions of the Association for Computational Linguistics, Volume 12

Pretrained character-level and byte-level language models have been shown to be competitive with popular subword models across a range of Natural Language Processing tasks. However, there has been little research on their effectiveness for neural machine translation (NMT), particularly within the popular pretrain-then-finetune paradigm. This work performs an extensive comparison across multiple languages and experimental conditions of character- and subword-level pretrained models (ByT5 and mT5, respectively) on NMT. We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited. In our analysis, we show how character models’ gains in translation quality are reflected in better translations of orthographically similar words and rare words. While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5, suggesting an ability to modulate word-level and character-level information during generation. We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.

Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses
Gabriele Sarti | Tommaso Caselli | Malvina Nissim | Arianna Bisazza
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models’ performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models’ linguistic proficiency and sequential instruction-following skills.

IT5: Text-to-text Pretraining for Italian Language Understanding and Generation
Gabriele Sarti | Malvina Nissim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.

Multi-property Steering of Large Language Models with Dynamic Activation Composition
Daniel Scalena | Gabriele Sarti | Malvina Nissim
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models’ intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
Jirui Qi | Gabriele Sarti | Raquel Fernández | Arianna Bisazza
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Ensuring the verifiability of model answers is a fundamental challenge for retrieval-augmented generation (RAG) in the question answering (QA) domain. Recently, self-citation prompting was proposed to make large language models (LLMs) generate citations to supporting documents along with their answers. However, self-citing LLMs often struggle to match the required format, refer to non-existent sources, and fail to faithfully reflect LLMs’ context usage throughout the generation. In this work, we present MIRAGE – Model Internals-based RAG Explanations – a plug-and-play approach using model internals for faithful answer attribution in RAG applications. MIRAGE detects context-sensitive answer tokens and pairs them with retrieved documents contributing to their prediction via saliency methods. We evaluate our proposed approach on a multilingual extractive QA dataset, finding high agreement with human answer attribution. On open-ended QA, MIRAGE achieves citation quality and efficiency comparable to self-citation while also allowing for a finer-grained control of attribution parameters. Our qualitative evaluation highlights the faithfulness of MIRAGE’s attributions and underscores the promising application of model internals for RAG answer attribution. Code and data released at https://github.com/Betswish/MIRAGE.

DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
Anna Langedijk | Hosein Mohebbi | Gabriele Sarti | Willem Zuidema | Jaap Jumelet
Findings of the Association for Computational Linguistics: NAACL 2024

In recent years, several interpretability methods have been proposed to interpret the inner workings of Transformer models at different levels of precision and complexity.In this work, we propose a simple but effective technique to analyze encoder-decoder Transformers. Our method, which we name DecoderLens, allows the decoder to cross-attend representations of intermediate encoder activations instead of using the default final encoder output.The method thus maps uninterpretable intermediate vector representations to human-interpretable sequences of words or symbols, shedding new light on the information flow in this popular but understudied class of models.We apply DecoderLens to question answering, logical reasoning, speech recognition and machine translation models, finding that simpler subtasks are solved with high precision by low and intermediate encoder layers.

EurekaRebus - Verbalized Rebus Solving with LLMs: A CALAMITA Challenge
Gabriele Sarti | Tommaso Caselli | Arianna Bisazza | Malvina Nissim
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Language games can be valuable resources for testing the ability of large language models (LLMs) to conduct challenging multi-step, knowledge-intensive inferences while respecting predefined constraints. Our proposed challenge prompts LLMs to reason step-by-step to solve verbalized variants of rebus games recently introduced with the EurekaRebus dataset. Verbalized rebuses replace visual cues with crossword definitions to create an encrypted first pass, making the problem entirely text-based. We introduce a simplified task variant with word length hints and adopt a comprehensive set of metrics to obtain a granular overview of models’ performance in knowledge recall, constraints adherence, and re-segmentation abilities across reasoning steps.

2023

RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation
Gabriele Sarti | Phu Mon Htut | Xing Niu | Benjamin Hsu | Anna Currey | Georgiana Dinu | Maria Nadejde
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Attribute-controlled translation (ACT) is a subtask of machine translation that involves controlling stylistic or linguistic attributes (like formality and gender) of translation outputs. While ACT has garnered attention in recent years due to its usefulness in real-world applications, progress in the task is currently limited by dataset availability, since most prior approaches rely on supervised methods. To address this limitation, we propose Retrieval and Attribute-Marking enhanced Prompting (RAMP), which leverages large multilingual language models to perform ACT in few-shot and zero-shot settings. RAMP improves generation accuracy over the standard prompting approach by (1) incorporating a semantic similarity retrieval component for selecting similar in-context examples, and (2) marking in-context examples with attribute annotations. Our comprehensive experiments show that RAMP is a viable approach in both zero-shot and few-shot settings.

Contrastive Language–Image Pre-training for the Italian Language
Federico Bianchi | Giuseppe Attanasio | Raphael Pisoni | Silvia Terragni | Gabriele Sarti | Dario Balestri
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Inseq: An Interpretability Toolkit for Sequence Generation Models
Gabriele Sarti | Nils Feldhus | Ludwig Sickert | Oskar van der Wal
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Past work in natural language processing interpretability focused mainly on popular classification tasks while largely overlooking generation settings, partly due to a lack of dedicated tools. In this work, we introduce Inseq, a Python library to democratize access to interpretability analyses of sequence generation models. Inseq enables intuitive and optimized extraction of models’ internal information and feature importance scores for popular decoder-only and encoder-decoder Transformers architectures. We showcase its potential by adopting it to highlight gender biases in machine translation models and locate factual knowledge inside GPT-2. Thanks to its extensible interface supporting cutting-edge techniques such as contrastive feature attribution, Inseq can drive future advances in explainable natural language generation, centralizing good practices and enabling fair and reproducible model evaluations.

2022

DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages
Gabriele Sarti | Arianna Bisazza | Ana Guerberof-Arenas | Antonio Toral
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We introduce DivEMT, the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages. Using a strictly controlled setup, 18 professional translators were instructed to translate or post-edit the same set of English documents into Arabic, Dutch, Italian, Turkish, Ukrainian, and Vietnamese. During the process, their edits, keystrokes, editing times and pauses were recorded, enabling an in-depth, cross-lingual evaluation of NMT quality and post-editing effectiveness. Using this new dataset, we assess the impact of two state-of-the-art NMT systems, Google Translate and the multilingual mBART-50 model, on translation productivity. We find that post-editing is consistently faster than translation from scratch. However, the magnitude of productivity gains varies widely across systems and languages, highlighting major disparities in post-editing effectiveness for languages at different degrees of typological relatedness to English, even when controlling for system architecture and training data size. We publicly release the complete dataset including all collected behavioral data, to foster new research on the translation capabilities of NMT systems for typologically diverse languages.

InDeep × NMT: Empowering Human Translators via Interpretable Neural Machine Translation
Gabriele Sarti | Arianna Bisazza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

Neural machine translation (NMT) systems are nowadays essential components of professional translation workflows. Consequently, human translators are increasingly working as post-editors for machine-translated content. The NWO-funded InDeep project aims to empower users of Deep Learning models of text, speech, and music by improving their ability to interact with such models and interpret their behaviors. In the specific context of translation, we aim at developing new tools and methodologies to improve prediction attribution, error analysis, and controllable generation for NMT systems. These advances will be evaluated through field studies involving professional translators to assess gains in efficiency and overall enjoyability of the post-editing process.

2021

Teaching NLP with Bracelets and Restaurant Menus: An Interactive Workshop for Italian Students
Ludovica Pannitto | Lucia Busso | Claudia Roberta Combei | Lucio Messina | Alessio Miaschi | Gabriele Sarti | Malvina Nissim
Proceedings of the Fifth Workshop on Teaching NLP

Although Natural Language Processing is at the core of many tools young people use in their everyday life, high school curricula (in Italy) do not include any computational linguistics education. This lack of exposure makes the use of such tools less responsible than it could be, and makes choosing computational linguistics as a university degree unlikely. To raise awareness, curiosity, and longer-term interest in young people, we have developed an interactive workshop designed to illustrate the basic principles of NLP and computational linguistics to high school Italian students aged between 13 and 18 years. The workshop takes the form of a game in which participants play the role of machines needing to solve some of the most common problems a computer faces in understanding language: from voice recognition to Markov chains to syntactic parsing. Participants are guided through the workshop with the help of instructors, who present the activities and explain core concepts from computational linguistics. The workshop was presented at numerous outlets in Italy between 2019 and 2020, both face-to-face and online.

A dissemination workshop for introducing young Italian students to NLP
Lucio Messina | Lucia Busso | Claudia Roberta Combei | Alessio Miaschi | Ludovica Pannitto | Gabriele Sarti | Malvina Nissim
Proceedings of the Fifth Workshop on Teaching NLP

We describe and make available the game-based material developed for a laboratory run at several Italian science festivals to popularize NLP among young students.

That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models
Gabriele Sarti | Dominique Brunato | Felice Dell’Orletta
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

This paper investigates the relationship between two complementary perspectives in the human assessment of sentence complexity and how they are modeled in a neural language model (NLM). The first perspective takes into account multiple online behavioral metrics obtained from eye-tracking recordings. The second one concerns the offline perception of complexity measured by explicit human judgments. Using a broad spectrum of linguistic features modeling lexical, morpho-syntactic, and syntactic properties of sentences, we perform a comprehensive analysis of linguistic phenomena associated with the two complexity viewpoints and report similarities and differences. We then show the effectiveness of linguistic features when explicitly leveraged by a regression model for predicting sentence complexity and compare its results with the ones obtained by a fine-tuned neural language model. We finally probe the NLM’s linguistic competence before and after fine-tuning, highlighting how linguistic information encoded in representations changes when the model learns to predict complexity.

2020

Italian Transformers Under the Linguistic Lens
Alessio Miaschi | Gabriele Sarti | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Venues