Malvina Nissim - ACL Anthology

Malvina Nissim

2026

Practising responsibility: Ethics in NLP as a hands-on course
Malvina Nissim | Viviana Patti | Beatrice Savoldi
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)

As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field’s rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and “learning by teaching” methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.

Steering Large Language Models for Machine Translation Personalization
Daniel Scalena | Gabriele Sarti | Arianna Bisazza | Elisabetta Fersini | Malvina Nissim
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models have simplified the production of personalized translations reflecting predefined stylistic constraints. However, these systems still struggle when stylistic requirements are implicitly represented by a set of examples, such as texts produced by a specific human translator. In this work, we explore various strategies for personalizing automatically generated translations when few examples are available, with a focus on the challenging domain of literary translation. We begin by determining the feasibility of the task and how style information is encoded within model representations. Then, we evaluate various prompting strategies and inference-time interventions for steering model generations towards a personalized style, with a particular focus on contrastive steering with sparse autoencoder (SAE) latents to identify salient personalization properties. We demonstrate that contrastive SAE steering yields robust style conditioning and translation quality, resulting in higher inference-time computational efficiency than prompting approaches. We further examine the impact of steering on model activations, finding that layers encoding personalization properties are impacted similarly by prompting and SAE steering, suggesting a similar mechanism at play.

2025

When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation
Daniela Occhipinti | Marco Guerini | Malvina Nissim
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Endowing dialogue agents with persona information has proven to significantly improve the consistency and diversity of their generations. While much focus has been placed on aligning dialogues with provided personas, the adaptation to the interlocutor’s profile remains largely underexplored. In this work, we investigate three key aspects: (1) a model’s ability to align responses with both the provided persona and the interlocutor’s; (2) its robustness when dealing with familiar versus unfamiliar interlocutors and topics, and (3) the impact of additional fine-tuning on specific persona-based dialogues. We evaluate dialogues generated with diverse speaker pairings and topics, framing the evaluation as an author identification task and employing both LLM-as-a-judge and human evaluations. By systematically masking or disclosing information about interlocutor, we assess its impact on dialogue generation. Results show that access to the interlocutor’s persona improves the recognition of the target speaker, while masking it does the opposite. Although models generalise well across topics, they struggle with unfamiliar interlocutors. Finally, we found that in zero-shot settings, LLMs often copy biographical details, facilitating identification but trivialising the task.

QE4PE: Word-level Quality Estimation for Human Post-Editing
Gabriele Sarti | Vilém Zouhar | Grzegorz Chrupała | Ana Guerberof-Arenas | Malvina Nissim | Arianna Bisazza
Transactions of the Association for Computational Linguistics, Volume 13

Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality, and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors’ speed are critical factors in determining highlights’ effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

Storytelling in Argumentative Discussions: Exploring the Use of Narratives in ChangeMyView
Sara Nabhani | Khalid Al Khatib | Federico Pianzola | Malvina Nissim
Proceedings of the 12th Argument mining Workshop

Psychological research has long suggested that storytelling can shape beliefs and behaviors by fostering emotional engagement and narrative transportation. However, it remains unclear whether these effects extend to online argumentative discourse. In this paper, we examine the role of narrative in real-world argumentation using discussions from the ChangeMyView subreddit. Leveraging an automatic story detection model, we analyze how narrative use varies across persuasive comments, user types, discussion outcomes, and the kinds of change being sought. While narrative appears more frequently in some contexts, it is not consistently linked to successful persuasion. Notably, highly persuasive users tend to use narrative less, and storytelling does not demonstrate increased effectiveness for any specific type of persuasive goals. These findings suggest that narrative may play a limited and context-dependent role in online discussions, highlighting the need for computational models of argumentation to account for rhetorical diversity.

Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?
Leonidas Zotos | Hedderik van Rijn | Malvina Nissim
Proceedings of the 31st International Conference on Computational Linguistics

Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty. Specifically, we exploit model uncertainty towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models’ behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Gabriele Sarti | Vilém Zouhar | Malvina Nissim | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

Explanations explained. Influence of Free-text Explanations on LLMs and the Role of Implicit Knowledge
Andrea Zaninello | Roberto Dessi | Malvina Nissim | Bernardo Magnini
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)

In this work, we investigate the relationship between the quality of explanations produced by different models and the amount of implicit knowledge the are able to provide beyond the input. We approximate explanation quality via accuracy on a downstream task with a standardized pipeline (GEISER) and study its correlation with three different association measures, each capturing different aspects of implicitness, defined as a combination of relevance and novelty. We conduct experiments with three SOTA LLMs on four tasks involving implicit knowledge, with explanations either confirming or contradicting the correct label. Our results demonstrate that providing quality explanations consistently improves the accuracy of LLM predictions, even when the models are not explicitly trained to take explanations as input, and underline the correlation between implicit content delivered by the explanation and its effectiveness.

2024

Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses
Gabriele Sarti | Tommaso Caselli | Malvina Nissim | Arianna Bisazza
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models’ performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models’ linguistic proficiency and sequential instruction-following skills.

Multi-property Steering of Large Language Models with Dynamic Activation Composition
Daniel Scalena | Gabriele Sarti | Malvina Nissim
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models’ intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.

Proceedings of the First Workshop on Reference, Framing, and Perspective @ LREC-COLING 2024
Pia Sommerauer | Tommaso Caselli | Malvina Nissim | Levi Remijnse | Piek Vossen
Proceedings of the First Workshop on Reference, Framing, and Perspective @ LREC-COLING 2024

Choosy Babies Need One Coach: Inducing Mode-Seeking Behavior in BabyLlama with Reverse KL Divergence
Shaozhen Shi | Yevgen Matusevych | Malvina Nissim
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

This study presents our submission to the Strict-Small Track of the 2nd BabyLM Challenge. We use a teacher-student distillation setup with the BabyLLaMa model (Timiryasov and Tastet, 2023) as a backbone. To make the student’s learning process more focused, we replace the objective function with a reverse Kullback-Leibler divergence, known to cause mode-seeking (rather than mode-averaging) behaviour in computational learners. We further experiment with having a single teacher (instead of an ensemble of two teachers) and implement additional optimization strategies to improve the distillation process. Our experiments show that under reverse KL divergence, a single-teacher model often outperforms or matches multiple-teacher models across most tasks. Additionally, incorporating advanced optimization techniques further enhances model performance, demonstrating the effectiveness and robustness of our proposed approach. These findings support our idea that “choosy babies need one coach”.

ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG
Irene Mondella | Huiyuan Lai | Malvina Nissim
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

In spite of the core role human judgement plays in evaluating the performance of NLP systems, the way human assessments are elicited in NLP experiments, and to some extent the nature of human judgement itself, pose challenges to the reliability and validity of human evaluation. In the context of the larger ReproHum project, aimed at running large scale multi-lab reproductions of human judgement, we replicated the understandability assessment by humans on several generated outputs of simplified text described in the paper “Neural Text Simplification of Clinical Letters with a Domain Specific Phrase Table” by Shardlow and Nawaz, appeared in the Proceedings of ACL 2019. Although we had to implement a series of modifications compared to the original study, which were necessary to run our human evaluation on exactly the same data, we managed to collect assessments and compare results with the original study. We obtained results consistent with those of the reference study, confirming their findings. The paper is complete with as much information as possible to foster and facilitate future reproduction.

GATTINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge
Maria Francis | Matteo Rinaldi | Jacopo Gili | Leonardo De Cosmo | Sandro Iannaccone | Malvina Nissim | Viviana Patti
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

We introduce a new benchmark designed to evaluate the ability of Large Language Models (LLMs) to generate Italian-language headlines for science news articles. The benchmark is based on a large dataset of science news articles obtained from Ansa Scienza and Galileo, two important Italian media outlets. Effective headline generation requires more than summarizing article content; headlines must also be informative, engaging, and suitable for the topic and target audience, making automatic evaluation particularly challenging. To address this, we propose two novel transformer-based metrics to assess headline quality. We aim for this benchmark to support the evaluation of Italian LLMs and to foster the development of tools to assist in editorial workflows.

IT5: Text-to-text Pretraining for Italian Language Understanding and Generation
Gabriele Sarti | Malvina Nissim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.

Mult-IT Multiple Choice Questions on Multiple Topics in Italian: A CALAMITA Challenge
Matteo Rinaldi | Jacopo Gili | Maria Francis | Mattia Goffetti | Viviana Patti | Malvina Nissim
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Multi-choice question answering (MCQA) is a powerful tool for evaluating the factual knowledge and reasoning capacities of Large Language Models (LLMs). However, there is a lack of large-scale MCQA datasets originally written in Italian. Existing Italian MCQA benchmarks are often automatically translated from English, an approach with two key drawbacks: Firstly, automatic translations may sound unnatural, contain errors, or use linguistics constructions that do not align with the target language. Secondly, they may introduce topical and ideological biases reflecting Anglo-centric perspectives. To addressthis gap, we present Mult-IT, an MCQA dataset comprising over 110,000 manually written questions across a wide range of topics. All questions are sourced directly from preparation quizzes for Italian university entrance exams, or for exams for public sector employment in Italy. We are hopeful that this contribution enables a more comprehensive evaluation of LLMs’ proficiency, not only in the Italian language, but also in their grasp of Italian cultural and contextual knowledge.

EurekaRebus - Verbalized Rebus Solving with LLMs: A CALAMITA Challenge
Gabriele Sarti | Tommaso Caselli | Arianna Bisazza | Malvina Nissim
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Language games can be valuable resources for testing the ability of large language models (LLMs) to conduct challenging multi-step, knowledge-intensive inferences while respecting predefined constraints. Our proposed challenge prompts LLMs to reason step-by-step to solve verbalized variants of rebus games recently introduced with the EurekaRebus dataset. Verbalized rebuses replace visual cues with crossword definitions to create an encrypted first pass, making the problem entirely text-based. We introduce a simplified task variant with word length hints and adopt a comprehensive set of metrics to obtain a granular overview of models’ performance in knowledge recall, constraints adherence, and re-segmentation abilities across reasoning steps.

A Gentle Push Funziona Benissimo: Making Instructed Models in Italian via Contrastive Activation Steering
Daniel Scalena | Elisabetta Fersini | Malvina Nissim
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Adapting models to a language that was only partially present in the pre-training data requires fine-tuning, which is expensive in terms of both data and computational resources. As an alternative to fine-tuning, we explore the potential of activation steering-based techniques to enhance model performance on Italian tasks. Through our experiments we show that Italian steering (i) can be successfully applied to different models, (ii) achieves performances comparable to, or even better than, fine-tuned models for Italian, and (iii) yields higher quality and consistency in Italian generations. We also discuss the utility of steering and fine-tuning in the contemporary LLM landscape where models are anyway getting high Italian performances even if not explicitly trained in this language.

ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study
Yiru Li | Huiyuan Lai | Antonio Toral | Malvina Nissim
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

In the context of the ReproHum project aimed at assessing the reliability of human evaluation, we replicated the human evaluation conducted in “Generating Scientific Definitions with Controllable Complexity” by August et al. (2022). Specifically, humans were asked to assess the fluency of automatically generated scientific definitions by three different models, with output complexity varying according to target audience. Evaluation conditions were kept as close as possible to the original study, except of necessary and minor adjustments. Our results, despite yielding lower absolute performance, show that relative performance across the three tested systems remains comparable to what was observed in the original paper. On the basis of lower inter-annotator agreement and feedback received from annotators in our experiment, we also observe that the ambiguity of the concept being evaluated may play a substantial role in human assessment.

CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian
Giuseppe Attanasio | Pierpaolo Basile | Federico Borazio | Danilo Croce | Maria Francis | Jacopo Gili | Elio Musacchio | Malvina Nissim | Viviana Patti | Matteo Rinaldi | Daniel Scalena
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.

Fine-tuning with HED-IT: The impact of human post-editing for dialogical language models
Daniela Occhipinti | Michele Marchi | Irene Mondella | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim | Marco Guerini
Findings of the Association for Computational Linguistics: ACL 2024

Automatic methods for generating and gathering linguistic data have proven effective for fine-tuning Language Models (LMs) in languages less resourced than English. Still, while there has been emphasis on data quantity, less attention has been given to its quality. In this work, we investigate the impact of human intervention on machine-generated data when fine-tuning dialogical models. In particular, we study (1) whether post-edited dialogues exhibit higher perceived quality compared to the originals that were automatically generated; (2) whether fine-tuning with post-edited dialogues results in noticeable differences in the generated outputs; and (3) whether post-edited dialogues influence the outcomes when considering the parameter size of the LMs. To this end we created HED-IT, a large-scale dataset where machine-generated dialogues are paired with the version post-edited by humans. Using both the edited and unedited portions of HED-IT, we fine-tuned three different sizes of an LM. Results from both human and automatic evaluation show that the different quality of training data is clearly perceived and it has an impact also on the models trained on such data. Additionally, our findings indicate that larger models are less sensitive to data quality, whereas this has a crucial impact on smaller models. These results enhance our comprehension of the impact of human intervention on training data in the development of high-quality LMs.

mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models
Huiyuan Lai | Malvina Nissim
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) with Chain-of-thought (CoT) have recently emerged as a powerful technique for eliciting reasoning to improve various downstream tasks. As most research mainly focuses on English, with few explorations in a multilingual context, the question of how reliable this reasoning capability is in different languages is still open. To address it directly, we study multilingual reasoning consistency across multiple languages, using popular open-source LLMs. First, we compile the first large-scale multilingual math reasoning dataset, *mCoT-MATH*, covering eleven diverse languages. Then, we introduce multilingual CoT instruction tuning to boost reasoning capability across languages, thereby improving model consistency. While existing LLMs show substantial variation across the languages we consider, and especially low performance for lesser resourced languages, our 7B parameter model *mCoT* achieves impressive consistency across languages, and superior or comparable performance to close- and open-source models even of much larger sizes.

2023

Responsibility Perspective Transfer for Italian Femicide News
Gosse Minnema | Huiyuan Lai | Benedetta Muscato | Malvina Nissim
Findings of the Association for Computational Linguistics: ACL 2023

Different ways of linguistically expressing the same real-world event can lead to different perceptions of what happened. Previous work has shown that different descriptions of gender-based violence (GBV) influence the reader’s perception of who is to blame for the violence, possibly reinforcing stereotypes which see the victim as partly responsible, too. As a contribution to raise awareness on perspective-based writing, and to facilitate access to alternative perspectives, we introduce the novel task of automatically rewriting GBV descriptions as a means to alter the perceived level of blame on the perpetrator. We present a quasi-parallel dataset of sentences with low and high perceived responsibility levels for the perpetrator, and experiment with unsupervised (mBART-based), zero-shot and few-shot (GPT3-based) methods for rewriting sentences. We evaluate our models using a questionnaire study and a suite of automatic metrics.

Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work
Silvia Stopponi | Nilo Pedrazzini | Saskia Peels | Barbara McGillivray | Malvina Nissim
Proceedings of the Ancient Language Processing Workshop

We evaluate four count-based and predictive distributional semantic models of Ancient Greek against AGREE, a composite benchmark of human judgements, to assess their ability to retrieve semantic relatedness. On the basis of the observations deriving from the analysis of the results, we design a procedure for a larger-scale intrinsic evaluation of count-based and predictive language models, including syntactic embeddings. We also propose possible ways of exploiting the different layers of the whole AGREE benchmark (including both human- and machine-generated data) and different evaluation metrics.

Pre-Trained Language-Meaning Models for Multilingual Parsing and Generation
Chunliu Wang | Huiyuan Lai | Malvina Nissim | Johan Bos
Findings of the Association for Computational Linguistics: ACL 2023

Pre-trained language models (PLMs) have achieved great success in NLP and have recently been used for tasks in computational semantics. However, these tasks do not fully benefit from PLMs since meaning representations are not explicitly included. We introduce multilingual pre-trained language-meaning models based on Discourse Representation Structures (DRSs), including meaning representations besides natural language texts in the same model, and design a new strategy to reduce the gap between the pre-training and fine-tuning objectives. Since DRSs are language neutral, cross-lingual transfer learning is adopted to further improve the performance of non-English tasks. Automatic evaluation results show that our approach achieves the best performance on both the multilingual DRS parsing and DRS-to-text generation tasks. Correlation analysis between automatic metrics and human judgements on the generation task further validates the effectiveness of our model. Human inspection reveals that out-of-vocabulary tokens are the main cause of erroneous results.

Same Trends, Different Answers: Insights from a Replication Study of Human Plausibility Judgments on Narrative Continuations
Yiru Li | Huiyuan Lai | Antonio Toral | Malvina Nissim
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

We reproduced the human-based evaluation of the continuation of narratives task presented by Chakrabarty et al. (2022). This experiment is performed as part of the ReproNLP Shared Task on Reproducibility of Evaluations in NLP (Track C). Our main goal is to reproduce the original study under conditions as similar as possible. Specifically, we follow the original experimental design and perform human evaluations of the data from the original study, while describing the differences between the two studies. We then present the results of these two studies together with an analysis of similarities between them. Inter-annotator agreement (Krippendorff’s alpha) in the reproduction study is lower than in the original study, while the human evaluation results of both studies have the same trends, that is, our results support the findings in the original study.

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP
Anya Belz | Craig Thomson | Ehud Reiter | Gavin Abercrombie | Jose M. Alonso-Moral | Mohammad Arvan | Anouck Braggaar | Mark Cieliebak | Elizabeth Clark | Kees van Deemter | Tanvi Dinkar | Ondřej Dušek | Steffen Eger | Qixiang Fang | Mingqi Gao | Albert Gatt | Dimitra Gkatzia | Javier González-Corbelle | Dirk Hovy | Manuela Hürlimann | Takumi Ito | John D. Kelleher | Filip Klubicka | Emiel Krahmer | Huiyuan Lai | Chris van der Lee | Yiru Li | Saad Mahamood | Margot Mieskes | Emiel van Miltenburg | Pablo Mosteiro | Malvina Nissim | Natalie Parde | Ondřej Plátek | Verena Rieser | Jie Ruan | Joel Tetreault | Antonio Toral | Xiaojun Wan | Leo Wanner | Lewis Watson | Diyi Yang
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

DUMB: A Benchmark for Smart Evaluation of Dutch Models
Wietse de Vries | Martijn Wieling | Malvina Nissim
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. The total set of nine tasks includes four tasks that were previously not available in Dutch. Instead of relying on a mean score across tasks, we propose Relative Error Reduction (RER), which compares the DUMB performance of language models to a strong baseline which can be referred to in the future even when assessing different sets of language models. Through a comparison of 14 pre-trained language models (mono- and multi-lingual, of varying sizes), we assess the internal consistency of the benchmark tasks, as well as the factors that likely enable high performance. Our results indicate that current Dutch monolingual models under-perform and suggest training larger Dutch models with other architectures and pre-training objectives. At present, the highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base). In addition to highlighting best strategies for training larger Dutch models, DUMB will foster further research on Dutch. A public leaderboard is available at https://dumbench.nl.

Multilingual Multi-Figurative Language Detection
Huiyuan Lai | Antonio Toral | Malvina Nissim
Findings of the Association for Computational Linguistics: ACL 2023

Figures of speech help people express abstract concepts and evoke stronger emotions than literal expressions, thereby making texts more creative and engaging. Due to its pervasive and fundamental character, figurative language understanding has been addressed in Natural Language Processing, but it’s highly understudied in a multilingual setting and when considering more than one figure of speech at the same time. To bridge this gap, we introduce multilingual multi-figurative language modelling, and provide a benchmark for sentence-level figurative language detection, covering three common figures of speech and seven languages. Specifically, we develop a framework for figurative language detection based on template-based prompt learning. In so doing, we unify multiple detection tasks that are interrelated across multiple figures of speech and languages, without requiring task- or language-specific modules. Experimental results show that our framework outperforms several strong baselines and may serve as a blueprint for the joint modelling of other interrelated tasks.

Cross-lingual Transfer Learning with Persian
Sepideh Mollanorozy | Marc Tanti | Malvina Nissim
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

The success of cross-lingual transfer learning for POS tagging has been shown to be strongly dependent, among other factors, on the (typological and/or genetic) similarity of the low-resource language used for testing and the language(s) used in pre-training or to fine-tune the model. We further unpack this finding in two directions by zooming in on a single language, namely Persian. First, still focusing on POS tagging we run an in-depth analysis of the behaviour of Persian with respect to closely related languages and languages that appear to benefit from cross-lingual transfer with Persian. To do so, we also use the World Atlas of Language Structures to determine which properties are shared between Persian and other languages included in the experiments. Based on our results, Persian seems to be a reasonable potential language for Kurmanji and Tagalog low-resource languages for other tasks as well. Second, we test whether previous findings also hold on a task other than POS tagging to pull apart the benefit of language similarity and the specific task for which such benefit has been shown to hold. We gather sentiment analysis datasets for 31 target languages and through a series of cross-lingual experiments analyse which languages most benefit from Persian as the source. The set of languages that benefit from Persian had very little overlap across the two tasks, suggesting a strong task-dependent component in the usefulness of language similarity in cross-lingual transfer.

2022

Human Judgement as a Compass to Navigate Automatic Metrics for Formality Transfer
Huiyuan Lai | Jiali Mao | Antonio Toral | Malvina Nissim
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Although text style transfer has witnessed rapid development in recent years, there is as yet no established standard for evaluation, which is performed using several automatic metrics, lacking the possibility of always resorting to human judgement. We focus on the task of formality transfer, and on the three aspects that are usually evaluated: style strength, content preservation, and fluency. To cast light on how such aspects are assessed by common and new metrics, we run a human-based evaluation and perform a rich correlation analysis. We are then able to offer some recommendations on the use of such metrics in formality transfer, also with an eye to their generalisability (or not) to related tasks.

Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages
Wietse de Vries | Martijn Wieling | Malvina Nissim
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-lingual transfer learning with large multilingual pre-trained models can be an effective approach for low-resource languages with no labeled training data. Existing evaluations of zero-shot cross-lingual generalisability of large pre-trained models use datasets with English training data, and test data in a selection of target languages. We explore a more extensive transfer learning setup with 65 different source languages and 105 target languages for part-of-speech tagging. Through our analysis, we show that pre-training of both source and target language, as well as matching language families, writing systems, word order systems, and lexical-phonetic distance significantly impact cross-lingual performance. The findings described in this paper can be used as indicators of which factors are important for effective zero-shot cross-lingual transfer to zero- and low-resource languages.

AGILe: The First Lemmatizer for Ancient Greek Inscriptions
Evelien de Graaf | Silvia Stopponi | Jasper K. Bos | Saskia Peels-Matthey | Malvina Nissim
Proceedings of the Thirteenth Language Resources and Evaluation Conference

To facilitate corpus searches by classicists as well as to reduce data sparsity when training models, we focus on the automatic lemmatization of ancient Greek inscriptions, which have not received as much attention in this sense as literary text data has. We show that existing lemmatizers for ancient Greek, trained on literary data, are not performant on epigraphic data, due to major language differences between the two types of texts. We thus train the first inscription-specific lemmatizer achieving above 80% accuracy, and make both the models and the lemmatized data available to the community. We also provide a detailed error analysis highlighting peculiarities of inscriptions which again highlights the importance of a lemmatizer dedicated to inscriptions.

Multilingual Pre-training with Language and Task Adaptation for Multilingual Text Style Transfer
Huiyuan Lai | Antonio Toral | Malvina Nissim
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We exploit the pre-trained seq2seq model mBART for multilingual text style transfer. Using machine translated data as well as gold aligned English sentences yields state-of-the-art results in the three target languages we consider. Besides, in view of the general scarcity of parallel data, we propose a modular approach for multilingual formality transfer, which consists of two training strategies that target adaptation to both language and task. Our approach achieves competitive performance without monolingual task-specific parallel data and can be applied to other style transfer tasks as well as to other languages.

SocioFillmore: A Tool for Discovering Perspectives
Gosse Minnema | Sara Gemelli | Chiara Zanchi | Tommaso Caselli | Malvina Nissim
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

SOCIOFILLMORE is a multilingual tool which helps to bring to the fore the focus or the perspective that a text expresses in depicting an event. Our tool, whose rationale we also support through a large collection of human judgements, is theoretically grounded on frame semantics and cognitive linguistics, and implemented using the LOME frame semantic parser. We describe SOCIOFILLMORE’s development and functionalities, show how non-NLP researchers can easily interact with the tool, and present some example case studies which are already incorporated in the system, together with the kind of analysis that can be visualised.

Dead or Murdered? Predicting Responsibility Perception in Femicide News Reports
Gosse Minnema | Sara Gemelli | Chiara Zanchi | Tommaso Caselli | Malvina Nissim
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Different linguistic expressions can conceptualize the same event from different viewpoints by emphasizing certain participants over others. Here, we investigate a case where this has social consequences: how do linguistic expressions of gender-based violence (GBV) influence who we perceive as responsible? We build on previous psycholinguistic research in this area and conduct a large-scale perception survey of GBV descriptions automatically extracted from a corpus of Italian newspapers. We then train regression models that predict the salience of GBV participants with respect to different dimensions of perceived responsibility. Our best model (fine-tuned BERT) shows solid overall performance, with large differences between dimensions and participants: salient _focus_ is more predictable than salient _blame_, and perpetrators’ salience is more predictable than victims’ salience. Experiments with ridge regression models using different representations show that features based on linguistic theory similarly to word-based features. Overall, we show that different linguistic choices do trigger different perceptions of responsibility, and that such perceptions can be modelled automatically. This work can be a core instrument to raise awareness of the consequences of different perspectivizations in the general public and in news producers alike.

Multi-Figurative Language Generation
Huiyuan Lai | Malvina Nissim
Proceedings of the 29th International Conference on Computational Linguistics

Figurative language generation is the task of reformulating a given text in the desired figure of speech while still being faithful to the original context. We take the first step towards multi-figurative language modelling by providing a benchmark for the automatic generation of five common figurative forms in English. We train mFLAG employing a scheme for multi-figurative language pre-training on top of BART, and a mechanism for injecting the target figurative information into the encoder; this enables the generation of text with the target figurative form from another figurative form without parallel figurative-figurative sentence pairs. Our approach outperforms all strong baselines. We also offer some qualitative analysis and reflections on the relationship between the different figures of speech.

Visually Grounded Interpretation of Noun-Noun Compounds in English
Inga Lang | Lonneke Plas | Malvina Nissim | Albert Gatt
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Noun-noun compounds (NNCs) occur frequently in the English language. Accurate NNC interpretation, i.e. determining the implicit relationship between the constituents of a NNC, is crucial for the advancement of many natural language processing tasks. Until now, computational NNC interpretation has been limited to approaches involving linguistic representations only. However, much research suggests that grounding linguistic representations in vision or other modalities can increase performance on this and other tasks. Our work is a novel comparison of linguistic and visuo-linguistic representations for the task of NNC interpretation. We frame NNC interpretation as a relation classification task, evaluating on a large, relationally-annotated NNC dataset. We combine distributional word vectors with image vectors to investigate how visual information can help improve NNC interpretation systems. We find that adding visual vectors increases classification performance on our dataset in many cases.

2021

As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages
Wietse de Vries | Malvina Nissim
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Thank you BART! Rewarding Pre-Trained Models Improves Formality Style Transfer
Huiyuan Lai | Antonio Toral | Malvina Nissim
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Scarcity of parallel data causes formality style transfer models to have scarce success in preserving content. We show that fine-tuning pre-trained language (GPT-2) and sequence-to-sequence (BART) models boosts content preservation, and that this is possible even with limited amounts of parallel data. Augmenting these models with rewards that target style and content –the two core aspects of the task– we achieve a new state-of-the-art.

Breeding Fillmore’s Chickens and Hatching the Eggs: Recombining Frames and Roles in Frame-Semantic Parsing
Gosse Minnema | Malvina Nissim
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

Frame-semantic parsers traditionally predict predicates, frames, and semantic roles in a fixed order. This paper explores the ‘chicken-or-egg’ problem of interdependencies between these components theoretically and practically. We introduce a flexible BERT-based sequence labeling architecture that allows for predicting frames and roles independently from each other or combining them in several ways. Our results show that our setups can approximate more complex traditional models’ performance, while allowing for a clearer view of the interdependencies between the pipeline’s components, and of how frame and role prediction models make different use of BERT’s layers.

Frame Semantics for Social NLP in Italian: Analyzing Responsibility Framing in Femicide News Reports
Gosse Minnema | Sara Gemelli | Chiara Zanchi | Viviana Patti | Tommaso Caselli | Malvina Nissim
Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it 2021)

Teaching NLP with Bracelets and Restaurant Menus: An Interactive Workshop for Italian Students
Ludovica Pannitto | Lucia Busso | Claudia Roberta Combei | Lucio Messina | Alessio Miaschi | Gabriele Sarti | Malvina Nissim
Proceedings of the Fifth Workshop on Teaching NLP

Although Natural Language Processing is at the core of many tools young people use in their everyday life, high school curricula (in Italy) do not include any computational linguistics education. This lack of exposure makes the use of such tools less responsible than it could be, and makes choosing computational linguistics as a university degree unlikely. To raise awareness, curiosity, and longer-term interest in young people, we have developed an interactive workshop designed to illustrate the basic principles of NLP and computational linguistics to high school Italian students aged between 13 and 18 years. The workshop takes the form of a game in which participants play the role of machines needing to solve some of the most common problems a computer faces in understanding language: from voice recognition to Markov chains to syntactic parsing. Participants are guided through the workshop with the help of instructors, who present the activities and explain core concepts from computational linguistics. The workshop was presented at numerous outlets in Italy between 2019 and 2020, both face-to-face and online.

Human Perception in Natural Language Generation
Lorenzo De Mattei | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim
Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We ask subjects whether they perceive as human-produced a bunch of texts, some of which are actually human-written, while others are automatically generated. We use this data to fine-tune a GPT-2 model to push it to generate more human-like texts, and observe that this fine-tuned model produces texts that are indeed perceived more human-like than the original model. Contextually, we show that our automatic evaluation strategy well correlates with human judgements. We also run a linguistic analysis to unveil the characteristics of human- vs machine-perceived language.

Generic resources are what you need: Style transfer tasks without task-specific parallel training data
Huiyuan Lai | Antonio Toral | Malvina Nissim
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Style transfer aims to rewrite a source text in a different target style while preserving its content. We propose a novel approach to this task that leverages generic resources, and without using any task-specific parallel (source–target) data outperforms existing unsupervised approaches on the two most popular style transfer tasks: formality transfer and polarity swap. In practice, we adopt a multi-step procedure which builds on a generic pre-trained sequence-to-sequence model (BART). First, we strengthen the model’s ability to rewrite by further pre-training BART on both an existing collection of generic paraphrases, as well as on synthetic pairs created using a general-purpose lexical resource. Second, through an iterative back-translation approach, we train two models, each in a transfer direction, so that they can provide each other with synthetically generated pairs, dynamically in the training process. Lastly, we let our best resulting model generate static synthetic pairs to be used in a supervised training regime. Besides methodology and state-of-the-art results, a core contribution of this work is a reflection on the nature of the two tasks we address, and how their differences are highlighted by their response to our approach.

Adapting Monolingual Models: Data can be Scarce when Language Similarity is High
Wietse de Vries | Martijn Bartelds | Malvina Nissim | Martijn Wieling
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

A dissemination workshop for introducing young Italian students to NLP
Lucio Messina | Lucia Busso | Claudia Roberta Combei | Alessio Miaschi | Ludovica Pannitto | Gabriele Sarti | Malvina Nissim
Proceedings of the Fifth Workshop on Teaching NLP

We describe and make available the game-based material developed for a laboratory run at several Italian science festivals to popularize NLP among young students.

DALC: the Dutch Abusive Language Corpus
Tommaso Caselli | Arjan Schelhaas | Marieke Weultjes | Folkert Leistra | Hylke van der Veen | Gerben Timmerman | Malvina Nissim
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

As socially unacceptable language become pervasive in social media platforms, the need for automatic content moderation become more pressing. This contribution introduces the Dutch Abusive Language Corpus (DALC v1.0), a new dataset with tweets manually an- notated for abusive language. The resource ad- dress a gap in language resources for Dutch and adopts a multi-layer annotation scheme modeling the explicitness and the target of the abusive messages. Baselines experiments on all annotation layers have been conducted, achieving a macro F1 score of 0.748 for binary classification of the explicitness layer and .489 for target classification.

2020

Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias
Marion Bartl | Malvina Nissim | Albert Gatt
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

Contextualized word embeddings have been replacing standard embeddings as the representational knowledge source of choice in NLP systems. Since a variety of biases have previously been found in standard word embeddings, it is crucial to assess biases encoded in their replacements as well. Focusing on BERT (Devlin et al., 2018), we measure gender bias by studying associations between gender-denoting target words and names of professions in English and German, comparing the findings with real-world workforce statistics. We mitigate bias by fine-tuning BERT on the GAP corpus (Webster et al., 2018), after applying Counterfactual Data Substitution (CDS) (Maudslay et al., 2019). We show that our method of measuring bias is appropriate for languages such as English, but not for languages with a rich morphology and gender-marking, such as German. Our results highlight the importance of investigating bias and mitigation techniques cross-linguistically,especially in view of the current emphasis on large-scale, multilingual language models.

Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media
Malvina Nissim | Viviana Patti | Barbara Plank | Esin Durmus
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

Lower Bias, Higher Density Abusive Language Datasets: A Recipe
Juliet van Rosendaal | Tommaso Caselli | Malvina Nissim
Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language

Datasets to train models for abusive language detection are at the same time necessary and still scarce. One the reasons for their limited availability is the cost of their creation. It is not only that manual annotation is expensive, it is also the case that the phenomenon is sparse, causing human annotators having to go through a large number of irrelevant examples in order to obtain some significant data. Strategies used until now to increase density of abusive language and obtain more meaningful data overall, include data filtering on the basis of pre-selected keywords and hate-rich sources of data. We suggest a recipe that at the same time can provide meaningful data with possibly higher density of abusive language and also reduce top-down biases imposed by corpus creators in the selection of the data to annotate. More specifically, we exploit the controversy channel on Reddit to obtain keywords that are used to filter a Twitter dataset. While the method needs further validation and refinement, our preliminary experiments show a higher density of abusive tweets in the filtered vs unfiltered dataset, and a more meaningful topic distribution after filtering.

Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor
Malvina Nissim | Rik van Noord | Rob van der Goot
Computational Linguistics, Volume 46, Issue 2 - June 2020

Analogies such as man is to king as woman is to X are often used to illustrate the amazing power of word embeddings. Concurrently, they have also been used to expose how strongly human biases are encoded in vector spaces trained on natural language, with examples like man is to computer programmer as woman is to homemaker. Recent work has shown that analogies are in fact not an accurate diagnostic for bias, but this does not mean that they are not used anymore, or that their legacy is fading. Instead of focusing on the intrinsic problems of the analogy task as a bias detection tool, we discuss a series of issues involving implementation as well as subjective choices that might have yielded a distorted picture of bias in word embeddings. We stand by the truth that human biases are present in word embeddings, and, of course, the need to address them. But analogies are not an accurate tool to do so, and the way they have been most often used has exacerbated some possibly non-existing biases and perhaps hidden others. Because they are still widely popular, and some of them have become classics within and outside the NLP community, we deem it important to provide a series of clarifications that should put well-known, and potentially new analogies, into the right perspective.

Matching Theory and Data with Personal-ITY: What a Corpus of Italian YouTube Comments Reveals About Personality
Elisa Bassignana | Malvina Nissim | Viviana Patti
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

As a contribution to personality detection in languages other than English, we rely on distant supervision to create Personal-ITY, a novel corpus of YouTube comments in Italian, where authors are labelled with personality traits. The traits are derived from one of the mainstream personality theories in psychology research, named MBTI. Using personality prediction experiments, we (i) study the task of personality prediction in itself on our corpus as well as on TWISTY, a Twitter dataset also annotated with MBTI labels; (ii) carry out an extensive, in-depth analysis of the features used by the classifier, and view them specifically under the light of the original theory that we used to create the corpus in the first place. We observe that no single model is best at personality detection, and that while some traits are easier than others to detect, and also to match back to theory, for other, less frequent traits the picture is much more blurred.

On the interaction of automatic evaluation and task framing in headline style transfer
Lorenzo De Mattei | Michele Cafagna | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim | Albert Gatt
Proceedings of the 1st Workshop on Evaluating NLG Evaluation

An ongoing debate in the NLG community concerns the best way to evaluate systems, with human evaluation often being considered the most reliable method, compared to corpus-based metrics. However, tasks involving subtle textual differences, such as style transfer, tend to be hard for humans to perform. In this paper, we propose an evaluation method for this task based on purposely-trained classifiers, showing that it better reflects system differences than traditional metrics such as BLEU.

Datasets and Models for Authorship Attribution on Italian Personal Writings
Gaetana Ruggiero | Albert Gatt | Malvina Nissim
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

GePpeTto Carves Italian into a Language Model
Lorenzo De Mattei | Michele Cafagna | Felice Dell’Orletta | Malvina Nissim | Marco Guerini
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Personal-ITY: A Novel YouTube-based Corpus for Personality Prediction in Italian
Elisa Bassignana | Malvina Nissim | Viviana Patti
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Multiword Expressions We Live by: A Validated Usage-based Dataset from Corpora of Written Italian
Francesca Masini | M. Silvia Micheli | Andrea Zaninello | Sara Castagnoli | Malvina Nissim
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

Invisible to People but not to Machines: Evaluation of Style-aware HeadlineGeneration in Absence of Reliable Human Judgment
Lorenzo De Mattei | Michele Cafagna | Felice Dell’Orletta | Malvina Nissim
Proceedings of the Twelfth Language Resources and Evaluation Conference

We automatically generate headlines that are expected to comply with the specific styles of two different Italian newspapers. Through a data alignment strategy and different training/testing settings, we aim at decoupling content from style and preserve the latter in generation. In order to evaluate the generated headlines’ quality in terms of their specific newspaper-compliance, we devise a fine-grained evaluation strategy based on automatic classification. We observe that our models do indeed learn newspaper-specific style. Importantly, we also observe that humans aren’t reliable judges for this task, since although familiar with the newspapers, they are not able to discern their specific styles even in the original human-written headlines. The utility of automatic evaluation goes therefore beyond saving the costs and hurdles of manual annotation, and deserves particular care in its design.

MAGPIE: A Large Corpus of Potentially Idiomatic Expressions
Hessel Haagsma | Johan Bos | Malvina Nissim
Proceedings of the Twelfth Language Resources and Evaluation Conference

Given the limited size of existing idiom corpora, we aim to enable progress in automatic idiom processing and linguistic analysis by creating the largest-to-date corpus of idioms for English. Using a fixed idiom list, automatic pre-extraction, and a strictly controlled crowdsourced annotation procedure, we show that it is feasible to build a high-quality corpus comprising more than 50K instances, an order of a magnitude larger than previous resources. Crucial ingredients of crowdsourcing were the selection of crowdworkers, clear and comprehensive instructions, and an interface that breaks down the task in small, manageable steps. Analysis of the resulting corpus revealed strong effects of genre on idiom distribution, providing new evidence for existing theories on what influences idiom usage. The corpus also contains rich metadata, and is made publicly available.

What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models
Wietse de Vries | Andreas van Cranenburgh | Malvina Nissim
Findings of the Association for Computational Linguistics: EMNLP 2020

Peeking into the inner workings of BERT has shown that its layers resemble the classical NLP pipeline, with progressively more complex tasks being concentrated in later layers. To investigate to what extent these results also hold for a language other than English, we probe a Dutch BERT-based model and the multilingual BERT model for Dutch NLP tasks. In addition, through a deeper analysis of part-of-speech tagging, we show that also within a given task, information is spread over different parts of the network and the pipeline might not be as neat as it seems. Each layer has different specialisations, so that it may be more useful to combine information from different layers, instead of selecting a single one based on the best overall performance.

2019

The Contribution of Embeddings to Sentiment Analysis on YouTube
Moniek Nieuwenhuis | Malvina Nissim
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

Suitable Doesn’t Mean Attractive. Human-Based Evaluation of Automatically Generated Headlines
Michele Cafagna | Lorenzo De Mattei | Davide Bacciu | Malvina Nissim
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

You Write like You Eat: Stylistic Variation as a Predictor of Social Stratification
Angelo Basile | Albert Gatt | Malvina Nissim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Inspired by Labov’s seminal work on stylisticvariation as a function of social stratification,we develop and compare neural models thatpredict a person’s presumed socio-economicstatus, obtained through distant supervision,from their writing style on social media. Thefocus of our work is on identifying the mostimportant stylistic parameters to predict socio-economic group. In particular, we show theeffectiveness of morpho-syntactic features aspredictors of style, in contrast to lexical fea-tures, which are good predictors of topic

Embeddings Shifts as Proxies for Different Word Use in Italian Newspapers
Michele Cafagna | Lorenzo De Mattei | Malvina Nissim
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

2018

Source-driven Representations for Hate Speech Detection
Flavio Merenda | Claudia Zaghi | Tommaso Caselli | Malvina Nissim
Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018)

The Other Side of the Coin: Unsupervised Disambiguation of Potentially Idiomatic Expressions by Contrasting Senses
Hessel Haagsma | Malvina Nissim | Johan Bos
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

Disambiguation of potentially idiomatic expressions involves determining the sense of a potentially idiomatic expression in a given context, e.g. determining that make hay in ‘Investment banks made hay while takeovers shone.’ is used in a figurative sense. This enables automatic interpretation of idiomatic expressions, which is important for applications like machine translation and sentiment analysis. In this work, we present an unsupervised approach for English that makes use of literalisations of idiom senses to improve disambiguation, which is based on the lexical cohesion graph-based method by Sporleder and Li (2009). Experimental results show that, while literalisation carries novel information, its performance falls short of that of state-of-the-art unsupervised methods.

Discriminator at SemEval-2018 Task 10: Minimally Supervised Discrimination
Artur Kulmizev | Mostafa Abdou | Vinit Ravishankar | Malvina Nissim
Proceedings of the 12th International Workshop on Semantic Evaluation

We participated to the SemEval-2018 shared task on capturing discriminative attributes (Task 10) with a simple system that ranked 8th amongst the 26 teams that took part in the evaluation. Our final score was 0.67, which is competitive with the winning score of 0.75, particularly given that our system is a zero-shot system that requires no training and minimal parameter optimisation. In addition to describing the submitted system, and discussing the implications of the relative success of such a system on this task, we also report on other, more complex models we experimented with.

Proceedings of ACL 2018, Student Research Workshop
Vered Shwartz | Jeniya Tabassum | Rob Voigt | Wanxiang Che | Marie-Catherine de Marneffe | Malvina Nissim
Proceedings of ACL 2018, Student Research Workshop

Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media
Malvina Nissim | Viviana Patti | Barbara Plank | Claudia Wagner
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Bleaching Text: Abstract Features for Cross-lingual Gender Prediction
Rob van der Goot | Nikola Ljubešić | Ian Matroos | Malvina Nissim | Barbara Plank
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform dependent. Cross-lingual embeddings circumvent some of these limitations, but capture gender-specific style less. We propose an alternative: bleaching text, i.e., transforming lexical strings into more abstract features. This study provides evidence that such features allow for better transfer across languages. Moreover, we present a first study on the ability of humans to perform cross-lingual gender prediction. We find that human predictive power proves similar to that of our bleached models, and both perform better than lexical models.

Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics
Malvina Nissim | Jonathan Berant | Alessandro Lenci
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

2017

Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)
Roberto Basili | Malvina Nissim | Giorgio Satta
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

Preface
Roberto Basili | Malvina Nissim | Giorgio Satta
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

MODAL: A Multilingual Corpus Annotated for Modality
Malvina Nissim | Paola Pietrandrea
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

Last Words: Sharing Is Caring: The Future of Shared Tasks
Malvina Nissim | Lasha Abzianidze | Kilian Evang | Rob van der Goot | Hessel Haagsma | Barbara Plank | Martijn Wieling
Computational Linguistics, Volume 43, Issue 4 - December 2017

The Power of Character N-grams in Native Language Identification
Artur Kulmizev | Bo Blankers | Johannes Bjerva | Malvina Nissim | Gertjan van Noord | Barbara Plank | Martijn Wieling
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we explore the performance of a linear SVM trained on language independent character features for the NLI Shared Task 2017. Our basic system (GRONINGEN) achieves the best performance (87.56 F1-score) on the evaluation set using only 1-9 character n-grams as features. We compare this against several ensemble and meta-classifiers in order to examine how the linear system fares when combined with other, especially non-linear classifiers. Special emphasis is placed on the topic bias that exists by virtue of the assessment essay prompt distribution.

Predicting Controversial News Using Facebook Reactions
Angelo Basile | Tommaso Caselli | Malvina Nissim
Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017)

To normalize, or not to normalize: The impact of normalization on Part-of-Speech tagging
Rob van der Goot | Barbara Plank | Malvina Nissim
Proceedings of the 3rd Workshop on Noisy User-generated Text

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.

2016

Distant supervision for emotion detection using Facebook reactions
Chris Pool | Malvina Nissim
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

We exploit the Facebook reaction feature in a distant supervised fashion to train a support vector machine classifier for emotion detection, using several feature combinations and combining different Facebook pages. We test our models on existing benchmarks for emotion detection and show that employing only information that is derived completely automatically, thus without relying on any handcrafted lexicon as it’s usually done, we can achieve competitive results. The results also show that there is large room for improvement, especially by gearing the collection of Facebook pages, with a view to the target domain.

Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)
Malvina Nissim | Viviana Patti | Barbara Plank
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

Leveraging Native Data to Correct Preposition Errors in Learners’ Dutch
Lennart Kloppenburg | Malvina Nissim
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We address the task of automatically correcting preposition errors in learners’ Dutch by modelling preposition usage in native language. Specifically, we build two models exploiting a large corpus of Dutch. The first is a binary model for detecting whether a preposition should be used at all in a given position or not. The second is a multiclass model for selecting the appropriate preposition in case one should be used. The models are tested on native as well as learners data. For the latter we exploit a crowdsourcing strategy to elicit native judgements. On native test data the models perform very well, showing that we can model preposition usage appropriately. However, the evaluation on learners’ data shows that while detecting that a given preposition is wrong is doable reasonably well, detecting the absence of a preposition is a lot more difficult. Observing such results and the data we deal with, we envisage various ways of improving performance, and report them in the final section of this article.

2015

Adding Semantics to Data-Driven Paraphrasing
Ellie Pavlick | Johan Bos | Malvina Nissim | Charley Beller | Benjamin Van Durme | Chris Callison-Burch
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Uncovering Noun-Noun Compound Relations by Gamification
Johan Bos | Malvina Nissim
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

The Meaning Factory: Formal Semantics for Recognizing Textual Entailment and Determining Semantic Similarity
Johannes Bjerva | Johan Bos | Rob van der Goot | Malvina Nissim
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

A Modular System for Rule-based Text Categorisation
Marco Del Tredici | Malvina Nissim
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We introduce a modular rule-based approach to text categorisation which is more flexible and less time consuming to build than a standard rule-based system because it works with a hierarchical structure and allows for re-usability of rules. When compared to currently more wide-spread machine learning models on a case study, our modular system shows competitive results, and it has the advantage of reducing manual effort over time, since only fewer rules must be written when moving to a (partially) new domain, while annotation of training data is always required in the same amount.

Extracting MWEs from Italian corpora: A case study for refining the POS-pattern methodology
Malvina Nissim | Sara Castagnoli | Francesca Masini
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

2013

Sentiment analysis on Italian tweets
Valerio Basile | Malvina Nissim
Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

A Repository of Variation Patterns for Multiword Expressions
Malvina Nissim | Andrea Zaninello
Proceedings of the 9th Workshop on Multiword Expressions

Cross-linguistic annotation of modality: a data-driven hierarchical model
Malvina Nissim | Paola Pietrandrea | Andrea Sansò | Caterina Mauri
Proceedings of the 9th Joint ISO - ACL SIGSEM Workshop on Interoperable Semantic Annotation

Modelling the Internal Variability of MWEs
Malvina Nissim
Proceedings of the 9th Workshop on Multiword Expressions

2010

Creation of Lexical Resources for a Characterisation of Multiword Expressions in Italian
Andrea Zaninello | Malvina Nissim
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The theoretical characterisation of multiword expressions (MWEs) is tightly connected to their actual occurrences in data and to their representation in lexical resources. We present three lexical resources for Italian MWEs, namely an electronic lexicon, a series of example corpora and a database of MWEs represented around morphosyntactic patterns. These resources are matched against, and created from, a very large web-derived corpus for Italian that spans across registers and domains. We can thus test expressions coded by lexicographers in a dictionary, thereby discarding unattested expressions, revisiting lexicographers's choices on the basis of frequency information, and at the same time creating an example sub-corpus for each entry. We organise MWEs on the basis of the morphosyntactic information obtained from the data in an electronic, flexible knowledge-base containing structured annotation exploitable for multiple purposes. We also suggest further work directions towards characterising MWEs by analysing the data organised in our database through lexico-semantic information available in WordNet or MultiWordNet-like resources, also in the perspective of expanding their set through the extraction of other similar compact expressions.

2009

Automatic identification of semantic relations in Italian complex nominals
Fabio Celli | Malvina Nissim
Proceedings of the Eight International Conference on Computational Semantics

2008

The Italian Particle “ne”: Corpus Construction and Analysis
Malvina Nissim | Sara Perboni
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Italian particle ne exhibits interesting anaphoric properties that have not been yet explored in depth from a corpus and computational linguistic perspective. We provide: (i) an overview of the phenomenon; (ii) a set of annotation schemes for marking up occurrences of ne; (iii) the description of a corpus annotated for this phenomenon ; (iv) a first assessment of the resolution task. We show that the schemes we developed are reliable, and that the actual distribution of partitive and non-partitive uses of ne is inversely proportional to the amount of attention that the two different uses have received in the linguistic literature. As an assessment of the complexity of the resolution task, we find that a recency-based baseline yields an accuracy of less than 30% on both development and test data.

2007

SemEval-2007 Task 08: Metonymy Resolution at SemEval-2007
Katja Markert | Malvina Nissim
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text
Beatrice Alex | Malvina Nissim | Claire Grover
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we discuss five different corpora annotated forprotein names. We present several within- and cross-dataset proteintagging experiments showing that different annotation schemes severelyaffect the portability of statistical protein taggers. By means of adetailed error analysis we identify crucial annotation issues thatfuture annotation projects should take into careful consideration.

An Empirical Approach to the Interpretation of Superlatives
Johan Bos | Malvina Nissim
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Learning Information Status of Discourse Entities
Malvina Nissim
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

2005

A Framework for Annotating Information Structure in Discourse
Sasha Calhoun | Malvina Nissim | Mark Steedman | Jason Brenier
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

Comparing Knowledge Sources for Nominal Anaphora Resolution
Katja Markert | Malvina Nissim
Computational Linguistics, Volume 31, Number 3, September 2005

2004

Using the NITE XML Toolkit on the Switchboard Corpus to Study Syntactic Choice: a Case Study
Jean Carletta | Shipra Dingare | Malvina Nissim | Tatiana Nikitina
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web
Jenny Finkel | Shipra Dingare | Huy Nguyen | Malvina Nissim | Christopher Manning | Gail Sinclair
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)

An Annotation Scheme for Information Status in Dialogue
Malvina Nissim | Shipra Dingare | Jean Carletta | Mark Steedman
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

Syntactic Features and Word Similarity for Supervised Metonymy Resolution
Malvina Nissim | Katja Markert
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

Using the Web for Nominal Anaphora Resolution
Katja Markert | Malvina Nissim | Natalia Modjeska
Proceedings of the 2003 EACL Workshop on The Computational Treatment of Anaphora

Using the Web in Machine Learning for Other-Anaphora Resolution
Natalia N. Modjeska | Katja Markert | Malvina Nissim
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing

2002

Towards a Corpus Annotated for Metonymies: the Case of Location Names
Katja Markert | Malvina Nissim
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

Metonymy Resolution as a Classification Task
Katja Markert | Malvina Nissim
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

Co-authors

Katja Markert 7

Barbara Plank 7

Lorenzo De Mattei 6

Arianna Bisazza 5

Michele Cafagna 5

Felice Dell’Orletta 5

Rob Van Der Goot 5

Gosse Minnema 5

Martijn Wieling 5

Wietse de Vries 5

Daniel Scalena 4

Andrea Zaninello 4

Shipra Dingare 3

Maria Francis 3

Marco Guerini 3

Hessel Haagsma 3

Matteo Rinaldi 3

Chiara Zanchi 3

Angelo Basile 2

Roberto Basili 2

Elisa Bassignana 2

Johannes Bjerva 2

Jean Carletta 2

Sara Castagnoli 2

Claudia Roberta Combei 2

Elisabetta Fersini 2

Artur Kulmizev 2

Francesca Masini 2

Lucio Messina 2

Alessio Miaschi 2

Natalia N. Modjeska 2

Irene Mondella 2

Daniela Occhipinti 2

Ludovica Pannitto 2

Paola Pietrandrea 2

Giorgio Satta 2

Mark Steedman 2

Silvia Stopponi 2

Vilém Zouhar 2

Mostafa Abdou 1

Gavin Abercrombie 1

Lasha Abzianidze 1

Beatrice Alex 1

Jose M. Alonso-Moral 1

Mohammad Arvan 1

Giuseppe Attanasio 1

Davide Bacciu 1

Martijn Bartelds 1

Valerio Basile 1

Pierpaolo Basile 1

Charley Beller 1

Jonathan Berant 1

Federico Borazio 1

Jasper K. Bos 1

Anouck Braggaar 1

Jason Brenier 1

Sasha Calhoun 1

Chris Callison-Burch 1

Grzegorz Chrupała 1

Mark Cieliebak 1

Elizabeth Clark 1

Leonardo De Cosmo 1

Roberto Dessì 1

Benjamin Van Durme 1

Ondřej Dušek 1

Jenny Rose Finkel 1

Dimitra Gkatzia 1

Mattia Goffetti 1

Javier González Corbelle 1

Claire Grover 1

Ana Guerberof-Arenas 1

Manuela Huerlimann 1

Sandro Iannaccone 1

John Kelleher 1

Khalid Al Khatib 1

Lennart Kloppenburg 1

Filip Klubicka 1

Emiel Krahmer 1

Folkert Leistra 1

Alessandro Lenci 1

Nikola Ljubešić 1

Bernardo Magnini 1

Saad Mahamood 1

Christopher D. Manning 1

Michele Marchi 1

Yevgen Matusevych 1

Caterina Mauri 1

Barbara McGillivray 1

Flavio Merenda 1

M. Silvia Micheli 1

Margot Mieskes 1

Sepideh Mollanorozy 1

Pablo Mosteiro 1

Elio Musacchio 1

Benedetta Muscato 1

Moniek Nieuwenhuis 1

Tatiana Nikitina 1

Natalie Parde 1

Ellie Pavlick 1

Nilo Pedrazzini 1

Saskia Peels-Matthey 1

Federico Pianzola 1

Ondřej Plátek 1

Vinit Ravishankar 1

Levi Remijnse 1

Verena Rieser 1

Gaetana Ruggiero 1

Andrea Sansò 1

Beatrice Savoldi 1

Arjan Schelhaas 1

Vered Shwartz 1

Gail Sinclair 1

Pia Sommerauer 1

Jeniya Tabassum 1

Joel Tetreault 1

Craig Thomson 1

Gerben Timmerman 1

Marco Del Tredici 1

Andreas Van Cranenburgh 1

Hylke Van Der Veen 1

Emiel Van Miltenburg 1

Rik Van Noord 1

Claudia Wagner 1

Marieke Weultjes 1

Claudia Zaghi 1

Leonidas Zotos 1

Evelien de Graaf 1

Marie-Catherine de Marneffe 1

Kees van Deemter 1

Gertjan van Noord 1

Hedderik van Rijn 1

Juliet van Rosendaal 1

Chris van der Lee 1

Venues