Jaap Jumelet - ACL Anthology

Jaap Jumelet

2026

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
Jaap Jumelet | Leonie Weissweiler | Joakim Nivre | Arianna Bisazza
Transactions of the Association for Computational Linguistics, Volume 14

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.1

We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.

Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models
Vitalii Hirak | Jaap Jumelet | Arianna Bisazza
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.

2025

Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models
Francesca Padovani | Jaap Jumelet | Yevgen Matusevych | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can outperform LMs trained on an equal amount of adult-directed text like Wikipedia. However, it remains unclear whether these results generalize across languages, architectures, and evaluation settings. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in these benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Ezgi Başar | Francesca Padovani | Jaap Jumelet | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

Findings of the Third BabyLM Challenge: Accelerating Language Modeling Research with Cognitively Plausible Data
Lucas Charpentier | Leshem Choshen | Ryan Cotterell | Mustafa Omer Gul | Michael Y. Hu | Jing Liu | Jaap Jumelet | Tal Linzen | Aaron Mueller | Candace Ross | Raj Sanjay Shah | Alex Warstadt | Ethan Gotlieb Wilcox | Adina Williams
Proceedings of the First BabyLM Workshop

This report summarizes the findings from the 3rd BabyLM Challenge. The BabyLM Challenge is a shared task aimed at closing the data efficiency gap between human and machine language learners. This year, the challenge was held as part of an expanded BabyLM Workshop that invited paper submissions on topics relevant to the BabyLM effort, including sample-efficient pretraining and cognitive modeling for LMs. For the challenge, we kept the text-only and text–image tracks from previous years, but also introduced a new interaction track, where student models are allowed to learn from feedback from larger teacher models. Furthermore, we introduce a new set of evaluation tasks to assess the “human likeness” of models on a cognitive and linguistic level, limit the total amount of training compute allowed, and measure performance on intermediate checkpoints. We observe that new training objectives and architectures tend to produce the best-performing approaches, and that interaction with teacher models can yield high-quality language models. The strict-small and interaction tracks saw submissions that outperformed the baselines. We do not observe a complete correlation between training FLOPs and performance. This year’s BabyLM Challenge shows that there is still room to innovate in a data-constrained setting, and that community-driven research can yield actionable insights for language modeling.

Proceedings of the First BabyLM Workshop
Lucas Charpentier | Leshem Choshen | Ryan Cotterell | Mustafa Omer Gul | Michael Y. Hu | Jing Liu | Jaap Jumelet | Tal Linzen | Aaron Mueller | Candace Ross | Raj Sanjay Shah | Alex Warstadt | Ethan Gotlieb Wilcox | Adina Williams
Proceedings of the First BabyLM Workshop

2024

Filtered Corpus Training (FiCT) Shows that Language Models Can Generalize from Indirect Evidence
Abhinav Patil | Jaap Jumelet | Yu Ying Chiu | Andy Lapastora | Peter Shen | Lexie Wang | Clevis Willrich | Shane Steinert-Threlkeld
Transactions of the Association for Computational Linguistics, Volume 12

This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.

DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
Anna Langedijk | Hosein Mohebbi | Gabriele Sarti | Willem Zuidema | Jaap Jumelet
Findings of the Association for Computational Linguistics: NAACL 2024

In recent years, several interpretability methods have been proposed to interpret the inner workings of Transformer models at different levels of precision and complexity.In this work, we propose a simple but effective technique to analyze encoder-decoder Transformers. Our method, which we name DecoderLens, allows the decoder to cross-attend representations of intermediate encoder activations instead of using the default final encoder output.The method thus maps uninterpretable intermediate vector representations to human-interpretable sequences of words or symbols, shedding new light on the information flow in this popular but understudied class of models.We apply DecoderLens to question answering, logical reasoning, speech recognition and machine translation models, finding that simpler subtasks are solved with high precision by low and intermediate encoder layers.

Do Language Models Exhibit Human-like Structural Priming Effects?
Jaap Jumelet | Willem Zuidema | Arabella Sinclair
Findings of the Association for Computational Linguistics: ACL 2024

We explore which linguistic factors—at the sentence and token level—play an important role in influencing language model predictions, and investigate whether these are reflective of results found in humans and human corpora (Gries and Kootstra, 2017). We make use of the structural priming paradigm—where recent exposure to a structure facilitates processing of the same structure—to investigate where priming effects manifest, and what factors predict them. We find these effects can be explained via the inverse frequency effect found in human priming, where rarer elements within a prime increase priming effects, as well as lexical dependence between prime and target. Our results provide an important piece in the puzzle of understanding how properties within their context affect structural prediction in language models.

Transformer-specific Interpretability
Hosein Mohebbi | Jaap Jumelet | Michael Hanna | Afra Alishahi | Willem Zuidema
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Transformers have emerged as dominant play- ers in various scientific fields, especially NLP. However, their inner workings, like many other neural networks, remain opaque. In spite of the widespread use of model-agnostic interpretability techniques, including gradient-based and occlusion-based, their shortcomings are becoming increasingly apparent for Transformer interpretation, making the field of interpretability more demanding today. In this tutorial, we will present Transformer-specific interpretability methods, a new trending approach, that make use of specific features of the Transformer architecture and are deemed more promising for understanding Transformer-based models. We start by discussing the potential pitfalls and misleading results model-agnostic approaches may produce when interpreting Transformers. Next, we discuss Transformer-specific methods, including those designed to quantify context- mixing interactions among all input pairs (as the fundamental property of the Transformer architecture) and those that combine causal methods with low-level Transformer analysis to identify particular subnetworks within a model that are responsible for specific tasks. By the end of the tutorial, we hope participants will understand the advantages (as well as current limitations) of Transformer-specific interpretability methods, along with how these can be applied to their own research.

Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Yonatan Belinkov | Najoung Kim | Jaap Jumelet | Hosein Mohebbi | Aaron Mueller | Hanjie Chen
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Interpretability of Language Models via Task Spaces
Lucas Weber | Jaap Jumelet | Elia Bruni | Dieuwke Hupkes
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes.In this paper, we present an alternative approach, concentrating on the _quality_ of LM processing, with a focus on their language abilities.To this end, we construct ‘linguistic task spaces’ – representations of an LM’s language conceptualisation – that shed light on the connections LMs draw between language phenomena.Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call ‘similarity probing’.To disentangle the learning signals of linguistic phenomena, we further introduce a method called ‘fine-tuning via gradient differentials’ (FTGD).We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs.

2023

Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution
Jaap Jumelet | Willem Zuidema
Findings of the Association for Computational Linguistics: EMNLP 2023

We present a setup for training, evaluating and interpreting neural language models, that uses artificial, language-like data. The data is generated using a massive probabilistic grammar (based on state-split PCFGs), that is itself derived from a large natural language corpus, but also provides us complete control over the generative process. We describe and release both grammar and corpus, and test for the naturalness of our generated data. This approach allows us define closed-form expressions to efficiently compute exact lower bounds on obtainable perplexity using both causal and masked language modelling. Our results show striking differences between neural language modelling architectures and training objectives in how closely they allow approximating the lower bound on perplexity. Our approach also allows us to directly compare learned representations to symbolic rules in the underlying source. We experiment with various techniques for interpreting model behaviour and learning dynamics. With access to the underlying true source, our results show striking differences and outcomes in learning dynamics between different classes of words.

Feature Interactions Reveal Linguistic Structure in Language Models
Jaap Jumelet | Willem Zuidema
Findings of the Association for Computational Linguistics: ACL 2023

We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on language models, providing novel insights into the linguistic structure that these models have acquired.

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation
Jaap Jumelet | Michael Hanna | Marianne de Heer Kloots | Anna Langedijk | Charlotte Pouw | Oskar van der Wal
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Attribution and Alignment: Effects of Local Context Repetition on Utterance Production and Comprehension in Dialogue
Aron Molnar | Jaap Jumelet | Mario Giulianelli | Arabella Sinclair
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Language models are often used as the backbone of modern dialogue systems. These models are pre-trained on large amounts of written fluent language. Repetition is typically penalised when evaluating language model generations. However, it is a key component of dialogue. Humans use local and partner specific repetitions; these are preferred by human users and lead to more successful communication in dialogue. In this study, we evaluate (a) whether language models produce human-like levels of repetition in dialogue, and (b) what are the processing mechanisms related to lexical re-use they use during comprehension. We believe that such joint analysis of model production and comprehension behaviour can inform the development of cognitively inspired dialogue generation systems.

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Yonatan Belinkov | Sophie Hao | Jaap Jumelet | Najoung Kim | Arya McCarthy | Hosein Mohebbi
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

2022

Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations
Arabella Sinclair | Jaap Jumelet | Willem Zuidema | Raquel Fernández
Transactions of the Association for Computational Linguistics, Volume 10

We investigate the extent to which modern neural language models are susceptible to structural priming, the phenomenon whereby the structure of a sentence makes the same structure more probable in a follow-up sentence. We explore how priming can be used to study the potential of these models to learn abstract structural information, which is a prerequisite for good performance on tasks that require natural language understanding skills. We introduce a novel metric and release Prime-LM, a large corpus where we control for various linguistic factors that interact with priming strength. We find that Transformer models indeed show evidence of structural priming, but also that the generalizations they learned are to some extent modulated by semantic information. Our experiments also show that the representations acquired by the models may not only encode abstract sequential structure but involve certain level of hierarchical syntactic information. More generally, our study shows that the priming paradigm is a useful, additional tool for gaining insights into the capacities of language models and opens the door to future priming-based investigations that probe the model’s internal states.1

The Birth of Bias: A case study on the evolution of gender bias in an English language model
Oskar Van Der Wal | Jaap Jumelet | Katrin Schulz | Willem Zuidema
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Detecting and mitigating harmful biases in modern language models are widely recognized as crucial, open problems. In this paper, we take a step back and investigate how language models come to be biased in the first place. We use a relatively small language model, using the LSTM architecture trained on an English Wikipedia corpus. With full access to the data and to the model parameters as they change during every step while training, we can map in detail how the representation of gender develops, what patterns in the dataset drive this, and how the model’s internal state relates to the bias in a downstream task (semantic textual similarity).We find that the representation of gender is dynamic and identify different phases during training. Furthermore, we show that gender information is represented increasingly locally in the input embeddings of the model and that, as a consequence, debiasing these can be effective in reducing the downstream bias. Monitoring the training dynamics, allows us to detect an asymmetry in how the female and male gender are represented in the input embeddings. This is important, as it may cause naive mitigation strategies to introduce new undesirable biases. We discuss the relevance of the findings for mitigation strategies more generally and the prospects of generalizing our methods to larger language models, the Transformer architecture, other languages and other undesirable biases.

2021

Language Models Use Monotonicity to Assess NPI Licensing
Jaap Jumelet | Milica Denic | Jakub Szymanik | Dieuwke Hupkes | Shane Steinert-Threlkeld
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Language Modelling as a Multi-Task Problem
Lucas Weber | Jaap Jumelet | Elia Bruni | Dieuwke Hupkes
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behaviour of language models as they learn the linguistic concept of Negative Polarity Items (NPIs). Our experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling. We argue that this insight is valuable for multi-task learning, linguistics and interpretability research and can lead to exciting new findings in all three domains.

Attention vs non-attention for a Shapley-based explanation method
Tom Kersten | Hugh Mee Wong | Jaap Jumelet | Dieuwke Hupkes
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

The field of explainable AI has recently seen an explosion in the number of explanation methods for highly non-linear deep neural networks. The extent to which such methods – that are often proposed and tested in the domain of computer vision – are appropriate to address the explainability challenges in NLP is yet relatively unexplored. In this work, we consider Contextual Decomposition (CD) – a Shapley-based input feature attribution method that has been shown to work well for recurrent NLP models – and we test the extent to which it is useful for models that contain attention operations. To this end, we extend CD to cover the operations necessary for attention-based models. We then compare how long distance subject-verb relationships are processed by models with and without attention, considering a number of different syntactic structures in two different languages: English and Dutch. Our experiments confirm that CD can successfully be applied for attention-based models as well, providing an alternative Shapley-based attribution method for modern neural networks. In particular, using CD, we show that the English and Dutch models demonstrate similar processing behaviour, but that under the hood there are consistent differences between our attention and non-attention models.

2020

diagNNose: A Library for Neural Activation Analysis
Jaap Jumelet
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

In this paper we introduce diagNNose, an open source library for analysing the activations of deep neural networks. diagNNose contains a wide array of interpretability techniques that provide fundamental insights into the inner workings of neural networks. We demonstrate the functionality of diagNNose with a case study on subject-verb agreement within language models. diagNNose is available at https://github.com/i-machine-think/diagnnose.

2019

Analysing Neural Language Models: Contextual Decomposition Reveals Default Reasoning in Number and Gender Assignment
Jaap Jumelet | Willem Zuidema | Dieuwke Hupkes
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Extensive research has recently shown that recurrent neural language models are able to process a wide range of grammatical phenomena. How these models are able to perform these remarkable feats so well, however, is still an open question. To gain more insight into what information LSTMs base their decisions on, we propose a generalisation of Contextual Decomposition (GCD). In particular, this setup enables us to accurately distil which part of a prediction stems from semantic heuristics, which part truly emanates from syntactic cues and which part arise from the model biases themselves instead. We investigate this technique on tasks pertaining to syntactic agreement and co-reference resolution and discover that the model strongly relies on a default reasoning effect to perform these tasks.

2018

Do Language Models Understand Anything? On the Ability of LSTMs to Understand Negative Polarity Items
Jaap Jumelet | Dieuwke Hupkes
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

In this paper, we attempt to link the inner workings of a neural language model to linguistic theory, focusing on a complex phenomenon well discussed in formal linguistics: (negative) polarity items. We briefly discuss the leading hypotheses about the licensing contexts that allow negative polarity items and evaluate to what extent a neural language model has the ability to correctly process a subset of such constructions. We show that the model finds a relation between the licensing context and the negative polarity item and appears to be aware of the scope of this context, which we extract from a parse tree of the sentence. With this research, we hope to pave the way for other studies linking formal linguistics to deep learning.

Co-authors

Aaron Mueller 3

Francesca Padovani 3

Arabella Sinclair 3

Alex Warstadt 3

Yonatan Belinkov 2

Ryan Cotterell 2

Lucas Georges Gabriel Charpentier 2

Mustafa Omer Gul 2

Michael Hanna 2

Michael Y. Hu 2

Anna Langedijk 2

Raj Sanjay Shah 2

Shane Steinert-Threlkeld 2

Oskar Van Der Wal 2

Ethan Gotlieb Wilcox 2

Adina Williams 2

Afra Alishahi 1

Bastian Bunzeck 1

Julen Etxaniz 1

Raquel Fernández 1

Negar Foroutan 1

Abdellah Fourtassi 1

Diana Galván-Sosa 1

Mario Giulianelli 1

María Grandury 1

Faiz Ghifari Haznitrama 1

Vitalii Hirak 1

Marianne De Heer Kloots 1

Andy Lapastora 1

Mila Marcheva 1

Yevgen Matusevych 1

Arya D. McCarthy 1

Francois Meyer 1

Abhinav Patil 1

Charlotte Pouw 1

Laurent Prévot 1

Pouya Sadeghi 1

Suchir Salhan 1

Gabriele Sarti 1

Katrin Schulz 1

Bhargav Shandilya 1

Jakub Szymanik 1

Nikitas Theodoropoulos 1

Leonie Weissweiler 1

Clevis Willrich 1

Hugh Mee Wong 1

Venues