Dieuwke Hupkes - ACL Anthology

Dieuwke Hupkes

2025

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur | Kartik Choudhary | Venkat Srinik Ramayapally | Sankaran Vaidyanathan | Dieuwke Hupkes
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

The LLM-as-a-judge paradigm offers a potential solution to scalability issues in human evaluation of large language models (LLMs), but there are still many open questions about its strengths, weaknesses, and potential biases. This study investigates thirteen models, ranging in size and family, as ‘judge models’ evaluating answers from nine base and instruction-tuned ‘exam-taker models’. We find that only the best (and largest) models show reasonable alignment with humans, though they still differ with up to 5 points from human-assigned scores. Our research highlights the need for alignment metrics beyond percent agreement, as judges with high agreement can still assign vastly different scores. We also find that smaller models and the lexical metric contains can provide a reasonable signal in ranking the exam-taker models. Further error analysis reveals vulnerabilities in judge models, such as sensitivity to prompt complexity and a bias toward leniency. Our findings show that even the best judge models differ from humans in this fairly sterile setup, indicating that caution is warranted when applying judge models in more complex scenarios.

Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Nicholas Roberts | Niladri S. Chatterji | Sharan Narang | Mike Lewis | Dieuwke Hupkes
Findings of the Association for Computational Linguistics: ACL 2025

Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as ‘compute-optimally’ trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
Lovish Madaan | David Esiobu | Pontus Stenetorp | Barbara Plank | Dieuwke Hupkes
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model’s ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

2024

Interpretability of Language Models via Task Spaces
Lucas Weber | Jaap Jumelet | Elia Bruni | Dieuwke Hupkes
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes.In this paper, we present an alternative approach, concentrating on the _quality_ of LM processing, with a focus on their language abilities.To this end, we construct ‘linguistic task spaces’ – representations of an LM’s language conceptualisation – that shed light on the connections LMs draw between language phenomena.Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call ‘similarity probing’.To disentangle the learning signals of linguistic phenomena, we further introduce a method called ‘fine-tuning via gradient differentials’ (FTGD).We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs.

From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency
Xenia Ohmer | Elia Bruni | Dieuwke Hupkes
Computational Linguistics, Volume 50, Issue 4 - December 2024

The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what “understanding” means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes—inspired by Fregean senses—of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model’s multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP
Dieuwke Hupkes | Verna Dankers | Khuyagbaatar Batsuren | Amirhossein Kazemnejad | Christos Christodoulopoulos | Mario Giulianelli | Ryan Cotterell
Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP

2023

Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
Dieuwke Hupkes | Verna Dankers | Khuyagbaatar Batsuren | Koustuv Sinha | Amirhossein Kazemnejad | Christos Christodoulopoulos | Ryan Cotterell | Elia Bruni
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation
Verna Dankers | Ivan Titov | Dieuwke Hupkes
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

When training a neural network, it will quickly memorise some source-target mappings from your dataset but never learn some others. Yet, memorisation is not easily expressed as a binary feature that is good or bad: individual datapoints lie on a memorisation-generalisation continuum. What determines a datapoint’s position on that spectrum, and how does that spectrum influence neural models’ performance? We address these two questions for neural machine translation (NMT) models. We use the counterfactual memorisation metric to (1) build a resource that places 5M NMT datapoints on a memorisation-generalisation map, (2) illustrate how the datapoints’ surface-level characteristics and a models’ per-datum training signals are predictive of memorisation in NMT, (3) and describe the influence that subsets of that map have on NMT systems’ performance.

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks
Kaiser Sun | Adina Williams | Dieuwke Hupkes
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model capabilities. In this work, we investigate this question in the domain of compositional generalization. We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies, ranking models by 18 compositional generalization splits in total. Our results show that: i) the datasets, although all designed to evaluate compositional generalization, rank modeling approaches differently; ii) datasets generated by humans align better with each other than with synthetic datasets, or than the latter among themselves; iii) generally, whether datasets are sampled from the same source is more predictive of the resulting model ranking than whether they maintain the same interpretation of compositionality; and iv) specific lexical items in dataset impacts the measurement consistency. Overall, our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure, and suggests that elucidating more rigorous standards for establishing the validity of evaluation sets could benefit the field.

Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses
Xenia Ohmer | Elia Bruni | Dieuwke Hupkes
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model’s understanding and simultaneously addressing the important topic of multilingualism. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning
Lucas Weber | Elia Bruni | Dieuwke Hupkes
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Finding the best way of adapting pre-trained language models to a task is a big challenge in current NLP. Just like the previous generation of task-tuned models (TT), models that are adapted to tasks via in-context-learning (ICL) or instruction tuning (IT) are robust in some setups, but not in others. Here, we present a detailed analysis of which design choices cause instabilities and inconsistencies in LLM predictions. First, we show how spurious correlations between input distributions and labels – a known issue in TT models – form only a minor problem for prompted models. Then we engage in a systematic, holistic evaluation of different factors that have been found to influence predictions in a prompting setup. We test all possible combinations of a range of factors on both vanilla and instruction-tuned LLMs of different scale, and statistically analyse the results to show which factors are the most influential, the most interactive or the most stable. From our results, we deduce which factors can be used without precautions, should be avoided or handled with care in most settings.

2022

Sparse Interventions in Language Models with Differentiable Masking
Nicola De Cao | Leon Schmid | Dieuwke Hupkes | Ivan Titov
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the information found to be encoded, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discovers a small subset of neurons within a neural LM responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission probabilities. We use a differentiable relaxation to approximately search through the combinatorial space. An L₀ regularization term ensures that the search converges to discrete and sparse solutions. We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs. We observe that it is fast and finds better solutions than alternatives such as REINFORCE and Integrated Gradients. Our experiments confirm that each of these phenomena is mediated through a small subset of neurons that do not play any other discernible role.

Text Characterization Toolkit (TCT)
Daniel Simig | Tianlu Wang | Verna Dankers | Peter Henderson | Khuyagbaatar Batsuren | Dieuwke Hupkes | Mona Diab
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations

We present a tool, Text Characterization Toolkit (TCT), that researchers can use to study characteristics of large datasets. Furthermore, such properties can lead to understanding the influence of such attributes on models’ behaviour. Traditionally, in most NLP research, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that – especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations – deeper results analysis should become the de-facto standard when presenting new models or benchmarks. TCT aims at filling this gap by facilitating such deeper analysis for datasets at scale, where datasets can be for training/development/evaluation. TCT includes both an easy-to-use tool, as well as off-the-shelf scripts that can be used for specific analyses. We also present use-cases from several different domains. TCT is used to predict difficult examples for given well-known trained models; TCT is also used to identify (potentially harmful) biases present in a dataset.

Can Transformers Process Recursive Nested Constructions, Like Humans?
Yair Lakretz | Théo Desbordes | Dieuwke Hupkes | Stanislas Dehaene
Proceedings of the 29th International Conference on Computational Linguistics

Recursive processing is considered a hallmark of human linguistic abilities. A recent study evaluated recursive processing in recurrent neural language models (RNN-LMs) and showed that such models perform below chance level on embedded dependencies within nested constructions – a prototypical example of recursion in natural language. Here, we study if state-of-the-art Transformer LMs do any better. We test eight different Transformer LMs on two different types of nested constructions, which differ in whether the embedded (inner) dependency is short or long range. We find that Transformers achieve near-perfect performance on short-range embedded dependencies, significantly better than previous results reported for RNN-LMs and humans. However, on long-range embedded dependencies, Transformers’ performance sharply drops below chance level. Remarkably, the addition of only three words to the embedded dependency caused Transformers to fall from near-perfect to below-chance performance. Taken together, our results reveal how brittle syntactic processing is in Transformers, compared to humans.

The Paradox of the Compositionality of Natural Language: A Neural Machine Translation Case Study
Verna Dankers | Elia Bruni | Dieuwke Hupkes
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Obtaining human-like performance in NLP is often argued to require compositional generalisation. Whether neural networks exhibit this ability is usually studied by training models on highly compositional synthetic data. However, compositionality in natural language is much more complex than the rigid, arithmetic-like version such data adheres to, and artificial compositionality tests thus do not allow us to determine how neural models deal with more realistic forms of compositionality. In this work, we re-instantiate three compositionality tests from the literature and reformulate them for neural machine translation (NMT).Our results highlight that: i) unfavourably, models trained on more data are more compositional; ii) models are sometimes less compositional than expected, but sometimes more, exemplifying that different levels of compositionality are required, and models are not always able to modulate between them correctly; iii) some of the non-compositional behaviours are mistakes, whereas others reflect the natural variation in data. Apart from an empirical study, our work is a call to action: we should rethink the evaluation of compositionality in neural networks and develop benchmarks using real data to evaluate compositionality on natural language, where composing meaning is not as straightforward as doing the math.

Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Jasmijn Bastings | Yonatan Belinkov | Yanai Elazar | Dieuwke Hupkes | Naomi Saphra | Sarah Wiegreffe
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

The Curious Case of Absolute Position Embeddings
Koustuv Sinha | Amirhossein Kazemnejad | Siva Reddy | Joelle Pineau | Dieuwke Hupkes | Adina Williams
Findings of the Association for Computational Linguistics: EMNLP 2022

Transformer language models encode the notion of word order using positional information. Most commonly, this positional information is represented by absolute position embeddings (APEs), that are learned from the pretraining data. However, in natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been studied. In this work, we observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information. Specifically, when models are subjected to sentences starting from a non-zero position (excluding the effect of priming), they exhibit noticeably degraded performance on zero- to full-shot tasks, across a range of model families and model sizes. Our findings raise questions about the efficacy of APEs to model the relativity of position information, and invite further introspection on the sentence and word order processing strategies employed by these models.

2021

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
Koustuv Sinha | Robin Jia | Dieuwke Hupkes | Joelle Pineau | Adina Williams | Douwe Kiela
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.

Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Jasmijn Bastings | Yonatan Belinkov | Emmanuel Dupoux | Mario Giulianelli | Dieuwke Hupkes | Yuval Pinter | Hassan Sajjad
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Attention vs non-attention for a Shapley-based explanation method
Tom Kersten | Hugh Mee Wong | Jaap Jumelet | Dieuwke Hupkes
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

The field of explainable AI has recently seen an explosion in the number of explanation methods for highly non-linear deep neural networks. The extent to which such methods – that are often proposed and tested in the domain of computer vision – are appropriate to address the explainability challenges in NLP is yet relatively unexplored. In this work, we consider Contextual Decomposition (CD) – a Shapley-based input feature attribution method that has been shown to work well for recurrent NLP models – and we test the extent to which it is useful for models that contain attention operations. To this end, we extend CD to cover the operations necessary for attention-based models. We then compare how long distance subject-verb relationships are processed by models with and without attention, considering a number of different syntactic structures in two different languages: English and Dutch. Our experiments confirm that CD can successfully be applied for attention-based models as well, providing an alternative Shapley-based attribution method for modern neural networks. In particular, using CD, we show that the English and Dutch models demonstrate similar processing behaviour, but that under the hood there are consistent differences between our attention and non-attention models.

Generalising to German Plural Noun Classes, from the Perspective of a Recurrent Neural Network
Verna Dankers | Anna Langedijk | Kate McCurdy | Adina Williams | Dieuwke Hupkes
Proceedings of the 25th Conference on Computational Natural Language Learning

Inflectional morphology has since long been a useful testing ground for broader questions about generalisation in language and the viability of neural network models as cognitive models of language. Here, in line with that tradition, we explore how recurrent neural networks acquire the complex German plural system and reflect upon how their strategy compares to human generalisation and rule-based models of this system. We perform analyses including behavioural experiments, diagnostic classification, representation analysis and causal interventions, suggesting that the models rely on features that are also key predictors in rule-based models of German plurals. However, the models also display shortcut learning, which is crucial to overcome in search of more cognitively plausible generalisation behaviour.

Co-evolution of language and agents in referential games
Gautier Dagan | Dieuwke Hupkes | Elia Bruni
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Referential games offer a grounded learning environment for neural agents which accounts for the fact that language is functionally used to communicate. However, they do not take into account a second constraint considered to be fundamental for the shape of human language: that it must be learnable by new language learners. Cogswell et al. (2019) introduced cultural transmission within referential games through a changing population of agents to constrain the emerging language to be learnable. However, the resulting languages remain inherently biased by the agents’ underlying capabilities. In this work, we introduce Language Transmission Simulator to model both cultural and architectural evolution in a population of agents. As our core contribution, we empirically show that the optimal situation is to take into account also the learning biases of the language learners and thus let language and agents co-evolve. When we allow the agent population to evolve through architectural evolution, we achieve across the board improvements on all considered metrics and surpass the gains made with cultural transmission. These results stress the importance of studying the underlying agent architecture and pave the way to investigate the co-evolution of language and agent in language emergence studies.

Language Modelling as a Multi-Task Problem
Lucas Weber | Jaap Jumelet | Elia Bruni | Dieuwke Hupkes
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research: multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behaviour of language models as they learn the linguistic concept of Negative Polarity Items (NPIs). Our experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling. We argue that this insight is valuable for multi-task learning, linguistics and interpretability research and can lead to exciting new findings in all three domains.

Language Models Use Monotonicity to Assess NPI Licensing
Jaap Jumelet | Milica Denic | Jakub Szymanik | Dieuwke Hupkes | Shane Steinert-Threlkeld
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

Location Attention for Extrapolation to Longer Sequences
Yann Dubois | Gautier Dagan | Dieuwke Hupkes | Elia Bruni
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural networks are surprisingly good at interpolating and perform remarkably well when the training set examples resemble those in the test set. However, they are often unable to extrapolate patterns beyond the seen data, even when the abstractions required for such patterns are simple. In this paper, we first review the notion of extrapolation, why it is important and how one could hope to tackle it. We then focus on a specific type of extrapolation which is especially useful for natural language processing: generalization to sequences that are longer than the training ones. We hypothesize that models with a separate content- and location-based attention are more likely to extrapolate than those with common attention mechanisms. We empirically support our claim for recurrent seq2seq models with our proposed attention on variants of the Lookup Table task. This sheds light on some striking failures of neural models for sequences and on possible methods to approaching such issues.

The Grammar of Emergent Languages
Oskar van der Wal | Silvan de Boer | Elia Bruni | Dieuwke Hupkes
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we consider the syntactic properties of languages emerged in referential games, using unsupervised grammar induction (UGI) techniques originally designed to analyse natural language. We show that the considered UGI techniques are appropriate to analyse emergent languages and we then study if the languages that emerge in a typical referential game setup exhibit syntactic structure, and to what extent this depends on the maximum message length and number of symbols that the agents are allowed to use. Our experiments demonstrate that a certain message length and vocabulary size are required for structure to emerge, but they also illustrate that more sophisticated game scenarios are required to obtain syntactic properties more akin to those observed in human language. We argue that UGI techniques should be part of the standard toolkit for analysing emergent languages and release a comprehensive library to facilitate such analysis for future researchers.

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Afra Alishahi | Yonatan Belinkov | Grzegorz Chrupała | Dieuwke Hupkes | Yuval Pinter | Hassan Sajjad
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Internal and external pressures on language emergence: least effort, object constancy and frequency
Diana Rodríguez Luna | Edoardo Maria Ponti | Dieuwke Hupkes | Elia Bruni
Findings of the Association for Computational Linguistics: EMNLP 2020

In previous work, artificial agents were shown to achieve almost perfect accuracy in referential games where they have to communicate to identify images. Nevertheless, the resulting communication protocols rarely display salient features of natural languages, such as compositionality. In this paper, we propose some realistic sources of pressure on communication that avert this outcome. More specifically, we formalise the principle of least effort through an auxiliary objective. Moreover, we explore several game variants, inspired by the principle of object constancy, in which we alter the frequency, position, and luminosity of the objects in the images. We perform an extensive analysis on their effect through compositionality metrics, diagnostic classifiers, and zero-shot evaluation. Our findings reveal that the proposed sources of pressure result in emerging languages with less redundancy, more focus on high-level conceptual information, and better abilities of generalisation. Overall, our contributions reduce the gap between emergent and natural languages.

2019

Analysing Neural Language Models: Contextual Decomposition Reveals Default Reasoning in Number and Gender Assignment
Jaap Jumelet | Willem Zuidema | Dieuwke Hupkes
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Extensive research has recently shown that recurrent neural language models are able to process a wide range of grammatical phenomena. How these models are able to perform these remarkable feats so well, however, is still an open question. To gain more insight into what information LSTMs base their decisions on, we propose a generalisation of Contextual Decomposition (GCD). In particular, this setup enables us to accurately distil which part of a prediction stems from semantic heuristics, which part truly emanates from syntactic cues and which part arise from the model biases themselves instead. We investigate this technique on tasks pertaining to syntactic agreement and co-reference resolution and discover that the model strongly relies on a default reasoning effect to perform these tasks.

Assessing Incrementality in Sequence-to-Sequence Models
Dennis Ulmer | Dieuwke Hupkes | Elia Bruni
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Since their inception, encoder-decoder models have successfully been applied to a wide array of problems in computational linguistics. The most recent successes are predominantly due to the use of different variations of attention mechanisms, but their cognitive plausibility is questionable. In particular, because past representations can be revisited at any point in time, attention-centric methods seem to lack an incentive to build up incrementally more informative representations of incoming sentences. This way of processing stands in stark contrast with the way in which humans are believed to process language: continuously and rapidly integrating new information as it is encountered. In this work, we propose three novel metrics to assess the behavior of RNNs with and without an attention mechanism and identify key differences in the way the different model types process sentences.

The emergence of number and syntax units in LSTM language models
Yair Lakretz | German Kruszewski | Theo Desbordes | Dieuwke Hupkes | Stanislas Dehaene | Marco Baroni
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-sensitive generalizations such as long-distance number agreement. We have however no mechanistic understanding of how they accomplish this remarkable feat. Some have conjectured it depends on heuristics that do not truly take hierarchical structure into account. We present here a detailed study of the inner mechanics of number tracking in LSTMs at the single neuron level. We discover that long-distance number information is largely managed by two “number units”. Importantly, the behaviour of these units is partially controlled by other units independently shown to track syntactic structure. We conclude that LSTMs are, to some extent, implementing genuinely syntactic processing mechanisms, paving the way to a more general understanding of grammatical encoding in LSTMs.

Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
Tal Linzen | Grzegorz Chrupała | Yonatan Belinkov | Dieuwke Hupkes
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Transcoding Compositionally: Using Attention to Find More Generalizable Solutions
Kris Korrel | Dieuwke Hupkes | Verna Dankers | Elia Bruni
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

While sequence-to-sequence models have shown remarkable generalization power across several natural language tasks, their construct of solutions are argued to be less compositional than human-like generalization. In this paper, we present seq2attn, a new architecture that is specifically designed to exploit attention to find compositional patterns in the input. In seq2attn, the two standard components of an encoder-decoder model are connected via a transcoder, that modulates the information flow between them. We show that seq2attn can successfully generalize, without requiring any additional supervision, on two tasks which are specifically constructed to challenge the compositional skills of neural networks. The solutions found by the model are highly interpretable, allowing easy analysis of both the types of solutions that are found and potential causes for mistakes. We exploit this opportunity to introduce a new paradigm to test compositionality that studies the extent to which a model overgeneralizes when confronted with exceptions. We show that seq2attn exhibits such overgeneralization to a larger degree than a standard sequence-to-sequence model.

The Fast and the Flexible: Training Neural Networks to Learn to Follow Instructions from Small Data
Rezka Leonandya | Dieuwke Hupkes | Elia Bruni | Germán Kruszewski
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Learning to follow human instructions is a long-pursued goal in artificial intelligence. The task becomes particularly challenging if no prior knowledge of the employed language is assumed while relying only on a handful of examples to learn from. Work in the past has relied on hand-coded components or manually engineered features to provide strong inductive biases that make learning in such situations possible. In contrast, here we seek to establish whether this knowledge can be acquired automatically by a neural network system through a two phase training procedure: A (slow) offline learning stage where the network learns about the general structure of the task and a (fast) online adaptation phase where the network learns the language of a new given speaker. Controlled experiments show that when the network is exposed to familiar instructions but containing novel words, the model adapts very efficiently to the new vocabulary. Moreover, even for human speakers whose language usage can depart significantly from our artificial training language, our network can still make use of its automatically acquired inductive bias to learn to follow instructions more effectively.

On the Realization of Compositionality in Neural Networks
Joris Baan | Jana Leible | Mitja Nikolaus | David Rau | Dennis Ulmer | Tim Baumgärtner | Dieuwke Hupkes | Elia Bruni
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We present a detailed comparison of two types of sequence to sequence models trained to conduct a compositional task. The models are architecturally identical at inference time, but differ in the way that they are trained: our baseline model is trained with a task-success signal only, while the other model receives additional supervision on its attention mechanism (Attentive Guidance), which has shown to be an effective method for encouraging more compositional solutions. We first confirm that the models with attentive guidance indeed infer more compositional solutions than the baseline, by training them on the lookup table task presented by Liska et al. (2019). We then do an in-depth analysis of the structural differences between the two model types, focusing in particular on the organisation of the parameter space and the hidden layer activations and find noticeable differences in both these aspects. Guided networks focus more on the components of the input rather than the sequence as a whole and develop small functional groups of neurons with specific purposes that use their gates more selectively. Results from parameter heat maps, component swapping and graph analysis also indicate that guided networks exhibit a more modular structure with a small number of specialized, strongly connected neurons.

2018

Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information
Mario Giulianelli | Jack Harding | Florian Mohnert | Dieuwke Hupkes | Willem Zuidema
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

How do neural language models keep track of number agreement between subject and verb? We show that ‘diagnostic classifiers’, trained to predict number from the internal states of a language model, provide a detailed understanding of how, when, and where this information is represented. Moreover, they give us insight into when and where number information is corrupted in cases where the language model ends up making agreement errors. To demonstrate the causal role played by the representations we find, we then use agreement information to influence the course of the LSTM during the processing of difficult sentences. Results from such an intervention reveal a large increase in the language model’s accuracy. Together, these results show that diagnostic classifiers give us an unrivalled detailed look into the representation of linguistic information in neural models, and demonstrate that this knowledge can be used to improve their performance.

Do Language Models Understand Anything? On the Ability of LSTMs to Understand Negative Polarity Items
Jaap Jumelet | Dieuwke Hupkes
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

In this paper, we attempt to link the inner workings of a neural language model to linguistic theory, focusing on a complex phenomenon well discussed in formal linguistics: (negative) polarity items. We briefly discuss the leading hypotheses about the licensing contexts that allow negative polarity items and evaluate to what extent a neural language model has the ability to correctly process a subset of such constructions. We show that the model finds a relation between the licensing context and the negative polarity item and appears to be aware of the scope of this context, which we extract from a parse tree of the sentence. With this research, we hope to pave the way for other studies linking formal linguistics to deep learning.

Analysing the potential of seq-to-seq models for incremental interpretation in task-oriented dialogue
Dieuwke Hupkes | Sanne Bouwmeester | Raquel Fernández
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We investigate how encoder-decoder models trained on a synthetic dataset of task-oriented dialogues process disfluencies, such as hesitations and self-corrections. We find that, contrary to earlier results, disfluencies have very little impact on the task success of seq-to-seq models with attention. Using visualisations and diagnostic classifiers, we analyse the representations that are incrementally built by the model, and discover that models develop little to no awareness of the structure of disfluencies. However, adding disfluencies to the data appears to help the model create clearer representations overall, as evidenced by the attention patterns the different models exhibit.

2016

POS-tagging of Historical Dutch
Dieuwke Hupkes | Rens Bod
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a study of the adequacy of current methods that are used for POS-tagging historical Dutch texts, as well as an exploration of the influence of employing different techniques to improve upon the current practice. The main focus of this paper is on (unsupervised) methods that are easily adaptable for different domains without requiring extensive manual input. It was found that modernising the spelling of corpora prior to tagging them with a tagger trained on contemporary Dutch results in a large increase in accuracy, but that spelling normalisation alone is not sufficient to obtain state-of-the-art results. The best results were achieved by training a POS-tagger on a corpus automatically annotated by projecting (automatically assigned) POS-tags via word alignments from a contemporary corpus. This result is promising, as it was reached without including any domain knowledge or context dependencies. We argue that the insights of this study combined with semi-supervised learning techniques for domain adaptation can be used to develop a general-purpose diachronic tagger for Dutch.

Co-authors

Khuyagbaatar Batsuren 3

Mario Giulianelli 3

Amirhossein Kazemnejad 3

Koustuv Sinha 3

Jasmijn Bastings 2

Christos Christodoulopoulos 2

Grzegorz Chrupała 2

Ryan Cotterell 2

Gautier Dagan 2

Stanislas Dehaene 2

Théo Desbordes 2

Germán Kruszewski 2

Joelle Pineau 2

Hassan Sajjad 2

Willem Zuidema 2

Afra Alishahi 1

Tim Baumgärtner 1

Sanne Bouwmeester 1

Niladri S. Chatterji 1

Kartik Choudhary 1

Nicola De Cao 1

Emmanuel Dupoux 1

Raquel Fernández 1

Peter Henderson 1

Anna Langedijk 1

Rezka Leonandya 1

Lovish Madaan 1

Florian Mohnert 1

Sharan Narang 1

Mitja Nikolaus 1

Barbara Plank 1

Edoardo Maria Ponti 1

Venkat Srinik Ramayapally 1

Nicholas Roberts 1

Diana Rodríguez Luna 1

Shane Steinert-Threlkeld 1

Pontus Stenetorp 1

Jakub Szymanik 1

Aman Singh Thakur 1

Sankaran Vaidyanathan 1

Oskar Van Der Wal 1

Sarah Wiegreffe 1

Hugh Mee Wong 1

Silvan de Boer 1

Venues