Joris Baan


2024

pdf bib
Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?
Joris Baan | Raquel Fernández | Barbara Plank | Wilker Aziz
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good _representation of uncertainty_ by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.

pdf bib
Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)
Raúl Vázquez | Hande Celikkanat | Dennis Ulmer | Jörg Tiedemann | Swabha Swayamdipta | Wilker Aziz | Barbara Plank | Joris Baan | Marie-Catherine de Marneffe
Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)

2023

pdf bib
What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability
Mario Giulianelli | Joris Baan | Wilker Aziz | Raquel Fernández | Barbara Plank
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system’s predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator’s calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model’s representation of uncertainty.

2022

pdf bib
Stop Measuring Calibration When Humans Disagree
Joris Baan | Wilker Aziz | Barbara Plank | Raquel Fernandez
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - including class frequency, ranking and entropy.

2019

pdf bib
On the Realization of Compositionality in Neural Networks
Joris Baan | Jana Leible | Mitja Nikolaus | David Rau | Dennis Ulmer | Tim Baumgärtner | Dieuwke Hupkes | Elia Bruni
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We present a detailed comparison of two types of sequence to sequence models trained to conduct a compositional task. The models are architecturally identical at inference time, but differ in the way that they are trained: our baseline model is trained with a task-success signal only, while the other model receives additional supervision on its attention mechanism (Attentive Guidance), which has shown to be an effective method for encouraging more compositional solutions. We first confirm that the models with attentive guidance indeed infer more compositional solutions than the baseline, by training them on the lookup table task presented by Liska et al. (2019). We then do an in-depth analysis of the structural differences between the two model types, focusing in particular on the organisation of the parameter space and the hidden layer activations and find noticeable differences in both these aspects. Guided networks focus more on the components of the input rather than the sequence as a whole and develop small functional groups of neurons with specific purposes that use their gates more selectively. Results from parameter heat maps, component swapping and graph analysis also indicate that guided networks exhibit a more modular structure with a small number of specialized, strongly connected neurons.