Exploring Predictive Uncertainty and Calibration in NLP: A Study on the Impact of Method & Data Scarcity

We investigate the problem of determining the predictive confidence (or, conversely, uncertainty) of a neural classifier through the lens of low-resource languages. By training models on sub-sampled datasets in three different languages, we assess the quality of estimates from a wide array of approaches and their dependence on the amount of available data. We find that while approaches based on pre-trained models and ensembles achieve the best results overall, the quality of uncertainty estimates can surprisingly suffer with more data. We also perform a qualitative analysis of uncertainties on sequences, discovering that a model’s total uncertainty seems to be influenced to a large degree by its data uncertainty, not model uncertainty. All model implementations are open-sourced in a software package.


Introduction
In 1877, Italian astronomer Giovanni Schiaparelli described the existence of "canals" on the surface of Mars, a finding that was described by a contemporary as a "very important and perplexing [problem]" (Young, 1895;p. 355).It later turned out that the structures, originally termed canali in Italian, were simply mistranslated, since the word can also refer to (natural) channels of water.By that point however, the possibility of irrigation on the red planet had already sept into popular culture, and is still being referenced to this day.In the meantime, translation has become a task that is increasingly performed by neural networks, which -in the face of a word such as canali -might simply fall back on the most likely translation given the training data.And while the error above seems fairly innocuous, there are more safety-critical scenarios in which such ambiguities matter and can potentially have negative real-word consequences.Besides translation, there also exist other language-based problems in which the uncertainty surrounding a model prediction can convey critical information, such as medical analyses (Esteva et al., 2019), legal case data (Frankenreiter and Livermore, 2020) or analyzing job applications (Zimmermann et al., 2016).Determining model confidence, or, conversely, uncertainty, consequently is an important mean to instill trust in end users and avert harm (Bhatt et al., 2021;Jacovi et al., 2021).While there exist many works on images (Lakshminarayanan et al., 2017;Snoek et al., 2019) and tabular data (Ruhe et al., 2019;Ulmer et al., 2020;Malinin et al., 2021), the quality of uncertainty estimates provided by neural networks remains underexplored in Natural Language Processing (NLP).In addition, as model underspecification due to insufficient data presents a risk (D'Amour et al., 2020), the increasing interest in less-researched languages with limited resources raises the question of how reliably uncertain predictions can be identified.This lets us pose the following research questions: Contributions 1 We address these questions by conducting a comprehensive empirical study of eight different models for uncertainty estimation for classification and evaluate their effectiveness on three languages spanning distinct NLP tasks, involving sequence labeling and classification.2 We show that while approaches based on pre-trained models and ensembles achieve the best results overall, the quality of uncertainty estimates on OOD data can become worse using more data.3 In a qualitative analysis, we also discover that a model's total uncertainty seems to mostly consist of its data uncertainty.4 We make our experimental code and model implementations available open-source in separate repositories, aiding future research in this direction. 1  2 Related Work Notions of Uncertainty In the absence of additional information, the introductionary example canali has two valid translations -canals and channels.This is an instance of data or aleatoric uncertainty, describing the irreducible ambiguity and noise in the data generating process.The other notion is model or epistemic uncertainty: Fitting parameters, there remains a degree of incertitude about the optimal values due to finite data.We can usually reduce this uncertainty by amassing more data, 2 for instance by supplying a translation system with other meanings of canali.These two concepts form the basis for uncertainty estimation in Machine Learning (Der Kiureghian and Ditlevsen, 2009;Hüllermeier and Waegeman, 2021).
Uncertainty in NLP Since uncertainty estimation literature is manifold on image data, we dedicate this part to related works in the realm of Natural Language Processing.There are several examples trying to incorporate uncertainty into models to either increase trustworthiness or performance, for instance in Machine Translation (Glushkova et al., 2021;Wei et al., 2020;Xiao et al., 2020), Summarization (Gidiotis and Tsoumakas, 2021), Information Retrieval (Penha and Hauff, 2021) and Active Learning (Siddhant and Lipton, 2018).To obtain uncertainties, Gan et al. (2017) use Stochastic-Gradient Langevin Dynamics (Welling and Teh, 2011) to obtain posterior weight samples for a LSTM.Shelmanov et al. (2021) apply MC Dropout with determinantal point processes to transformers for Natural Language Understanding.Several authors have also highlighted connections of multihead attention to Bayesian inference (An et al., 2020;Hron et al., 2020).Shen et al. (2020) attempt to transfer the idea of prior networks (Malinin and Gales, 2018;Joo et al., 2020) onto recurrent neural networks.Another line of works investigates uncertainty properties themselves; For instance, Chen and Ji (2022) try to explain uncertainty estimates for BERT and RoBERTa.Another example is given by Xiao and Wang (2021), who use predictive uncertainty to explain hallucination in Language Generation.Xu et al. (2020) similarly use uncertainty as a tool to investigate challenges of neural summarization approaches.Lastly, due to the way that uncertainty estimates are evaluated, investigating distributional shift in NLP is also of interest, for instance through the work of Arora et al. (2021), Kamath et al. (2020), who focus on question answering and Tan et al. (2019) for text classification.The most similar work to ours is the text classification uncertainty benchmark by Van Landeghem et al. (2022), however they do not consider the impact of data or language, and test a different selection of models.
Calibration Calibration denotes the property of a model's output to accurately reflect the true chance of a correct prediction -i.e.predicting a class with a confidence of 90% should yield the correct prediction for 90% of similar inputs, when repeated.There have been several studies testing this property in modern neural networks (Guo et al., 2017;Nixon et al., 2019;Minderer et al., 2021;Wang et al., 2021b) and proposing ways to improve it (Thulasidasan et al., 2019;Mukhoti et al., 2020;Karandikar et al., 2021;Zhao et al., 2021a;Tian et al., 2021).In NLP, calibration as been explored for pre-trained models (Desai and Durrett, 2020), including on out-of-distribution data (Dan and Roth, 2021), for neural machine translation (Wang et al., 2020) and for question-answering (Jiang et al., 2021).Likewise, authors have proposed several calibration schemes, for instance by focusing on classes of interest (Jagannatha and Yu, 2020), generating synthetic examples for regularization (Kong et al., 2020), using richer input representations (Zhang et al., 2021) and adapting prompts in a zero-shot setting (Zhao et al., 2021b).

Models
We choose a variety of models that cover a range of different approaches based on the two most prominently used architectures in NLP: Long-Short Term Memory networks (LSTMs; Hochreiter and Schmidhuber, 1997) and transformers (Vaswani et al., 2017).Inside the first family, we use the Variational LSTM (Gal and Ghahramani, 2016b) based on MC Dropout (Gal and Ghahramani, 2016a), the Bayesian LSTM (Fortunato et al., 2017) implementing Bayes-by-backprop (Blundell et al., 2015) and the ST-τ LSTM (Wang et al., 2021a), modelling transitions in a finite-state automaton, as well as an ensemble (Lakshminarayanan et al., 2017).In the second family, we count the Variational Transformer (Xiao et al., 2020), also using MC Dropout, the SNGP Transformer (Liu et al., 2022), using a Gaussian Process output layer, and the Deep Deterministic Uncertainty transformer (DDU; Mukhoti et al., 2021), fitting a Gaussian mixture model on extracted features.We elaborate on implementation details in Appendix C.1.

Uncertainty Metrics
We employ the following metrics to quantify confidence or uncertainty -in all cases, lower values indicate lower confidence / certainty and conversely, higher values mean higher confidence / certainty.The following metrics were either chosen due to their frequent use in the literature, or because they are trying to capture uncertainty in a novel way.

Single prediction metrics
We distinguish between metrics suitable for models using only a single prediction (or using the mean of multiple predictions, e.g. for an ensemble).The most straightforward of them is the maximum softmax probability by Hendrycks and Gimpel (2017).A variant of this is the softmax-gap, measuring the difference between the two largest predicted probabilities (Tagasovska and Lopez-Paz, 2019).Another common metric, predictive entropy, involves measuring the Shannon entropy of the output distribution, which is maximized for a uniform prediction: Lastly, we consider the Dempster-Shafer metric (Sensoy et al., 2018), defined as K/(K + K k=1 exp(z k )), where z k denotes the logit corresponding to class k.It has been shown that probabilities for (ReLU) networks tend to saturate in the limit (Hein et al., 2019;Ulmer and Cinà, 2021), and since this metric considers logits, it might provide more informative estimates on OOD data.

Multiple prediction metrics
For some of the included models, we can express uncertainty as some score based on a number of predicted distributions, e.g. from different ensemble members or forward passes for MC Dropout.Here we use the expectation with respect to the weight posterior to express the aggregation of multiple predictions, which will simply be evaluated using the mean of a number of Monte Carlo samples in practice.A simple uncertainty metric on this basis is the predictive variance between predictions for a class: where the expectation is evaluated over multiple sets of parameters, e.g.stemming from different dropout masks.Another possibility lies in using the mutual information between the label and model parameters given the data and input sample, which was introduced by Smith and Gal (2018): where H denotes the Shannon entropy as used for predictive entropy.The two terms of this equation can be identified as the total entropy and the aleatoric uncertainty, respectively.In theory, the remaining epistemic uncertainty of the model -in form of the the mutual information -should be particularly high on OOD inputs.
Model-specific metrics Lastly, DDU by Mukhoti et al. (2021) uses the log-probability of the last layer network activation under a Gaussian Mixture Model fitted on the training set as an additional metric.Since all others models are trained or fine-tuned as classifiers, they are not able to assign log-probabilities to sequences.
Uncertainty for sequences Since some tasks require predictions for every time step of a sequence, we determine the uncertainty of a whole sequence in these cases by taking the mean over all step-wise uncertainties. 3 A more principled approach for sequences is for instance provided by Malinin and Gales (2021), and we leave the extension and exploration of such methods for different uncertainty metrics, models and tasks to future work.

Dataset Selection & Creation
In-distribution training sets We choose three different languages, namely English (Clinc Plus; Larson et al., 2019), Danish in the form of the Dan+ dataset (Plank et al., 2020) based on News texts from PAROLE-DK (Bilgram and Keson, 1998), Finnish (UD Treebank; Haverinen et al., 2014;Pyysalo et al., 2015;Kanerva and Ginter, 2022), corresponding to NLP tasks such as sequence classification, named entity recognition and part-of-speech tagging.An overview over the used the data is given in Table 1.We do use standardized low-resource languages in the case of Finnish and Danish, and simulate a low-resource setting using English data. 4Starting with a sufficiently-sized training set and then sub-sampling allows us to create training sets of arbitrary sizes.By using languages from different families, we hope to be able draw conclusions that generalize across a single language.We employ a specific sampling scheme that tries to maintain the sequence length and class distribution of the original corpus, which we explain and verify in Appendix A.2.
Out-of-distribution Test Sets While it is possible to create OOD text by for instance withholding classes from the training set or appending text from a different source (Arora et al., 2021), we choose to pick entirely new OOD test sets that are qualitatively different: Out-of-scope voice commands by users in Larson et al. (2019), 5 the Twitter split of the Dan+ dataset (Plank et al., 2020), and the Finnish OOD treebank (Kanerva and Ginter, 2022).In similar works for the image domain, OOD test sets are often chosen to be convincingly different from the training distribution, for instance MNIST versus Fashion-MNIST (Nalisnick et al., 2019;van Amersfoort et al., 2021).While there exist a variety of formalizations of types of distributional shift (Moreno-Torres et al., 2012;Wald et al., 2021;Arora et al., 2021;Federici et al., 2021), it is often hard to determine if and what kind of shift is taking place.Winkens et al. (2020) define near OOD as a scenario in which the inlier and outlier distribution are meaningfully related, and far OOD as a case in which they are unrelated.Unfortunately, this distinction is somewhat arbitrary and hard to apply in a language context, where OOD could be defined as anything ranging from a different language or dialect to a different demographic on an author or speaker or a new genre.Therefore, we use a similar methodology to the validation of the sub-sampled training sets to make an argument that the selected OOD splits are sufficiently different in nature from the training splits.The exact procedure along some more detailed results is described in Appendix A.3.

Model Training
Unfortunately, our datasets do not contain enough data to train transformer-based models from scratch.Therefore, we only fully train LSTMbased models, while using pre-trained transformers, namely BERT (English; Devlin et al., 2019), Danish BERT (Hvingelby et al., 2020), and Fin-BERT (Finnish; Virtanen et al., 2019), for the other approaches.The whole procedure is depicted in Figure 1.The way we optimize models is provided in Appendix C.3.We list training hardware, hyperparameter information in Appendix C.2, with the environmental impact described in Appendix C.5.

Evaluation
Apart from evaluating models on the task performance, we also evaluate the following calibration and uncertainty, painting a multi-faceted picture of the reliability of models.In all cases, we use the Almost Stochastic Order test (ASO; del Barrio et al., 2018;Dror et al., 2019) for significance testing, which is elaborated on in Appendix C.1.
Evaluation of Calibration First, we measure the calibration of models using the adaptive calibration error (ACE; Nixon et al., 2019), which is an extension of the expected calibration error (ECE;Naeini et al., 2015;Guo et al., 2017). 6Furthermore, we use the frequentist measure of coverage (Larry, 2004;Kompa et al., 2021) Leonard et al. (1992): A good model should be less certain for inputs that incur a higher loss.To measure this both on a token and sequence level, we utilize Kendall's τ (Kendall, 1938), which, given two lists of measurements, determines the degree to which they are concordant -that is, to what extent the rankings of elements according to their measured values agree.This is expressed by a value between −1 and 1, with the latter expressing complete concordance.In our case, these measurements correspond to the uncertainty estimate and the actual model loss, either for tokens (Token τ ) or sequences (Sequence τ ).

RQ1: Uncertainty & Calibration
We present the results from our experiments using the largest training set sizes per dataset in Table 2.8 Task Performance Across datasets and models, we can identify several trends: some of the BERTbased models unsurprisingly perform better than LSTM based models, which can be explained with their pre-training procedure.We observe worse performance for some LSTM and BERT-variants, in particular the Variational, Bayesian and ST-τ LSTM, as well the SNGP BERT.In accordance with the ML literature (see e.g.Lakshminarayanan et al. ( 2017); Ovadia et al. (2019), LSTM ensembles actually perform very strongly and on par or sometimes better than fine-tuned BERTs.
Calibration We also see BERT models to generally achieve lower calibration errors across all metrics measured, which is in line with previous works (Desai and Durrett, 2020;Dan and Roth, 2021).It is interesting to see that the correct prediction is almost always contained in the 0.95 confidence set across all models, however these number have to be interpreted in the context of the set's width: It becomes apparent that for instance LSTMs achieve this coverage by spreading probability mass over many classes, while only BERTbased models, LSTM ensembles as well as the Bayesian LSTM (on Danish) and the Variational LSTM (on Finnish) are confidently correct.
Uncertainty Quality LSTM-based model seem to struggle to distinguish in-from out-ofdistribution data based on predictive uncertainty.For Danish, only BERTs perform visibly above chance-level.For Finnish, the AUPR results suggest that although some OOD instances are quickly .18 ±.01 1.00 ±.00 1.00 ±.00 1.68  identified as uncertain, many other OOD remain undetected among in-distribution samples.For English, OOD samples are detected more effectively, which can be explained by them consisting of unknown voice commands, representating a potential instance of semantic shift, which has been shown to be easier to detect by classifiers (Arora et al., 2021).Furthermore, it is striking that uncertainty and loss on a token-level (Token τ ) is only positive correlated for some models, using metrics such as the maximum probability score, softmax gap or the Dempster-Shafer metric, which are all entirely based on the categorical output distributions.On a sequence-level (Sequence τ ), the correlation is often negative, meaning that higher uncertainty goes hand in hand with a higher loss.Lastly, it should be noted that different uncertainty metrics yield diverse outcomes: There does not seem to be one superior metric across all experimental settings, as seen by the variety of markers shown in Table 2.

RQ2: Dependence on training data
After presenting the best results for the biggest training set sizes in Table 2, we now continue to analyze the difference between models and metrics in a more fine-grained way.In Figure 2, we show dif-  ferences for the token-level correlation between a model's loss and its uncertainty measured by Kendall's τ , with arrows indicating the shift from measurements on the in-to the out-of-distribution test set.Here, we see the same trend of more training data having a larger influence on BERT models.Peculiarly, we also observe pre-trained models' uncertainty to correlate less with their losses on the OOD data, while this property stays relative constant for LSTMs.We can recognize this trend also for the other datasets in Figure 2 and to a lesser degree on a sequence level Figure 14a in Appendix D.1, albeit with a negative correlation in general in the latter case.In Figures 11 and 12 in Appendix D.1, we show the AUROC and AUPR of different model-uncertainty metric combinations for all datasets and training set sizes.In both cases, we can notice that pre-trained models profit more from an increase in available training data than LSTM-based models that are trained from scratch.This improvement is observed both in task performance, as well as in the model's ability to discern ID from OOD data using its uncertainty, but more so for the Danish than English or Finnish.Like in the previous section, we often see that uncertainty metrics of the same model perform quite similarly.These results outline a seeming paradox: Pre-trained and then fine-tuned models (often) perform better on the task at hand, and provide better uncertainty estimates, but only on in-distribution data.Models trained from scratch that have seen less data overall, however provide more reliable uncertainty estimates on OOD data, but are also worse calibrated (Section 4.1), with the exception of ensembles.This effect appears to largest on Danish, containing the least data.

Uncertainty quality over training
Adding another facet to this issue, we plot the development of uncertainty estimate quality over the training for different models in Figure 3.We use LSTMs and DDU BERTs on Dan+ as representative for the observed differences between pre-trained transformers and models trained from scratch, with more examples given in Appendix D.2.On both a token and sequence-level, we can see that the correlation between uncertainty and loss dips for DDU BERT, before increasing again over the course the of the training.9Most curiously, the highest correlations are achieved with the models using the least training data.Such behavior is also present for LSTMs on a sequence level.We can also see that while the correlation is higher for DDU BERT on in-distribution data (see again Table 2), on OOD data, LSTMs actually more accurately reflect their knowledge using uncertainty.This again corroborates earlier insights from Section 4.1: Pretrained models seem to provide better uncertainty estimates on in-distribution data, but yield worse results on OOD than LSTMs trained from scratch.Furthermore, the less training data is available, the more indicative predictive uncertainty seems to be of the correctness of a model.We see such behavior also to a lesser extent in the other, datasets (see Appendix D.2).Before we offer some potential explanations of this behavior, we try to gain an even more fine-grained understanding by analysing the differences in metrics and models on a token-level.

RQ3: Qualitative Analysis
We investigate the development of uncertainty estimates over the course of a single sequence for different datasets, models, and uncertainty metrics.
We showcase two examples in Figure 4, with more examples in Appendix D. 3. By looking at the predictive entropy of models in Section 4.3, we can observe multiple things: First of all, we can observe some degree of agreement between models and their uncertainty: Processing sub-word tokens, uncertainty seems to increase, and the total uncertainty always appears to reduce considerably on punctuation.Interestingly, the highest uncertainty seems to be produced by the DDU and Variational BERT models as well as the ensembles.In Figure 4b, we compare the estimates for predictive entropy and mutual information, the latter of which is supposed to only express model uncertainty.Here, uncertainty is generally low, indicating a large part of the total uncertainty might actually be of an aleatoric nature (which is the gap between triangle and cross markers of the same color, due to Equation ( 1)).These insights indicate that while aleatoric uncertainty might be a constant factor for all models, epistemic uncertainty expectedly differs noticeably between them.We use all of these insights to discuss the choice of model next.

Discussion
Our experiments in Section 4 have uncovered interesting nuances about uncertainty estimation in Natural Language Processing.With respect to RQ1, we observe that fine-tuning BERTs and training LSTM ensembles on different languages produces high task scores with low calibration errors and high-quality uncertainty estimates, but only so on in-distribution data.On OOD data, uncer-tainty estimates from fine-tuned models do actually become less indicative of potential model loss compared to LSTM-based models.We also find that among the variety of uncertainty metrics proposed, there does not appear to be a superior metric.Differences in Kendall's τ on a token and sequence level suggest that loss and uncertainties fluctuate over the course of sequence.Answering RQ2, it seems that paradoxically more training data seems to decrease the quality of uncertainty estimates on OOD data for pre-trained models.We speculate that fine-tuning models increasingly lets them forget relevant features that would produce higher uncertainty.This might explain why for LSTM-type models, this effect seems to be smaller.Lastly, we conclude about RQ3 that all models' total uncertainty behave somewhat similarly, potentially due to the strong influence of aleatoric uncertainty.From these insights, we conclude that the approaches using pre-trained models overall give the best trade-off between task performance, uncertainty quality and calibrations, however their failure on OOD samples opens up further directions of research.Ensembles can provide an alternative here in data-scarce settings, when the task is sufficiently learnable without the need for pre-training.

Conclusion
In this work, we explore the current options for uncertainty estimation in NLP on three different languages and tasks, focusing on the impact of available data on the quality of uncertainty scores in a potential low-resource environment.We conclude the following: Fine-tuning pre-trained models produces the best results in terms of task performance, calibration and uncertainty quality, but only on in-distribution data.On out-of-distribution data, LSTM-based models produce more reliable (b) Predictive entropy and mutual information over the sentence "However, the phenomenon lasted for such a short time that Pekka did not have a chance to prove it".
Figure 4: Uncertainty estimates on single sequences, for predictive entropy of different models on Danish (Section 4.3) and predictive entropy and mutual information for multi-prediction models on Finnish (Figure 4b).
estimates, and could be preferred in cases pretrained models might not be available, with LSTMensembles providing an especially attractive alternative.We discover that more training data seems to decrease quality of uncertainty on OOD, and show that the total uncertainty of models seems to often to be influenced by their aleatoric uncertainty.
Future Work We see our work as groundwork for future research: While uncertainty estimation is a thriving subject in Computer Vision, it remains understudied in NLP.Our experiments highlight that the model behavior on language data is not well-understood and open several lines for further investigation: One such line is the development of new methods for NLP that a) produce more faithful estimates on OOD data while retaining their ID performance and b) require less training data to so, in order to be applicable in low-resource settings.Additionally, our qualitative analyses along with existing works such as Xiao and Wang (2021); Xu et al. (2020) highlighted the potential to use uncertainty to understand model behavior.

Limitations
Even though the experiments test a large array of models and metrics, the here shown collection is by no means exhaustive, and thus only a selection of popular models or approaches from very different families were considered.
Another glaring shortcoming is the focus on only three European languages: By comparing members of the Uralic, North Germanic and West Germanic families, we only scratch the surface when it comes to the morphological diversity of human language.Further, we only focused on languages with a latin writing systems, as well as specific text domains.This is due to resource constraints and the availability of suitable OOD test sets.We hope that follow-up works will refine our insights on a more representative sample of natural languages.
Lastly, we solely focused on sequence labelling and sequence predictions tasks.Van Landeghem et al. ( 2022) feature more sequence prediction tasks for English, however we are looking forward to similar studies on natural language generation and structured prediction tasks as well.

Ethics Statement
We do not foresee any immediate negative ethical consequences of our research.
2021.An information-theoretic approach to distribution shifts.

A Data
A.1 Pre-processing Tokenization We use the corresponding BERT tokenizer for each language, including for LSTMbased models to ensure compatibility.For English, this corresponds to the original SentencePiece tokenizer used by Devlin et al. (2019), while we use the tokenizer of the Danish BERT (Hvingelby et al., 2020) and Finnish BERT (Virtanen et al., 2019) for those lanuages, respectively.
Tags for Sub-word Tokens For named entity recognition and part-of-speech tagging, we follow Jurafsky and Martin ( 2022), chapter 11.3.3 to deal with sub-word tokens: For every token that is split into sub-word tokens, we assign the tag only to the first sub-word token, and −100 for the rest, which ignores them for evaluation purposes.

A.2 Sub-sampling of Training Sets
Since we sub-sample some of the data splits in Table 1, this bears the dangers of producing unnatural samples of text.For that reason, we use this appendix to describe the sampling strategies in more detail.
Sub-sampling procedure The procedure for subsampling text is that sequences are first placed into buckets of the same label, then into sub-buckets of the same length.Then, the sampling procedure consists of first drawing a label based on the observed label frequencies, after which the draw of sequence length, proportional to the frequency of this length inside the bucket, determines the final bucket from which a sequence is again drawn uniformly.Lastly, the process for token classification involves the grouping into sequences by length at the highest level.Inside a bucket, a sequence is not drawn uniformly but with a probability according to the alignment of the sequence's labels with the overall corpus label distribution.This alignment is calculated for each sequence by evaluating the expected log-probability of the sequence's label distribution w.r.t to the label distribution of the corpus (i.e., the cross-entropy).The scores for all same-length sequences in a bucket are then normalized into a [0, 1] interval in order to enable sampling, which is similar to the two-stage procedure used in the sequence classification case.

Validation of sub-sampled training sets
We take multiple steps to validate the representative-ness of our sub-sampled data splits.First, we plot the distributions of the 50 most frequent types in the original corpus in Figure 5, where we see that distributions converge with increasing sample size.Secondly, we plot sentence length distributions in Figure 6, where we also see increasing alignment with sample size. 10For Sequence and Token Classification tasks, we also plot the class distributions in Figure 7. Lastly, we train an interpolated trigram Kneser-Ney language model (Jelinek, 1980;Ney et al., 1994) with uniform interpolation weights trained on the original training set using SRILM (Stolcke, 2002) and sub-word tokens produced by the corresponding BERT tokenizer, sub-sample multiple splits and compare their perplexity scores to those of the original corpus in Table 3.While n-gram perplexities of sub-sampled training sets do lie over the ones of the original data, they are still upper-bounded by the in-distribution test-set perplexities.Furthermore, this verification was not aimed to give the most precise results, as also the scoring using an n-gram model can be rather crude.Thus, with all these results, we conclude that our sub-sampling procedure produces sufficiently representative samples of the original data for the different tasks discussed.

A.3 Selection of OOD Test Sets
In this appendix section, we present additional evidence that the OOD test splits shown in Table 1 are sufficiently different from the training datameaning, out-of-distribution -to enable our chosen methodology.To that end, we re-use similar ideas as described in Appendix A.2, but with the opposite goal.In Figure 9, we plot the distribution of sequence lengths of the training set compared with the OOD test set, with the same done for the most frequent 25 types in Figure 10 and class labels in Figure 8. Lastly, we again use a interpolated Kneser-Ney trigram language model to compute the perplexity of the training compared to the OOD test set in Table 3.In all cases, OOD n-gram perplexities lie much over the training or sub-sampled data perplexities.Except for Finnish, they are also widely different from the test set perplexities.In that exceptional cases, an explanation could be given by the highly agglutinative nature of Finnish, increasing the sparsity of the language despite the subword tokenization.

Dan+
Original 100 1000     For English, the distribution of lengths of voice assistant commands is quite similar, while the differences for Dan+ and Finnish UD are more pronounced.

B Calibration Metrics
Perfect calibration is defined as the the confidence of a neural network corresponding to the percentage of samples with that same predicted probability actually receiving the correct label by the network.Using a predicted label ŷ with probability p, perfect calibration is defined as The expected calibration error (Naeini et al., 2015) quantifies the difference between the confidence and the calibration on a test set by collecting predictions into m bins: where N is the number of data points and B m denotes the m-th bin.
The problem is that ECE is only defined for binary classification and depends highly on the number of bins chosen.For the former problem, Guo et al. ( 2017) present a naive extension to multi-class classification that only considers the most likely prediction.In order to consider all classes, Nixon et al. (2019)   (SCE) as an extension to multi-class problems: Here, N mk denotes the number of instances of class k in bin m, and acc(B m , k), conf(B m , k) the accuracies and confidences for class label k in bin m, respectively.However, we found this error not be very informative in our case, and therefore omitted corresponding results.
Secondly, Nixon et al. (2019) introduce the adaptive calibration error (ACE), which makes sure that every bin contains the same number of predictions.They define a calibration range by the ⌊N/R⌋-th index of sorted and thresholded predictions.Then, the error is defined as

Implementation Details
Resources All models were implemented in Py-Torch (Paszke et al., 2019).BERT models where implemented with the help of HuggingFace's transformers library (Wolf et al., 2020).Linear algebra operations where often implemented using the EinOps package (Rogozhnikov, 2022).
The Bayesian LSTM was developed using the Blitz package (Esposito, 2020) for PyTorch and the SNGP transformer using gpytorch (Gardner et al., 2018).Hugginface's datasets (Lhoest et al., 2021) were furthermore used for dataset creation and codecarbon (Schmidt et al., 2021) for carbon emissions tracking.Weights & Biases (Biewald, 2020) was used to track and manage hyperparameter searches and experiments.In general, we follow many of the experimental guidelines and suggestions laid out by Ulmer et al. (2022a).
Models For the DUE transformer, we used Principal Component Analysis on the latent representations for Clinc Plus to reduce the memory usage of the Gaussian Discriminant Analysis by reducing dimensionality to 64.We initially also experimented with the usage of the DUE transformer by (van Amersfoort et al., 2021), however found that it was not trivial to create the inducing points for the Gaussian process output layer in a sequential setting.For the Variational Transformer (Xiao et al., 2020), the authors do not specify exactly how MC Dropout is used.We use the existing dropout layers in the corresponding model, and use a number of forward passes with different dropout masks to make predictions.Since the number of classes is prohibitive for the original formulation of the SNGP transformer, we use the extension proposed by Liu et al. (2022) in Appendix A.1 and only store one Σ−1 matrix for all classes.Furthermore, we update the matrix sequence lengths of 35 for LSTM-based and 128 for BERT-based models.All LSTM-based models are trained using 2 layers, with the exception of the vanilla LSTM and the LSTM-ensemble on Clinc Plus with 3 layers.Their hidden size and embedding sizes are set to 650.For all models, gradient clipping is set to 10.For models using multiple predictions to compute uncertainty estimates, 10 predictions are used at a time.

C.3 Optimization
To make sure that all models are trained for the same number of steps regardless of the the size of (sub-sampled) training set, we set the training duration to the number of steps corresponding to a number of epochs using the original training set size, and name it epoch-equivalents in the following.Due to the imbalance of classes in Finnish UD and Dan+, all models were trained using lossweights that are inverse to the frequency of a label  in the dataset.

Optimization of LSTMs
We adopt different optimization schemes for transformer and LSTM-based models.For LSTMs, we choose stochastic gradient descent with a decaying learning rate schedule, decaying by 0.8695 after the equivalent of 14 epochs for every following epoch-equivalent for 55 epochequivalents in total.This corresponds to the setup in Gal and Ghahramani (2016b), modified from the setup in Zaremba et al. (2014).

Optimization of BERTs
We fine-tune BERT models using the shorter duration of 20 epochequivalents, corresponding to the NLP experiments in Liu et al. (2022).Adam (Kingma and Ba, 2015) is used for optimization with default parameters β 1 = 0.9 and β 2 = 0.999 alongside a triangular learning rate, using the first 10% of the training duration as warm-up.

C.4 Convergence on Clinc Plus
Here, we briefly address the models missing from the English Clinc Plus experiments.For the ST-τ and Variational LSTM, we could not identify clear reasons on why models did not converge.Even after extensive hyperparameter searches and manual fine-tuning of hyperparameters (including different learning rate schedules and optimizers), we did not find a combination of options that resulted in convergence.We also observed strange behavior for the Bayesian LSTM, which, after reaching a validation accuracy of 0.5, would suddenly return to its initial training performance.This could potentially be explained by the model accidentally escaping a low-loss basin due to a learning rate that is still too high, and thus we changed the model to only be trained for 18 epoch-equivalents and initiate the learning rate decay after seven epoch-equivalents.The puzzling fact is that SNGP BERT did not converge on Clinc Plus, since the authors successfully used the dataset in their own work (Liu et al., 2022).We put forth the following explanations: First of all, we observed the model to generally possess a high variance, as demonstrated by the standard deviation on the Danish and Finnish data.Secondly, we make at least two changes to their implementation: Instead of using the mean-field approximation to the predictive distribution, we use the Monte Carlo approximation in order to compute metrics such as mutual information.Also, we update the covariance matrix Σ over the whole training time in order to track the predictive performance for our experiments, and not just during the last epoch.

C.5 Environmental Impact
The carbon efficiency was estimated to be 0.61 kgCO 2 eq/kWh.735 hours of computation were performed on a Tesla V100 GPU.This includes hyperparameter search, failed runs, debugging, and discarded runs.As a rough upper bound, we estimate the compute time for a single replication of all experiments to take around 73 hours.12To lessen the environmental impact, all models and model predictions are published in the open-source repository.Total emissions are estimated to be 52.45kgCO 2 eq.We use direct air capture by climeworks to offset the emissions (climeworks, 2022).Estimations were conducted using the codecarbon package (Schmidt et al., 2021), a joint effort from authors of Lacoste et al. (2019) and Lottick et al. (2019).

D Additional Results
This section contains additional experimental results and plots that could not be added to the paper due to spatial constraints.We roughly follow the structure of Section 4.

D.1 Additional Scatter Plots
This section provides some additional scatter plots.For all plots presented here as well as Figure 2, some slight jitter sampled from N (0, 0.01) was added to x and y-coordinates to increase readability of overlapping points.
Clinc Plus In Figures 11a and 12a, we can see that the Variational Bert model actually degrades in performance as the more training data is added, both on a task and uncertainty dimensions, while other models stay relatively constant.The same trend can be detected using the sequence-level Kendall's τ for Clinc Plus.We suspect that the smallest training size of 10k examples does already provide enough data for models to converge to similar solutions even after adding more data, and that the Variational Bert alone might be prone to overfitting in this case.
Dan+ Results for the Danish dataset are shown in Figures 11b and 12b.It is apparent that LSTMbased models stay mostly constant in their predictive performance, with the largest gains observed by the LSTM ensemble.We can also observe the DDU and Variational BERT to increase both in task performance and uncertainty quality with increasing training data.Interestingly, we can see for the SNGP BERT that uncertainty estimates become more indicative of OOD with more training samples, but mostly only using predictive entropy and the maximum probability score.This might indicate that in these cases, the model actually achieves the desired distance-awareness posed by Liu et al. (2022).In Figure 14b, we can see a similar behavior of the SNGP-BERT and its metrics w.r.t. to the sequence-level correlation.Also, we see that the other BERT models and LSTM-Ensemble actually loose in uncertainty quality as more data is added.
Finnish UD In Figures 11c and 12c, we see that the AUROC and AUPR scores of differnet models and metrics stay largely constant across dataset sizes, which could be explained with the larger amount of training data supplied compared to Dan+.
On the token-level correlation between uncertainty and loss in Figure 13, we see the DDU BERT profiting most from more data.On a sequencelevel, as depicted in Figure 14c, the correlation appers mostly static across training set sizes, with only small gaps between in-distribution and out-ofdistribution data.
Overall, it seems that the range of dataset sizes for Dan+ show the most critical differences between models, while for the dataset sizes used for Finnish UD and Clinc Plus, enough data seems to be supplied for changes to be more miniscule.This result is particularly relevant for low-resource setting, although the dependency on the task can not be disentangled from these results.

D.2 Additional Uncertainty over Training Plots
We extend the plots from Figure 3 for all tested models and datasets in Figure 15 and Figure 16, showing the correlation of predictive entropy on a token-and sequence-level, respectively.On the token-level, we see that token-level correlation is the highest for SNGP-BERT, although the correlation levels for training set sizes seems to be harder to differentiate between models and could also be due to variance between models runs.Secondly, on a sequence-level, we also see either similar correlation across training set sizes, or higher correlation for lower sizes.In all cases, we observe that some models start with a high correlation that decreases over the training time, as the model fits the indistribution data better.That corroborates a trend described in Section 4.2, implying that uncertainty estimates become less reliable as the model tries to decrease the loss on the training data.(a) Predictive entropy over the sentence "@ToniLotjonen @harrikumpulaine It is true that I'd maybe like to see more of such Latvia-Russia type games in these kinds of major sports events.#floorball".

Figure 1 :
Figure 1: Schematic of our experiments.Training sets are sub-sampled and used to train LSTM-based models and fine-tune transformer-based ones, which are evaluated on in-and out-of-distribution test data.

Figure 2 :
Figure 2: Scatter plot showing the difference between model performance (measured by macro F 1 and the quality of uncertainty estimates on a token-level (measured by Kendall's τ ).Shown are different models and uncertainty metrics and several training set sizes of the Dan+ dataset.Arrows indicate changes between the in-distribution and out-of-distribution test set.Best viewed electronically and in color.
Development of sequence-level Kendall's τ .

Figure 3 :
Figure 3: Development of correlation between uncertainty and loss, shown on the Dan+ OOD test set over the training time using differently-sized training sets.Colored areas indicate the standard deviation over five runs.
entropy over the sentence "This time in company with Jørn Middelhede, also from Kolding".

Figure 5 :
Figure 5: Comparing the relative frequency of types in the original and sub-sampled training sets.Shown are the top 20 types in the original training set, compared to sub-sampled training sets of 100 and 1000 sequences for Dan+, Finnish UD and Clinc Plus.It is shown that while the type frequencies differ noticeably for the small dataset, already 1000 sequences suffice to approximate the original frequencies.Numbers, stopwords and the most common punctuation were removed.

Figure 6 :
Figure 6: Comparing the relative frequency of sequence lengths in the original and sub-sampled training sets.Shown are sequence lengths between 0 and 25 in the original test, compared to OOD test sets for Dan+, Finnish UD, Clinc Plus.Not the whole distribution is shown in all cases, with many of the OOD sentences for Dan+ being very long.For Dan+ and Finnish UD, the sentence length distributions are noticeably different.For Clinc Plus, they are very similar.

Figure 7 :
Figure 7: Comparing the relative frequency of labels in the original training set, compared to sub-sampled training sets.Shown are frequencies for 100 and 1000 sequences.For Danish, the most frequent label by far is the neutral label indicating that no named entity is present.

Figure 8 :
Figure 8: Comparison of the relative class frequencies between original training set compared to the OOD test set.The proportions stay largely the same for Danish, while different more for Finnish.

Figure 9 :
Figure 9: Comparison of sequence length distribution between the original training set and the OOD test set.For English, the distribution of lengths of voice assistant commands is quite similar, while the differences for Dan+ and Finnish UD are more pronounced.

Figure 10 :
Figure 10: Comparison of the relative frequencies of the top 25 types in the original training set compared to the OOD test set.Even among the most frequent and therefore usually common tokens, the plots show differences between the in-distribution train and out-of-distribution test set.Numbers, stopwords and the most common punctuation were removed.
Predictive entropy over the sentence "On the contrary, it is one of Russia's few success stories that performs when the rock group Gorky Park begins their Danish tour in the city of the beautiful lakes".Predictive entropy over the sentence "However, we did not have precise information about what was agreed upon".entropy over the sentence "Demonizing hate speech inspires the marginalized, PSYCHOLOGY UNSTA-BLE (!) Men on the far right to resort to violence against Muslims.This writes Elvir, who....

Figure 17 :
Figure 17: Further examples for uncertainty estimates on single sequences.Taken from the Dan+ dataset.
Predictive entropy over the sentence "I hope that the procedures done on the person question stop and he gives his body (and mind) time to recover from that poisoning!".
entropy over the sentence "Maybe the hat or how it got on my head doesn't matter".

Figure 18 :
Figure 18: Further examples for uncertainty estimates on single sequences.Taken from the Finnish UD dataset.

Table 1 :
. Coverage is based on Datasets.The original and sub-sampled number of sequences for experiments are given on the right.

Table 2 :
Results on the tested datasets.Task performance is measured by macro F1 and accuracy, calibration by different calibration errors, the coverage percentage the average prediction set width.For every result, and value on the ID and OOD test set is shown.For English, OOD scores are not available since the OOD set does not contain gold labels, and Token τ is missing due to CLINC being a sequence prediction task.Uncertainty quality is evaluated using its ability to discriminate between ID and OOD data, quantified by AUROC and AUPR.Furthermore, Kendall's τ is measured between the uncertainty and losses on a sequence-and token-level.Displayed are mean and standard deviation over five random seeds, with bolding and underlining indicating almost stochastic dominance with εmin ≤ 0.3 over all other models.For last section, the best value over uncertainty metrics is given, with symbols indicating the type of metric achieving it: ⃝ Max.probability, △ Predictive entropy.Class variance.Softmax gap.Dempster-Shafer.Mutual information.
Advances in Neural Information Processing Systems, 34.(Cited on p. 4)

Table 3 :
Results of using an interpolated Kneser-Ney n-gram language model on selected datasets, including sub-sampled training splits and the OOD test set.Scores of sub-sampled training sets were obtained over five different attempts.
introduce the static calibration error

Table 4 :
List of searched hyperparameters.LSTM Ensemble hyperparameters are not searched, but simply copied from the found LSTM hyperparameters.

Table 5 :
List of used model hyperparameters by dataset.