On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search

According to the Probability Ranking Principle (PRP), ranking documents in decreasing order of their probability of relevance leads to an optimal document ranking for ad-hoc retrieval. The PRP holds when two conditions are met: [C1] the models are well calibrated, and, [C2] the probabilities of relevance are reported with certainty. We know however that deep neural networks (DNNs) are often not well calibrated and have several sources of uncertainty, and thus [C1] and [C2] might not be satisfied by neural rankers. Given the success of neural Learning to Rank (LTR) approaches—and here, especially BERT-based approaches—we first analyze under which circumstances deterministic neural rankers are calibrated for conversational search problems. Then, motivated by our findings we use two techniques to model the uncertainty of neural rankers leading to the proposed stochastic rankers, which output a predictive distribution of relevance as opposed to point estimates. Our experimental results on the ad-hoc retrieval task of conversation response ranking reveal that (i) BERT-based rankers are not robustly calibrated and that stochastic BERT-based rankers yield better calibration; and (ii) uncertainty estimation is beneficial for both risk-aware neural ranking, i.e. taking into account the uncertainty when ranking documents, and for predicting unanswerable conversational contexts.


Introduction
According to the Probability Ranking Principle (PRP) (Robertson, 1977), ranking documents in decreasing order of their probability of relevance leads to an optimal document ranking for ad-hoc retrieval 2 . Gordon and Lenk (1991) discussed that 1 The source code and data are available at https://github.com/Guzpenha/transformer_ rankers/tree/uncertainty_estimation. 2 Standard retrieval task where the user specifies his information need through a query which initiates a search by the for the PRP to hold, ranking models must at least meet the following conditions: [C1] assign well calibrated probabilities of relevance, i.e. if we gather all documents for which the model predicts relevance with a probability of e.g. 30%, the amount of relevant documents should be 30%, and [C2] report certain predictions, i.e. only point estimates such as, for example, 80% probability of relevance. DNNs have been shown to outperform classic Information Retrieval (IR) ranking models over the past few years in setups where considerable training data is available. It has been shown that DNNs are not well calibrated in the context of computer vision (Guo et al., 2017). If the same is true for neural L2R models for IR, e.g. transformer-based models for ranking (Nogueira and Cho, 2019), [C1] is not met. Additionally, there are a number of sources of uncertainty in the training process of neural networks (Gal, 2016) that make it unreasonable to assume that neural ranking models fulfill [C2]: parameter uncertainty (different combinations of weights that explain the data equally well), structural uncertainty (which neural architecture to system for documents that are likely relevant (Baeza-Yates et al., 1999). use for neural ranking), and aleatoric uncertainty (noisy data). Given these sources of uncertainty, using point estimate predictions and ranking according to the PRP might not achieve the optimal ranking for retrieval. While the effectiveness benefits of risk-aware models , which take into account the risk 3 , i.e. the uncertainty of the document's prediction scores, have been shown for non-neural IR approaches, this has not yet been explored for neural L2R models.
In this paper we first analyze the calibration of neural rankers, specifically BERT-based rankers for IR tasks to evaluate how calibrated they are. Then, to model the uncertainty of BERT-based rankers, we propose stochastic neural ranking models (see Figure 1), by applying different techniques to model the uncertainty of DNNs, namely MC Dropout (Gal and Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017) which are agnostic to the particular DNN.
In our experiments, we test models under distributional shift, i.e. the test data distribution is different from the training data, also referred to as out-ofdistribution (OOD) examples (Lee et al., 2018). In real-world settings, there are often inputs that are shifted due to factors such as non-stationarity and sample bias. Additionally, this experimental setup provides a way of measuring whether the DNN "knows what it knows" (Ovadia et al., 2019), e.g. by outputting high uncertainty for OOD examples.
We find that BERT-based rankers are not robustly calibrated. Stochastic BERT-based rankers have 14% less calibration error on average than BERT-based rankers. Uncertainty estimation from stochastic BERT-based rankers is advantageous for downstream applications as shown by our experiments for risk-aware neural ranking (2% more effective on average relative to a model without risk-awareness) and for predicting unanswerable conversational contexts (improves classification by 33% on average of all conditions).

Related Work
Calibration and Uncertainty in IR Even though to optimally rank documents according to the PRP (Robertson, 1977) requires the model to be calibrated (Gordon and Lenk, 1991) ([C1]), the calibration of ranking models has received little attention in IR. In contrast, in the machine learning community there have been a num-ber of studies about calibration (Ovadia et al., 2019;Maddox et al., 2019), due to the larger decision making pipelines DNNs are often part of and their importance for model interpretability (Thiagarajan et al., 2020). For instance, in the automated medical domain it is important to provide a calibrated confidence measure besides the prediction of a disease diagnosis to provide clinicians with sufficient information (Jiang et al., 2012). Guo et al. (2017) have shown that DNNs are not well calibrated in the context of computer vision, motivating our study of the calibration of neural L2R models.
The second condition ([C2]) for optimal retrieval when ranking according to the PRP (Gordon and Lenk, 1991) is that models report predictions with certainty. While the (un)certainty has not been studied in neural L2R models, there are classic approaches in IR that model the uncertainty. Such approaches have been mostly inspired by economics theory, treating variance as a measure of uncertainty (Varian, 1999). Following such ideas, nonneural ranking models that take uncertainty into account (i.e. risk-aware models), and thus do not follow the PRP (Robertson, 1977), have been proposed , showing significant effectiveness improvements compared to the models that do not model uncertainty. Uncertainty estimation is a difficult task that has other applications in IR besides improving the ranking effectiveness: it can be employed to decide between asking clarifying questions and providing a potential answer in conversational search (Aliannejadi et al., 2019); to perform dynamic query reformulation (Lin et al., 2020) for queries where the intent is uncertain; and to predict questions with no correct answers .

Bayesian Neural Networks
Unlike standard algorithms to train neural networks, e.g. SGD, that fit point estimate weights given the observed data, Bayesian Neural Networks (BNNs) infer a distribution over the weights given the observed data. Denker et al. (1987) contains one of the earliest mentions of choosing probability over weights of a model. An advantage of the Bayesian treatment of neural networks (MacKay, 1992;Neal, 2012;Blundell et al., 2015) is that they are better at representing existing uncertainties in the training procedure. One limitation of BNNs is that they are computationally expensive compared to DNNs. This has lead to the development of techniques that scale well, and do not require mod-ifications of the neural net architecture and training procedure. Gal and Ghahramani (2016) proposed a way to approximate Bayesian inference by relying on dropout (Srivastava et al., 2014). While dropout is a regularization technique that ignores units with probability p during every training iteration and is disabled at test time, Dropout (Gal and Ghahramani, 2016) employs dropout at both train and test time and generates a predictive distribution after a number of forward passes. Lakshminarayanan et al. (2017) proposed an alternative: they employ ensembles of models (Ensemble) to obtain a predictive distribution. Ovadia et al. (2019) showed that Ensemble are able to produce well calibrated uncertainty estimates that are robust to dataset shift.

Conversational Search
Conversational search is concerned with creating agents that fulfill an information need by means of a mixed-initiative conversation through natural language interaction. A popular approach to conversational search is its modeling as an ad-hoc retrieval task: given an ongoing conversation and a large corpus of historic conversations, retrieve the response that is best suited from the corpus (this is also known as conversation response ranking (Wu et al., 2017;Penha and Hauff, 2020;Gu et al., 2020;Lu et al., 2020)). This retrieval-based approach does not require task-specific knowledge provided by domain experts (Henderson et al., 2019), and it avoids the difficult task of dialogue generation, which often suffers from uninformative, generic responses (Li et al., 2016a) or responses that are incoherent given the dialogue context (Li et al., 2016b). One of the challenges of conversational search is identifying unanswerable questions , which can trigger for instance clarifying questions (Aliannejadi et al., 2019). Identifying unanswerable conversational contexts is one of the applications we employ uncertainty estimation for. Intuitively, if the system has high uncertainty in the available responses, there may be no correct response available. In this paper we focus on pointwise BERT for ranking-a competitive approach for the conversation response ranking task.

Method
In this section we introduce the methods used for answering the following research questions: RQ1 How calibrated are deterministic and stochastic BERT-based rankers? RQ2 Are the uncertainty estimates from stochastic BERT-based rankers useful for risk-aware ranking? RQ3 Are the uncertainty estimates obtained from stochastic BERTbased rankers useful for identifying unanswerable queries? We first describe how to measure the calibration of neural rankers ([C1]), followed by our approach for modeling and ranking under uncertainty ([C2]), and then we describe how we evaluate their robustness to distributional shift.

Measuring Calibration
To evaluate the calibration of neural rankers (RQ1) we resort to the Empirical Calibration Error (ECE) (Naeini et al., 2015). ECE is an intuitive way of measuring to what extent the confidence scores from neural networks align with the true correctness likelihood. It measures the difference between the observed reliability curve (DeGroot and Fienberg, 1983) and the ideal one 4 . More formally, we sort the predictions of the model, divide them into c buckets {B 0 , ..., B c }, and take the weighted average between the average predicted probability of relevance avg(B i ) and the fraction of relevant 5 documents rel(B i ) |B i | in the bucket: where n is the total number of test examples.

Modeling Uncertainty
First we define the ranking problem we focus on, followed by the BERT-based ranker baseline model (BERT). Having set the foundations, we move to the methods we propose to answer RQ2 and RQ3: a stochastic BERT-based ranker to model uncertainty (S-BERT) and a risk-aware BERT-based ranker to take into account uncertainty provided by S-BERT when ranking (RA-BERT).

Conversation Response Ranking
The task of conversation response ranking Gu et al., 2019;Tao et al., 2019;Henderson et al., 2019;Penha and Hauff, 2020; (also known as next utterance selection), concerns retrieving the best response given the dialogue context. We choose this specific task due to the large-scale training data available, suitable for the training of neural L2R models. For- be a data set con-sisting of N triplets: dialogue context, response candidates and response relevance labels. The dialogue context U i is composed of the previous utterances {u 1 , u 2 , ..., u τ } at the turn τ of the dialogue. The candidate responses R i = {r 1 , r 2 , ..., r k } are either ground-truth responses or negative sampled candidates, indicated by the relevance labels Y i = {y 1 , y 2 , ..., y k } 6 . The task is then to learn a ranking function f (.) that is able to generate a ranked list for the set of candidate responses R i based on their predicted relevance scores f (U i , r).

Deterministic BERT Ranker
We use BERT for learning the function f (U i , r), based on the representation of the [CLS] token. The input for BERT is the concatenation of the context U i and the response r, separated by SEP tokens. This is the equivalent of early adaptations of BERT for ad-hoc retrieval  transported to conversation response ranking. Formally the input sentence to BERT is concat( where | indicates the concatenation operation. The utterances from the context U i are concatenated with special separator tokens [U ] and [T ] indicating end of utterances and turns. The response r is concatenated with the context using BERT's standard sentence separator [SEP ]. We fine-tune BERT on the target conversational corpus and make predictions as follows: where BERT CLS is the pooling operation that extracts the representation of the [CLS] token from the last layer and F F N is a feed-forward network that outputs logits for two classes (relevant and non-relevant). We pass the logits through a softmax transformation σ that gives us a probability of relevance. We use the cross entropy loss for training. The learned function f (U i , r) outputs a point estimate and we refer to it as BERT.

Stochastic S-BERT Ranker
In order to obtain a predictive distribution, .., f (U i , r) n }, which allows us to extract uncertainty estimates, we rely on two techniques, namely Ensemble (Lakshminarayanan et al., 2017) and Dropout (Gal and Ghahramani, 2016). Both techniques scale well and do not require modifications on the architecture or training of BERT.
Using Deep Ensembles (S-BERT E ) We train M models using different random seeds without changing the training data, each with its own set of parameters {θ m } M m=1 and make predictions with each one of them to generate M predicted values: The mean of the predicted values is used as the predicted probability of relevance: gives us a measure of the uncertainty in the prediction.
Using MC Dropout (S-BERT D ) We train a single model with parameters θ and employ dropout at test time and generate stochastic predictions of relevance by conducting T forward passes: The mean of the predicted values is used as the predicted probability of relevance: gives us a measure of the uncertainty.

Risk-Aware RA-BERT Ranker
Given the predictive distribution R r , obtained either by Ensemble or Dropout, we use the following function to rank responses with riskawareness: where E[R r ] is the mean of the predictive distribution, and b is a hyperparameter that controls the aversion or predilection towards risk. Unlike (Zuccon et al., 2011), we are not combining different runs that encompass different model architectures. We instead take a Bayesian interpretation of the process of generating a predictive distribution from a single model architecture. We refer to the rankers as RA-BERT D and RA-BERT E , when using S-BERT D 's predictive distribution and S-BERT E 's predictive distribution respectively.

Robustness to Distributional Shift
In order to evaluate whether we can trust the model's calibration and uncertainty estimates, similar to (Ovadia et al., 2019) we evaluate how robust the models are to different types of shift in the test data. We do so by training the model using one setting and applying it in a different setting. Specifically for all three research questions we test the models under two settings-cross domain and cross negative sampling-which we describe next.

Cross Domain
We train a model using the training set from one domain known as the source domain D S and evaluate it on the test set of a different domain, known as the target domain D T . This is also known as the problem of domain generalization (Gulrajani and Lopez-Paz, 2020).

Cross Negative Sampling
Pointwise L2R models are trained on pairs of query and relevant document and pairs of query and non relevant document (Lucchese et al., 2017). Selecting the non-relevant documents requires a negative sampling (NS) strategy. For the cross-NS condition, we test models on negative documents that were sampled using a different NS strategy than during training, evaluating the generalization of the models on a shifted distribution of candidate documents. We use three NS strategies. In NS random we randomly select candidate responses from the list of all responses. For NS classic we retrieve candidate responses using the conversational context U i as query to a conventional retrieval model and all the responses r as documents. In NS sentenceEmb we represent both U i and all the responses with a sentence embedding technique and retrieve candidate responses using a similarity measure.

Experimental Setup
We consider three large-scale information-seeking conversation datasets 7 that allow the training of neural ranking models for conversation response ranking: MSDialog

Implementation Details
We fine-tune BERT (Devlin et al., 2019) (bert-basecased) for conversation response ranking using the huggingface-transformers (Wolf et al., 2019). We follow recent research in IR that employed finetuned BERT for retrieval tasks (Nogueira and Cho, 2019;, including conversation response ranking (Penha and Hauff, 2020;Vig and Ramea, 2019;Whang et al., 2019). When training BERT we employ a balanced number of relevant and non-relevant-sampled using BM25 (Robertson and Walker, 1994)-context and response pairs. The sentence embeddings we use for cross-NS is sentenceBERT (Reimers and Gurevych, 2019) and we employ dot product calculation from FAISS (Johnson et al., 2017). We consider each dataset as a different domain for cross-NS. We use the Adam optimizer (Kingma and Ba, 2014) with lr = 5 −6 and = 1 −8 , we train with a batch size of 6 and fine-tune the model for 1 epoch. This baseline BERT-based ranker setup yields comparable effectiveness with SOTA methods 8 .

Evaluation
To evaluate the effectiveness of the neural rankers we resort to a standard evaluation metric in conversation response ranking (Yuan et al., 2019;Gu et al., 2020;Tao et al., 2019): recall at position K with n candidates 9 : R n @K. To evaluate the calibration of the models, we resort to the Empirical Calibration Error (cf. §3.1, using C = 10). Throughout, we report the test set results for each dataset. To evaluate the quality of the uncertainty estimation we rely on two downstream tasks. The first is to improve conversation response ranking itself via Risk-Aware ranking (cf. §3.2.4). The second, which fits well with conversation response ranking, is to predict unanswerable conversational contexts. Formally the task is to predict whether there is a correct answer in the candidates list R or not. In our experiments, for half of the instances we remove the relevant response from the list, setting the label as None Of The Above (NOTA). The other half of the data has the label Answerable (ANSW) indicating that there is a suitable answer 8 We obtain 0.834 R10@1 on UDCDSTC8 with our baseline BERT model, c.f. Table 1, Figure 2: Calibration of BERT trained on a balanced set of relevant and non-relevant documents, and tested data with more non-relevant (#-non-rel) than relevant (1 per query) documents. A fully calibrated model is represented by the dotted diagonal: for every bucket of confidence in relevance, the % of relevant documents in that bucket is exactly the confidence. The calibration error is the difference between the curves and the diagonal line.  Table 1: Calibration (ECE, lower is better) and effectiveness (R 10 @1, higher is better) of BERT for conversation response ranking in cross-domain, and cross-NS conditions. All models were trained using NS BM25 . ECE is calculated using a balanced number of relevant and non relevant documents. Underlined values indicate no distributional shift (D S = D T and train NS = test NS).  Table 2: Relative decreases of ECE (lower is better) of S-BERT E and S-BERT D over BERT. Superscript † denote significant improvements (95% confidence interval) using Student's t-tests.

MANTiS
in the candidates list, for which we remove one of the negative samples instead. Similar to , who proposed to use the outputs (logits) of a LSTM-based model in order to predict NOTA, we use the uncertainties as additional features to the classifier for NOTA prediction. The input space with the additional features is fed to a learning algorithm (Random Forest), and we evaluate it with a 5 fold cross-validation procedure using F1-Macro.

Calibration of Neural Rankers (RQ1)
In order to answer our first research question about the calibration of neural rankers, let us first analyze BERT under standard settings (no distributional shift). Our results show that BERT is both effective and calibrated under no distributional shift condi-tions. In Table 1 we see that when the target data (Test on →) is the same as the source data (Train on ↓)-indicated by underlined values-we obtain the highest effectiveness (on average 0.70 R 10 @1) and the lowest calibration error (on average 0.036 ECE). When plotting the calibration curves of the model in Figure 2, we observe the curves to be almost diagonal (i.e. having near perfect calibration) when there are an equal number of relevant and non-relevant candidates (#-non-rel = 1). However, when we make the conditions more realistic 10 by having multiple non-relevant candidates for each conversational context, we observe in Figure 2 that the calibration errors start to in- +0.00% +0.00% +0.00% +0.00% +0.42% -0.06% +6.32% † +3.83% † +16.39% † +17.18% † Table 3: Relative improvements (higher is better) of R 10 @1 of RA-BERT E and RA-BERT D over the mean of stochastic BERT predictions (S-BERT E and S-BERT D ). Superscript † denote statistically significant improvements over the S-BERT ranker at 95% confidence interval using Student's t-tests. crease, moving away from the diagonal. Additionally, when we challenge the model in crossdomain and cross-NS settings, the calibration error increases significantly as evident in Table 1. On average, the ECE is 4.6 times higher for crossdomain and 7.9 times higher for cross-NS. Thus answering the first part of our first research question about the calibration of deterministic BERT-based rankers, indicating that they do not have robust calibrated predictions, failing on the scenarios where there is a distributional shift.
In order to answer the remaining part of RQ1, on how calibrated stochastic BERT-based rankers are, let us consider Table 2. It displays the improvements (relative drop in ECE) over BERT in terms of calibration. S-BERT E is on average 14% better (has less calibration error) than BERT, while S-BERT D is on average 10% better than BERT, answering our first research question: stochastic BERT-based rankers are better calibrated than deterministic BERT-based ranker. We hypothesize that S-BERT E leads to less ECE than S-BERT D because it better captures the model uncertainty in the training procedure, since it combines different weights that explain equally well the prediction of relevance given the inputs. In the next section we focus on evaluating the effectiveness of such models that are better calibrated and also taking into account uncertainty when ranking.

Uncertainty Estimates for Risk-Aware
Neural Ranking (RQ2) In order to evaluate the quality of the uncertainty estimations, we first resort to using them as a measure of the risk through risk-aware neural ranking (RA-BERT D and RA-BERT E ). Figure 3 displays the effectiveness in terms of R 10 @1 gains over BERT for the different settings (cross-domain and cross-NS) when varying the risk aversion b.
We note that when b = 0, we are using the mean of the predictive distribution and disregard the risk, which is equivalent to S-BERT D and S-BERT E . The ensemble based average S-BERT E is more effective than the baseline BERT for almost all com-   binations and S-BERT D is equivalent to the baseline. When using b < 0, we are ranking with risk predilection (the opposite of risk aversion), and in all conditions we found that the effectiveness was significantly worse than when b = 0 and thus b < 0 is not displayed in Figure 3.
When increasing the risk aversion (b > 0), we see that it has different effects depending on the combination of domain and NS. For instance, when training in MSDialog and applying on UDC DSTC8 , increasing the risk aversion improves effectiveness of RA-BERT E until b reaches 0.25 and after that the effectiveness drops, meaning that too much risk aversion is not effective. In order to investigate whether ranking with risk aversion is more effective than using the predictive distribution mean, we select b based on the best value observed on the validation set. Table 3 displays the results of this experiment, showing the improvements of RA-BERT D and RA-BERT E over S-BERT D and S-BERT E respectively. The results show that in a few cases (8 out of 30) the best value of b is 0, for which risk-aversion is not the best option in the development set. We obtain effectiveness improvements primarily on the cross-NS condition (up to 17.2% improvement of R 10 @1), which is the hardest condition (when the models are most ineffective, c.f. Table 1). This answers our second research question, indicating that the uncertainties obtained from stochastic neural rankers are useful for risk-aware rank-ing, specially in the cross-NS setting where the baseline model is quite ineffective. RA-BERT E is on average 2% more effective than S-BERT E , while RA-BERT D is on average 1.7% more effective than S-BERT D .

Uncertainty Estimates for NOTA prediction (RQ3)
Besides using the uncertainty estimation for riskaware ranking, we also employ it for the NOTA (None of the Above) prediction task. We compare here different input spaces for the NOTA classifier. E[R D ] stands for the input space that only uses the mean of the predictive distribution for the k candidate responses in R using S-BERT D , +var[R E ] uses both E[R D ] and the uncertainties of S-BERT E for the k candidates and +var[R D ] uses both the scores E[R D ] and the uncertainties of S-BERT D . Our results show that the uncertainties from S-BERT D and of S-BERT E significantly improve the F1 for NOTA prediction for both cross-domain (Table 4, improvement of 24% on average when using S-BERT D ) and cross-NS settings (Table 5, improvement of 46% on average when using S-BERT D ). We can thus answer our last research question: the uncertainty estimates from stochastic neural rankers do improve the effectiveness of the NOTA prediction task (by an average of 33% across all conditions considered).

Conclusions
In this work we study the calibration and uncertainty estimation of neural rankers, specifically BERT-based rankers. We first show that the deterministic BERT-based ranker is not robustly calibrated for the task of conversation response ranking and we improve its calibration with two techniques to estimate uncertainty through stochastic neural ranking. We also show the benefits of estimating uncertainty using risk-aware neural ranking and for predicting unanswerable conversational contexts. As future work, investigating the use of stochastic rankers in other settings is important, such as other neural L2R architectures, other search and retrieval tasks (Guo et al., 2019;Diaz et al., 2020;Lin et al., 2020), and the ensembling of neural rankers (Zuccon et al., 2011).