Uncertainty Measures in Neural Belief Tracking and the Effects on Dialogue Policy Performance

The ability to identify and resolve uncertainty is crucial for the robustness of a dialogue system. Indeed, this has been confirmed empirically on systems that utilise Bayesian approaches to dialogue belief tracking. However, such systems consider only confidence estimates and have difficulty scaling to more complex settings. Neural dialogue systems, on the other hand, rarely take uncertainties into account. They are therefore overconfident in their decisions and less robust. Moreover, the performance of the tracking task is often evaluated in isolation, without consideration of its effect on the downstream policy optimisation. We propose the use of different uncertainty measures in neural belief tracking. The effects of these measures on the downstream task of policy optimisation are evaluated by adding selected measures of uncertainty to the feature space of the policy and training policies through interaction with a user simulator. Both human and simulated user results show that incorporating these measures leads to improvements both of the performance and of the robustness of the downstream dialogue policy. This highlights the importance of developing neural dialogue belief trackers that take uncertainty into account.


Introduction
In task-oriented dialogue, the system aims to assist the user in obtaining information. This is achieved through a series of interactions between the user and the system. As the conversation progresses, it is the role of the dialogue state tracking module to track the state of the conversation. For example, in a restaurant recommendation system, the state would include the information about the cuisine of the desired restaurant, its area as well as the price range that the user has in mind. It is crucial that this state contains all information necessary for the dialogue policy to make an informed decision for the next action . Policy training optimises decision making in order to complete dialogues successfully.
It has been proposed within the partially observable Markov decision process (POMDP) approach to dialogue modelling to track the distribution over all possible dialogue states, the belief state, instead of a single most-likely candidate. This approach successfully integrates uncertainty to achieve robustness (Williams and Young, 2007;Young et al., 2016. However, such systems do not scale well to complex multidomain dialogues. On the other hand, discriminative neural approaches to dialogue tracking achieve state-of-the-art performance in the state tracking task. Nevertheless, the state-of-the-art goal accuracy on the popular MultiWOZ  multi-domain benchmark is currently only at 60% Li et al., 2020a). In other words, even the best neural dialogue state trackers at present incorrectly predict the state of the conversation in 40% of the turns. What is particularly problematic is that these models are fully confident about their incorrect predictions.
Unlike neural dialogue state trackers, which predict a single best dialogue state, neural belief trackers produce a belief state (Williams and Young, 2007;Henderson et al., 2013). State-of-the-art neural belief trackers, however, achieve an even lower goal accuracy of approximately 50% , making the more accurate state trackers a preferred approach. High-performing state trackers typically rely on span-prediction approaches, which are unable to produce a distribution over all possible states as they extract information directly from the dialogue.
Ensembles of models are known to yield improved predictive performance as well as a calibrated and rich set of uncertainty estimates (Malinin, 2019;Gal, 2016). Unfortunately, ensemble generation and, especially, inference come at a high computational and memory cost which may be pro-hibitive. While standard ensemble distillation (Hinton et al., 2015) can be used to compress an ensemble into a single model, information about ensemble diversity, and therefore several uncertainty measures, is lost. Recently  and Ryabinin et al. (2021) proposed ensemble distribution distillation (EnD 2 ) -an approach to distill an ensemble into a single model which preserves both the ensemble's improved performance and full set of uncertainty measures at low inference cost.
In this work we use EnD 2 to distill an ensemble of neural belief trackers into a single model and incorporate additional uncertainty measures, namely confidence scores, total uncertainty (entropy) and knowledge uncertainty (mutual information), into the belief state of the neural dialogue system. This yields an uncertainty-aware neural belief tracker and allows downstream dialogue policy models to use this information to resolve confusion. To our knowledge, ensemble distillation, especially ensemble distribution distillation, and the derived uncertainty estimates, have not been examined for belief state estimation or any downstream tasks.
We make the following contributions: 1. We present SetSUMBT, a modified SUMBT belief tracking model, which incorporates set similarity for accurate state predictions and produces components essential for policy optimisation. 2. We deploy ensemble distribution distillation to obtain well-calibrated, rich estimates of uncertainty in the dialogue belief tracker. The resulting model produces state-of-the-art results in terms of calibration measures. 3. We demonstrate the effect of adding additional uncertainty measures in the belief state on the downstream dialogue policy models and confirm the effectiveness of these measures both in a simulated environment and in a human trial.

Dialogue Belief Tracking
In statistical approaches to dialogue, one can view the dialogue as a Markov decision process (MDP) (Levin et al., 1998). This MDP maintains a Markov dialogue state in each turn and chooses its next action based on this state. Alternatively, we can model the dialogue state as a latent variable, maintaining a belief state at each turn, as in partially observable Markov decision processes (POMDPs) (Williams and Young, 2007;. While attractive in theory, the POMDP model is computationally expensive in practice. Although there are practical implementations, they are limited to single-domain dialogues and their performance fall short of discriminative statistical belief trackers (Williams, 2012). The inherent problem lies in the generative nature of POMDP trackers where the state generates noisy observations. This becomes an issue for instance when the user wants to change the goal of a conversation, e.g., the user wants an Italian instead of a French restaurant. Henderson (2015) has shown empirically that discriminative models model a change in user goal more accurately.
In discriminative approaches, the state depends on the observation, making it easier for the system to identify a change of the user goal. Traditional discriminative approaches suffer from low robustness, as they depend on static semantic dictionaries for feature extraction (Henderson et al., 2014;Mrkšić et al., 2017b). Integrated approaches on the other hand utilise learned token vector representations, leading to more robust state trackers (Mrkšić et al., 2017a;Ramadan et al., 2018;. However, highly over-parameterised models, such as neural networks -when trained via maximum-likelihood on finite data -often yield miscalibrated, over-confident predictions, placing all probability mass on a single outcome (Pleiss et al., 2017). Consequently, belief tracking is reduced to state tracking, losing the benefits of uncertainty management. State-of-the-art approaches to dialogue state tracking redefine the problem as a span-prediction task. These models extract the values directly from the dialogue context (Chao and Lane, 2019; and manage to achieve state-of-the-art results on MultiWOZ Eric et al., 2020). Span-prediction models at present do not produce probability distributions, so additional work is needed to apply our proposed uncertainty framework to them. Neural belief and state trackers rarely model the correlation between domain-slot pairs, except for works by Hu et al. (2020) and Ye et al. (2021). Due to scalability issues we do not include these approaches in our investigation. We therefore consider the slot-utterance matching belief tracker (SUMBT) (Lee et al., 2019) a better starting point, as it is readily able to produce a belief state distribution.
In theory, well-calibrated belief trackers have an inherent advantage over state tracking, producing uncertainty estimates that lead to more robust downstream policy performance. This raises the question: Is it possible to instil well-calibrated uncertainty estimates in neural belief trackers? And if so, do these estimates have a positive effect on the downstream policy optimisation in practice?
We believe SUMBT is a fitting approach to investigate these questions, as it has been shown that an ensemble of SUMBT models can achieve stateof-the-art goal L2-Error when trained using specialised loss functions aiming at inducing uncertainty in the output .

Ensemble-based Uncertainty Estimation
Consider a classification problem with a set of features x, and outcomes y ∈ {ω 1 , ω 2 , ..., ω K }. In dialogue state tracking, x would be features of the input to the tracker and y would be a dialogue state. Given an ensemble of M models P(y|x, θ (m) ) M m=1 , the predictive posterior is obtained as follows: Predictions made using the predictive posterior are often better than those of individual models. The entropy H[] of the predictive posterior is an estimate of total uncertainty. Ensembles allow decomposing total uncertainty into data and knowledge uncertainty by considering measures of ensemble diversity. Data uncertainty is the uncertainty due to noise, ambiguity and class overlap in the data. Knowledge uncertainty is uncertainty due to a lack of knowledge of the model about a test data (Malinin, 2019; Gal, 2016) -ie, uncertainty due to unfamiliar, anomalous or atypical inputs. Ideally, ensembles should yield consistent predictions on data similar to the training data and diverse predictions on data which is significantly different from the training data. Thus measures of ensemble diversity yield estimates of knowledge uncertainty 1 . These quantities are obtained via the mutual information I[y, θ] between predictions and model parameters. The quantity in the Equation 2 is a measure of ensemble diversity, and therefore, knowledge uncertainty. This quantity is the difference between the entropy of the predictive posterior (total uncertainty) and the average entropy of each model in the ensemble (data uncertainty).

Ensemble Distillation
While ensembles provide improved predictive performance and a rich set of uncertainty measures, their practical application is limited by their inference-time computational cost. Ensemble distillation (EnD) (Hinton et al., 2015) can be used to compress an ensemble into a single student model (with parameters φ) by minimising the Kullback-Leibler (KL) divergence between the ensemble predictive posterior and the distilled model predictive posterior, significantly reducing the inference cost. Unfortunately, a significant drawback of this method is that information about ensemble diversity, and therefore knowledge uncertainty, is lost in the process. Recently,  proposed ensemble distribution distillation (EnD 2 ) as an approach to distill an ensemble into a single prior network model (Malinin and Gales, 2018), such that the model retains information about ensemble diversity. Prior networks yield a higherorder Dirichlet distribution over categorical output distributions π and thereby emulate ensembles, whose output distributions can be seen as samples from a higher-order distribution 2 . Formally, a prior network is defined as follows: where Dir(·|α) is a Dirichlet distribution with concentration parameters α, and f (·; φ) is a learned function which yields the logits z. The predictive posterior can be obtained in closed form though marginalisation over π, thereby emulating (1). This yields a softmax output function: Closed form estimates of all uncertainty measures are obtained via Eq. (5), which emulates the same underlying mechanics as Eq. (2), as follows (Malinin, 2019): Originally,  implemented EnD 2 on the CIFAR10, CIFAR100 and TinyIma-geNet datasets. However, Ryabinin et al. (2021) found scaling to tasks with many classes challenging using the original Dirichlet Negative loglikelihood criterion. They analysed this scaling problem and proposed to a new loss function, which minimises the reverse KL-divergence between the model and an intermediate proxy Dirichlet target derived from the ensemble. This loss function was shown to enable EnD 2 on tasks with arbitrary numbers of classes. In this work we use this improved loss function, as detailed in the Appendix Section B.2.

Policy Optimisation
In each turn of dialogue, the dialogue policy selects an action to take in order to successfully complete the dialogue. The input to the policy is constructed using the output of the belief state tracker, thus being directly impacted by its richness.
Optimising dialogue policies within the original POMDP framework is not practical for most cases. Therefore, the POMDP is viewed as a continuous MDP whose state space is the belief space. This state space can be discretised, so that tabular reinforcement learning (RL) algorithms can be applied (Gašić et al., 2008;. Gaussian process RL can be applied directly on the original belief space (Gašić and Young, 2014). This is also possible using neural approaches with less computational effort (Jurčíček et al., 2011;Weisz et al., 2018;. Current state-of-theart RL algorithms for multi-domain dialogue management (Takanobu et al., 2019;Li et al., 2020b) utilise proximal policy optimisation (Schulman et al., 2017) operating on single best dialogue state.

Effects of Uncertainty on Downstream Tasks
We take the following steps in order to examine the effects of the additional uncertainty measures in the dialogue belief state: 1. Modify the original SUMBT model  to arrive at a competitive baseline. We call this model SetSUMBT. 2. Produce ensembles of SetSUMBT following the work of van Niekerk et al. (2020). 3. Apply EnD and EnD 2 as introduced in Section 2.3. 4. Apply policy optimisation that uses belief states from distilled models.

Neural Belief Tracking Model
We propose a neural belief tracker which one can easily incorporate in a full dialogue system pipeline. We base our tracker on the slot-utterance matching belief tracker (SUMBT) , but we make two important changes. First, we ensure our tracker is fully in line with the requirements of the hidden information state (HIS) model for dialogue management  by adding user action predictions to our tracker. These are not produced by the SUMBT model and nor by other available neural trackers. However, they are essential for integration into a full dialogue system. Second, in order to improve the understanding ability of the model, we utilise a set of concept description embeddings rather than a single embedding for semantic concepts. We use this set of embeddings for information extraction and prediction, hence we call our model SetSUMBT. In this section we describe each component in detail, also depicted in Figure 1.
Slot-utterance matching The slot-utterance matching (SUM) component performs the role of language understanding in the SUMBT architecture. The SUM multi-head attention mechanism (Vaswani et al., 2017) attends to the relevant information in the current turn for a specific domain-slot pair. In the process of slot-utterance matching, SUMBT utilises BERT's (Devlin et al., 2019) [CLS] sequence embedding to represent the semantic concepts in the model ontology.
Instead of using the single [CLS] embedding, we make use of the sequence of embeddings for the domain-slot description. We choose to make this expansion, as approaches which utilise a sequence of embeddings outperform approaches based on a single embedding in various natural language processing tasks (Poerner et al., 2020;Choi et al., 2021). We further use RoBERTa as a feature extractor (Liu et al., 2019).
Dialogue context tracking The first of the three components of the HIS model is a representation of the dialogue context (history). In the SUMBT approach, a gated-recurrent unit mechanism tracks the most important information during a dialogue. The resulting context conditioned representations for the domain-slot pairs contain the relevant information from the dialogue history. Similar to the alteration in the SUM component, we represent the dialogue context as a sequence of representations. This sequence, C s t , represents the dialogue context for domain-slot pair s across turns 1 to t, while it dimension being independent of t. Besides the above modification, we add a further step where we reduce this sequence of context representations to a single representationŷ s t . We do this reduction using a learned convolutional pooler, which we call the Set Pooler. See Appendix Section C for more details regarding the implementation.
User goal prediction The second component of the HIS model is the user goal. This is typically the only component that neural tracing models explicitly model as a set of domain-slot-value pairs. Here, we follow the matching network approach (Vinyals et al., 2016) utilised by SUMBT, where the predictive distribution is based on the similarity between the dialogue context and the value candidates. To obtain the similarity between the dialogue context and a value candidate we make use of cosine similarity, S cos (·, ·). Based on these similarity scores, we produce a predictive distribution, Equation 6, for the value of domain-slot pair s at turn t v s t , the user and system utterances at turn t, u usr t and u sys t−1 , and the dialogue context representations at turn t − 1 C s t−1 . Contrary to the SUMBT approach, each value candidate is represented by the sequence of value description embeddings from a fixed RoBERTa model. The Set Pooler, with the same parameters used for pooling context representations, reduces this sequence of value description representations to a representation y v , for value v.
User action prediction To be fully in line with the HIS model, we further require the predicted user actions. In order to predict the user actions, we categorise them into general user actions and user request actions. Further, since our system is a multi-domain system, we include the current active domain in the hidden information state of the system. General user action includes actions such as the user thanking or greeting the system, which do not rely on the dialogue context. Hence, we can infer general user actions from the current user utterance. A user request action is an action indicating that the user is requesting information about an entity. Zhu et al. (2020) shows that simple rule-based estimates of these actions lead to poor downstream policy performance. Hence, we propose predicting this information within the belief tracking model.
Since we can infer the general actions from the current user utterance, we use a single turn representation x 0 t to predict such actions. The single turn representation, x 0 t , is the representation for the RoBERTa sequence representation <s>, which is equivalent to the BERT [CLS] representation. That is: where a ∈ {none, thank_you, goodbye}.
The more difficult sub-tasks include active user request and active domain prediction. For user request prediction we utilise the dialogue context representationŷ s t for a specific domain-slot pair to predict whether the user has requested information relating to this slot. That is: where r s t indicates an active request for domain-slot s by the user in turn t.
Last, to predict active domains in the dialogue, we incorporate information relating to all slots associated with a specific domain. We do so by performing mean reduction across the context representations of all the slots associated with a domain. The resulting domain representations are used to predict whether a domain is currently being discussed in the dialogue. That is, for active domain d t , S d the set of slots within domain d, and C d t−1 := C s t−1 s∈S d the set of context representations for all domain-slot pairs in S d at turn t − 1, we have the active domain distribution:

Request Action Prediction
Active domain prediction + , = 1| + -,. , +/0 ,1, , +/0 , + = | + -,. , +/0 ,1, , +/0 5 Figure 1: Architecture of our SetSUMBT model, which takes as input the current user utterance, the latest system utterance, and a domain-slot pair description. The model, further, requires a pre-defined set of plausible value candidates for each domain-slot pair. At each turn, we encode the utterances only once, the Slot-utterance matching and Context tracking components are utilised once for each domain-slot pair. Further, we use the Set Pooler once for each domain-slot pair and once for each value candidate. The Set Pooler used for pooling value candidate and domain-slot context sequences shares the same parameters θ. SetSUMBT outputs a belief state distribution for the relevant domain-slot pair (User goal), a distribution over general actions, and the probability of a user request for the domain-slot pair (User action). The model also outputs the probability of an active domain.
Optimisation For each of the four tasks: user goal prediction, general user action prediction, user request action prediction and active domain prediction, the aim of the model is to predict the correct class. To optimise for these objectives, we minimise the following classification loss functions: L goal , L general , L request and L domain . During model training we combine four weighted classification objectives: where α x ∈ (0, 1] is the importance of task x. In this work, we use the label smoothing classification loss for all sub-tasks as it results in better calibrated predictions, as shown by van Niekerk et al. (2020), see details in Section B.1 of the appendix.

Uncertainty Estimation in SetSUMBT
Similarly to van Niekerk et al. (2020), we construct an ensemble of SetSUMBT models by training each model on one of 10 randomly selected subsets of data. We then distil this ensemble into a single model by adopting ensemble distillation (EnD) and ensemble distribution distillation (EnD 2 ) as described in Section 2.3. We refer to these distilled versions of the SetSUMBT ensemble as EnD-SetSUMBT and EnD 2 -SetSUMBT, respectively.
The SetSUMBT belief tracker tracks the presence and value of each domain-slot pair s as the dialogue progresses. For the sake of scalability of the downstream policy, in the user goal g we do not consider all possible values, but rather the most likely one v s for every domain-slot pair s and its associated probability, i.e., the confidence score given by h g t,s summarised in vector h g t for all domain-slot pairs: For the EnD-SetSUMBT belief tracker, we can also calculate the total uncertainty for each domain-slot given by the entropy, see Section 2.3. We encode that information in h unc t,s for each domain-slot pair s and summarise in h unc t for all domain-slot pairs: For the EnD 2 -SetSUMBT belief tracker, can further include the knowledge uncertainty for each domain-slot pair s given by the mutual information: as per Eq. (5) where π represents the ensemble distribution and φ the model parameters.
In addition, all versions of SetSUMBT include the following vectors/variables: h g t is the estimate of the user goal from Eq. (11), h usr t is the estimate of user actions from Eq. (7-9), h db t is the database search result 3 , h sys t−1 is the system action, h book t is the set of completed bookings, h term t indicates the termination of the dialogue. This results in the following belief state: For a system without uncertainty, all confidences would be rounded to either 0 or 1 and the belief state would not contain the h unc t vector.

Policy Optimisation as Downstream Task
For our experiments we optimise the dialogue policy operating on the belief state via RL using the PPO algorithm (Schulman et al., 2017). PPO is an on-policy actor-critic algorithm that is widely applied across different reinforcement learning tasks because of its good performance and simplicity. Similarly to Takanobu et al. (2019), we use supervised learning to pretrain the policy before starting the RL training phase. In order to perform supervised learning we need to map the belief states into system actions as they occur in the corpus. These belief states can either be oracle states taken from the corpus or predictions of our belief tracker that takes corpus dialogues as input. We investigate both options for policy training.  (Lee et al., 2020). We consider the joint goal accuracy (JGA), L2-Error and expected calibration error (ECE). The JGA of a belief tracking model is the percentage of turns for which the model correctly predicted the value for all domain-slot pairs. The L2-Error is the L2-Norm of the difference between the predicted user distribution and the true user goal. Further, the ECE is the average absolute difference between the accuracy and the confidence of a model. In this comparison, we do not consider state tracking approaches, as they do not yield uncertainty estimates. SetSUMBT outperforms SUMBT and SUMBT+LaRL in terms of calibration and accuracy. We name the variants of SetSUMBT as follows: CE-SetSUMBT is a calibrated ensemble of SetSUMBT similar to CE-BST, EnD-SetSUMBT is the distilled SetSUMBT model, and EnD 2 -SetSUMBT is the distribution distilled SetSUMBT model.

Runtime efficiency
The single instance of the SetSUMBT tracker processes a dialogue turn in approximately 77.768 ms, whereas an ensemble of 10 models processes a turn in approximately 768.025 ms. These processing times are averaged across the 7372 turns in the MultiWOZ test set, see Appendix Section E for more details. The significant increase in processing time for the ensemble of models makes this approach inappropriate for real time interaction with users on a private device. Calibration The reliability diagram in Figure 2 illustrates the relationship between the joint goal accuracy and the model confidence. The best calibrated model is the one that is closest to the diagonal, i.e., the one whose confidence for each dialogue state is closest to the achieved accuracy. The best reliability is achieved by CE-BST, and CE-SetSUMBT comes second. Both distillation models (EnD-SetSUMBT and EnD 2 -SetSUMBT) do not deviate greatly from CE-SetSUMBT.

Policy Training on User Simulator
We incorporate SetSUMBT, EnD-SetSUMBT and EnD 2 -SetSUMBT within the Convlab2 (Zhu et al., 2020) task-oriented dialogue environment and compare their performance by training policies which take belief states as inputs 4 .
To investigate the impact of additional uncertainty measures on the dialogue policy we perform interactive learning in a more challenging environment than the original Convlab2 template-based simulator. We add ambiguity to the simulated user utterances in the form of value variations that occur in the MultiWOZ dataset. For example, instead of the user simulator asking for a hotel for "one person", it could also say "It will be just me.". For more information see Appendix Section D.
When policies are trained for large domains, they are typically first pretrained on the corpus in a supervised manner, and then improved using reinforcement learning. We first investigate which states to use for the supervised pretraining (Section 3.3): oracle states, i.e., the dialogue state labels from the MultiWOZ corpus, or estimated belief states, e.g., those predicted by a EnD-SetSUMBT model. We then evaluate the pretrained policies with the simulated user. During the evaluation both  Table 2: Performance of the systems in the simulated environment. For each setting we have 5 policies initiated with different random seeds, each evaluated with 1000 dialogues and their success rates, reward and number of turns averaged.
policies use a EnD-SetSUMBT model to provide belief states. We observe that the policy pretrained using the oracle state achieves a success rate of 36.50% in the simulated environment compared to the 46.08% success rate achieved by the policy pretrained using EnD-SetSUMBT. Thus, all our following experiments use predicted belief states of respective tracking models for the pretraining stage.
For each setting of the belief tracker we have four possible belief state settings, i.e., the binary state (no uncertainty), the confidence score state, the confidence score state with additional total uncertainty features and the confidence score state with additional knowledge uncertainty features. For each setting we evaluate the policies through interaction with the user simulator, results are given in Table 2.
In interaction with the simulator, systems making use of confidence outperform the systems without any uncertainty (significance at p < 0.05). Moreover, the additional total and knowledge uncertainty features always outperform the systems which only use a confidence score (significance at p < 0.05). This indicates that additional measures of uncertainty improve the robustness of the downstream dialogue policy in a challenging environment.
It is interesting to note that the system which makes use of total uncertainty appears to outperform the system that makes use of knowledge uncertainty (significance at p < 0.05). We suspect that this controlled simulated environment has low data uncertainty, so the total uncertainty is overall more informative.

Human Trial
We conduct a human trial, where we compare SetSUMBT as the baseline with EnD-SetSUMBT and EnD 2 -SetSUMBT. For EnD-SetSUMBT, we consider the model that includes both confidence scores and entropy features. For EnD 2 -SetSUMBT, we investigate the model that includes confidence scores and knowledge uncertainty features. For each model we have two variations: one with a binary state corresponding to the most likely state (no uncertainty variation), and one with uncertainty measures (uncertainty variation). For each variation we chose the policy whose performance on the simulated user is closest to the average performance of its respective setting, see Section 4.2.
Subjects are recruited through the Amazon Mechanical Turk platform to interact with our systems via the DialCrowd platform (Lee et al., 2018). Each assignment consists of a dialogue task and two dialogues to perform. The task comprises a set of constraints and goals, for example finding the name and phone number of a guest house in the downtown area. We encourage the subjects to use variants of labels by introducing random value variants in the tasks. The two dialogues are performed in a random order with two variations of the same model, namely no-uncertainty and uncertainty variation, as described above. After each dialogue, the subject rates the system as successful if they think they received all the information required and all constraints were met. The subjects rate each system on a 5 point Likert scale. In total we collected approximately 550 dialogues for each of 6 different systems, 3300 in total. There was a total of 380 subjects who took part in these experiments. Table 3 shows the performance of the above policies in the human trial. We confirm that each no uncertainty system is always worse than its uncertainty counterpart (each significant at p < 0.05). It is important to emphasise here that in each pairing, the systems have exactly the same JGA, but their final performance can be very different in terms of success and user rating. This empirically demonstrates the limitations of JGA as a single measure for dialogue state tracking, urging the modelling of uncertainty and utilisation of calibration measures. Finally, we observe that adding additional uncertainty measures improves the policy (each significant at p < 0.05) and the best overall performance is achieved by the system that utilises both knowledge uncertainty and confidence scores  Table 3: Performance of the systems evaluated with real users. We have 550 dialogues for each system with success rates, ratings and the number of turns averaged.
(significant at p < 0.05). This suggests that in human interaction there is more data uncertainty, necessitating the knowledge uncertainty to be an explicit part of the model. It is important to note here that solely a lower average number of turns is not necessarily an indicator of the desired behaviour of a system. For example, a system which says goodbye too early may never be successful, but will have a low average number of turns.

Conclusion
Whilst neural dialogue state trackers may achieve state-of-the-art performance in the isolated dialogue state tracking task, the absence of uncertainty estimates may lead to less robust performance of the downstream dialogue policy. In this work we propose the use of total and knowledge uncertainties along with confidence scores to form a dialogue belief state. We moreover describe a model, Set-SUMBT, that can produce such a belief state via distillation. Experiments with both simulated and real users confirm that these uncertainty metrics can lead to more robust dialogue policy models. In future, we will investigate modifying span-based dialogue state trackers to incorporate uncertainty. We will further investigate the expansion of the Set-SUMBT model to include the correlation between different domain-slot pairings.

A Example dialogues
Here we provide two example dialogues from our human trial, which illustrates the advantage of using uncertainty. In Figures 3 and 4 we observe that the system makes a second request to the user to clarify the intent of the user.
User: I need to book a cheap 4 star hotel. System: Do you have a area preference? User: I need the hotel to be eastern. System: Is there a specific area of town you are interested in? User: The east. User: Find me a hotel in the city centre. System: Do you have a price range preference? User: High System: It is in the centre. What price range would you like? User: A highly priced hotel. System: Everyone seems to enjoy Acorn guest house. It is an expensive hotel. Is there anything else I can help you with?

B Loss Functions
Consider the classification problem with input features x, oracle class y and training dataset:

B.1 Label Smoothing Loss Function
The label smoothing loss is a regularised variant of standard negative log likelihood loss. Here, instead of considering a 1-hot target y I , some noise is induced in the target distribution in the form of: where is the smoothing parameter, y ls the noisy/smoothed targets and y I the one hot representation of the target y. The objective is to minimise the KL divergence between the predictive distribution, P(y|x (i) , φ), and the smoothed target y ls .

B.2 Distillation Loss Functions
Here we detail the loss functions used for ensemble distillation (EnD) and ensemble distribution distillation (EnD 2 ) in this work. Consider an ensemble {θ (1) , ..., θ (M ) } consisting of M models, with predictive posterior P(y|x (i) , D train ).
Standard ensemble distillation (Hinton et al., 2015) is accomplished by minimising the KLdivergence between a student model with parameters φ and the ensemble's predictive posterior: Distribution distillation is accomplished using the improved loss function proposed by Ryabinin et al. (2021). Here, we first compute a Proxy Dirichlet Target with Dirichlet concentration parameters β from the ensemble: Given this Proxy Dirichlet Target, distribution distillation is done by minimising the following loss:

C SetSUMBT Implementation Details
Here we provide details regarding the SetSUMBT model configuration and the model training configuration. Table 4 provides details about the configuration of the SetSUMBT model. Tables 5 and   6 provide details regarding the training configurations for both the single model and distillation (EnD) of SetSUMBT. For all SetSUMBT models the Set Pooler consists of a single convolutional layer with padding followed by a mean pooling layer.

D Variations in User Simulator Output
The user simulator used in our experiments consists of a natural language understanding (NLU) module, a rule based user agent and template based natural language generation module, all provided in the ConvLab 2 environment (Zhu et al., 2020). A predefined set of rules simulates the user behaviour based on the predicted semantic system actions and the resulting user actions are mapped to natural language using a pre-defined set of templates. To induce variation to the user simulator utterances and thus make understanding more difficult for the system, we utilise a set of pre-defined value variations obtained from the MultiWOZ 2.1 value map . For example, we can map the value, expensive, in the user action: Inform -Restaurant -Price_range -expensive to any of the following options: [ high end, high class, high scale, high price, high priced, higher price, fancy, upscale, nice, expensively, luxury ].
In our experiments 20% of simulated user actions contain such variations.

E System Latencies
In this section we provide the processing times per turn for our SetSUMBT model as well as the systems used in this work. These processing times are averaged across the 7372 turns in the MultiWOZ 2.1 test set. This test is performed on a Google Cloud virtual machine containing a Nvidia V100 16GB GPU, 8 n1-standard VCPU's and 30GB memory. In Table 7 we compare the latencies of a single instance of SetSUMBT against a 10 model ensemble. In Table 8 we compare the latencies of the full dialogue system setups used in this work.