Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System

A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning. To mitigate this problem, this paper proposes a dynamic partial average estimator (DPAV) of the ground truth maximum action value. DPAV calculates the partial average between the predicted maximum action value and minimum action value, where the weights are dynamically adaptive and problem-dependent. We incorporate DPAV into a deep Q-network as the dialogue policy and show that our method can achieve better or comparable results compared to top baselines on three dialogue datasets of different domains with a lower computational load. In addition, we also theoretically prove the convergence and derive the upper and lower bounds of the bias compared with those of other methods.


Introduction
Task-completion dialogue systems are commonly implemented in two schemes. One is by end-to-end training, such as (Zhang et al., 2020a). The other is a pipeline framework , which typically consists of four modules that are independently trained, as shown in Figure 1a: natural language understanding (NLU), dialogue state tracker (DST), dialogue policy learning (DPL) and natural language generation (NLG). For this pipeline-style dialogue system, the conversation text from a user is first fed to the NLU module, where the user utterance is parsed into semantic slots for DST. DST manages the inputs of each dialogue turn together with the dialogue history. Then DST outputs the current dialogue state embedding to the DPL module, where a dialogue action is taken based on current dialogue state and knowledge base data. The NLG module maps the selected dialogue action into natural language to converse with the user.
Reinforcement learning (RL) algorithms, specifically Q-learning (Watkins and Dayan, 1992) based algorithms, have become a mainstream method for training the dialogue policy module (Peng et al., 2018;Zhang et al., 2020b). For each step, the policy agent updates its action value 1 estimate as the sum of the observed reward and the estimated maximal action value in the next state. However, this update rule suffers from an overestimation problem (Hasselt, 2010): mostly the estimated maximal action value is larger than the ground truth. The overestimation problem causes that the dialogue policy module has inaccurate action values estimations after the training, which misleads the dialogue policy to choose the wrong dialogue action (see the wrong dialogue action in Figure 2). Some prior studies have tried to address this problem in domains like video game playing and multi-agent systems, but they either suffered from the underestimation problem (Hasselt, 2010;Lan et al., 2020) or required heavy computational load, such as those ensemble methods (Anschel et al., 2017;Lan et al., 2020;Lee et al., 2021).
In this work, we propose dynamic partial average (DPAV), a novel approach to mitigate the overestimation problem specifically for the taskcompletion dialogue policy. DPAV utilizes the partial average between the predicted maximal action value and the predicted minimal action value to estimate the ground truth maximum action value, where the weights are dynamically adaptive and problem-dependent. The rationale here is that  DPAV learns the optimal trade-off between the predicted maximal action value and the predicted minimal action value so that the dialogue policy learning procedure will be more reasonable and stable. Our system not only yields a better dialogue process (see Figure 2), but also has much lower computational cost compared to ensemble models. Overall, our main contributions are as follows: (i) This is the first work to investigate and handle the overestimation problem of the reinforcement learning framework for task-completion dialogue systems. (ii) We propose a novel and effective approach, the dynamic partial average DPAV, which can alleviate the overestimation problem with lower computational load. (iii) We theoretically prove the convergence and derive the upper and lower bounds of our method to claim its effectiveness.

Related Work
Dialogue Policy. The dialogue policy module makes a dialogue decision given the current state (Zhang et al., 2019). Early methods are rulebased . Since handcrafted rules are non-extensible and resource-consuming (Zhao et al., 2021), deep reinforcement learning (DRL) has become a mainstream method for training dialogue policies (Wu et al., 2019;Zhao et al., 2021). Task-completion dialogue policy learning is often regarded as an RL problem (Zhang et al., 2021).
Overestimation Bias. The value-based algorithm Q-learning, a common unit of the dialogue policy module, suffers the overestimation bias (Thrun and Schwartz, 1993;Hasselt, 2010). Prior studies addressed the problem in multiple ways, including (1) bias compensation with additive pseudo costs and (2) a variety of estimators. Bias-corrected Q-Learning (Lee et al., 2013) subtracts a quantity from the target but this method cannot address the bias from the function approximation (Pentaliotis and Wiering, 2021). It is known that the bias compensation method is labor involved and time consuming (Anwar and Patnaik, 2008;Lee and Powell, 2012). Double Q-learning (Hasselt, 2010) trades overestimation bias for underestimation bias using the double estimator. Since underestimation bias is not preferable (Hasselt, 2010;Lan et al., 2020), Weighted Q-learning proposes (D'Eramo et al., 2016;Zhang et al., 2017) the weighted estimator for the maximal action value based on a weighted average of estimated actions values. However, the weights computation is only practical in a tabular setting (D'Eramo et al., 2017). Our work differs from the foregoing in that it proposes a new estimator which could be generalized into the deep Q-learning network setting.
Overestimation bias is more problematic in the deep Q-learning network (DQN) algorithm (Fan et al., 2020) due to the function approximation errors of DRL. Polishing estimation tricks of a single model and using ensemble models are two mainstream solutions. Double Q-learning is subsequently adapted to a neural network as Double DQN (Van Hasselt et al., 2016), and Duel DQN proposes a new action value estimation scheme (Wang et al., 2016). But the two methods still suffer the bias of double estimator and maximum estimator, respectively. Another approach against overestimation bias is based on the idea of ensembling. Averaged DQN controls the estimation bias by tak-ing the average over action values of multiple target networks (Anschel et al., 2017). Later, (Lan et al., 2020) claims that an average operation will never completely remove the overestimation bias, and they propose the Maxmin DQN which takes a minimum from multiple maximums of different ensemble units to estimate the maximum action value in a selective process. Then, (Kuznetsov et al., 2020) recognizes that Maxmin DQN also suffers underestimation bias and that the bias control is coarse. Recently, the SUNRISE method uses the uncertainty estimates of the ensemble. But it only down-weights the biased estimation (Lee et al., 2021). In this work, the model only uses a value function instead of a combination of multiple value functions and tailors the predicted maximum and minimum of a value function to approximate the optimal action value. Our work does not move towards underestimation and avoid the computational complexity of ensemble models.

Problem Definition
Even though an unbiased estimator does not exist (D'Eramo et al., 2016), the maximum estimator (Watkins and Dayan, 1992) and double estimator (Hasselt, 2010;Van Hasselt et al., 2016) are the most representative among the relevant works.
Maximum estimator (ME). This method is used by deep Q-learning to approximate the ground truth maximum action value of the following state by maximizing over a set of action values Q (s t+1 , ·). It represents the target y DQN for taking a possible action a under the state s t+1 as: where r t+1 is the reward, γ is the discount value for future rewards and θ − is parameters of the target network. As Smith and Winkler (2006) found, the estimate of ME is larger than the ground truth (i.e., the estimated maximum value of the following state, max a Q s t+1 , a; θ − is overestimated), which results in the biased loss: where m is the RL experience replay pool and θ is parameters of the DQN model. Thus, the Q (s t , ·) will not be perfectly accurate after training, Double estimator (DE). This method (Hasselt, 2010;Van Hasselt et al., 2016) is used by deep Q-learning to solve the overestimation problem of ME in DQN. The Double DQN has two estimators, and one estimator decides the action index while the other estimator evaluates the action value of the selected action. Then Double DQN (DDQN) uses the evaluated action value to estimate the ground truth maximum action value of state s t+1 : (3) However, DE suffers from the underestimation problem and does not guarantee better estimation than ME (Lan et al., 2020).

Problem in Dialogue Policy
Q-learning is a common unit of RL-based dialogue policies. The overestimation bias of ME propagates into model action values Q (s t , ·). In dialogue Q (s t , ·) represent the dialogue action values, which are the expected returns the dialogue system will receive after taking an action under the state s t . Since Q (s t , ·) are biased, the dialogue policy cannot issue accurate actions accordingly. This hurts dialogue performances.
Example. We use a dialogue turn to show the negative effects of the overestimation bias. In Figure 1b, the dialog state tracker module outputs state embedding, dialogue policy processes state embedding and predicts the wrong dialogue action B instead of the correct action A based on the biased action values.

Dynamic Partial Average
Q-learning suffers from overestimation bias because of the ME (Hasselt, 2010). To reduce the bias, in this work, we propose the dynamic partial average (DPAV) estimator. DPAV utilizes the partial average between the predicted maximal action value and the minimal action value to estimate the ground truth maximal action value Q * (s t+1 ) of the target of Q-learning update, The mathematical formula of the DPAV estimator of Q-learning is as follows: λ t is a float number between [0,1] that is dynamic in time and problem-dependent such that the DPAV can take the average between the maximum and minimum of the action values. The weights assigned to the maximum and minimum are not the same, so it is a partial average. The DPAV estimator is deployed in Q-learning as DPAV Q-learning, so that we have the action value function Q update formula as: (5) where γ is the discount factor for the future action value and α t is the step size. λ t in Q DP AV (s t+1 ) decays according to a predefined rate as training progresses, the decay formula is as follows: where d is the decay rate that is set to the fixed value in training. So 1 − λ t will give more weight to the maximal action value during the training.
To apply DPAV to the complex dialogue policy learning setting, this paper combines it with the deep Q-learning network (DQN) and proposes DPAV DQN. Its loss function is adapted from Equation 2 to the formula: (7) The algorithm of the dynamic partial average deep Q-learning network is summarized in Algorithm 1. 2 The intuition behind this approach is that the predicted maximal action value overestimates the ground truth, so DPAV uses the predicted minimal action value to shift the estimate towards the ground truth. Because the predicted action values accuracy is improved in training, DPAV assigns less weight to the predicted minimum to avoid shifting towards the small estimate too much. DPAV reduces the overestimation bias in the target of the training loss, so it is less biased. This improves the dialogue action values accuracy of the DPAV DQN dialogue policy, so this dialogue policy issues more accurate dialogue actions accordingly which improve dialogue performances.
Additionally, this method has a lower computational complexity compared to those of ensemble models. Even if the latter could trade time complexity for space complexity by parallel computing, they still have high computational complexity in general as shown in the Table 2. And this method achieves better or comparable performances according to the Figure 3. The upper and lower bounds of the DPAV DQN estimation bias are also reasonable compared with those of other methods. A detailed explanation is found in section 4.3.

Algorithm 1: DPAV DQN
Initialize replay memory D to capacity N , action-value function Q with random weights, and decay rate d for episode =1,...,M do Initialise state s 1 for j=1,...,T do With probability ϵ select a random action a j , otherwise select a j = max a Q * (s j , a; θ) Execute action a j in environment, observe reward r j+1 and come into state s j+1 . Store transition (s j , a j , r j+1 , s j+1 ) in D, and set s j = s j+1 Sample random minibatch of (s t , a t , r t+1 , s t+1 ) from D

Convergence
In this subsection we show in Theorem 1 that in the limit DPAV Q-learning converges to the optimal policy. The proof 3 of this result using the Lemma 1 (Singh et al., 2000) is in the Appendix A.2.
Theorem 1. In a Markov decision process, the approximate action value function Q as updated by DPAV Q-learning in Equation 5 converges to the optimal action value function q * with probability one if an infinite number of experience tuples in the form of (s t , a t , r t+1 , s t+1 ) are given by a learning policy for each state action pair and if the following conditions are satisfied: 1. The Markov decision process is finite (i.e. | S× A × R |< ∞, S means the set of states, A means the set of actions, and R is the set of rewards.).
is the step size of a Qlearning update.

Upper and Lower Bound
As shown in (D'Eramo et al., 2016;Hasselt, 2010), In many problems, one is interested in the maximum expected value in such a set µ * = max i E {X i }. Without knowledge of the functional form and parameters of the underlying distribution of each variable X i , it is impossible to find µ * analytically. Given a set of a limited number of samples, S = {S 1 , . . . , S M }, S i corresponds to the subset of samples drawn from the unknown distribution of the random variable X i . The maximum estimator (Watkins and Dayan, 1992) and double estimator (Hasselt, 2010) are the most representative methods to estimate µ * . ME estimation:

Bias
We start with representing the main results about the bias of Maximum Estimator (ME) and Double Estimator (DE) reported in ( Van Hasselt, 2013). As for the direction of the bias, ME is positively biased, while DE is negatively biased. ME is 3 Lemma 1 was also used to prove the convergence of SARSA (Rummery and Niranjan, 1994) and Double Qlearning ( Van Hasselt et al., 2016) bounded by: For the bound of DE, ( Van Hasselt, 2013) conjectures the following lower bound: . M means the number of sample means, σ i means the variance of the i th sample mean. For the bias of DPAV estimator, we have the following bounds.
Theorem 2. For any given set X of M random variables: Explanation. ME uses the maximum of sample means to estimate the ground truth maximal expected value (MEV), while DPAV takes the partial average over the maximum and minimum of sample means. The minimum will shift the DPAV estimation towards the ground truth, so the upper bound of the DPAV estimator bias will be smaller than that of ME. DE uses the minimum of sample means to estimate the ground truth in the worst case, however, the DPAV estimator mitigates this bias through importing the maximum into the partial average shifting the estimation towards the ground truth. So its lower bound is larger than that of DE.

Variance
Since the MSE loss of an estimator is the sum of its squared bias and its variance, so we should also consider its variance to evaluate its goodness. Van Hasselt (2013) proved that both the variance of ME and the one of DE could be upper bounded by the sum of variances of sample means: Theorem 3. The variance of DP AV estimator is upper bounded by: Var μ DP AV * Explanation. Because the DPAV estimator utilizes the partial average between the maximum and minimum of sample means to estimate the ground truth. The weights assigned to the maximum and minimum are in the range (0,1), and the sum of weights is 1. According to the variance math properties (Casella and Berger, 2021), the estimation variance is smaller than the larger one among the variances of maximum and minimum of sample means. Therefore, it is also smaller than the maximal variance of all sample means.

Dataset and Evaluation Metrics
We evaluate the DPAV DQN method and baselines on three public task-completion dialogue datasets 4 : movie-ticket booking (Li et al., 2016(Li et al., , 2017, restaurant reservation and taxi ordering . The statistics of the datasets are given in Table 1 (see Appendix B.1 for details) The evaluation metrics are success rate and averaged reward. Success rate is the ratio of the number of tasks successfully completed by the dialogue system in evaluation to the total number of dialogues in the test set. Averaged reward refers to the average of the cumulative rewards obtained by the dialogue system for completing each dialogue of the test set.

Baselines
To benchmark our method performance, we use different DQN variants as baselines in dialogue policy module for comparison: (1) DQN policy is learned with standard DQN algorithm (Mnih et al., 2015). (2) Duel DQN policy is learned by the duel network structure (Wang et al., 2016).(3) Double DQN policy uses Double Estimator of Q-learning to train (D'Eramo et al., 2016). (4) Averaged DQN policy is trained by taking average over multiple action values of target networks (Anschel et al., 2017). (5) Maxmin DQN policy uses the minimum of multiple maximums from different ensemble units to estimate the ground truth maximal action value in a selective process (Lan et al., 2020). (6) SUNRISE policy trains with weighted Bellman backups from multiple networks (Lee et al., 2021). Our model DPAV uses a value function instead of a combination of multiple value functions to tailor the maximum and the minimum action value.
We conduct two λ searching schemes: neural network (NN) searching and heuristic searching. We also analyze the influence of different initial value λ 0 in the heuristic searching. So we have following models in the experiment: (1) LambdaX is the heuristic searching version of the DPAV DQN. The floating number X is the initial value λ 0 with the range (0, 1). And LambdaX (e.g. Lambda0.5, Lambda0.6) searches different floating numbers X for initial λ 0 in the heuristic searching. (2) Lamb-daNet is the neural network searching version of the DPAV DQN. It trains a NN to find a value for λ t for each dialogue state s k in the training process. Here, λ t means the value λ in the training episode t, and s k represents the dialogue state k sampled from the experience replay buffer of reinforcement learning.

Implementation Details
This work is implemented with PyTorch toolkit. Compared with the standard DQN algorithm, we change the loss with the one defined by DPAV DQN in Algorithm 1. For these RL-based dialogue policies, action value network Q(·) is a MLP with one hidden layer of 80 hidden nodes. ReLU is the activation function. A greedy policy is used in the evaluation. All neural networks warm start 120 episodes using the same rule-based policy before training and are trained with the same hyper-parameters. We follow the default hyperparameters of the user simulator setting. The discount factor γ for future reward is 0.9. The batch size is 16, and the learning rate is 0.001. The test set size in the movie domain and other domains is set to 100 and 500, respectively. All baselines are based on DQN for a fair comparison. We set L = 40 as the maximum of dialogue turns in all domains. The heuristically searched decay rate d and decay interval of the DPAV estimator in the movie domain and other domains are set to (0.75, 15 train iterations) and (0.9965, 30 train iterations), respectively. For specific parameters of each model and the user simulator, we refer to Appendix B.2.

Main Results
The main simulation results are reported in Figure 3, we evaluate each dialogue policy performance in terms of success rate and averaged reward. The top two rows of Figure 3 show DPAV DQN consistently outperforms DQN. The overestimation error in target Q values gets propagated into the DQN Q values, while DPAV DQN reduces the overestimation error then its Q values will be less biased. So it more correctly creates dialogue utterances based on the Q values and achieves a better success rate and averaged reward.
Our DPAV DQN method performs better than  the baselines in terms of general performance. Since the training starts with the experience pool initialized by the the same rule-based dialogue policy, the models' performance in the very first few episodes is very similar. After that, the performance improved for all models, but much rapidly for DPAV DQN, which finally converges to a higher success rate and averaged reward. As we claimed above, the DPAV estimator reduces the overestimation error propagated into its model Q values and results in better action values estimation. Ensemble models performance relies on its number of networks. With a limited number of networks, as mentioned in the Related Work, Averaged DQN still suffers overestimation bias, Maxmin DQN has estimation bias from the coarse estimator and SUN-RISE only down-weights the biased estimation. For non-ensemble models, Duel DQN suffers overestimation with the Maximum Estimator, while Double DQN has underestimation bias (Anschel et al., 2017). These drawbacks of the baselines get their biased loss propagated into the policy model Q values and hurt the accuracy of the policy models. So their performance (i.e., success rate and averaged reward) cannot improve further after reaching a certain level. The training efficiency and performance of DPAV DQN in comparison validate the effectiveness of our model. However, in the taxi domain, Duel DQN outperforms other dialogue policies. DPAV DQN only slightly improves the results compared to DQN but it converges faster than DQN. Because sometimes there is no explicitly preferable action for a state, so the action values of the state will be sim- ilar (Thrun and Schwartz, 1993), and the DPAV estimator cannot notably reduce estimation bias through averaging between maximum and minimum action values. But the DPAV estimator still could estimate better than other baselines (except for Duel DQN) as shown in the results. Duel DQN uses the duel network structure to estimate action values (Wang et al., 2016), which is helpful for recognizing the correct action when confronted with confusing states.

Influence of Parameter λ
Intuitively, the optimal λ should seek the best tradeoff between the estimated maximum and minimum that could be used to train the dialogue policy properly. It is a non-trivial optimization problem because the distribution of action values Q (s t , ·) at state s t is constantly updated, and the optimal λ for s t should be adjusted accordingly. The third row of Figure 3 shows that with the neural network (NN) searching almost each dialogue policy evaluation has zero success rate and can not converge. Since the distribution and the optimal λ are changing for the same state s t , the fixed λ searched by the neural network does not work. This validates that the λ for s t is dynamic, and a fixed λ leads to bad performances.
It is a difficult problem to calculate the exact ground truth maximal action value, so existing works use estimators to approximate it (Lan et al., 2020;Lee et al., 2021;Anschel et al., 2017). DPAV DQN uses the DPAV estimator for the approximation. In the heuristic searching version of DPAV DQN, λ 0 is a very important initial value for the DPAV estimator. λ 0 is the initial weight of the minimum, because finally we will give the total confidence to our model, the weight of the maximum will be nearly 1. So the lower bound for λ 0 is 0. Since model reduces the overestimation of the maximum through shifting the estimate towards the minimum action value. It is a trade-off, so the upper bound for λ 0 is 1. It is problem dependent and should be set in a range (0,1).
As shown in the third row of Figure 3, among three dialogue datasets: movie, restaurant and taxi, we empirically find that the initial value λ 0 around 0.5 results in good performances. And other heuristic values degrade the dialogue policy performances. Since shifting the estimate towards the minimum action value too much or too less both causes the estimation bias of the ground truth maximal action value. These validate that λ 0 is problem dependent and λ t should decay to proper values to balance the maximum and the minimum along the training.

Computational Complexity Comparison
All baselines and DPAV DQN use various estimators or estimation tricks to approximate the ground truth maximal action value Q * (s t+1 ). Given the state s t+1 as the input to all baselines and DPAV DQN, the time complexity of the forward propagation of every model unit has a similar time complexity besides the minor differences (e.g., addition). In order to facilitate the comparison, we denote the time complexity for the forward propagation of every model unit as O(N), here, N means the dimension of the vector input for forward propagation. In this comparison, we suppose each ensemble model has K model units.
Combining the results of Figure 3 and Table 2, DPAV DQN achieves better or comparable performances with a lower time complexity. Although the time complexity of ensemble models can be reduced by parallel computing, but that increases the space complexity. So, the overall computational complexity is still high and resource consuming.

Results on Maximum Action Value
In reinforcement learning, the action value Q is the expectation of return R t that is the sum of the discounted rewards: ure 4 shows learning curves 5 of the averaged maximum action value for the starting state on the test set, the value in dialogue context means how much return the dialogue policy assumes it could maximally receive from the starting state.
At the first few training epochs, we notice that the averaged maximum value of DPAV DQN is negative which is consistent with the averaged reward of its evaluation shown in Figure 3d, because at the early training stage, the policy quality is too low to finish most of the dialogues so the averaged reward is low and the averaged maximum action value should be low if the model Q values are accurate. But the values of other models are not consistent with and larger than the real averaged reward. Because the estimation bias of the loss makes that these models have inaccurate Q values, the maximum action value of these models is larger than the ground truth.
The policy training based on these inaccurate Q values will be negatively affected. Only using Maximum Estimator (ME) will cause overestimation bias and even lead to worse policy quality, it can be observed from the curves of DQN and Duel DQN in Figure 4. Averaged DQN and Maxmin DQN use ME in their single unit so the bias leads their Q functions to converge into inaccurate values, which prevents averaged maximum action values from improving further. SUNRISE down-weights the biased estimation and it is trained in such a way so that SUNRISE dialogue policy receives more rewards during the evaluation 3d. As shown in the Figure 4, the averaged maximal action value of DPAV DQN remains the highest among the three datasets because its model gets trained better with the less biased loss and receives more return from successful dialogues during evaluation. This also coincides with the averaged reward from test dialogues in Figure 3d. This empirically validates that DPAV is a better estimator than others because of less estimation bias.

Conclusion
This paper is the first to investigate the negative effects of the overestimation problem in taskcompletion dialogue systems. We propose the DPAV estimator to mitigate this problem of Qlearning. We also theoretically prove convergence and derive the upper and lower bounds of the estimation bias compared with those of other methods. The resulting DPAV DQN model is empirically evaluated on three dialogue datasets and achieves better or comparable results with lower computational load compared to state-of-the-art baselines. Yichi Zhang, Zhijian Ou, Min Hu, and Junlan Feng. 2020a. A probabilistic end-to-end task-oriented dialog model with latent belief states towards semisupervised learning. In Proceedings of the 2020

A Appendix
A.1 Lemma Lemma 1 (Hasselt, 2010). Let (β t , ∆ t , F t ) be a stochastic process, where β t , ∆ t , F t : X → R satisfy, where x t ∈ X and t = 0, 1, 2, . . .. Let P t be a sequence of increasing σ-fields such that β 0 and ∆ 0 are P 0 − measurable and β t , ∆ t , and F t−1 are P t -measurable, with t ≥ 1. Assume that the following conditions are satisfied: where V{·} denotes the variance and ∥ · ∥ denotes the maximum norm. Then ∆ t converges to zero with probability one.

A.2 DPAV Q-learning Convergence Proof
Proof. We apply Lemma 1 with X = S × A, ∆ t = Q t (s t , a t )−q * (s t , a t ), β t = α t , β t is also the step size, P t = {Q 0 , s 0 , a 0 , α 0 , r 1 , s 1 , . . . , s t , a t } and And a max = arg max a ′ Q t (s t+1 , a ′ ) while a min = arg min a ′′ Q t (s t+1 , a ′′ ). The first condition of the Lemma 1 is satisfied because |S × A| < ∞. The second condition of Lemma 1 is met by the third condition of Theorem 1. Because the absolute value of reward |r| < ∞ =⇒ ∀t : the fourth condition of the Lemma 1 is sufficed. This leaves to show that the third condition of the Lemma 1 on the expected contraction of F t holds. We can write Hasselt, 2010), it follows that, Since in DPAV Q-learning, the λ t will decay as λ t+1 = λ t * d. When t → ∞, given ε > 0, ∃t 0 : ∀t ≥ t 0 =⇒ λ t < ε =⇒ lim t→∞ λ t = 0. Therefore, it suffices to show that c t = γλ t Q sub → 0 w.p.1. Since all the conditions of lemma 1 are satisfied, it holds that, ∀s, a : Q t (s, a) → q * (s, a) w.p.1.

B Appendix
B.1 Dataset details Table 1 lists the number of intents, slots and users goals in the three datasets used in the evaluation. And Table 3 shows all annotated dialogue acts and slots in details. Task-oriented dialogue systems are designed to help users to complete a specific goal G. Even though the dialogue system knows nothing about the user goal explicitly, the whole dialogue progresses around this user goal G implicitly. In order to explain the user goal better, we take a user goal as an example from the movie domain: (10) In this user goal, a user inquires the dialogue system about the theater and start time of a today's movie about the Enter the Dragon by Bruce Lee. The user goals are generated from the annotated datasets mentioned in Table 3. The user goals extracted from the same dataset are then aggregated into a user goal set for that task. The user goals extracted from the same dataset are then aggregated into a user goal set for that task. When running a dialogue, the user simulator (Li et al., 2016) randomly samples a user goal from the user goal set to converse with the dialogue system. Helping the user to achieve specific user goals is the task to complete for dialogue systems. In this paper, we use the success rate and averaged reward as our main evaluation criteria. We do not use averaged turns into our criteria because overestimation bias mainly prevents the dialogue system from completing a task in a dialogue. This is explicitly with success rate and averaged reward, and this is not directly related with averaged turns. If and only if the dialogue system recognizes all constraints provided by users and informs all information that users want, and finally books the desired tickets successfully, the user goal is viewed as successful, and the dialogue policy received positive reward for success. The averaged reward means the averaged cumulative discounted reward received by dialogue system per dialogue.

B.2 Implementation details
The size of the experience replay pool in the movie domain and other domains is set to 8000 and 10000, respectively. The number of target networks in Averaged DQN, Maxmin DQN and SUNRISE is set to 4. The temperature parameter of SUNRISE is set to 2. The target network update period for Averaged DQN is set to 4. In the experiment, we use a user simulator to interact with dialogue systems. In the movie domain, the dialogue system receives a 2L reward if the dialogue finishes successfully and receives -L if it fails. Also, a fixed reward (-1) is given to the dialogue system for each dialogue turn. In the restaurant and taxi domains, the dialogue system receives a 2L reward if the dialogue finishes successfully and receives 0 if it fails. Also, a fixed reward (0) is given to the dialogue system for each dialogue turn. Under this setup, the dialogue datasets for experiments have varieties (Li et al., 2016).  Table 3: The details of the datasets. (Li et al., 2016 577