Efficient Dialogue Complementary Policy Learning via Deep Q-network Policy and Episodic Memory Policy

Deep reinforcement learning has shown great potential in training dialogue policies. However, its favorable performance comes at the cost of many rounds of interaction. Most of the existing dialogue policy methods rely on a single learning system, while the human brain has two specialized learning and memory systems, supporting to find good solutions without requiring copious examples. Inspired by the human brain, this paper proposes a novel complementary policy learning (CPL) framework, which exploits the complementary advantages of the episodic memory (EM) policy and the deep Q-network (DQN) policy to achieve fast and effective dialogue policy learning. In order to coordinate between the two policies, we proposed a confidence controller to control the complementary time according to their relative efficacy at different stages. Furthermore, memory connectivity and time pruning are proposed to guarantee the flexible and adaptive generalization of the EM policy in dialog tasks. Experimental results on three dialogue datasets show that our method significantly outperforms existing methods relying on a single learning system.


Introduction
Dialogue policy, one of the most critical modules of task-oriented dialogue systems, aims to determine system responses based on current states (Zhang et al., 2019a). One of the earliest methods is the rule-based policy (Litman and Allen, 1987;Bos et al., 2003). Although this method often has acceptable performance, handcrafting rules are expensive and non-extensible. Recently, deep reinforcement learning (RL) has become a mainstream method for training dialogue policies (Cuayáhuitl, 2016;Peng et al., 2017Peng et al., , 2018. Since Deep RL-based methods are learning in an online fashion, a large amount of interaction with * Corresponding author real users is required, which is generally infeasible in practical application (Fatemi et al., 2016;Dhingra et al., 2017;Su et al., 2018;Wu et al., 2019).
Intuitively, the reward-based learning mechanism in Deep RL (DRL) coincides with the dopamine-centered regulation in the human brain (O'Reilly et al., 2014). Human brains have two differentially specialized learning and memory systems for collaboration, allowing them to find good solutions without requiring copious examples (Mc-Clelland et al., 1995;Norman and O"Reilly, 2003;O'Reilly et al., 2014). However, most of the DRLbased dialogue policies (Chen et al., 2017b;Peng et al., 2018;Lipton et al., 2018;Wu et al., 2019; rely on a single learning system, which neglects the human brain's memory structure. Consequently, we imitate the human brain to construct an efficient complementary policy learning (CPL) model with learning and memory systems for cooperation.
Inspired by cognitive neuroscience studies (Sutherland and Rudy, 1989;Daw et al., 2005;Poldrack et al., 2001), some researches evidence that episodic memory (EM) plays a vital role in decision tasks. Thus, they incorporate the EM into RL to accelerate learning (Blundell et al., 2016;Young et al., 2018;Pritzel et al., 2017;Lin et al., 2018). Despite the effectiveness of these methods on video game tasks, there is little research validating the practical usage of EM in dialogue tasks.
In this paper, we investigate the roles of the EM policy and the DQN policy (a classic representative of the DRL-based dialogue policies) in dialogue policy tasks. We observed that the EM policy is similar to the human brain's memory system, which efficiently leans with little data and bridges interdependency between actions and results from past experience. It is of limited usefulness in novel situations, since it generalizes poorly. The DQN policy is analogous to the human brain's learning system. It effectively extracts and generalizes potential in-formation from a large amount of experience to drive decisions and calibrate strategies stored in the EM. Its good generalization comes at a cost of learning inefficiency and the demand for massive data. These two policies complement each other. Nevertheless, directly combining the DQN policy with the vanilla EM policy is difficult to maintain effectiveness consistently in the dialog policy task. Thus we have the following considerations: (1) A meta-controller should be proposed to coordinate between the two policies. Over-reliance on the DQN policy may not achieve the available performance quickly, while over-reliance on the EM policy is difficult to generalize to new situations.
(2) In order to ensure that the EM policy remains consistently effective in the dialogue tasks, a mechanism for generalization to new situations is needed. The same situation may never be encountered twice in dialogue tasks, and it is impossible to record all the situations.
For question (1), we propose a confidence controller that allows the two policies to form a seamless hybridization by controlling the complementary timing according to their relative efficacy at different stages. Once the CPL is enabled, the EM policy provides diversified guidances for the DQN policy: an extra memory objective (EMO), an example memory action (EMA), and an extra intrinsic reward (EIR). For question (2), we define memory connectivity that allows the flexibility and generalization of the EM policy by associating past familiar memories. Then, time pruning prunes the outdated memories.
In summary, our main contributions are two-fold: (1) We present a novel CPL framework, which gets rid of collecting any demonstrations and does not rely on any experts. Preferably, it exploits the complementary superiorities of the EM policy and the DQN policy through the confidence controller. To the best of our knowledge, this is the first work to learn a dialogue policy, which integrates the learning and memory systems seamlessly and avoids being stuck on a single system. (2) We experimentally demonstrate that the effectiveness of our framework, and EM can be a crucial building block of effective dialogue policy learning. Our model is the first step in that direction, as far as we know.

Related Work
The research on the learning efficiency of dialogue policies is not new. Lipton et al. ( , 2018 showed that pre-filling the replay buffer with few successful dialogue experiences at the beginning can accelerate learning. Prioritized experience replay improves the sample efficiency by increasing the replay probability of experiences with higher temporal difference errors (Schaul et al., 2016). Peng et al. (2018) proposed a world model to simulate users and integrated planning into policy learning. A lot of progress has been made in improving the effectiveness of dialogue policies by combining supervised learning (SL) Henderson et al. (2008). Su et al. (2016 and Williams et al. (2017) proposed to use SL to initialize the policy network and then fine-tune it within the RL process. Chen et al. (2017a,b), Zhao et al. (2021) incorporated a teacher to guide policy learning. Nevertheless, these methods require extra effort to hire or design teacher models.  proposed an efficient policy learning from demonstrations. However, these methods require the collection of human demonstrations, and their performance depends on the quality of the demonstrations. Parallelly, another solution is to increase the density of meaningful rewards Lu et al., 2019;Zhao et al., 2020).
Episodic memory has been used outside of dialogue research to improve data efficiency (Lengyel and Dayan, 2007). Blundell et al. (2016) proposed table-based model-free episodic control to learn past good experiences in a one-shot learning fashion. Pritzel et al. (2017) proposed neural episodic control, which uses differentiable neural dictionaries to store and lookup beneficial memories for decision. However, these table-based methods lack good generalization capabilities. Young et al. (2018) proposes a EM integrated into RL agent. But its computing time increases with the history length. Based on this, Lin et al. (2018) proposed episodic memory deep Q-network in the video game domains with high-dimensional. However, these researches focused on the video game fields, how effectively use the EM in the dialogue domain, and whether it is feasible are less explored.

Proposed Framework
The CPL framework is described in Figure 1, which mainly includes three modules: (1) The episodic memory policy quickly latches familiar experiences from the past to provide auxiliaries for the DQN policy. It includes two operations. Writing effectively retains memories while minimizing the reten- tion of obsoleting memories. Lookup with memory connectivity and time pruning selectively associates relevant memories while casting aside irrelevant or obsolete memories; (2) The DQN policy effectively extracts and generalizes potential information from a large amount of experience to drive decisions and calibrate strategies stored in the EM; (3) The confidence controller choose an appropriate time to perform complementary policy learning according to their relative efficacy at different stages.

Episodic Memory Policy
Episodic memory policy is a memory system based on past experience. It can quickly record and replay the empirical decisions of the dialogue agent, containing two operations, as depicted in Figure 2: Writing: we adopt the similar architecture as previous EM (Pritzel et al., 2017) to record past experience. For each action a ∈ A, the EM policy has a separate memory, which is indexed by states and actions, M a = (H a , T a , Q a ). After the episode ends, the EM policy will write each (h, t, Q(s, a)) into the corresponding M a through a backward replay process according to the following equation: where h is the representation of state s, t are the up- In theory, each M a in vanilla EM is constantly growing, so they need to consume a large amount of memory to record. Therefore, we add the update time T a and overwrite the entry that has the least recently updated to minimize the retention of obsoleting memories and limit the size of the memory for each action. This is in line with the law that the human brain is more likely to forget older memories (Hardt et al., 2013).
Lookup: the query key h is used to lookup similar experiences from the M a . For large-scale dialog tasks, novel states are common. However, the lookup methods used in the video game domain are not applicable. Generalizing familiar experiences to novel situations in our tasks is essential. Therefore, we define memory connectivity to lookup M memories as similar memories 1 : where h and h i are two probability distributions of the state. The smaller C(h||h j ), the stronger the connectivity of memories. Consequently, we use it to indicate the importance weighting of the selected memories, W a = 1 − C(h||h i ). This is in line with the law that the past familiar experiences have profound implications for humans (Carbonell, 1983). In order to cast aside the outdated nonoptimal policies, we further propose time pruning for corresponding entry, which a monotonically decreasing function: where T is the maximum valid time (set to 15 in this paper), t is the current time, and t is the update time of the memory. Therefore, for each the M a , the corresponding Q a M obtained by lookup operation can be rewritten as follow: The EM policy select the corresponding action with the maximum Q a M as the memory action a M t for subsequent auxiliaries.
Overall, the EM policy is different from the DQN policy, which does not correspond to estimate the expected return, rather than looking up the highest potential return for a given state based on the previous memories.

DQN Policy
The task-oriented dialogue policy learning is typically formulated as an MDP problem. We employ the vanilla DQN (Mnih et al., 2015) 2 to train the dialogue policy based on experience from the interaction between agents and users.
At each step, the agent uses -greedy to select a DQN action based on the dialogue state s. Afterward, the agent obtains a reward r, observes a corresponding user response, and updates the dialogue state to the next s until the end of the conversation. Finally, we store the experience (s, a, r, s ) into the experience replay buffer D. We optimize the parameter θ by minimizing the mean-squared loss function. It is worth noting that here we only consider the vanilla inference objective function: where γ ∈ [0, 1] is a discount factor, and Q θ is the target value function that is updated periodically. Q θ is optimized through back-propagation and mini-batch deep Q-learning.

Confidence Controller
We use the confidence controller to control the complementary timing by judging the confidence of the DQN policy. We use the DropoutQNetwork 3 (Hinton et al., 2012;Srivastava et al., 2014;Chen et al., 2017b) to estimate the confidence of the DQN policy at t-th turn c D t (lines 7-12 in Algorithm 1). The DQN policy has more confidence when the c D t is greater than the confidence threshold ξ, and its Q θ is greater than Q M . Otherwise, the EM policy is more confident. When the DQN has less confidence, the CPL is enabled, where the EM policy provides three guidances for the DQN policy: Take action a D t , receive environment rewards r ext t and next state s t+1 . 14: Update M a using (h t , t, Q(s t , a t )) via Eq. (1) 25: end for 26: end for a) Extra Memory Objective (EMO): the EM policy provides an extra memory objective L(M) to reconciles the loss function of DQN policy. We propose a new objective function combining the two objectives: where Q θ (s t , a t ) is the same as Q θ (s, a) in Eq. (5). Q M (s t , a t ) is the Q-value looked up by the EM  policy in the same action. And we weigh the two policies by adjusting the value of λ. In this way, we make flexible use of two policies in the learning process. b) Example Memory Action (EMA): the memory action a M t with the highest potential reward replaces the DQN action a θ t for responding. c) Extra Intrinsic Reward (EIR): exploitation rewards and exploration rewards are composed of the extra intrinsic reward r int t to encourage the DQN policy to explore and exploit effectively. If the DQN action a θ t is the same as the memory action a M t with the highest potential rewards, exploitation rewards are provided. If the DQN action a θ t does not appear in the corresponding M a , exploration rewards are provided.
At each turn, the dialogue state s t is transmitted to both the DQN policy and the EM policy. The DQN policy first generates a DQN action a θ t . Then the confidence controller judges whether the DQN policy has sufficient confidence. When it has less confidence, the EM policy provided auxiliaries for it: EMO, EMA, and EIR. After the episode ends, the memories of the EM policy will be updated through a backward replay process. The full procedure of the CPL is described in Algorithm 1.

Performance Evaluation
We conduct sufficient experiments on three public task-oriented datasets in both simulation and human evaluation: movie-ticket booking, restaurant reservation, and taxi ordering 4 .

Dataset
The movie-ticket booking task is collected from Amazon Mechanical Turk and annotated by , and the other two tasks are provided by Microsoft Dialogue Challenge . Each domain has its domain-specific intents, slots, and labeled dialogues, and the statistics are shown in Table 1. Readers can refer to the details of the three domains from Appendix A.

Baselines
To benchmark the performance of our method, we have developed different versions of task-oriented dialogue agents as baselines for comparison: • DQN agents are learned with standard DQN with only direct reinforcement learning 5 .
• EPAC agents introduce a human teacher in the training process to teach dialogue policy learning via providing example actions and extra rewards (Chen et al., 2017a).
• S 2 Agent learns the dialogue policy from demonstrations through policy shaping and reward shaping .
In order to further analyze the effectiveness of each component in our method, we construct ablation tests: Proposed CPL • CPL is our proposed approach which learns policy by complementary policy learning.
• CPL w/o EMP is a variant of CPL which learns policy by DQN policy with two guidances (without EMA).
• CPL w/o DQN is a variant of CPL, but only uses EMP to make quick decisions (without the DQN action).
• CPL w/o W is our proposed method which learns policy by complementary policy learning without importance weights.
• CPL w/o T is our proposed method which learns policy by complementary policy learning without time pruning.

Implementation Details
For all RL-based agents, value network Q(·) has one hidden layer MLPs with 80 hidden nodes, ReLU is used as the activation function in three domains. All the NN models are warm start 100 epochs and trained with the same hyper-parameters settings. -greedy is applied for policy exploration which starts from 0.2 and decays every episode with a decay rate of 0.95. We set the discount factor γ = 0.9. The size of the experience relay in the movie domain and other domains is set to 5000 and 10000, respectively. The batch size is 16, and the learning rate is 0.001. We set K as 10 in the movie domain and 50 in other domains. For a fair comparison, all baselines (except DQN(K)) are based on DQN rather than DQN(K).
In terms of hyperparameters for EM policy, the memories are stored up to 5000 per action. We do a backward replay update for each action after the end of each episode. The M = 5 unless indicated. The learning rate α in Eq.1 is set to 0.1. We fix λ in Eq.6 at 0.1. The confidence threshold ξ is set 0.7. The N is set to 50. The dropout rate is set to 0.25. Exploration in the EM Policy is applied by using -greedy with = 0.005. The maximal extra intrinsic reward r int is 5. Appendix B shows detailed information about the user simulator.

Main Results
The main simulation results are shown in Table 2 and Figure 3. From the results, it is clear that through complementary policy learning, the CPL agents are much faster and consistently better than other strong methods in all domains. Figure 3 shows the learning curves of different agents in three domains. It can be seen that the DQN(K) performs better than the DQN in all domains since its experiences have K − 1 times more than the DQN. With the same number of experiences, EAPC and S 2 Agent consistently perform better than the DQN in all domains. And even in the case of less experience, they are still superior to the DQN(K) in restaurant and taxi domains. But their performance hardly exceeds the DQN(K) in the movie domain. The reason might be that, in the simpler movie domain where dialogues are easier to succeed, simply increasing experiences makes efficiency improvement more obvious. By contrast, in the relatively complex domains where successful dialogues are relatively rare, it is difficult to provide  clear guidance for agents. The above observations are also confirmed in Table 2. With complementary learning, the CPL agent also alleviates reward sparsity issues, which is especially obvious in relatively complex domains. In the restaurant and taxi domains, the average rewards of all baselines are negative, while the CPL agent always learns meaningful positive actions. These actions are basically given in the form of EIR. Moreover, an additional result is observed. Although the CPL agents have the highest average success rate and rewards, their average turns are longer than the CPLw/oEMP agents. We argue that the EMA from EM policy may be non-optimal, causing the CPL agents to complete user goals in a detour instead of the most effective way. The CPLw/oEMP agents explore a more efficient path through EMO and EIR.

Training with varing number of M
Intuitively, the number of M has a large impact on dialogue policy learning. M represents the number of empirical decisions that the EM policy provides to the DQN policy for reference. Experiments with varying numbers of M values were conducted in three domains. The moving averaged success rate is calculated at 300 epochs. Figure 4 shows that the moving average success rate of each agent during the learning. The agent with a small M value still has better learning efficiency in the movie domain, while the agent performs worst in other domains. In all domains, the agent with a large M value has an inferior learning efficiency. This is owing to the fact that the dialogue agent benefits from related memories in many aspects to consider the current state more comprehensively with the increase of M . After more than 9, irrelevant episodic memories are chosen to simply fill the post, which affects the efficiency and quality of dialogue policy learning. This experimental result also verified our assumption.

Training with varing values of λ
Similarly, the λ affects the performance of dialogue policies by controlling the use of two policies (EM policy and DQN policy) in the dialogue policy learning process. Therefore, experiments with varying λ values were conducted in three domains to serve as a reference for CPL practitioners. The moving average success rate of each agent at 300 epochs is shown in Figure 6. It can be viewed that no matter whether the EM policy is completely nonparticipation or completely dominated, it seriously hurt the performance of dialogue policies. It performs better when the DQN policy is dominating with the EM policy auxiliary.

Ablation Test
We conduct ablation experiments to analyze the effectiveness of each component in the CPL framework. As illustrated in Figure 5, although the average success rate of the CPLw/oEMP in the early stage is lower than the CPL, it can achieve approximate performance finally in three domains. The CPLw/oDQN achieves rapid learning in the early stage, but its later learning is limited when making decisions in novel situations. It can be seen that the involvement of the EM policy in the CPL framework tends to predominate early, while the involvement of the DQN policy predominates later. Although both the CPLw/oW and the CPLw/oT learn faster in the early stage, the performance in the later stage hardly improves. It is helpful to reference memories aggressively at the early stages regardless of their relevance and timeliness. With the increase of training time, the dialogue agent has been significantly improved, irrelevant and out-dated memories often hurt the performance badly. The experiment verifies that the four components benefit the CPL to a large extent.

Human Evaluation
In order to further verify the feasibility of our method in real dialogue scenarios, we recruited 33 real users to interact with different agents in three tasks without know which one is behind. We collect 50 valid conversations for each agent in each domain. All evaluated agents have been trained for 300 epochs. In each conversation, users randomly select an agent to communicate with a user goal sampled from the corpus. Users have the right to abandon the task and terminate the conversation if they believe that the dialogue is unlikely to succeed. At the end of the conversation, in addition to requiring users to provide feedback on whether the conversation is successful, the datasets  also needs users to evaluate the naturalness, coherence, and task completion ability of the agent with a score of 1 to 5 7 . As illustrated in Table 3, the CPL and CPLw/oEMP are significantly outperforms other agents and the CPL is considered to be more slightly dilatory than CPLw/oEMP, which is consistent with what we have observed in simulation evaluation.

Conclusion
In this paper, we propose a novel complementary policy learning (CPL) framework that realized dialogue policy learning in a more effective and faster manner through direct use of its own experience without any extra cost. This framework exploits the complementary advantages of the EM policy and the DQN policy. Additionally, we proposed a confidence controller to coordinate between the two policies according to their relative efficacy at different stages. Further proposed memory connectivity and time pruning ensure the flexible and adaptive generalization of the EM policy in dialogue tasks.
The results show that the CPL significantly outperforms baselines in three domains, and an episodic memory component is a crucial building block of effective dialogue policy learning. To the best of our knowledge, this is the first work to learn a dialogue policy, which integrates the learning and memory systems seamlessly and avoids being stuck on a single system. In the future, we plan to expand our method to multi-domain tasks, e.g., MultiWoz (Budzianowski et al., 2018) and evaluating it using other dialogue platforms, e.g., PyDial , Convlab (Lee et al., 2019).  Table 4 lists all annotated dialogue acts and slots in details.

A Appendices
These three datasets are not used to directly train the dialogue policy model, but to extract the user goals. Therefore, the movie-ticket booking task is simpler than the other two tasks. For each conversation, the user simulator  randomly samples a user goal from the user goal set to interact with the agent. The goal of each agent is to help them achieve specific user goals.
In order to verify the effectiveness of the proposed method, the datasets provide both automatic and human evaluations on three criteria : success rate, average turns, and average reward 8 . Also, the datasets conducted a human evaluation: in addition to the above criteria, human users need to give a rating (1-5) at the end of each conversation according to the naturalness, coherence, and task completion ability of the agent. Specifically, the fulfillment degree of task (2 points), natural responses (1 point), timely and correct responses (1 point), and smoothly steer conversations (1 point). In this paper, we choose the success rate as our main evaluation criteria. If and only if the agent identifies all constraints provided by users and provides all information that users want, and finally successfully booking, the user goal is considered successful.

B Appendices
The task-oriented dialogue system is designed to assist users to accomplish a specific goal G. The entire conversation revolves around this user goal G implicitly, while the agent knows nothing about the user goal explicitly.
In order to make the user goal G more clear, taking the movie-ticket booking domain as an example. A user may ask about the theater and starttime of a today's movie-ticket about the Enter the Dragon 8 In this paper, we choose the success rate as our main evaluation criteria.
The user goals are generated from the annotated dataset mentioned in Section 4.1. The user goals extracted from the dataset are then aggregated into a user goal set. Whenever running dialogues, the user simulator randomly samples one user goal from this user goal set. For the intrinsic rewards r i nt, it includes exploitation rewards and exploration rewards to encourage the DQN policy to explore and exploit effectively. Exploitation rewards of 5 are provided, when the DQN action is the same as the memory action with the highest potential rewards. Exploration rewards of 5 are provided, if no memories are corresponding to the DQN action in EM policy. These two rewards do not appear at the same time. For the external reward function, in all domains, the agent receives 2L reward if the dialogue finishes successfully and −L if it fails, where L is the maximum of turns in each dialogue. A fixed (−1) penalty is given to the agent at each turn to encourage the policy to reach the goal more efficiently. We set L to 40 in three domains.