A Collaborative Multi-agent Reinforcement Learning Framework for Dialog Action Decomposition

Most reinforcement learning methods for dialog policy learning train a centralized agent that selects a predefined joint action concatenating domain name, intent type, and slot name. The centralized dialog agent suffers from a great many user-agent interaction requirements due to the large action space. Besides, designing the concatenated actions is laborious to engineers and maybe struggled with edge cases. To solve these problems, we model the dialog policy learning problem with a novel multi-agent framework, in which each part of the action is led by a different agent. The framework reduces labor costs for action templates and decreases the size of the action space for each agent. Furthermore, we relieve the non-stationary problem caused by the changing dynamics of the environment as evolving of agents’ policies by introducing a joint optimization process that makes agents can exchange their policy information. Concurrently, an independent experience replay buffer mechanism is integrated to reduce the dependence between gradients of samples to improve training efficiency. The effectiveness of the proposed framework is demonstrated in a multi-domain environment with both user simulator evaluation and human evaluation.


Introduction
Dialog policy optimization is one of the most critical tasks of task-oriented dialog modeling. Recently, it has shown great potentials for using reinforcement learning (RL) based methods to formulate dialog policy learning Peng et al., 2017). However, most of these methods learn a centralized agent based on the joint action space that covers predefined atomic action (Budzianowski et al., 2018), which is the concatenation of domain name, intent type, and slot name, e.g. 'restaurant-inform-address', or both atomic actions and the top-k most frequent atomic action combinations (Lee et al., 2019a). The elaborate concatenated actions may achieve acceptable performance in simple cases, however, continuously suffer from being laborious to engineers and struggled with edge cases in multi-domain or complex scenes. Another drawback of the centralized agent is its exponential growth in the observation and actions spaces with the growing number of domains (Lee et al., 2019b).
To alleviate the problem of large user-agent interaction requirements caused by the large action space, a hierarchical reinforcement learning framework was proposed to learn the dialog policy that operates at different temporal scales (Peng et al., 2017). It has achieved promising results, however, is still up against some challenges. Firstly, their setting requires a rule-based critic to provide the intrinsic reward for the low-level agent. However, creating such a critic is not easy, especially in intricate scenarios. The man-made critic, somewhat inadvertently, may bias the convergent optimal. Moreover, the action space composed of intent and slot for the low-level agent can be still large, especially when there are a lot of intent types and slot names. Drawing the structural features of dialog actions, we address the above problems with a proposed collaborative multi-agent reinforcement learning framework, where the concatenated dialog action space is decomposed into subspaces corresponding to the domain, intent type, and slot name. Furthermore, each subspace is assigned to different agents, which cooperate to make the final joint action without any human knowledge. The agents concatenate together and pass the output to the next agent. To relieve the non-stationary problem (Claus and Boutilier, 1998;Hu and Wellman, 2003) caused by unexpected changes in the dynamics of the environment as evolving of the agents' policies and to reduce the dependence of the gradients due to the non-independent data, we propose a new approach which allows Joint Optimization based on Independent Experience replay buffers for all agents, termed as JOIE. Our experiments show that such a multi-agent framework reduces the state-action space size significantly and make exploration more efficient. Furthermore, JOIE leads to a better performance benefit from the proposed optimization mechanism.
To the best of our knowledge, this is the first work that strives to develop a multi-agent RL-based dialog action decomposition framework. Our main contributions are three-fold: • We formulate dialog policy learning in the mathematical framework of collaborative multi-agent reinforcement learning.
• We propose an efficient and effective multiagent-based approach factoring the action space size and learning each part by different agents with joint optimization and independent experience replay.
• We validate the effectiveness of the proposed method in a multi-domain task with both user simulators and human users.

Related Work
Many studies have been dedicated to optimizing dialog policy with reinforcement learning, most of which learn a centralized agent that maps the observation to a joint action (Young et al., 2013;Su et al., 2016;Williams et al., 2017;Peng et al., 2018a,b;Lipton et al., 2018;Li et al., 2020a;Zhu et al., 2020;Li et al., 2020b;Wang et al., 2020). For more efficient exploration, (Peng et al., 2017) factor the centralized spaces into hierarchical reinforcement learning paradigms. Meanwhile, cooperative multi-agent reinforcement learning methods have started moving from tabular methods to deep learning methods and are widely applied especially on computer games (Sunehag et al., 2017;Rashid et al., 2018;Jhunjhunwala et al., 2020). Towards multi-agent taskoriented dialog policy, a lot of progress is being made in modeling the interaction as a stochastic collaborative game, where dialog agent and the user simulator are jointly optimized with their objectives (Liu and Lane, 2017;Papangelis et al., 2019;. Building a user simulator in this way is more flexible. However, different from existing frameworks, our multi-agent framework is devoted to decompose concatenated actions in order to reduce the large action space size to improve the performance of dialog agents. Figure 1: Illustration of the collaborative multi-agent framework for dialog policy learning.

Approach
Different from the previous methods that learn a centralized agent or that adopt hierarchical RL paradigms, we cast the policy learning as a multiagent RL framework, as shown in Figure 1. It integrates three agents specified to be responsible for the domain a d , intent type a i , and slot name a s , respectively. They share reward r and make decisions cooperatively based on the state s from the user. Consequently, a concatenation A a from the three agents is passed to the user.

Multi-agent Dialog Policy
Specifically, Agent1 perceives the state s and learns the domain policy π d that selects a domain category a d ∈ A d . Meanwhile, Agent2 equipped with the intent policy π i , takes as input the state s and the selected domain a d , and decides the intent type a i ∈ A i . Then, Agent3 receives s, a d and a i , and determines the slot names a s ∈ A s based on the slot policy π s . Where A d , A i , and A s are the sets of all possible domain names, intent types, and slot names, respectively.
Naturally, we aim to simultaneously optimize all policies that achieve the maximal shared cumulative rewards. Specifically, Agent1 aims to learn the domain policy π d that maximizes the expected sum of rewards condition on s and a d that where r t denotes the reward from the user at turn t, and γ ∈ [0, 1] is a discount factor. Similarly, the intent policy π i is trained to maximize E π i ,st=s,a d neural network parameterized by θ d that satisfies the following: is the target state-action value function that is only periodically updated. Similarly, the intent policy π i estimates the optimal Q-function parameterized by θ i that satisfies the following: Where Q θ i (.) is the target value function, and || is the tagger of concatenation. Meanwhile, the slot policy estimates the optimal Q-function parameterized by θ s that satisfies the following:

JOIE for Policy Learning
To alleviates the dependence of the gradients caused by the non-independent data, the agents maintain their independent experience replay buffer, set as D d , D i and D s for the domain policy, the intent policy, and the slot policy respectively. Consequently, the Q-function Q θ d for the domain policy is learned by minimizing the following loss function: (4) Similarly, the intent policy tries to minimize the following loss function: Meanwhile, the loss function for the slot policy is: , (a s ) ) (6) As shown in Figure 1 and Equation 4, 5 and 6, all agents can observe the global state and the previous agents' actions during training. This setting stabilizes the training procedure by alleviating the non-stationary environment caused by unexpected changes in the dynamics as evolving of the agents' policies. Besides, we proposed to utilize a joint optimization process by adding up each agent's losses represented as Equation 7 based on a shared hidden network. With the joint optimization, the agents do not experience unexpected changes in the environment because different agents can exchange policy information through the shared hidden layers φ.
A detailed summary of the learning algorithm of the collaborative multi-agent reinforcement learning for dialog policy based on joint optimization and independent experience replay buffer (JOIE) is provided in Algorithm 1 in Appendix D.

Experiments
Comparison is on MultiWoz (Budzianowski et al., 2018) with a public available agenda-based user simulator (Zhu et al., 2020). The detail of the user simulator and implementation is in Appendix B, C. We first evaluate 2-agent based models that factor the centralized spaces into two subspaces of the domain and joint intent-slot on 3 different domains sizes of 2, 4, and 7 on MultiWoz. Then we compare 3-agent based models that decompose the action spaces into three subspaces of the domain, intent, and slot. The dataset contains 7 domains, 13 intents, and 28 slots totally. Details of the dataset are provided in Appendix A.

Baseline Agents
We compare JOIE with DQN, Hierarchical DQN (H-DQN), and two multi-agent RL agents. Note that, we do not consider any other methods that use demonstrations because our motivation is to improve learning in a large action space without human knowledge.
• H-DQN (Peng et al., 2017) is a hierarchical deep RL approach consists of: (1) a top-level agent that selects domain (sub-goal), (2) a low-level agent that determines intent-slot to complete the sub-goal.  • JOIE is our proposed collaborative multiagent framework factoring the joint action space and learning each part by a different agent with joint optimization and independent experience replay, as described in Section 3.2.
• VDN (Sunehag et al., 2017) is a multi-agent method that combines each agent's state action-value function as a simple sum for optimization with shared transitions.
• QMIX (Rashid et al., 2018) is a variant of VDN which contains a mixing network that centralizes each agent's state action-value function for optimization.

Main Results
All agents are evaluated with the success rate (Succ.) at the end of the training, average turn (Turn), average reward (Reward). The main simulation results are shown in Table 1 and Figure 2, 3. The results show that the proposed JOIE learns much faster and performs consistently better in cases with a statistically significant margin. Figure 2 shows the learning curves of 2-agent based models. Firstly, JOIE achieves the best Succ. (on average 0.98) with the highest learning efficiency for all domain sizes. Qmix and VDN adopt an optimization fashion that estimates a concatenated action values, which is originally for partial observability. JOIE abandons this step to avoid the extra cost since we assume the state is fully observed by all agents. Additionally, the advantages of joint optimization that relieves non-stationary problems and independent experience replay buffer that reduces gradient dependence make JOIE better-learning performance.

Results of 2-agent based Models
The improvement is slight on domain = 2, but remarkable and impressive as the increasing sizes of the domains. Besides, multi-agent-based models outperform H-DQN, indicating that the proposed collaborative multi-agent framework, which decomposes the joint action space and is led each part by a different agent, can alleviate the exploration obstacles brought by large action space without human knowledge. Finally, DQN is consistently the worst, which is not surprising since it explores and learns from a flat and large action space without any guidance. Noticed that, the performance of DQN increase as the number of domains decreases, which depicts that the growth of action space hinders the learning speeds of RL agent. Meanwhile, as illustrated in Table. 1, the comparison results of Turn and Reward are consistent with that of Succ. Figure 3 shows the learning curves of 3-agent based models. It can be seen that JOIE3 learns faster and performs significantly better with a clear margin compared with VDN3 and Qmix3, which depicts that the decentralized policy with joint optimization and independent experience replay buffer is more capable of and robust to dialog policy learning. JOIE3 factors the concatenated intent-slot action space and assigns them to two agents, which further reduces the action space and balance load for each agent. As a consequence, JOIE3 learns faster than JOIE that based on joint intent-slot action space. Moreover, compared with VDN3 applying a simple sum centralization, Qmix3 adopts a trainable network centralization and achieves better performance.

Human Evaluation
User simulators are not sufficient to fully mimic the complexity of real users (Dhingra et al., 2017), therefore human evaluation is given to further assess the feasibility of JOIE in real scenarios. we deploy the agents in Figure 2 and 3 to interact with human users in 2-agent based models and 3-agent based models 1 trained on all (seven) domains for 2.0 × 10 5 simulation epochs. In each evaluation session, each human user is assigned with a goal sampled goal and instructed to communicate with a randomly selected agent to achieve the goal. Users can end the session at any time if the agent Keeps repeating or they believe the dialog is going to be a failure. At the end of each session, users are required to give explicit feedback on whether the dialog succeeded with all the user constraints satisfied. Moreover, evaluators rate the dialog session on a scale from 1 to 5 about the quality (5 is the best, 1 is the worst). We collect 50 dialogues for each agent. The results are listed in Table 2, which reflects JOIE of both 2-agent based and 3-agent based models perform consistently better than other baselines, which is consistent with what we have observed in simulation evaluation.

Conclusion and Future Work
We presented JOIE, a generally applicable collaborative multi-agent framework for policy learning. It factors action space and learning each part by a different agent with joint optimization and independent experience replay. The experiment results of the simulation show that the proposed agents are efficient and effective in multi-domain with large action space settings. Directions of future work include: (1) extending JOIE to multi-action policy. (2) improving JOIE with demonstration.  Table. 3 lists all annotated dialog domains, intents, and slots for MultiWoz at a different number of domains in detail. Noted that, we didn't count the "General" and the "Booking" as a domain for they cannot define a task independently.

B User simulator
During training, the simulator initializes with a goal and takes system acts as input and outputs user acts with reward, which is set as -1 for each turn, and a positive (2 · T ) for successful dialog or a negative of −T for failed one, where T (set as 40) is the maximum number of turns in each dialog. A dialog is considered successful only if the agent helps the user simulator accomplish the goal and satisfies all the user's search constraints (Wang et al., 2020).

C Hyperparameters and Implementation
Set m ∈ 2, 4, 9 as the numbers of domains. We adopt 2-layer MLP with 100 hidden dimensions and Relu as the activation function for all m. Inputting state with dimension as 393, DQN's output dimension is m * 364. Where 364 is the number of action concatenating intent and slot. 2-agent based models with combined intent and slot action space, i.e. H-DQN, VDA, Qmix, JOIE, utilize two networks with different output heads of m and 364 dimensions. Noted that, VDA, Qmix, JOIE share input, and hidden layers. 3-agent based models with separated domain, intent, and slot action space, i.e. VDA3, Qmix3, JOIE3, apply three different output heads of m, 13, and 28 dimensions and share input and hidden. -greedy is utilized for policy exploration. We set the discount factor as γ = 0.9. The target networks are updated at every 1000 training epochs. To mitigate warm-up issues, We apply the rule-based agent of ConvLab (Lee et al., 2019a) to provide experiences at the beginning, the warm_start epoch for all agents is 1000. The learning rate is set as 0.001 for DQN, 0.0005 for JOIE3, and 0.00005 for the other models. The decay rate and step size are 0.95 and 1000.

D Algorithms
Algorithm 1 outlines the full procedure for training multi-agent-based dialogue policies based on joint optimization and independent experience replay buffers.
Algorithm 1 JOIE for dialog policy learning