High-Quality Dialogue Diversification by Intermittent Short Extension Ensembles

Many task-oriented dialogue systems use deep reinforcement learning (DRL) to learn policies that respond to the user appropriately and complete the tasks successfully. Training DRL agents with diverse dialogue trajectories prepare them well for rare user requests and unseen situations. One effective diversification method is to let the agent interact with a diverse set of learned user models. However, trajectories created by these artificial user models may contain generation errors, which can quickly propagate into the agent's policy. It is thus important to control the quality of the diversification and resist the noise. In this paper, we propose a novel dialogue diversification method for task-oriented dialogue systems trained in simulators. Our method, Intermittent Short Extension Ensemble (I-SEE), constrains the intensity to interact with an ensemble of diverse user models and effectively controls the quality of the diversification. Evaluations on the Multiwoz dataset show that I-SEE successfully boosts the performance of several state-of-the-art DRL dialogue agents.


Introduction
Task-oriented dialogue agents assist human users to complete their tasks in multi-round human-agent interactions.Example tasks include booking a movie ticket or reserving a lunch table.Many agents use deep reinforcement learning (DRL) to learn good policies that respond appropriately in the dialogue and succeed in completing the task (Schulman et al., 2017;Takanobu et al., 2020;Peng et al., 2018).Due to the high cost of interactive training using real human users, simulators have been used to replace the humans (Shi et al., 2019).Early user simulators rely on rules hand-crafted by domain experts, who review and summarize dialogue templates from everyday conversations (Li et al., 2016b).Rule-based simulators manage routine task scenarios efficiently; but they cannot easily generate unconventional and more diverse dialogue acts marked by human spontaneity.Agents learning from these simulators may fail to handle rare queries and unseen situations.For example, in Table 1's left column, the system learns from a rule-based simulator and fails to answer an unexpected question "In which area of the city is this hotel located?"The user immediately ends the conversation and does not book the hotel.
Increasing dialogue diversity is a long-lasting research interest motivated by different needs.In dialogue generation, diversity is desired to avoid dull responses.In simulator construction and reinforcement learning (RL), diversity is desired for training robust agents.Although the uses are slightly different, ideas to improve diversification can be universal.Dialogue diversity can be improved via i) enforcing diversity in objective functions (such as maximize mutual information) of neural models (Li et al., 2016a;Baheti et al., 2018), ii) perturbing language rules (Niu and Bansal, 2019) or environment parameters (Tobin et al., 2017;Ruiz et al., 2019), iii) randomizing trajectory synthesis (Andrychowicz et al., 2017;Lu et al., 2019), iv) selecting more diverse data contributors (Stasaski et al., 2020), and v) sampling trajectories from a diverse set of environments (Chua et al., 2018;Janner et al., 2019).
Figure 1 illustrates the last idea, which has received much attention recently (Chua et al., 2018;Janner et al., 2019).This line of research diversi-User: Hi!I would like to book hotel Leverton House.System: Are you interested in booking hotel Leverton House on the east side?User: Where is it exactly located?What is the post, phone and type of the hotel?... System: ... Would you like to go ahead and book it?User: Ok, Not sure on the duration of stay or number of people as of now.System: Booking confirmed.
[success] User: I would also like to book a train from Peterborough to Cambridge arriving at 15:15 on Sunday.System: Would you like to go ahead and book tickets on TR7721?[correct] User: Ok.Not sure on the number of people.System: One ticket booked.[success] User: Great!Thanks!System: Have a good day!Failure: The system fails to answer the area question from the user and it does not book the hotel for the user.
Failure: Even though the system successfully answers all questions from the user and books the hotel, it fails to identify the correct train based on arrival time and does not book the train ticket.
Success: Both hotel and train ticket have been successfully booked satisfying all constraints.fies an agent's learning experiences by letting the agent interact with a diverse set of generative user models learned from an expert simulator.The idea involves little manual configuration and is often developed as model-based DRL (MBDRL) (Sutton and Barto, 1998).MBDRL methods alternate between learning an environment model and learning a policy.For a task-oriented dialogue agent, the environment model can be thought of a user model.It is a dynamic model updated to fit the trajectories the agent has collected so far; the policy then is optimized to maximize the expected long-term rewards within the model.Diversification of the user model is achieved by randomizing the parameter initialization of neural networks to our advantage.The agent, which is the policy learner, interacts with an ensemble of randomized user models to gain more diverse learning experiences.
However, one issue in this approach is that errors in (user) model learning may quickly propagate into policy learning.Table 1's middle column demonstrates a result from uncontrolled use of the diversified user models.In this example, even though the system successfully answers all questions from the user and books the hotel, the agent recommends two erroneous trains that do not satisfy the user's constraints and fail to book the ticket.This is because noise has been introduced to the training dialogues and they deviate too much from a legitimate conversation in real-life.
In this paper, we propose a novel dialogue diversification method, Intermittent Short Extension Ensemble (I-SEE), for task-oriented dialogues agents trained in simulators.First, I-SEE employs neural networks to learn a generative user model by imitating the expert simulator (Torabi et al., 2018).Second, it randomizes the parameter initialization of the neural networks to generate more user models, which are diversification from the original expertbuilt simulator.These randomized user models form an ensemble of diverse simulators, named Diverse User Model Ensemble (DUME).Third, during policy learning, the agent interacts with multiple simulators to obtain diverse training trajectories.Particularly, we propose to mix trajectory segments sampled from the expert simulator and trajectory segments sampled from the DUME.This is to constrain the degree of noise introduced by diversification and do not divert too far from the expert simulator.Moreover, we propose to include the DUME trajectories only moderately frequently and for a short horizon.Figure 2 illustrates our idea conceptually.By constraining the degree of diversification, I-SEE effectively controls the training trajectories' quality while preserving their diversity.In Table 1's last (right) example, the I-SEE agent successfully takes the booking task to a logical conclusion by correctly finding the TR7721 train, which satisfies the user's time constraints.We apply I-SEE to a few best performing DRL dialogue methods and evaluate them on the Multiwoz (Budzianowski et al., 2018) dataset.Results show that using DUME and I-SEE in combination would significantly improve the performance of these state-of-the-art systems.
2 Related Work
Seq2Seq dialogue agents are generation methods.They use language models to capture the probability of one utterance given the previous, and based on the learned models to generate new utterances (Vinyals and Le, 2015;Hosseini-Asl et al., 2020).These supervised methods take advantage of deep neural networks and infer effective encoderand-decoders from large amount of sequential training data.Modeling the dialogue states (Campagna et al., 2020) in the Seq2Seq architecture is a major interest in this line of research.
KG-QA dialogue agents enable reasoning and inference with pre-built knowledge graphs (KGs).The KGs can be about commonsense or domainspecific knowledge.A general KG can help a conversation more interesting and engaging (Moon et al., 2019;Young et al., 2018); while a specific KG can help accomplish the task more efficiently (Madotto et al., 2018(Madotto et al., , 2020)).Methods in this category focus on scaling up the KGs (Madotto et al., 2020) and hopping mulitple steps on the KGs (Moon et al., 2019).
Retrieval-based dialogue agents leverage mature techniques in ad hoc retrieval and extend the techniques from individual queries to a session of them.Retrieval-based approaches do not rely on simulators; instead, learning from historical data, such as query logs, is still quite popular.This line of research focuses on revealing a user's mixedinitiative information need via asking back-andforce questions (Aliannejadi et al., 2019;Yu et al., 2020;Qu et al., 2020;Wang and Ai, 2021).However, when task complexity goes beyond the user's capability, these approaches may face difficulty in finding global solutions to the task goal.
RL-based dialogue agents can be grouped into model-free and model-based methods.Model-Free DRL (MFDRL) agents take a pre-built environment/simulator as it is and learn policies via direct interactions with it (Li et al., 2017;Dhingra et al., 2017;Li et al., 2017;Lipton et al., 2018;Su et al., 2018;Wu et al., 2020).On the contrary, model-based DRL (MBDRL) agents indirectly learn policies from the environment.MBDRL has two concurrent learning modules, namely model learning and policy learning.The model learning module can be thought of an additional computational layer between the environment and the agent.This provides opportunities to alter the original environment.MB-DRL was originally proposed in robotics and control to speed up direct policy learning by inferring decision rules from past interactions and embedding them in the model.For dialogue agents, this middle layer of model learning acts as derived simulators (or learned user models) from the original expert simulator.Deep Dyna-Q (DDQ) (Peng et al., 2018) is an MBDRL method built upon Dyna (Sutton and Barto, 1998).D3Q (Su et al., 2018) employs generative adversarial networks (GAN) to minimize the difference between trajectories generated from the learned models and that from the original expert simulator, assuming that the expert simulator is the gold standard.Likewise, ADC (Wu et al., 2020) uses double critics to mitigate the impact of poorly-generated trajectories to stabilize the agent's performance.Our method belongs to the family of MBDRL, with a focus on diversification.

Diversification in Dialogues
Increasing dialogue diversity is a long-lasting research interest.Dialogue diversity can be improved via enforcing diversity objective functions (such as maximize mutual information) in neural models (Li et al., 2016a;Baheti et al., 2018), perturbing language rules (Niu and Bansal, 2019) or environment parameters (Tobin et al., 2017;Ruiz et al., 2019), randomizing trajectory synthesis (Andrychowicz et al., 2017;Lu et al., 2019), selecting more diverse data contributors (Stasaski et al., 2020), and sampling trajectories from a diverse set of environments (Chua et al., 2018;Janner et al., 2019).For instance, Campagna et al. augmented dialogue data using domain-independent transition rules and domain-specific ontology (Campagna et al., 2020).Niu and Bansal synthesized more diverse dialogue trajectories by choosing semantic-preserving language perturbations via RL (Niu and Bansal, 2019).

Diversification in DRL
In model-free DRL, diversification can be achieved by domain randomization (Tobin et al., 2017;Ruiz et al., 2019) or hindsight experience replay (Andrychowicz et al., 2017;Lu et al., 2019), without modeling the dynamics of the environment.
In model-based DRL, diversification is done by altering the learned environment/user model; which are the closest to our work.For instance, Chua et al. proposed probabilistic ensemble trajectory sampling (PETS) (Chua et al., 2018), which learns an ensemble of environment models and uses them for planning.The follow-up work (Janner et al., 2019) extended PETS with policy learning.Like us, Janner et al. concerned noise added by new trajectories generated by the derived environments.They proposed that the generation of new trajectories from the derived models should start from a beginning state shared with the original environment.These methods are mainly developed for robotics and work in continuous action space.
In this paper, we propose to obtain mixed training trajectories by branching from the original trajectory generated by the expert simulator and extending with new trajectories by the derived simulators.Different from (Janner et al., 2019), our method is designed for dialogue agents' discrete action space.In our method, each training trajectory has an overlap much larger than (Janner et al., 2019) has with the expert trajectory.This allows us to obtain smoother transition distributions to facil-itate discrete action space better.In addition, our method can parameterize the intensity to branch out, so that the level of diversification can be controlled and adjusted.

Problem Setup
Task-Oriented Dialogue is the interactive process between a user and a dialogue agent, who work together to accomplish a task.The process begins with the user initiating the dialogue with a task goal in mind.The task goal can have constraints and requests.Constraints are requirements a system response must satisfy and requests are for missing information the user needs to accomplish the task.E.g., a user wants to book tickets of a movie to be played on weekends but does not know the theater's phone number.Here the constraint is time = weekend and request is phone number =?.The dialogue ends when both parties say "good-bye" or the user abandons it.
Expert Simulator is the rule-based simulator built by human experts.It is denoted as M 0 , which describes how a typical user would choose proper dialogue acts as the dialogue unfolds.The state of the expert simulator is s u t at time step t and the action is a u t selected from an action set A u , which can be either making requests or imposing constraints.M 0 shows a mapping from s u t to a u t , describing patterns and behaviours for the human users, and provides feedback to and converse with the dialogue agent.
Diversified Simulator (or Diversified User Model) M φ * is a trainable user model that learns a parametric mapping from s u t to a u t with parameter φ * .It mimics the behavior of the expert simulator M 0 .With different parameter initialization, we can create a set of diversified user models.This set of diversified simulators is called Diversified User Model Ensemble (DUME).
Dialogue Agent (DA) is the automatic response generator, who is expected to search in the knowledge base, reply the human users with relevant and correct answers, and make transactions following the user's requests.We use s s t , a s t to denote the state and action of the dialogue agent at time step t.The agent also receives a reward signal r t as immediate feedback for its action a s t .Its state transition function P models the probability of its next state given the current state and actions from both the user and the DA: s s t+1 = P (s s t , a s t , a u t ).In the DRL setting, the DA is the policy learner.It learns a pol-icy π from a set of dialogue trajectories {τ * }.The goal of the agent is to learn a policy that can maximize the expected cumulative rewards E π [ t r t ] in a task-oriented dialogue.
Interaction Tuple T is the state-action-reward tuple generated when the DA interacts with a simulator or a real user.At the t th dialogue turn, the t th interaction tuple is T t = (s s t , a s t , r t , s u t , a u t ).Trajectory Segment τ k j is a sequence of interaction tuples when the DA interacts with a simulator (M 0 or M φ i ) or a real user, starting from time step j to k: where T t∈[j,k] is the t th interaction tuple of the segment.Decided by the state transition function P , latter interaction tuples in τ depend on the earlier tuples.A base trajectory segment τ 0 is a trajectory that records the interaction between the expert simulator M 0 and the DA.A diversified trajectory segment τ is a trajectory segment that records the interaction between a diversified simulator M φ * and the DA.A full trajectory τ f ull 0 =[T 0 ..., T T ] starts from the beginning of a dialogue, s.t., j = 0 and ends at T , where T is the entire dialogue's length.

Proposed Work
Our method aims to provide high-quality diversified training trajectories for task-oriented dialogue agents.We propose to (1) construct an ensemble of diversified user models called DUME and (2) intermittently branching out short trajectories from the base trajectory using DUME and employ the new trajectories in policy learning.
Figure 3 illustrates the proposed system architecture.In our design, the dialogue agent can interact with both the expert simulator and a diversified simulator.Usually the agent starts with interacting with the expert simulator since t = 0.At a branching step t = p, the agent switches to the diversified simulator to interact with, until the trajectory ends at t = T .The diversified simulator is obtained via imitation learning (from the expert simulator) and neural network initialization randomization.By controlling how frequently the branching should be performed and how long a diversified segment should be used, we effectively reach a balance between training data diversity and quality.

Constructing Diversified User Model Ensemble (DUME)
To enhance dialogue diversity, we propose to have the agent interact with an ensemble of diverse user We use neural networks with different initialization to learn diversified user models from the expert simulator M 0 , and form the DUME using these learned models.

Learning a single user model
We propose to learn the user models from the expert simulator by behavior cloning (Torabi et al., 2018).For a single user model, we aim to learn a sequential decision-making function that maps (s u 1 , s u 2 , ..., s u t ...) to (a u 1 , a u 2 , ..., a u t , ...).The training inputs are from the base trajectories τ 0 , which includes a sequence of user state and user action pairs s u t , a u t .The user state at the t th turn is where G is the user goal, which can include both constraints and requests.t−1 t =1 a s t is the history of the dialogue agent's actions.The user action a u t is where a u t,i are binary variables indicating whether the i th dialogue act is active at dialogue turn t.A u are the available dialogue acts for the user.The ending of a dialogue is also a special dialogue act.
Here a single user action can contain multiple dialogue acts.For instance, informing the destination and arrival time at the same dialogue turn when booking a train ticket.It means the number of dialogue acts per user action would vary.To allow the flexibility for modeling varied number of user acts, we propose to break the training trajectory (which is a sequence) τ 0 into individual state-action pairs and formulate the learning as choosing the right dialogue acts at a given state, i.e. learning the mapping from s u t to a u t .The optimization is done by minimizing the loss function L(φ): where φ is the model parameter vector, a t,i is the ground truth indicator of whether the i th dialog act is taken at time step t, and M φ (s u t ) i estimates the probability of the i th dialog act being chosen by the user model given s u t .The learning is performed by a multi-layer perceptron neural network parameterized by φ.
We are aware that the learning of the user models can be done using much more sophisticated methods.E.g., we can use more advanced neural network architectures and/or incorporate more information when defining the user states.However, these changes are not the main focus of this paper.The proposed user modeling is sufficient to support our investigation in exploiting them to improve diversification.

Forming a Diverse Ensemble
We propose to build an ensemble of diversified user models for better diversification.The ensemble, DUME, contains a set of E number of user models M φ 1 , M φ 2 , ..., M φ E .Each of them is trained with behavior cloning as stated in Section 4.1.1.DUME diversifies the user models by initializing the behavior cloning with different seeds.Each user model is trained using a separate neural network; these neural networks share the same architecture but use randomized, different initial parameters φ j .Our experiments (Section 5.4) show that the diversity in DUME dramatically increases, as E increases.

Policy Learning with I-SEE
One would imagine that the more diversified trajectories used in training, the more robust the policy would be.An intuitive idea is to interact with the diversified user models M φ * from the beginning to the end, without using M 0 at all.DDQ (Peng et al., 2018) indeed exploits this design.However, a dialogue trajectory completely generated by M φ * suffers from accumulation of generation errors because they may deviate too much from what a real conversation looks like.
In this paper, we propose to learn from training trajectories generated from mixed sources.Our idea is to have controlled diversification during policy learning, where some of the learning is done by learning from the original expert simulator and some is done by learning from the diversified user models in DUME.The ratio of the diversified portion can be controlled as a hyper-parameter.The following details our method.

Diversifying the Trajectories
During policy learning, the dialogue agent collects training trajectories generated from the simulators, to keep refining its policy based on gradient ascent.Algorithm 1 details the trajectory generation process.In order to sample a trajectory, the policy learner, i.e. the dialogue agent, interacts with a user model to obtain interaction tuples step by step and store each individual tuple in a dataset D. To obtain an individual interaction tuple, the simulator needs to take an action based on its own user model, and then the agent performs an action based on the state and its current policy π.The agent receives rewards and the next state from the simulator.The interaction tuple is stored and would be used later to form a full trajectory.This process works the same regardless the agent interacting with the expert simulator or a diversified simulator.
In this work, we propose to diversify the agent's learning experiences by learning from trajectories generated from mixed sources.First, we generate a full base trajectory τ f ull 0 =[T 0 ..., T T ] from the expert simulator and store all its tuples.Second, we pick a branching tuple T p ∈ τ f ull 0 at a branching point p ∈ (0, T ).Third, from p onward, the trajectory is generated with a diversified user model M φ * , which would take an action a u p different from the expert action a u p and the agent would also land in a different state s s p+1 = P (s s p , a s p , a u p ).Such interaction with the diversified simulator M φ * continues with H steps, resulting a diversified trajectory segment.The diversified trajectory segment τ p+H p records the interaction between M φ * and the agent, extending the base trajectory τ 0 from a branching point p and running from p + 1 onward.It is denoted as: where p is the branching point and p > 0, and H is τ p 's horizon.The first interaction tuple in τ p is copied from the p th turn in τ 0 , i.e., T p = T p .The full trajectory with diversification is thus τ f ull p = [T 0 , ..., T p , T p+1 , ..., T p+H ].Our method generates parts of a dialogue with the diversified simulator and the other parts using the expert simulator.Each training trajectory thus has overlaps with the expert trajectory, which obtains smoother transition distributions to facilitate the discrete action space that a dialogue agent has.

Intermittent, Short Extensions
Further, we control the quality of diversification by using the DUME conservatively -only use the DUME trajectories for a short horizon and intermittently -to avoid accumulating generation errors.
Branching Horizon.The hyper-parameter H is the branching horizon that controls how far a trajectory is generated from DUME.The larger the horizon H, the more diverse the resulting trajectory.Setting H too small may cause the policy to be myopic as actions take time to show effects; whereas setting it too large may result in accumulation of errors.Our experiments show that using a moderately small H = 5 is preferable.An analysis is reported in the experiment section.
Branching Intensity.Another factor that determines the degree of diversification is the intensity of branchings.Instead of branching at every single step, our method only intermittently forks a diversified trajectory uniformly.This is done by setting a diversification ratio η between the times the agent interacting with the expert simulator M 0 and with DUME.The diversification ratio η is calculated as: where T i is a diversified interaction tuple stored in D dvs and T j is an interaction tuple stored in D base .D base and D dvs are collections of individual interaction tuples obtained as Lines 4-14 in Algo. 2. A larger η means more diversified the agent's learning is.Algo. 2 shows the entire I-SEE algorithm.

Experimental Setup
• Dataset.We evaluate the proposed approach on the Multiwoz (Budzianowski et al., 2018)  expected to 1) provide correct answers to requested information and 2) complete the booking, if asked.
• Evaluation Metrics.Success is our main metric, which is the success rate over all dialogue tasks tested.A task is successful if and only if 1) all the requested information is provided, and 2) all the booked entities match the user's requirements.
Inform F1 evaluates whether an agent provides the information requested by the user.It is calculated as F 1 = 2P rec * Recall P rec+Recall , where P rec and Recall are the precision and recall of the information replied by the agent.Match evaluates whether the booked entities satisfy the user's requirement.It scores 1 if the correct entity is booked, otherwise 0. In the case of multiple bookings, the scores are averaged across all bookings.#Turns measures the number of turns a dialogue last regardless of its success.The less the turns, the • Baselines.We compare the performance of a few top-performing DRL dialogue agents on the Multiwoz dataset with three settings.The settings are 1) the algorithm without diversification, 2) with full and uncontrolled diversification, and 3) with I-SEE.These baseline systems include state-of-theart MFDRL and MBDRL methods and best performing DRL agents on Multiwoz.DQN (Deep Q-Network) (Mnih et al., 2015) is an off-policy MFDRL method, which approximates the value function of state-action pairs with a deep neural network and learns the function using experience replay.PPO (Proximal Policy Optimization) (Schulman et al., 2017) is an on-policy MFDRL algorithm, which optimizes a surrogate objective function which restricts the change of action distributions in a policy update.GDPL (Guided Dialogue Policy Learning) (Takanobu et al., 2019) is the best performer on Multiwoz.It uses inverse RL to reconstruct reward function and optimizes its policy with PPO.DDQ (Deep Dyna-Q) (Peng et al., 2018) is an MBDRL algorithm designed for taskoriented dialogue agents.DDQ generates complete trajectories from its environmental models, which is equivalent to our setting of DQN+full diversification.MADPL (Multi-Agent Dialogue Policy 2 Five out of seven domains require querying the database. Learning) (Takanobu et al., 2020) is a multi-agent MFDRL method that trains the system and the user simulator simultaneously.It is also a leading performer on Multiwoz.
• Implementation Details We use Multiwoz's agenda-based simulator (Zhu et al., 2020) as the expert simulator.The DUME and policy networks and value networks in the baselines are learned using three-layer multi-layer perceptrons (MLPs).A learned user model has an input dimension of 230 and output of 67, with a hidden layer of 200 units.The DRL dialogue agents all use an the input layer of 553 units.PPO's policy network uses a hidden layer of 200 units and output of 166.PPO's value network has a hidden layer of 50 and output of 1. DQN also uses a hidden layer of 200 units and output of 166.The I-SEE dialogue agent is trained with a mix of expert simulator and diversified simulators as presented in the paper and tested with only the expert simulator.

Effectiveness
Table 3 presents the experiment results.The proposed method I-SEE outperforms the original algorithms and the full diversification variants for all baselines on the main metric, success, and the number of turns.The best performance is given by GDPL+I-SEE, with a success rate of 93.2 and only 7.32 dialogue turns on average.Moreover, the I-SEE variants perform the best on Inform F1 for PPO and DQN, and on Match for PPO and GDPL.The improvements are large.These results suggest that diversification in general improves a DRL dialogue agent's effectiveness.However, full and uncontrolled diversification may worsen the performance; while a moderate level of diversification as we propose is a better choice.

Analysis of I-SEE
To understand why I-SEE works, we investigate the relationship between the degree of diversification and the success rate.GDPL is selected as the baseline system X.We study three I-SEE hyperparameters that are responsible for the degree of diversification.They are the user model ensemble size E, branching horizon H, and diversification ratio η.As each of these parameters gets bigger, the degree of diversification increases.We plot the dialogue agent's learning curves w.r.t the three parameters in Figures 4a, 4b, and 4c, respectively.
We observe that a single optimum exists for each hyper-parameter when they reach the best success  rate.As we increase the size of the ensemble with E = 1, 3, 5, 7, 9, the degree of diversity increases.Figure 4a shows that initially increasing the diversity helps improve the performance; However, the trend turns downwards after reaching the optimum when E = 5.Figures 4b and 4c demonstrate similar trends.In the end, the best combined I-SEE setting is E = 5, H = 5, and η = 0.2.This experiment suggest that diversification can only help an agent's learning to a certain extent; Too much diversification beyond that may introduce too much noise in the learning and hurt the agent's performance.Therefore, the degree of diversification must be carefully chosen in practice.

Analysis of DUME
DUME is our collection of trainable diversified user models.We calculate the average pairwise KLdivergence for every two models M φ i and M φ j ∈ DUME to directly measure the degree of diversity within DUME.Each user model is run on the same stavte sequence {s u 1 , ..., s u t ...} and outputs an action sequence {a u 1 , ..., a u t ...}.Since each a u t may contain multiple dialogue acts, we break down every a u t into individual dialogue acts and calculate the distribution over the dialogue act set A u .The mean µ and standard deviation σ of the KL divergences are plotted in Figure 4d.We can see that as DUME has bigger size, both µ and σ increase; which means the differences between the DUME simulators dramatically increase and they would add much diversity into the agent's learning.

Conclusion
This paper presents Intermittent Short Extension Ensemble (I-SEE), a DRL diversification method that successfully improves dialogue diversity and policy robustness while maintaining high data quality.I-SEE uses an ensemble of trainable user models to achieve diversity and controls the diversification quality by branching from original dialogue trajectories only for a short horizon and intermittently.Our experiments on Multiwoz show that using I-SEE can significantly improve several best state-of-the-art DRL dialogue agents.

Table 3 :
Dialogue Effectiveness on Multiwoz.X+Dvs shows the improvement w.r.t. a baseline X. X+I-SEE reports the improvements w.r.t.X and X+Dvs, respectively.
Output :Dialogue agent's policy π 1 Initialize an ensemble of E user models; 2 Initialize the dialogue agent policy π; 3 while the dialogue agent's policy does not converge do 4D base , D dvs = ∅, ∅ ;