Retrieve & Memorize: Dialog Policy Learning with Multi-Action Memory

Dialogue policy learning, a subtask that determines the content of system response generation and then the degree of task completion, is essential for task-oriented dialogue systems. However, the unbalanced distribution of system actions in dialogue datasets often causes difficulty in learning to generate desired actions and responses. In this paper, we propose a retrieve-and-memorize framework to enhance the learning of system actions. Specially, we first design a neural context-aware retrieval module to retrieve multiple candidate system actions from the training set given a dialogue context. Then, we propose a memory-augmented multi-decoder network to generate the system actions conditioned on the candidate actions, which allows the network to adaptively select key information in the candidate actions and ignore noises. We conduct experiments on the large-scale multi-domain task-oriented dialogue dataset MultiWOZ 2.0 and MultiWOZ 2.1. Experimental results show that our method achieves competitive performance among several state-of-the-art models in the context-to-response generation task.


Introduction
Task-oriented dialogue systems communicate with users through natural language conversations to accomplish a wide range of tasks such as restaurant and flight bookings. Recent years have seen a rapid growth of interest in building task-oriented dialogue systems (Budzianowski et al., 2018). Such systems are usually decomposed into several subtasks, including natural language understanding (Gupta et al., 2018), dialogue state tracking (Zhong et al., 2018), system actions (dialogue policy) prediction, and response generation (Wen et al., 2015;Chen et al., 2019;Zhao et al., 2019), where system actions can be viewed as a semantic plan of * Corresponding author I can help you with that. What is your price range?
There are 4 that meet your criteria. Is there a price range you are interested in?
Can you give me more information about the type of hotel you would like ? I need to book a hotel in the east that has 4 stars.
System User I need to book a hotel in the east that has 4 stars.
I need to book a hotel in the east that has 4 stars.
hotel-request-price hotel-inform-choice hotel-request-price hotel-request-type system actions #1 system actions #2 system actions #3 Figure 1: An example of the one-to-many property,where there are multiple appropriate system actions and responses given the same dialogue context. response generation. One of the main challenges for context-to-response generation in task-oriented dialogue systems comes from the intrinsic oneto-many property in conversations. As shown in Figure 1, there can be multiple valid system actions for the same dialogue context, which means that multiple satisfactory system responses can be generated correspondingly. However, in most collected dialogue datasets, each dialogue context has only one reference, which leads to an unbalanced distribution of system actions and responses in multi-domain dialogue datasets (Zhang et al., 2020). Models trained on such unbalanced datasets tend to overfit high-frequency system actions and underfit low-frequency ones.
One line of work focuses on the representation of system actions, which alleviates the unbalanced problem to a certain extent. Chen et al. (2019) reconstruct system actions into a compact graph representation. Zhao et al. (2019) treat system actions as latent variables and use reinforcement learning to optimize them. Wang et al. (2020b) model system actions prediction as a sequence generation problem by treating system actions as a sequence of tokens. On the other hand, Zhang et al. (2020) explicitly modeling the one-to-many property to enrich system action diversity through a rule-based multi-action data augmentation. Specifically, they treat system actions that follow the same dialogue state as alternative valid actions and train them together with the reference system action. However, their data augmentation framework has two shortcomings. First, it enforces a rigid mapping between dialogue state and system actions. Dialogue state, which consists of information such as belief state and user actions, is not flexible enough to represent the whole dialogue context and thus limits the diversity of the mapped system actions. Second, they treat the mapped system actions as gold references during training which may force the model to fit noise in the mapped system actions and ultimately hinder the quality of the generated system actions.
To address the above limitations, we propose to model the one-to-many property more effectively by retrieving multiple candidate system actions and selectively taking the candidates into consideration when generating system action. We design a retrieve-and-memorize framework that consists of a context-aware neural retrieval module (CARM) and a memory-augmented multi-decoder network (MAMD). Specifically, the context-aware retrieval module uses a pre-trained language model to convert the dialogue history as well as belief state into a context representation of each sample. Multiple candidate system actions are retrieved based on the distances between the context vector and the representations of other samples in the latent space. These retrieved candidate actions are more diverse and consistent with the dialogue context since they are obtained based on a more holistic representation. Instead of treating the candidates impartially with the gold references, we encode them into a memory bank and the memory-augmented multi-decoder network can dynamically attend to the memory bank during system actions generation. Additionally, we employ a random sampling mechanism where during training, the memory bank is filled with randomly sampled system actions with a probability, which allows the model to learn to distinguish the quality of the candidates and adaptively adjust its dependence on the candidate actions. We evaluate our model on MultiWOZ (Budzianowski et al., 2018), a large-scale multidomain dataset for task-oriented dialogue systems. Extensive experiments and analyses are conducted to demonstrate the effectiveness of our model, and the results show that it significantly outperforms the baseline model. Our main contributions are summarized as follows: • We propose a context-aware retrieval module that can retrieve multiple appropriate system actions given a dialogue context.
• We propose a memory-augmented multidecoder network that can generate system actions based on multiple candidate actions.
• Our model outperforms several state-of-theart baselines on a large-scale multi-domain dataset for task-oriented dialogue systems.

Related Work
One line of research focuses on the representation of system actions. A typical approach to encoding system actions is by concatenating the one-hot representation at each level of actions into a flat vector (Wen et al., 2015;Budzianowski et al., 2018 Wang et al. (2020b) propose a co-generation framework to generate system actions and response sequentially, which achieves a new state of the art in the context-to-response task. Our proposed framework adopts the idea of modeling belief state and system actions (Wang et al., 2020b;Liang et al., 2020) as sequences and generates the belief state, system action, and response sequentially to make better use of the intermediate supervision.
Another line of research uses data augmentation to expand the training data.  use the paraphrase technique (Li et al., 2019; to generate user utterances and then expand the training set with the augmented user utterances. Zhang et al. (2020) augment system actions with a mapped dialogue state, which consists of belief state, user action, turn domain, and database search result. Such mapping is rule-based and requires user actions for the construction of dialogue state, which takes extra annotations. Both of the above approaches treat the augmented samples as equivalent to the gold ones, which may force the model to fit noises in the augmented data. In this paper, we focus on a better neural retrieval method for the alternative system actions, and instead of directly training on the augmented actions, we encode them in a memory bank as auxiliary information.

Methodology
To frame the problem of dialogue policy learning, we use X t = {U 1 , .., U t−1 , R t−1 , U t } to denote the dialogue history at turn t of a multi-turn conversation, where U i = u 1 u 2 , ...u m i and R i = r 1 r 2 ...r n i are the i-th user utterance and system response, respectively. Following previous works (Zhang et al., 2020;Liang et al., 2020), we convert the belief state and system actions from a list of triples to sequences. For example, the belief state "restaurantfood-Chinese,restaurant-price-expansive" is converted to "restaurant [food] Chinese [price] expansive", and the system actions "restaurant-informprice,restaurant-inform-phone" are converted to "restaurant [inform] price phone". We use B t = b 1 b 2 ...b p and A t = a 1 a 2 ...a q to represent the current belief state and system action, respectively. Our goal is to generate system actions A t and system response R t of turn t based on the dialogue context X t and belief state B t .
We employ a retrieve-and-memorize framework to generate the system response. First, we use a context-aware retrieve module to retrieve multiple proper candidate system actions from the training set. Then, we encode the candidate actions into a memory bank and propose a memory-augmented module to enhance the action generation.

Context-Aware Retrieval Module
In order to retrieve alternative system actions that are more comprehensive and context-aware, we utilize the powerful pre-trained language model BERT (Devlin et al., 2019) to obtain distributed representations of the dialogue context. We search in the training corpus for system actions with similar distributed representations and retrieve them as alternative candidate actions. Concretely, we combine the dialogue history X t = {U 1 , .., U t−1 , R t−1 , U t } and belief state B t as dialogue context and feed the concatenated dialogue context into a pre-trained BERT encoder: where ⊕ is the concatenation operator, [CLS] is a special token that precedes every input sequence of BERT, and [SEP ] is a special token used to separate different parts of the input sequence. The BERT model encodes the input dialogue context into a sequence of hidden states H = {h CLS , h 1 , ..., h L }. We use h CLS to represent the distributed representation of dialogue context, since h CLS is expected to capture the information of the whole sequence. Then we use L 2 distance to measure the similarity between the distributed representations of different dialog contexts: Based on the L 2 distance, k most similar dialogue contexts are selected from the training set, and the corresponding system actions constitute a candidate actions set {Ā 1 ,Ā 2 , ...,Ā k }.
Pre-training Task Directly applying h CLS from BERT without fine-tuning or further pre-training may not result in desired dialogue context representations that correlate well with system actions. A good dialogue contextual representation should satisfy the property that dialogue contexts with similar semantics are close to each other in the representation space. Therefore, we further pre-train the BERT model with an actions prediction task: where y ∈ R D is a one-hot label of system actions (Chen et al., 2019), D is the dimension of the label space, and classifier is a simple linear classifier.

Memory-Augmented Multi-Decoder Network
We propose a memory-augmented multi-decoder network that jointly generates belief state, system actions, and system response while having access to a memory bank when generating the system action. Given the retrieved candidate system actions, we encode these candidates into the memory bank and enhance the generation of system actions by querying the memory bank during decoding. Encoding Module We use Bidirectional GRUs (Chung et al., 2014) as our encoders. First, we encode the current user utterance, the previous system response and the previous belief state separately into hidden states: Then, another encoder is used to encode the candidate system actions into memory bank: where M t = {m 1 , ..., m k }. Belief State Generation The belief state B t of turn t is generated based on the current user utterance U t , previous system response R t−1 and previous belief state B t−1 . The generation of B t at each time step τ can be formulated as follows: where Attn 1 is an attention function, e(b τ −1 ) is the embedding of the previous token, h τ −1 is the hidden state from the last decoding step, and h 0 = 0. Dec b 1 is the belief state decoder augmented with copy mechanism (Gu et al., 2016), which can copy tokens from the previous belief state. p(b τ |b 1:τ −1 ) is a distribution over vocabulary. We use cross 1 Please refer to the appendix for more details. entropy between ground truth and the output distribution L b (θ) as the loss of belief state generation. We collect the hidden states H b = {h 0 , h 1 , ..., h p } of each step to feed them into the action decoder. Memory-Augmented Action Generation As shown in Figure 3, the system action A t of turn t is generated based on not only the dialog history and the current belief state, but also the memory bank which encodes the retrieved candidate system actions. For the generation of A t , at each time step, we first compute the state s τ : Then, we use the hidden state h τ −1 to query the encoded candidate system actions memory M t : where W are learnable parameters and v τ contains information from the memory. Now we incorporate v τ into the generation process: where e(a τ −1 ) is the embedding of the previous token, e(DB t ) is the embedding of the database search result which indicates the number of matched entities. Dec a is the action decoder augmented with copy mechanism. The cross entropy L a (θ) between the output distribution and ground truth is the loss of actions generation. We collect the hidden states H a = {h 0 , h 1 , ..., h q } as well to feed it into the system response decoder. Random Sampling Though the retrieved candidate system actions are considered to be of high quality and suitable given the dialogue context, we would still like our model to avoid taking those candidates for granted and developing excessive dependence on them. To this end, during training, the memory bank is filled with randomly sampled system actions with a probability p, and retrieved candidates with a probability (1 − p). This allows the model to learn to distinguish good candidates from bad candidates.
Response Generation Lastly, we generate the system response conditioned on the hidden states of user utterance H u , belief state H b and system actions H a with the response decoder Dec r : The response generation loss L r (θ) is the cross entropy between the output and ground truth.
Objective Function The final objective function is the sum of belief state loss, actions generation loss and response generation loss: 4 Experiments

Dataset and Metrics
We conduct our experiments primarily on Multi-WOZ 2.0 (Budzianowski et al., 2018). It consists of 8438 dialogues spanning several domains and topics. Each of the test and validation sets contains 1000 dialogues. As for automatic evaluation, we use Inform Rate and Success Rate to evaluate dialogue task completion. The former measures whether the system has provided a proper entity and the latter measures whether it has answered all the requested attributes (Budzianowski et al., 2018). Besides, BLEU (Papineni et al., 2002) is used to measure the fluency of generated responses. To measure the overall quality, we compute a combined score by (Inform+Success)×0.5+ BLEU (Mehri et al., 2019).

Implementation Details
Our model is trained on a 12 GB Nvidia GeForce RTX 2080 Ti with a batch size of 80. Our im-plementation 2 is based on PyTorch (Paszke et al., 2019). We pre-trained the BERT model based on the open-source library Transformers (Wolf et al., 2020). The dimension of word embeddings is 50 and the hidden size is 100. We use one-layer Bidirectional GRUs (Chung et al., 2014) as context encoders and three GRUs augmented with copy mechanism as decoders. The candidate actions are encoded by another Bidirectional GRU. We use Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.007. We use greedy search to decode system actions and beam search with a beam size of 5 to decode system responses. We use the ground truth belief states for a fair comparison with other baselines. We train our model for 60 epochs and select the best model on the validation set, and then evaluate it on the test set to get the final results.

Baselines
We compare our . Especially, SC-LSTM and HDSA treat system actions as one-hot vectors, and LaRL, HDNO, LAVA treats them as latent variables. Besides, HDSA uses BERT to predict system actions. DAMD, PARG, SimpleTOD, MarCo, and UBAR treat belief state, system actions as sequences and generate them along with system response. Besides, DAMD (aug) means DAMD using rule-based multi-action data augmentation to augment the system actions. Similar to HDSA, MarCo also uses BERT to predict system actions.

Overall Results
As shown in Table 1, our model significantly outperforms the baseline model DAMD in Inform Rate, Success Rate and especially Combined Score. Besides, our model achieves the best performance in Combined Score among all the baseline models. We also observe that models that generate system actions as a sequence generally have superior performance, implying that sequence is a better representation to model the inter-relationships among dialogue actions than one-hot vectors. Besides, our   model outperforms all the methods with data augmentation, which shows the effectiveness of our proposed retrieve-and-memorize framework.
We also evaluate our model on MultiWOZ 2.1 (Eric et al., 2020), an updated version of Multi-WOZ 2.0. As shown in Table 11, the results are consistent with that on MultiWOZ 2.0 in Table 1.

Performance Across Different Domains
We report the performance of our model on different domains of MultiWOZ 2.0 and compare it with DAMD and DAMD (aug). The results are shown in Figure 4. From the bar chart, we can find that our model achieves the best performance across all domains. Besides, our model achieves significant performance improvements in taxi and attraction domains, which appear less frequently in the training data than other domains. Our MAMD narrows the performance gaps among different domains.

Ablation Study
In this section, we conduct experiments to study the contributions of the proposed context-aware retrieval module and memory-augmented module.

Hotel
Train As shown in Table 3, the first group is the baseline directly trained on four types of augmented data, where it treats the augmented actions as equivalent to the golden ones. We observe that the performance drops significantly if the augmented actions are randomly selected, suggesting that the benefit of such data augmentation is strongly subject to the quality of the augmented data. Additionally, the model trained with CARM outperforms the Rule, which indicates the higher quality of our contextaware retrieved candidates and the effectiveness of the proposed CARM. What's more, removing the system actions prediction pre-training task in CARM causes a performance drop, which demonstrates the necessity to adjust the pre-trained model and obtain more task-related representations.
The second group in Table 3 shows the results of the model with the memory-augmented (MA) module trained as well as evaluated with various augmented data. First, with MA, our MAMD is much more robust to random noise, only slightly under-  Table 3: Results of ablation study. Baseline is MAMD without the memory-augmentation component. Random means randomly selected actions, Rule is the rulebased augmentation proposed by DAMD, CARM is the proposed context-aware retrieval module, and w/o Pt means without pre-training before retrieval. MA is the proposed memory-augmentation module, and RS is the proposed random sampling technique.
performing the baseline. This is because, during training, a model with MA can learn to ignore the noises in the memory and pay less attention to the memory during evaluation. Second, we see more performance gains with MA from both rule-based and context-aware retrieved candidates, which suggests a model with MA can utilize the candidate system actions more effectively. Last but not least, with the random sampling mechanism, the performance of our full model further improves.

Effect of Random Sampling
To further analyze the effect of random sampling, we adjust the random sampling probability during training from 0 (no random sampling and all candidates are from CARM) to 1 (all candidates are randomly sampled), and evaluate MAMD with retrieved candidates and randomly sampled candidates in the memory bank. As shown in Figure  5, the first thing to notice is that without random sampling, i.e., the random sampling probability p is set to 0, the performance of MAMD with random candidate system actions drops drastically to 66.40. This indicates MAMD trained with all decent-quality candidates has developed excessive dependence on the candidates and in a way treats them as ground truth actions, which is what we try to avoid by introducing random sampling. Once we introduced random sampling, the performance gap between MAMD evaluated with retrieved actions and random actions is significantly narrowed, which suggests MAMD is capable of telling the quality of the candidates in the memory bank.

Effect of the Number of Candidate Actions
To analyze the effect of the number of candidate actions on our proposed modules, we train three model variations with different numbers of candidate actions retrieved by CARM. As shown in Figure 6, we can see that both MAMD and MAMD (w/o RS) achieve their best performances with 9 candidate actions. Additionally, both our models consistently outperform DAMD, which suggests the effectiveness of the memory-augmented module. What's more, the performance of our full model increases more steadily as the number of candidate actions goes up, while without random sampling, the performance of our model is much more unstable across different numbers of candidate actions, which indicates that random sampling can bring in some desirable regularization. Figure 7: Visualization of the attention from generated system actions to candidate actions. The y-axis is generated system actions and the x-axis is candidate system actions. At each decoding step, the generated system actions selectively attend to the candidate actions. (Dialogue ID:MUL0473)

Visualization and Case Study
An illustrative example is shown in Figure 7, the current user utterance is "I need a train departing cambridge arriving by 20:30". The action decoder successfully attends to appropriate actions and ignores the noisy ones like " [ Table 6: Results of human evaluation on response quality. Reference means ground truth response. Win, Tie and Lose respectively indicate the proportions that our model wins over, ties with or loses to its counterpart.
as the leaving time has not provided by the user. Table 4 shows an example of candidate system actions that CARM appropriately retrieved but DAMD failed. The user asks the system to provide the postcode and phone number of the attraction, while DAMD returns "[attraction][nooffer][type]".
We also present an example of response generation in Table 5, where the user asks for the price and reference number. DAMD manages to provide the postcode but fails to provide the reference number, while our MAMD model successfully provides both the postcode and the reference number.

Human Evaluation
Finally, we conduct a human study to evaluate our model from the human perspective. We randomly select 30 dialogue sessions (211 dialog turns in total) from the test dataset and have 5 postgraduates as judges to compare two groups of systems: MAMD vs. DAMD and MAMD vs. Reference, in terms of Readability and Completion (Wang et al., 2020b). Completion measures whether a response has correctly answered a user query, including relevance and informativeness. Readability measures the fluency and consistency of the response.
We report the human evaluation results in Table  6, from which we can observe that our model outperforms DAMD and beats or ties with Reference nearly 70% of the time in terms of Completion. In Readability, our model ties more than 92% with DAMD as well as Reference. This may suggest the language of responses lacks diversity and is easy to learn. Overall, our model is superior to DAMD in human evaluation, which demonstrates its competence in a more holistic evaluation other than automatic metrics.

Conclusion
In this paper, we proposed a retrieve-and-memorize framework to deal with the unbalanced distribution of system actions in task-oriented dialogue systems. Our framework includes a neural retrieval module that can retrieve multiple candidate system actions given a dialogue context, and a memoryaugmented multi-decoder network that can generate system actions conditioned on multiple candidate system actions. Extensive experiments were conducted on a large-scale multi-domain task dialogue dataset and the results demonstrate the effectiveness of our framework. In essence, the whole framework, including its random sampling strategy, can be viewed as an attempt to prevent the systems from overfitting skewed dialogue datasets with an unbalanced distribution of system actions.
where ⊕ is the concatenation operator and CatAttn is a simple concat-attention defined as: where W represents learnable parameters, H is the sequence of encoded hidden states, 3 and n is the number of hidden states in H.

A.2 Decoder with Copy Mechanism
The decoder used to generate the belief state, system action and response is a one-layer GRU augmented with copy mechanism. Each step of the generation in Dec(c t , h t−1 , H) is defined as follows: where W v and W c are learnable weights, and X is the corresponding context of H.

B.1 Hyperparameters
In this section, we report the hyperparameter setting in our model. For MAMD, we adopt the default hyperparameters in DAMD, as shown in Table  7. As for the learning rate, the number of candidate actions, and the random sampling probability, we apply grid search to find the best combination on the development set. It takes about 10 hours to train our model on a single 12 GB Nvidia GeForce RTX 2080 Ti. As for CARM's pre-training task, the hyperparameter setting is shown in Table 8.   [reqmore]".
• Null system actions are removed.
• System actions with different database query results are filtered out.
• System actions that conflict with current belief are filtered, e.g., requesting a slot that is already included in belief states.

C Dataset Details
We provide more information about the MultiWOZ 2.0 dataset. The training set contains 8438 dialogs, 115,424 turns, and 1,520,970 tokens. The average number of turns per dialog is 13.68, and the average number of tokens per turn is 13.18. The number of slots and values are 25 and 4510, respectively. The ontology is shown in Table 9. We also count the numbers of system actions across different domains. As shown in Figure 8, the numbers of system actions in attraction and taxi are smaller than the other domains, showing the unbalanced distribution of system actions at the domain level. act type inform * / request * / nooffer 1234 / recommend 123 / select 1234 / offerbook 124 / offerbooked 124 / nobook 12 / bye * / greet * / reqmore * / welcome * slot car 5 / address 12367 / postcode 12367 / phone 123567 / internet 2 / parking 2 / type 23 / pricerange 12 / food 1 / stars 2 / area 123 / reference 1234 / time 14 / leave 45 / price 45 / arrive 45 / id 4 / stay 2 / day 124 / leave 45 / people 123 / name 123 / destination 45 / departure 45 / department 6

D.1 Results on Development and Test Sets
We report the results of MAMD on the development and test sets of MultiWOZ 2.0 and Multi-WOZ 2.1. As shown in Table 10 and Table 11, the results on the development set are generally consis-tent with that on the test set on both benchmarks.

D.2 Distribution of Generated System Actions
To further analyze the influence of our model on the generation of system actions, we count the appearance of generated actions. Recall that each dimension of the actions stands for either domain, function or slot, where domain defines the domain involved in the conversation, and function defines the behavior of system such as informing the user or request certain information. Here we only count the first two dimensions of the actions because the third dimension appears to be less important. As shown in Figure 9, the distribution of system actions generated by DAMD is proportional to the original distribution in the dataset, and DAMD tends to generate fewer actions than the original distribution. After applying their rulebased multi-action data augmentation, DAMD (aug) can generate more diverse system actions compared with DAMD. Compared with DAMD (aug), MAMD generates more actions. More importantly, MAMD generates more important actions such as "attraction-inform" and "taxi-inform" which are more relevant to task completion, while DAMD (aug) tends to generate less useful actions such as "general-require" and "general-greet". This phenomenon indicates that the memory-augmented mechanism provides some guidance to our model during system action learning. To sum up, our proposed model can generate more diverse and valuable actions, which demonstrates the effectiveness of our proposed memory-augmented mechanism.