Phrase-Level Action Reinforcement Learning for Neural Dialog Response Generation

Deﬁning a sophisticated action space for a dialog agent is essential for efﬁcient training with reinforcement learning (RL). Recent work introduces discrete latent variables to use as an action space; however, a limitation is that a global vector can contain entangled information such as dialog act, sentence structure, and content. This sacriﬁces the ﬂexibility of the response generation. In this paper, we pro-pose phrase-level action reinforcement learning (P HRASE RL), which allows the model to ﬂexibly alter the sentence structure and content with the sequential action selection. Our model ﬁrst learns to generate useful phrases during the supervised pre-training, and then further trained to form a response by rearranging the phrases with reinforcement learning. Experiments on the MultiWOZ dataset show that our model achieves competitive results with state-of-the-art models on automatic evaluation metrics, indicating that our phrase-level action space has improved ﬂexibility and is effective for solving task-oriented dialogs.


Introduction
Dialog policy optimization is key research to efficiently solving real-world tasks (Rastogi et al., 2020;Lewis et al., 2017). In neural response generation, which has made remarkable progress in recent years (Vinyals and Le, 2015;Li et al., 2016a;Serban et al., 2017;Bao et al., 2019), many methods that apply reinforcement learning (RL) have been proposed (Li et al., 2016b;Peng et al., 2018;Saleh et al., 2019;Zhao et al., 2019). In those studies, one major issue was how to define an action space. Early research proposed a method in which each word of the response is an action (Li et al., 2016b). However, this has a shortcoming that the generated responses deviate from natural human language (Zhao et al., 2019). A possible reason is that the action space is huge, making it difficult to optimize with RL. Moreover, rewarding only the task accomplishment can cause biased improvement, which leads the model to ignore the comprehensibility of the generated response (Wang et al., 2020a).
To overcome such issues, LaRL (Zhao et al., 2019) was proposed, which used a discrete global vector to represent dialog acts. In this method, reinforcement learning is performed only on those discrete latent variables, thus the policy optimization is achieved without affecting the language generation. However, LaRL depends on a single vector from the beginning to the end during the response generation, even though a response may often contain more than one dialog act and contents (Wang et al., 2020b). Due to this, a static, global vector tends to be an entangled representation of multiple dialog acts, sentence structure, and contents. Therefore, using a global vector for action space sacrifices the flexibility of the response generation.
To improve the flexibility of the surface realiza-tion, we propose phrase-level action reinforcement learning (PHRASERL), in which the model performs action selections in fine-grained semantic units. PHRASERL is based on neural hidden semi-Markov model (HSMM) decoder (Wiseman et al., 2018) which generates typed text-segments from hidden states, and we use them as an action space. This disentangles the generation process: the policy learns to structure a response as a sequence of hidden states, while each hidden state is trained to represent content or a type of phrase. Intuitively, as described in figure 1, our model learns to generate useful phrases during the supervised pre-training, and it is further trained with reinforcement learning to reorder the phrases and form a response. Experiment results on the task-oriented Multi-WOZ dataset  show that our best performing model outperforms LaRL by far and achieves competitive results with the stateof-the-art models in automatic evaluation. Furthermore, PHRASERL can maintain a high BLEU score, suggesting that the model is flexible in its output response depending on the context. Finally, we study the phrase generation from hidden states in a case study, and show that the hidden stateaction space is capable of generating (1) informative response, (2) grammatical sentence, and (3) diverse intentions, which can be considered as requirements for an effective action space. Our code is available at https://github.com/Alab-NII/ PhraseRL.

Related Work
A classical approach for realizing task-oriented dialog systems is the frame-based dialog system (Chen et al., 2017). This model generates a response in a pipeline fashion, by splitting the generation process into three modules: natural language understanding, dialog management, and natural language generation. Natural language understanding converts user utterances to a semantic frame which is considered a dialog state, and a popular method is slot filling (Mrkšić et al., 2017;. The estimated dialog state is then passed on to dialog management to determine the next action, which is formulated as a partially-observable Markov decision process (POMDP) (Young, 2006). The action space is represented with hand-crafted dialog acts Stolcke et al., 2000) or meaning representations (Balakrishnan et al., 2019). Finally, a natural language gener-ator generates a response, which is often realized with recurrent neural networks (Zhou et al., 2016;Tran and Nguyen, 2017). Our proposed model spans between dialog management and natural language generation, however, our model does not require any hand-crafted representation.
Past works that applied reinforcement learning to dialog models have shown a huge performance improvement in task success (Lewis et al., 2017;He et al., 2018). Li et al. (2016b) proposed a dialog generation method by using deep reinforcement learning with words as action spaces. Although the rewards were carefully designed, it is reported that these models tend to generate incomprehensible responses. Zhao et al. (2019) solved the problem by using discrete latent variables as the action space. Wang et al. (2020a) have extended LaRL and applied hierarchical reinforcement learning technique to decouple the dialog policy and natural language generation. The model is composed of two policy networks; one is the high-level policy which acts on latent dialog act and another is the low-level policy that acts on words. The low-level policy is prone to degeneration, so the paper proposes to use language model discriminator as a reward provider. These models either use words or a global latent variable as an action space, however, our work stands between the two by using phrases for the action space.

Preliminaries
In this section, we first explain the characteristics and formulation of the HSMM. We then describe the neural HSMM decoder, which will be the backbone of our proposed method.

Hidden Semi-Markov Model (HSMM)
In our work, we consider sentences as a sequence of phrases, and the probabilistic model that can represent this is the hidden semi-Markov model (HSMM). The difference between a standard hidden Markov model (HMM) and HSMM is shown in figure 2. While an HMM gives one observation from a hidden state, an HSMM gives a sequence of observations per hidden state. Therefore, if we consider words as observations, hidden states will be phrases.
To represent a variety of sentences with a limited number of hidden states, HSMM is expected to assign the same sequence of hidden states for similar sentences. Figure 3 is an example. As it can be seen, each hidden state has a type; for instance, hidden state #8 and #11 outputs the same phrase, while #23 outputs noun phrases and #34 outputs verb phrases for end of questions. In this way, textsegments assigned to a certain hidden state will be having a similar property.
For our model, we specifically use conditional HSMMs which takes a source input x. For each timestep t ∈ {1, · · · , T }, we denote the observations as y 1 · · · y T and the discrete hidden states as z t ∈ {1, · · · , K}. We additionally introduce two latent variables; the length of the current observation sequence, denoted as l t ∈ {1, 2, · · · , L}, and a binary variable which represents whether the sequence is finished at timestep t, denoted as f t . The maximum number of hidden states K and observation length L are tunable parameters. An HSMM will be represented with a joint distribution of the observations and the described latent variables: (1) In other words, an HSMM will be the product of three probabilities: state transition distribution, length distribution, and emission distribution.

Neural HSMM Decoder
We now introduce a neural HSMM decoder (Wiseman et al., 2018). Figure 4 shows the overview of the decoder model. The aforementioned three distributions can be obtained using trainable parameters. We define the embeddings of the x as x ∈ R d and the hidden state z as z ∈ R d .

State Transition Distribution
For the state transition function p(z t+1 |z t , x), we use K ×K matrix, where sum of each row is 1. We define the state transition matrix as where A ∈ R K×m 1 and B ∈ R m 1 ×K represents state embeddings, and where C : R d → R K×m 2 and D : R d → R m 2 ×K is a non-linear function parameterized with neural networks. m 1 and m 2 are tunable parameters.
Length Distribution Wiseman et al. (2018) have found that parameterizing length distribution leads to hidden states that specialize in specific output lengths. To avoid that, we simply used uniform distribution for every length probability p(l t+1 |z t+1 ).
Emission Distribution For the emission distribution p(y t−lt+1:t |z t , l t , x), we use the product of the token probability. Therefore, the emission distribution is obtained with where eop stands for end-of-phrase token which indicates the end of emission for each hidden state. We use gated recurrent unit (GRU) to compute the token probabilities: where y i−1 is the embeddings of the generated previous token and z k is the embeddings of the hidden state k. Finally, the probability of the token w will be Training We assume z, l, f of an HSMM is unobservable, so we maximize the marginal likelihood of the emission y given only input x by training. The marginal likelihood of y in HSMMs can be efficiently computed using a dynamic programming algorithm such as backward-algorithm (Murphy, 2002). Using variables β, β * , the backwardalgorithm can be expressed as where β T (j) = 1. Finally, from the definition f 0 = 1, the log-marginal likelihood of y will be: Here, we compute p(z 1 = k) with a linear layer. Since equations (6) and (7) are differentiable, we can optimize θ by maximizing the log-marginal likelihood ln p(y|x; θ) with backpropagation.

Proposed Method
The original neural HSMM decoder (Wiseman et al., 2018) was proposed as a data-to-text generation method, so the model needs modification to be applied to dialog response generation. Particularly, we investigate what the HSMM should condition on, in other words, we determine the input x. However, we need to carefully design x because of a known problem of the neural HSMM decoder, which will be explained first. Afterward, we discuss how to improve the response quality by applying reinforcement learning.

Conditional Source for Neural HSMM Decoder
The neural HSMM decoder is based on the assumption that the output phrases from each hidden state to be independent of each other. However, if the source x is informative enough to capture the interdependence between the phrases, the RNN decoder may fully depend on the source x and ignore the hidden state z for the generation. We will call this problem the interdependence problem. To avoid this, we must use weak source input that does not contain enough information to precisely predict the target response. For our work, we use contextual information (e.g. dialog history, belief state, database results) as a conditional source x. A common practice to embed contextual information is to use a GRU and a linear layer to encode, and as a result, we obtain continuous embeddings x. However, continuous embeddings can result in the interdependence problem, since they can theoretically contain infinite information.
To weaken the encoder, we reduce the resolution of the input embeddings x by using discrete embeddings. We define it as an array of N -way categorical variables: where each x n is a N -sized binary vector and M is the number of variables. To obtain this, straight-through Gumbel-softmax (Jang et al., 2017) is applied to the conditional source encoded by a GRU and a linear layer. In our experiments in section 6, we will compare the results of these discrete embeddings with continuous embeddings.

Response Generation with Reinforcement Learning
To rearrange the invented hidden states of a neural HSMM decoder, we apply reinforcement learning, which we named this method as phrase-level action reinforcement learning (PHRASERL). Here, we consider a Markov decision process of input context as state x ∈ S, hidden states (which represents phrases) as action space z ∈ A, and task-success rate as rewards r ∈ R. We define the timestep of hidden state selection as t = {1, 2, · · · , T }. We consider the combined initial state selection and state transition as policy π : S → A and apply RE-INFORCE algorithm (Williams, 1992). For each reward r t in time step t , we use discounted reward G t = T k=0 γ k r t +k during training. Now, the policy gradient will be: Note that we do not train the GRU for the emission distribution; we only further train the hidden state transition. For embedding the contexts, we used the pre-trained encoder and did not further update during this RL step.

Task Description
For the experiments, we use MultiWOZ dataset . MultiWOZ is a largescale task-oriented dialog dataset, which contains seven types of domains such as booking restaurants, hotels, and train seats. We specifically use Dialogue-Context-to-Text Generation task proposed in the original paper. In this task, a model is given an oracle belief state, and the model's goal is to generate an appropriate and informative response. For the evaluation, we use BLEU, Inform Rate, and Success Rate. We also compute the total score, which is used in previous works to compare models in MultiWOZ dataset. Total score is calculated with BLEU + (Inform + Success)/2.

Model Details
Duplicated Hidden States While increasing the number of hidden states allows for a more expressive latent model, the computational complexity of the neural HSMM decoder will increase linearly depending on the number of hidden states K. In order to increase the number of hidden states without making the computation heavier, we use the same emission distribution for multiple hidden states as proposed in Wiseman et al. (2018). For instance, if we set the base state as 80 and duplicated 5 times, K will be 400 and we use z mod 80 for the input into the computation of emission distribution. This way, the model can utilize a large number of hidden states in the state transition, while the model only needs to run the GRU feed-forward for a smaller number of times to compute emission distribution.
Training Details We first train the neural HSMM decoder with supervised learning, and later we further train with reinforcement learning as explained in section 4.2. To embed the context information x, we use a MLP layer for encoding oracle belief state and a GRU for encoding dialog history. For comparison, we trained both continuous and discrete embeddings, which we denote each model as CONT and DISC respectively. To train with reinforcement learning, we use the MultiWOZ RL setup proposed in (Zhao et al., 2019). For the rewards, we use r success + r inform + r BLEU .
The average loss is computed with the validation dataset after every epoch, and early stopping is performed after 5 consecutive epochs without improvement. When we determine the number of hidden states, we tested every 10 states from 40 to 120 for base states, and 1, 3, 5, 7 for duplication. In consequence, K = 400 (80 base states, duplicated 5 times) produced the best results, and the following evaluations are based on these results. For the vocabulary set, we substituted the words that occurred less than 30 times in the dataset with an unknown tag ( unk ). We used beam search with a beam size of 5 for the decoding. For more details, refer to Appendix A.

Baseline and State-of-the-Art Models
Our model is compared with the following models: • Baseline  is proposed in the original MultiWOZ paper. The model is based on Seq2Seq with attention on the context words.
• Word-Level RL further trains the pre-trained baseline on reinforcement learning with the action space of words. It is known that this model often encounters a degeneration problem, in which the generated sentence will diverge from natural human sentences.
• Latent Action Reinforcement Learning (LaRL) (Zhao et al., 2019) introduces a discrete latent variable between the encoder and decoder to represent a dialog act. Similar to our model, it first trains on supervised pretraining, then it further trains the dialog policy with reinforcement learning.
• Hierarchical Disentangled Self-Attention Network (HDSA) (Chen et al., 2020) introduces hierarchical dialog act. The model is composed of 3 transformer layers which each layer corresponds to each hierarchy of the dialog act. The model switches the self-attention based on the dialog act, which is called the disentangled self-attention.
• SOLOIST (Peng et al., 2020) is a transformerbased auto-regressive language model for taskoriented dialog, pre-trained on large and diverse dialog corpora. The model is fine-tuned on MultiWOZ task.
• MarCo (Wang et al., 2020b) extends the idea of HDSA and considers a hierarchical dialog act. The difference is that it co-generates the dialog act sequence and the response jointly.
• HDNO (Wang et al., 2020a) decouples the dialog policy and natural language generation by applying hierarchical RL. This model uses language model as a reward provider to maintain grammaticality.
Among these models, Word-Level RL, LaRL, and HDNO use reinforcement learning and, therefore, are considered as the main competitors to our PHRASERL. Also, note that HDNO uses rewards from external modules to avoid degeneration, but PhraseRL does not use them in this experiment.
6 Results and Analysis 6.1 Automatic Evaluations Table 1 shows the results of the automatic evaluation of our models. We firstly see that applying reinforcement learning greatly improves the scores, indicating that hidden states had been an effective action space. We also see that discrete embeddings outperform continuous embeddings in every score. This shows that the model can improve generation performance by alleviating the interdependence problem with weaker encoders. It also can be observed that the discrete embeddings are significantly effective within reinforcement learning models. This can also be seen in the reward graph shown in figure 5, where discrete embeddings have a sharper reward increase compared to continuous embeddings. A possible reason behind this is that discrete embeddings led the phrase generation to be less diverse and strongly typed, which made the agent easier to learn the relation between a hidden state and the generated phrases.
We also compare our best model (DISC-RL) with the past works, and the results are shown in table 2. Our model achieved competitive results with the recently proposed state-of-the-art models, which also is near-human performance.
Comparing with LaRL, our model significantly outperforms in BLEU score even after applying RL. A possible reason for LaRL's low BLEU score is that it cannot fully express the diverse human sentences in a discrete global vector. On the other hand, PHRASERL can broaden the range of expression by dividing the action space into finer semantic units, which enables it to learn more human-like responses. Additionally, the improvement in Inform Rate and Success Rate can also be attributed to the ability of PHRASERL to flexibly select content.
Nevertheless, if we compare with state-of-the-art models without using RL, our model has a lower BLEU score. This suggests that using fixed phrases and arranging them have a drawback in regard to generating grammatical responses, since it lacks word-level flexibility. However, it is surprising to see that it still achieves a competitive score even with such disadvantage.

Model Analysis
Although our PHRASERL was able to maintain a high BLEU score, this can only be because the model was rewarded with the BLEU score during the training. We also trained the model without using the BLEU score for rewards and the results are shown in table 3. Although there is a slight decrease, it is still largely outperforming the Word-Level RL and LaRL. This indicates that our PHRASERL is resistant to degeneration to some extent, even without adding external modules for rewarding grammaticality as in HDNO. Further improvements can be expected by applying external rewards for avoiding degeneration, though this remains as future work. Table 4 shows the generated phrases from randomly selected hidden states. By observing the outcome of DISC, we can notice that the common property of the generated phrases are interpretable: state 303 outputs "verb phrases for the end of question", state 239 outputs "beginning of the question", state 103 outputs "features of a facility", state 70 outputs "back-channeling words", and state 325   outputs "noun phrases for domains". This indicates that the hidden states in Disc are strongly typed. Although phrases of CONT also seems to be typed, some states such as state 138 have multiple types. This is due to the interdependence problem because the RNN can recognize which type to use depending on the conditional input x.

Case Study
To verify that the hidden states are a valid and flexible action space, we qualitatively validate the generated phrases. Figure 6 shows the generated phrases from user input and possible responses that can be formed by reordering the phrases. The possible responses were generated from hidden states that were reordered by hand. We consider three criteria for a valid and flexible action space in MultiWOZ dataset: the model must be able to (1)  We further investigated if the contents are sufficiently provided with the hidden states. We counted the number of cases in the test set where all the entities in the golden response were contained in the generated phrases. As a result, 83.6% of the cases had sufficient information in the generated phrases, which we consider enough because the model may have other response options. Therefore, we can conclude that the first condition has been met.
Grammar Although there remains the possibility of generating ungrammatical sentences, the possible responses in figure 6 show that an appropriate sequence of hidden states will allow generating fluent responses.
Dialog Act As shown in the bottom section in figure 6, the four response samples have different intentions. For instance, the first response asks the type of food to the user, while the second response recommend a restaurant and ask for a booking. Particularly, a second example contains two different dialog acts (recommend and offer booking), but the model can choose to finish with the first sentence to just recommend. This shows that our PHRASERL have disentangled representation for each dialog acts. Therefore, we can conclude that the model can flexibly select from several intentions.
Finally, in    Figure 6: A case study for generated phrases of DISC. In the second section, we show the number of hidden states that output the same phrase. The third section shows possible responses which were generated from inputting sequence of hidden states ordered by hand. The colors indicate the corresponding phrases.
from MultiWOZ test dataset. We first see that CONT is not generating a grammatical response. This may be due to the interdependence problem that tries to output various sentences in the same hidden state sequence. On the other hand, the rest of our models can generate a grammatical and incontext response. We also observe that while models without RL generate plausible responses, RL models provide a more informative response by including hotel names and phone numbers. The reason is that the model often receives rewards by conveying those information.

Conclusion and Future Work
In conclusion, this paper proposes a phrase-level action reinforcement learning for neural response generation. A neural HSMM decoder is introduced to learn hidden states that output typed phrases, and we used them as the action space for reinforcement