WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

An intelligent dialogue system in a multi-turn setting should not only generate the responses which are of good quality, but it should also generate the responses which can lead to long-term success of the dialogue. Although, the current approaches improved the response quality, but they over-look the training signals present in the dialogue data. We can leverage these signals to generate the weakly supervised training data for learning dialog policy and reward estimator, and make the policy take actions (generates responses) which can foresee the future direction for a successful (rewarding) conversation. We simulate the dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other. The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses. Each simulated state-action pair is evaluated (works as a weak annotation) with three quality modules: Semantic Relevant, Semantic Coherence and Consistent Flow. Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgment.


Introduction
Dialog policy for multi-turn dialogue decides the next best action to take on the environment so as to complete the conversation based on various success criteria. Reinforcement learning can help to learn such a policy where the environment can be users (human or model) and the policy takes action on the environment from which it gets a reward signal (Fatemi et al., 2016;Peng et al., 2017;Chen et al., 2017;Yarats and Lewis, 2018;Lei et al., 2018;He et al., 2018;. Learning a dialogue policy using reinforcement learning can be challenging with humans users, since it requires a large set of samples with a reward to train. Since there are a lot of previous works on neural response generation (Gu et al., 2020;Zhang et al., 2019; we can model the users also, using any of these encoder-decoder architectures. This helps to simulate the conversations between the simulated user and the agent (policy model) replying to each other (Zhao and Eskenazi, 2016;Dhingra et al., 2016;Shah et al., 2018). Reward signal for policy learning can be as simple as the small constant negative reward at each turn and a large reward at the end (if the goal completes) to encourage shorter conversations (Takanobu et al., 2019).
However, reward estimation for dialogue is challenging, the small constant negative reward at each turn may lead to ending the conversation prematurely. Instead of handcrafting the reward at the end based on success or failure, it is more useful if we can evaluate reward at every turn to guide the policy to dynamically change actions as per the need for the user and end the conversation naturally. With the growing complexity of the system across different topics, it is required to build a more sophisticated reward function to avoid manual intervention for accounting different factors towards conversation success.
In this work, we proposed a novel model for contextual response generation in multi-turn dialogue. The model includes the turn-level reward estimator, which combines the weak supervision signals obtained from three basic modules 1) Semantic Coherence, 2) Consistent Flow, 3) Semantic Relevance. These modules are learned jointly with the response generation model with the counterfactual examples obtained from negative sampling. Leveraging the weak supervision signals obtained from these models, we further update the reward estimator and dialog policy jointly in an alternative way, thus improving each other.
Our proposed approach integrates semantic understanding of utterances using encoder-decoder systems with the power of Reinforcement Learning (RL) to optimize long-term success. We test the proposed approach with two benchmarks: Daily-Dialog (Li et al., 2017b) and PersonaChat . Experimental results demonstrate on both datasets indicate that our model can significantly outperform state-of-the-art generation models in terms of both automatic evaluation and human judgment.

Related Work
Open-domain dialogue in a multi-turn setting has been widely explored with different encoderdecoder architectures (Gu et al., 2020;Feng et al., 2021;Kottur et al., 2017;Shah et al., 2018;Shang et al., 2015;Vinyals and Le, 2015;Wu et al., 2019;Zhong et al., 2019). The basic encoder-decoder architectures like Seq-to-Seq models have been widely extended and modified to generate the generic responses, context modelling and grounding by persona/emotion/knowledge Xing et al., 2017;Zhang et al., 2019.
The dialogue literature widely applies reinforcement learning, including the recent ones based on deep architectures (Takanobu et al., 2019(Takanobu et al., , 2020Li et al., 2020;Takanobu et al., 2020;Li et al., 2020;Gordon-Hall et al., 2020a,b). But these taskoriented RL dialogue systems often model the dialogue with limited parameters and assumptions specific to the dataset, targeted for that task. The dataset includes hand-built templates with state, action and reward signals designed by humans for each new domain making this setting difficult for extending these to open domain dialogue systems.
Our goal in this work is to integrate the stateof-the-art encoder-decoder architectures like in Gu et al. (2020); ; Csaky and Recski (2020) and reinforcement learning paradigms to efficiently learn the dialogue policy optimized for long-term success in the multi-turn dialogue scenarios. We are recently inspired by the works in Takanobu et al. (2019); Li et al. (2020 to jointly learn the reward function and dialogue policy, and reduce the effort and cost for manual labelling the conversations for building the reward model. Specifically, we leverage the weak supervi-sion inspired from Chang et al. (2021a,b) to generate the labelled dataset to facilitate this joint learning and building reward estimation model.

Approach
We represent dialog sessions D = {τ 1 , τ 2 , τ 3 , .......τ n } where each dialog session τ represents the trajectory of state-action pairs as {s u 0 , a u 0 , s 0 , a 0 , s u 1 , a u 1 , s 1 , a 1 , .....}. The user in our case is a simulator which utters a response a u given the state s u denoted as µ(a u , e u |s u ) where e u denotes the binary signal indicating the end of a dialog session, in that case the response a u is empty. The dialog policy π θ (a|s) decides the action a according to the current state s after the agent interacts with the user simulator µ. At each time, the state given to the either dialog party is updated after recording the action uttered by the other party. The reward estimator f evaluates the quality of response/action uttered by the dialog policy π. The dialog policy π is based on the BERT (Devlin et al., 2019) encoder-decoder model and the reward function f is the MLP model parameterized by θ and ω respectively. We have modeled the user simulator exactly in the same way as the agent but trained only using supervised learning objective.
In the subsequent section, we will introduce the components action, state, policy, quality modules and reward estimator. Further, sections explain the setup we have used for weakly supervised learning and, finally, the experimental results.

Action
An action a is the dialogue utterance generated by the encoder-decoder model as shown in Figure 1. The model takes as input the context history (state), and outputs the probability distribution over a set of possible actions denoted as π θ (a|s) parameterized by θ. The user simulator generates the action a u , policy generates the action a, and the input state for the agent and the user is s and s u respectively.

State
The state is the past conversation history between an agent and a user denoted as, s t = {q 1 , a 1 , q 2 , a 2 , q 3 , a 3 , ....., q t }. The state for an agent and a user are differently denoted as s and s u respectively. Let's say the agent utterances are denoted by a's, then state, s = s t and the agent utters a t . Similarly, the user state BERT Encoder [CLS] SBERT BERT Decoder [CLS] MLP [EOS] Figure 1: BERT based Encoder-Decoder with Semantic Coherence and Relevance. Similarly, Consistent Flow loss is also calculated using encoder. s u t = {q 1 , a 1 , q 2 , a 2 , q 3 , a 3 , ....., q t , a t } and the user utters q t+1 . Each of the utterances is mapped to a fixed-length sentence vector using SBERT (Reimers and Gurevych, 2019).

Dialogue Policy
The dialogue policy takes the form of a BERT based encoder-decoder ( i.e. π θ (a|s) ) (Gu et al., 2020) as shown in Figure 1. Similar to , we have used the BERT based encoder and transformer decoder, but instead of feeding the utterance at word level, we instead fed the utterance representation (obtained from SBERT) into the encoder. The encoder takes as input the previous context history as s t and output the response a t at the output of the decoder.

User Simulator
We have modelled the user simulator in exactly the same way as the BERT based encoder-decoder shown in Figure 1. However, the user simulator is trained only (with supervised learning objective) for utterances in dialog corpus and predicting user response (Gu et al., 2020).

Conversation Quality Modules
We calculate the reward for each state-action pair (see Section. 3.8) and use this signal to train the dialogue policy so that it can avoid reaching bad states so as to reach the successful end of the conversation between a user and an agent. We have leveraged the signals from three basic modules, namely, Semantic Coherence, Consistent Flow and Semantic Relevance (which are jointly learned with the dialogue policy). For each of the three modules, the data for the positive class is obtained from the source corpus while for the negative class it has been generated dynamically during training. We describe each of the three modules in the following sections.

Semantic Relevance
We need to filter out the utterances generated with high confidence by the dialog policy but are semantically irrelevant to the previous context. To quantify such a characteristic, we modeled the general response relevance prediction task which utilizes the sequential relationship of the dialog data fed to the encoder side of BERT encoder-decoder framework. Since, the task of semantic relevance is to match the two sequences of conversation, so instead of matching the context and response, we have measured the relevance of two fragments of dialogue session.
Specifically, given a context c = {q 1 , a 1 , q 2 , a 2 , .....q m }, we randomly split c into two consecutive pieces , we replaced the left or right part with the sampled piece from the corpus. Also, we additionally generate the negative samples by internal shuffling in the left or right part. The whole model is trained like a classifier with corresponding labels y sr ∈ {0, 1}. Since the individual utterances are fed after obtaining their vector representation, the aggregated representation of two pieces is represented by E sr CLS over which the non-linear transformation is applied, the score for semantic relevance is given by g(c left , c right ), and similar to , it has been trained using the binary cross-entropy loss as:

Semantic Coherence
The response generated should be rewarded only if it is coherent despite having adequate content. This makes the model to generate the coherent responses while avoiding the incoherent ones. Specifically, given a context c = {q 1 , a 1 , q 2 , a 2 , .....q m }, we randomly select any of the agent response at time t, denoted as a t , and replace it with any random utterance from the corpus. We also generate the incoherent samples by internal shuffling of bi-grams. The incoherent utterance is labelled as y coh t = 0 and coherent samples as y coh t = 1. The semantic coherence model is also trained like a classifier for each of the utterance representations obtained at the output of BERT encoder as shown in Figure 1. The probability of the t-th utterance being incoherent is given as: and the loss function is given as:

Consistent Flow
We want the agent to continuously add the information to keep the conversation going in the forward direction. To determine the flowing conversation, we take the cosine similarity between the last two agent utterances denoted as E a i−1 and E a i denoted as g(a i−1 , a i ), and we measure the similarity with randomly sampled utterance v in place of a i−1 given as g(a i−1 , v). We would like g(a i−1 , a i ) to be larger than g(a i−1 , v) by at least a margin ∆ and define the learning objective as a hing loss function:

Joint Training of Agent and Reward Modules
To initialize the parameters of agent and reward modules M ={Semantic Relevance, Semantic Coherence, Consistent Flow}, we used the supervised learning objective since all the state-action pairs obtained from the pre-training corpus are the groundtruth and can be used as close approximation for further fine-tuning on other dialog corpus. We used the pre-training corpus P as Gutenberg dialog corpus (Csaky and Recski, 2020). Since the agent model in our case is based on BERT encoderdecoder parameterized by θ similar to Gu et al. (2020), the probability of generating agent's response a is given as: where a j is the j-th word generated at the output of decoder and s is the whole context history utterances fed to the encoder and N is the maximum sequence length of decoder. The loss function for generating agent response a is given as: The joint loss function is defined as: The policy π θ is also parameterized by θ, and the probability of action a is given by π θ (a|s) similar to p θ (a|s), since the probability distribution is learned only from (s, a) pairs obtained from the corpus with human demonstrations. It is a good approximation to initialize the parameters of policy π θ (a|s) with parameters of p θ (a|s). Furthermore, we update the policy π θ (Step 13 in the Algorithm. 1) to avoid actions a which do not lead to rewarding conversations.

Dialogue Simulation between Agent and User
We setup simulation between virtual agent and user, and let them take turns talking to each other. The simulation is started with a starter utterance obtained from the dialog samples D H (Step 5 of Algorithm 1) and fed to the agent, it then encodes the utterance and generates the response a, the state s u is then updated with previous history and fed to the user model to obtain the next response a u . The response a u is appended to s u to obtain the updated state s. Similarly, the process is repeated until one of the following conditions occurs after a few number of turns 1 : a) When agent starts to produce dull responses like "I don't know" 2 . b) When agent starts to generate repetitive response consecutively 3 c) Or, the conversation achieved the maximum number of turns handled by agent and user models. 4

Weakly Supervised Learning Algorithm
Learning with weak supervision is widely used with the rise of data-driven neural approaches (Ratner et al., 2020;Mrkšić et al., 2017;Chang et al., 2020;Bach et al., 2017;Chang et al., 2021a). Our approach incorporates a similar line of work by providing noisy text to a pretrained model which incorporates prior knowledge from general-domain text and small in-domain text (Peng et al., 2020;Chen et al., 2019;Harkous et al., 2020) and use it as a weak annotator similar to Ratner et al. (2020). The primary challenge with the synthetic data is the noise introduced during the generation process, and the noisy labels tend to bring little to no improvement (Frénay and Verleysen, 2013). To train on such noisy data, we employ three step training process: a) pre-training b) generate data with weighted categories c) fine-tuning similar to Chang et al. (2021a); Dehghani et al. (2017).
Step 1: Pre-train Generation and Quality Modules Jointly. This step involves pre-training the agent with quality modules jointly as explained in Section 3.6. Quality modules trained on clean data as well as automatically generated negative samples by random sampling. These modules are further fine-tuned on the sampled dialogues from target dialogue corpus at each training iteration. Similarly, we initialized the user also by supervised training on the pre-training dialogue corpus with fine-tuning on target dialogue corpus. (see steps 2-7 of Algorithm 1). The fine-tuning steps make use of continual learning to avoid catastrophic forgetting (Madotto et al., 2020;Lee, 2017).
Step 2: Generates the Weakly Labelled data with Reward categories. After the models are initialized with trained parameters, the dialogue simulation has been started between the agent and the user (see Section. 3.7) to interact with each other and generates the synthetic data with annotated scores with each quality module for every stateaction pair in sampled dialogues. During dialogue simulation, we employ Dynamic Blocking mechanism (Niu et al., 2020) to generate novel words and paraphrased responses. Specifically, we generate Top-7 response at each turn and set the agent to exploration for 60 percent of the times and for the rest of the times it exploits by selecting the response from top two ranked responses. We specifically filter the state-action pairs into three reward categories namely, VeryHigh, High and Low. For the state-action pairs whose scores by each module are greater than or equal to 0.8 are put into the VeryHigh category. Other, state-action pairs whose scores by each module are between 0.6 and 0.8 are put into the High reward category. The rest of all state-action pairs are put into the Low reward category. Additionally, we include state-action pairs sampled from target dialog corpus in Step 1. into the VeryHigh category.
Step 3: Update the reward estimator and policy. The reward estimator maximizes the log likelihood state-action pairs of higher rewards than the lower ones. The reward estimator f ω , parameterized by ω, and let's say H, V and L represents the collection of all state action pairs of High, Very-High and Low reward category respectively.
where f models state-action pairs of H, V and L category as a Boltzmann distribution (Takanobu et al., 2019). The cost function for reward estimator in terms of trajectories obtained from respective reward categories is given as: − KL(p V (s, a) p ω (s, a)) + KL(p L (s, a) p ω (s, a)) (9) It minimize the KL-divergence between reward distribution and the state-action pairs of high and very high reward but maximize the distribution from the ones with low category. The gradient yields: Since, the dialog policy is required to put the actions atleast to that of high category, i.e. maximize the entropy regularized expected reward (E π [R] + H(π)) which is effectively minimizes the KL divergence between the policy distribution and Boltzmann distribution.
where the term log Z ω is independent to θ, and H(·) denotes the entropy of a model. Using likelihood ratio trick the gradient for policy is given as: − log π θ (a|s)) θ log π θ (a|s)]. (12) Hence, the reward is r ω (s, a) = f ω (s, a) − log π θ (a|s) for each state-action pair and the loss function re-written as: − log π θ (a k |s k ))] (13) Like in Takanobu et al. (2019) the reward estimator f ω includes the shaping term. Formally, we include next state s t+1 also instead of just (s t , a t ) where h is the MLP network with input as presigmoid scores from each quality modules, and g ω is also the MLP network with input as the concatenation of E CLS as state vector and SBERT sentence embedding of action a.

Experiments
We conduct experiments on DailyDialog (Li et al., 2017b), PersonaChat  and used Gutenberg Dialogue Dataset (Csaky and Recski, 2020) as a pre-training corpus. We compare our model performance with baselines on various aspects of response quality.

Datasets
We considered DailyDialog (Li et al., 2017b) and PersonaChat  which are open domain dialog corpus to evaluate our system. Dai-lyDialog contains conversation revolving around various topics pertaining to daily life, and Per-sonaChat contains conversations between people with their respective persona profiles. These dialogues can be of varying length, we limit the maximum length to 20, that can be fed to the BERT Collect dialog samples D π by executing the dialog policy π and interacting with µ, a u ∼ µ(·|s u ), a ∼ π(·|s) where s and s u is updated each time after getting response from user and agent respectively.

9:
Get weak annotation scores for all (s, a) ∈ D π from each of the modules M.

10:
Filtering the (s, a) pairs into {VeryHigh, High and Low} reward categories.

13:
Update the policy π θ by minimizing J π w.r.t θ (Eq. 13). 14: end for Encoder-Decoder model. Since average length of DailyDialog is 7.9 and that of PersonaChat is 9.4, so most of the dialogues fit easily without truncation from the history. For rest of the dialogues, it can be slided across to include the more recent utterances and remove it from the starting. Since we are mapping the utterances to their corresponding vectors using SBERT, the length of individual utterances truncated automatically and retain only first 512 word pieces in case of longer utterances. For pre-training corpus the vocabulary is limited to 100,000 while the vocabularies for DailyDialog and PersonaChat are 25,000 and 32,768 respectively.

Baselines
We select various multi-turn response generation baselines. The baselines which are not included pre-training are (1) HRED : Hierarchical encoder-decoder framework  (2) VHRED : an extension of HRED that generates response with latent variables  (3) HRAN : Hierarchical attention mechanism based encoder-decoder framework  (4) ReCoSa : Hierarchical transformer based model (Zhang et al., 2019) (5) SSN: dialogue generation learning with self-supervision signals extracted from utterance order (Wu et al., 2019) (6) Transformer-Auxiliary Tasks: A recent state-of-the are model leaning language generation with joint learning of transformer with auxiliary tasks . The another two baselines from Csaky and Recski (2020) which involve pre-training on the Gutenberg corpus are: (1)Transformer : 50M parameters version and (2) GPT-2 : Pre-trained model with version of 117M parameters. The repository 5 contains these two trained models.

Evaluation Metrics
We evaluate the performance of our model on various aspects of response quality using both automatic and human evaluation. Although, most of the automatic metrics poorly correlate with human evaluation (Liu et al., 2016), and the recently proposed metrics (Li et al., 2017a;Tao et al., 2018) are harder to evaluate than perplexity and BLEU (Papineni et al., 2002). Additionally, human evaluation has its inherent limitation of bias, cost and replication difficulty (Tao et al., 2018). Due to this consensus, some used only automatic metrics (Xing and Fernández, 2018;Xu et al., 2018b) and some used only human evaluation (Krause et al., 2017;Fang et al., 2018) while some used both (Shen et al., 2018;Xu et al., 2018a;Baheti et al., 2018;Ram et al., 2018).
We mainly used the automatic metrics using the DIALOG-EVAL repository 6 , it contains 17 different metrics, but we measure only a few metrics to facilitate the comparison with the published baselines results. We specifically follow  to measure automatic evaluation and human evaluation. For response content quality we measured BLEU-4 (Papineni et al., 2002) and Perplexity(PPL) (Sutskever et al., 2014). Like in  used embedding metrics average (AVG), extrema (EXT), and greedy (GRE) measuring similarity between response and target embedding. Similar to  we also measured the 5 https://github.com/ricsinaruto/gutenberg-dialog 6 https://github.com/ricsinaruto/ dialog-eval informativeness of responses with distinct-1 and distinct-2 that are calculated as the ratios of distinct unigrams and bigrams.
Since our main objective is not to judge the response quality but to predict the response for longterm success of dialogue. We follow the guidelines as in  to explore both single-turn and multi-turn settings. We picked 500 dialogues from the test set and asked 3 native speakers for their judgement. In the first setting, we asked judges to pick the better response among the one generated by our model and a baseline model (Pre-Trained GPT2) based on various criteria like answerability and semantics. In the second setting, in case of multi-turn we used 200 simulated conversations between RL agent and a user model to judge the whole conversation for responses uttered by agent. In a complete end-to-end conversation we asked the judges to decide which of the simulated conversations are of higher quality. To compare against the RL model we employ baseline model to simulate the 200 conversations with the same starter utterance used by RL model. Automatic and Human evaluation are shown in Table. 1 and 2 respectively. Table. 1 reports automatic evaluation metrics on the baseline and the proposed model. Our model outperforms for most of the metrics on both datasets. Since our main idea is to generate the responses for successful conversation in the long run than just evaluating the response quality at each of the turn. This is the main reason of why our model outperforms on both distinct-1 and distinct-2 metrics, in comparison to Transformer-auxiliary task model which also trained jointly with the similar tasks but lacks fine-tuning with the weak supervision signals indicate that an additional training with weakly labelled data improves the generalization performance. Although, we see the perplexity also improves since our model is generating the responses more like humans to optimize the conversation in long run. Similarly, embedding metrics also shown the improvement but little on average since it capturing the sense but due to length mismatch which occurs owing to the fact that our model is generating more novel words with futuristic sense. However, Distinct-{1,2} scores shows improvement because of the large pre-trained vocabulary, it gives the model more flexibility to generate novel words without disturbing the sense of the sentence.  We also note the results for our model without weak supervision training, namely, Our Model w/o Weak Supervision, this model just fine-tunes on the DailyDialog (Li et al., 2017b) and Per-sonaChat  without generating the weak labelled data. Clearly, the distinct-1 and distinct-2 metrics are lower than the proposed model, because the model tends to generate the repetitive words more frequently. Similarly, the embedding metrics and PPL does not show any improvement over the proposed model except on embedding metric based on Average. However, it performs well on BLEU scores since it learns well to reproduce the responses as in the ground truth but not optimized for a successful conversation in the long run. Table 1 also reports the results of another two baselines which are pre-trained models on Gutenberg Dialogue Corpus (Csaky and Recski, 2020). These models are fine-tuned on DailyDialog and PersonaChat dataset respectively. These models although improved much on BLEU scores and distinct-1 and distinct-2 scores since it gets the larger vocab and more enhanced training for learning the language structure. But lags in the embedding metrics indicating the response quality is low. Table 2 reports the human evaluation results, the objective for which our model training is to generate the response for a successful conversation in the long run for the multi-turn scenario. Clearly, the evaluation results are up to our expectation, since the RL system does not bring a significant boost in single-turn response quality than the case of multi-turn setting.

Conclusions
We proposed a weak supervision framework for policy and reward estimation for long-term success of the dialogue by simulating the conversation between a virtual agent and user. Empirical studies on two benchmarks proves the effectiveness of our approach.