[CASPI] Causal-aware Safe Policy Improvement for Task-oriented Dialogue

The recent success of reinforcement learning (RL) in solving complex tasks is often attributed to its capacity to explore and exploit an environment.Sample efficiency is usually not an issue for tasks with cheap simulators to sample data online.On the other hand, Task-oriented Dialogues (ToD) are usually learnt from offline data collected using human demonstrations.Collecting diverse demonstrations and annotating them is expensive.Unfortunately, RL policy trained on off-policy data are prone to issues of bias and generalization, which are further exacerbated by stochasticity in human response and non-markovian nature of annotated belief state of a dialogue management system.To this end, we propose a batch-RL framework for ToD policy learning: Causal-aware Safe Policy Improvement (CASPI). CASPI includes a mechanism to learn fine-grained reward that captures intention behind human response and also offers guarantee on dialogue policy’s performance against a baseline. We demonstrate the effectiveness of this framework on end-to-end dialogue task of the Multiwoz2.0 dataset. The proposed method outperforms the current state of the art. Further more we demonstrate sample efficiency, where our method trained only on 20% of the data, are comparable to current state of the art method trained on 100% data on two out of there evaluation metrics.


Introduction
Offline task-oriented dialogue systems involves solving disparate tasks of belief states tracking, dialogue policy management, and response generation. In this work we strive to improve the performance of dialogue policy management.
This work is under review The code available at: https://github.com/salesforce/CASPI The need for sample efficiency sample efficiency is key for learning offline Task oriented dialogue system as the access to data are finite and expensive. Recent advancements in Off-policy based reinforcement learning (Batch-RL) methods that uses historical annotated data as against a simulator has proven to be sample efficient and helps in safe policy improvement for generalizable policies. The effective use of these techniques are hindered by the nature of dialogue policy learning. For example, Off-policy based learning many times requires an estimation of behaviour policy for a given state of Markov Decision Process (MDP). In real life, belief-state does not capture the true state of the MDP latent state such as prosody, among others induce stochasticity in the agents response at each turn. Then there is the issue of loss of semantic information from dialogue act to generated natural language text. This is demonstrate by Fig: 1. Use of mere policy imitation for dialogue-act falls short of reasoning on the outcome, rather focuses on each constituent of composite action equally. This is demonstrated in Fig:1. Turns#3 and #2 are rich in semantic information and Turn#3 arXiv:2103.06370v1 [cs.CL] 10 Mar 2021 key to the transaction of the booking process, while Turn#4 though of least use in the success of the conversation gets equal weight as other semantically rich term, worse the appear more often than specifics like Turn#2 and #3 there by clogging the gradient budget. These importance are lost in imitation policy learning.
The main contribution of this work are, we introduce safe policy improvement in batch reinforcement setting for dialogue policy learning with guarantees for performance. We introduce pairwise causal reward learning to shape reward that reason the intention of human utterance instead of mimic the demonstration. By use of these two off-policy methods we demonstrate sample efficiency.

Related Works
With the release of multi-domain, multi-turn MultiWoz2.0 dataset (Budzianowski et al., 2018b), there has been flurry of recent works, of which (Zhang et al., 2019) uses data augmentation. Rastogi et al. (2019) and Hosseini-Asl et al. (2020) frame dialogue policy learning as language modeling task. Among the works that uses reinforcement learning. Mehri et al. (2019) uses supervised learning to bootstrap followed by RL fine tuning, whereas (Zhao et al., 2019) uses policy gradient on latent action space as against handcrafted ones. To our best of knowledge (Jaques et al., 2019) and (Wang et al., 2020) the only other work that uses Batch-RL for dialogue policy learning. Recently there's has been proliferation in use of large pretrained language model based systems like (Hosseini-Asl et al., 2020) (Lin et al., 2020) (Chen et al., 2019) etc.
The line of inverse RL used in this work can be traced back to Ziebart et al. (2008), proposes roll-outs from expert demonstration should have rewards exponentially higher than any other arbitrary roll-outs. This method requires a normalizing constant that integrates across rollouts, which is challenging. Christiano et al. (2017) and Thananjeyan et al. (2020) propose to do relative comparison of two roll-outs there by eliminating the need for normalization constant and they demonstrate in online setting.

Preliminaries
We model task-oriented dialogue as a Markov decision process (MDP) (Sutton & Barto, 2018) with set of states S and actions A. The agent at time step t with state s t performs a composite action a t as per a target policy π e (a t |s t ) on the environment with transition probabilities to next state P (s t+1 |s t , a t ), a latent reward function, R(s t , a t ) with discount factor γ ∈ [0, 1]. Then the objective is to optimize for the target policy π e , that maximizes the discounted sum of future reward on the MDP, given the state-action value function Q πe (a t , s t ) = Ea t∼πe ,st∼P [ T t =t γ t−t R(s t , a t )]. In offline Batch-RL. The agent does not get to interact with the environment, instead we are provided with offline data D logged by human agents performing actions based on a latent stochastic behaviour policy π b , where τ i ∈ D is a rollout of a dialogue, composing of Here the o t is the observation at turn t, composing of o t = (b t , u u t , u a t−1 ), where b t is the belief state of the agent at turn t, u u t and u a t−1 are the user and agent utterance at time t and t − 1 respectively.

Safe policy improvement
Batch-RL entails training a policy on rollout generated by the latent behaviour policy. Directly optimizing on the rollouts generated by another policy, leads to large bias in the value function estimation, poor generalization characteristic, and sample inefficiency (Thomas & Brunskill, 2016). Safe policy improvement ensures the new policy performance is bounded compared to the old policy, as in this case the behaviour policy. This is given by: where V πe and V π b are value functions of the target and behaviour policy respectively. Here 1 − δ and ζ are the high probability and approximation meta-parameters respectively. (Schulman et al., 2015) provide such update mechanism, (1), whose errors are bounded as long as the constraints of (1) are met, where D KL (.||.) is the KL divergence and η is a hyper-parameter.
Use of this update rule requires access to the behavior policy π b (a t |s t ) which is intractable to estimate and the learnt ones might have bias. Using them to perform bias correction like Important Sampling (Precup, 2000) might lead to worse policy. Instead we estimate the behaviour policy conditioned on the belief state b t as against s t in (1), which is result in a stochastic behavior policy. The belief state b t is part of the observation o t at turn t. The actions are stochastic in nature given just the belief state, this demonstrated by Fig:4. We purport that on availability of more evidence of the observation o t , (beside b t ) the mode of the policy collapse to a near deterministic action. To factor this into the policy learning, we have an additional loss: Causal-aware Safe Policy Improvement for Task-oriented dialogue is the discounted sum of future reward for a rollout τ 1 with goal g 1 .Hence policy optimization loss function is given by: We achieve this by doing two forward passes on the policy network, first with only the belief state as the input and another pass with all the observation information to policy network to get the action distribution. The first pass captures the stochasticity of the policy conditioned only on the belief state, b t and the second pass collapse the mode given other latent information of the state, such as u u t and u a t .

Pairwise causal reward learning
The policy optimization objective introduced in the previous section requires access to per time-step reward R(s t , a t , g)).
To this end, we provide a mechanism to learn a reward that is causally reasoned on the intention of the human demonstrator. Usually dialogue policy learning is accompanied by metrics M , to evaluate the performance of the learnt policy. Though these metrics could serve as a proxy for a reward function, using them directly is challenging. These metric functions usually returns a score for the entire dialogue. Given the complex state-action space of the dialogue management system, these dialogue level feedback are under-specified for rewarding an action performed at each turn.
To address this under-specified feedback, we adapt the preference learning introduced by (Christiano et al., 2017) from an online to an offline setting. We parametrize reward for Algorithm 1 CASPI Input: Dialogue dataset D and evaluation metric M Sub-sample K-folds of train and val set ( Learn ToD in supervised setting by optimizing objective: − min Ea,s∼D T log(π m (â|s)) for ∀ epoch do Predict on the valset D V and add it to the dataset, D P for pairwise causal learning D P = D P ∪ τ |τ ∼ π m end for end for repeat Sample pair of rollouts (τ 1 , τ 2 ) ∼ D P Learn for R(.) network by optimizing for objective Eqn: 4 until Convergence using data D P repeat Optimize for policy π e using objective 3 until Convergence using data D every timestep t, as R(s t , a t , g). Given a pair of rollouts τ 1 , τ 2 ∈ D with actions for each state in the rollouts sampled from the different learnt policies π 1 e and π 2 e respectively. Let P [τ 1 τ 2 ] be the probabilistic measure that captures the preference for policy π 1 e over policy π 2 e . This preference is true when the sum of rewards of each dialogue of the two rollouts is such that.
T t=0 R(s t , a t |(s t , a t ) ∈ τ 1 ) > T t=0 R(s t , a t , g|(s T , a t ) ∈ τ 2 ). We henceforth we refer T t=0 R(s t , a t , g|(s T , a t ) ∈ τ ) as R(τ ) Then the preferential probability can be represented by: Here φ(.) could either be exp(.) or identity 1(.). In our experiments later works best. We optimize for reward, R(s t , a t , g) by minimizing binary cross-entropy loss between the preference probability and the normalized metrics score, µ(τ ) between a pair of rollout. where, Learning policy in a sparse reward MDP is a hard problem (Ecoffet et al., 2019). In online learning, the agents can interact and explore the environment. The agents have the liberty to sample arbitrarily large numbers of rollouts from the environment and it may still fail (Ecoffet et al., 2019) to learn effective policy in sparse reward MDP with large state-action space, as the chance of encountering non-zero reward grows exponentially smaller with the episode length.
This is exacerbated in offline settings as we are forced to learn optimal policy with finite data. Some successes are seen with guided exploration (Aytar et al., 2018) (Nair et al., 2018) (Vecerik et al., 2017), where expert demonstrations are used to guide the exploration. This strategy improves the chance of encountering the sparse reward as it restricts the state-action space to regions where non-zero reward exists.
We observe that the dialogue roll-outs are generated by expert latent policy. The data (dialogue rollouts) are distributed as per the optimal latent policy and transition probability. We propose that predictions made by a policy while in the process of learning to maximize the likelihood of the data is a good curriculum for exploring the state-action space for pairwise reward learning. This is the key insight of this work.
We formalize this insight into a method depicted in Fig:2 and Algo:1. The (train) dataset is subsampled into K-fold train & val sets. K-baseline models are trained to fit the data distribution generated by experts using cross entropy loss. During the process of fitting the data distribution, the still learning K-policies are used to predict on their corresponding K-fold valset at every epoch of the training. Each of the dialogue is scored by the chosen dialogue level metric. On convergence of the supervised learning process. Pairs of dialogue predictions generated by the above process, along with their corresponding metric score are used to train for preferential optimization objective Eqn.4, which in-turn learns fine grained reward R(a, s, g; θ). The use of K-fold subsampling and K-baseline models helps generate stochaticity in the samples generated. It also helps in effectively using the data and make the method sample efficient.

Sample weights for policy optimization
The learnt reward is akin to sample weights for each instance of the data, that helps to redistribute the gradient update budget among the samples based of their contribution to the the overall success of the Task oriented Dialogue system. To this end, we propose that learnt reward could be used as sample weight to any existing ToD dialogue system to reap the benefit of sample efficiency it brings. We demonstrate this by adopting two exiting ToD with the learnt reward, more about this in the next section 4.1 We believe our pairwise casual reward learning and associated sample improvement is independent of model architecture used for learning Task oriented Dialogue systems. As argued in the previous section our approach could be used as sample weights for any existing methods. To this end we choose two TOD methods that are at the extremes of model architecture spectrum 1) One uses a light weight custom model and 2) Other uses a large standard pre-trained out-of-the box universal language model. We demonstrate the ease of integrating of CASPI with these methods, and demonstrate the improvement in performance and sample efficiency.

CASPI(DAMD)
In this setting , we use the neural model proposed by (Zhang et al., 2019) without their key contribution of data augmentation as the baseline for our experiments. DAMD is composed of three seq2seq generative model using GRUs. The three seq2seq models are one each for belief state, dialogue act and response generation modules. An attention layers is then used to attend the outputs of the seq2seq mod-  Table 3. Comparison of results for end-to-end of Multiwoz2.0. in low resource setting els with the context vector of previous turn for copy over mechanism. The outputs are then used as representation for predicting series of tokens for their respective modules. For more details on the model architecture and parameter setting refer (Zhang et al., 2019). In this setting we use both stochastic, L sto and deterministic, L det loss functions on dialogue act. For DST and response generation, we retain the cross entropy loss as is from DAMD (Zhang et al., 2019).

CASPI(MINTL)
On the other extreme of model complexity, we use the Task oriented Dialogue model, MinTL (Lin et al., 2020). MinTL uses a large pretrained language model BART (Lewis et al., 2019). BART use as a standard encoder decoder transformer architecture with a bidirectional encoder and an autoregressive decoder. It is pre-trained on the task of denoising corrupt documents. BART is trained using cross-entropy loss between the decoder output and the original document. We use three single bi-LSTM layers, one each to encode goal, belief state and dialogue act or response sequences at each dialogue turn on each of the sampled roll-outs pairs, τ 1 and τ 2 . The three encoded representations are concatenate and are fed through couple of feed-forward layers before making a bounded reward prediction R(s t , a t , g) for each turn using a sigmoid function. The per turn rewards are summed to form a global reward R(τ ) for the roll-out τ . Using a pair of dialogue rewards R(τ 1 ) and R(τ 2 ), we compute the probabilistic preference between the roll-outs P [τ 1 τ 2 ] either by standard normalization or a softmax function. The output of this optimized using crossentopy loss described in Eqn:4

Dataset
To evaluate our proposed method on Multi-domain Wizardof-Oz (MultiWoz) (Budzianowski et al., 2018b) dataset. It is a large scale multidomain, task oriented dataset generated by human-to-human conversation , where one participant plays the role of a user while the other plays the agent.The conversations are between a tourist and a clerk at an information center. The conversations span across 7 domains including attraction, hospital, hotel, police, restaurant, taxi and train. Each dialogue is generated by users with a defined goal which may cover 1-5 domains with a maximum of 13 turns in a conversation. The dataset has 10438 dialogues split into 8438 dialogues for training set and 1000 dialogues each for validation and test set.

Prepossessing
We represent DB results as one-hot vectors as proposed by (Budzianowski et al., 2018a). To reduce surface-level variability in the responses, we use domain-adaptive delexicalization preprocessing proposed in (Wen et al., 2016). As proposed in (Zhang et al., 2019), We generate delexicalized responses with placeholders for specific values which can be filled with information in DST and database.

EVALUATION
Since the focus of this work is sample efficiency of dialogue policy learning, we use the context-to-response generation task of Multiwoz2.0 (Budzianowski et al., 2018b) and use their evaluation metrics to measure the quality of the response as primary objective and for completeness we also evaluate performance of our method on end-to-end dialogue modeling task. Both of these setting uses three evaluations metrics. These include: 1) inform rate -measures the fraction of dialogue, the system has provided the correct entity, 2) success rate -fraction of dialogues, the system has answered all the requested information and 3) BLEU (Papineni et al., 2002) -measures the fluency of the generated response. We also report the combined score (Inf orm + Success) × 0.5 + BLEU proposed by Mehri et al. (2019). All the numbers of CASPI reported in this work are median of 5 runs with different seeds.

TRAINING
For the metric M used in pairwise causal reward learning , we use the following: This is very similar to combined score used in evaluation and both are equivalent when λ = 2. We introduced hyperparamter λ to normalize the achievable scale of BLEU . We observe that success rate, if used as is, will result in nonmarkovian and stochastic per turn reward function, since the reward of current state will depend on the performance of future states. Hence, we also use a soft version of the metric M sof t , where the success rate measures a fraction of requested information provided in a dialogue. We refer the original metric that uses the discrete variant of success rate as M hard . The choice of action in reward function R(s t , a t , g) can either be dialogue act or generate response, we refer corresponding variants of metrics as M (act) and M (resp). To demonstrate the versatility of the method to adapt to different metrics, we use all the discussed variants of the metric.  Table 4. Sample efficiency study of CASPI(DAMD) on contextto-response generation task of MultiWoz2.0

Baselines
DAMD: Introduced by (Zhang et al., 2019)is a domainaware multi-decoder network. The method also exploits stochastic nature of the dialogue act by using a dataaugmentation technique called the multi-action data augmentation. DAMD with data augmentation is denoted here as DAMD + multiaction.
HDSA by (Chen et al., 2019) proposes to use hierarchical graph representation for dialogue act. It uses a pre-trained 12-layer BERT model (Devlin et al., 2019) to represent dialogue act. The predicted dialogue act is transformed to the hierarchical graph structure using disentangled selfattention model, a 3-layer self-attention model (Vaswani et al., 2017) SOLOIST (Peng et al., 2020) and SimpleTOD (Hosseini-Asl et al., 2020) uses pretrained GPT-2-based methods. These method are trained on turn-level data without generated belief state and system act in dialog history.
MinTL-BART (Lin et al., 2020), introduced Levenshtein belief spans framework that predicts only the incremental change in dialogue state per turn. It leverages the pretrained T5 and BART (Lewis et al., 2019) as backbone for model architecture.
HDNO proposed by (Wang et al., 2020) is a dialogue policy learning method to solve context-to-response generation task of Multiwoz2.0 (Budzianowski et al., 2018b). It exploits the hierarchical nature of dialogue act and response generation task by proposing an option based framework of Hierarchical RL and variational model to learn a latent dialogue act that corresponds to natural language response. Unlike our method, HDNO though highlights the risk of sparsity of metric function such as success rate as reward function, resorts to shaping a proxy reward function. Use markov language model as a proxy reward function. The language model is learnt independent of the metric function. Our method refrains from reward shaping and is independent of the nature of any underspecified metric function.
Since we learn fine grained turn specific credit assignment, our solution can adapt to other metric function as long as the pairwise reward network is rich enough to factorize them.

Result
We first compare our method against the current state of the art methods on the context-to-response generation task defined by MultiWoz2.0, (Budzianowski et al., 2018b). The results are tabulated at Table:1. We use CASPI adaptation of DAMD, CASPI(DAMD) for this task. CASPI(DAMD) performs better than other methods on three of the four performance criteria i.e success rate, inform rate and combined score. HDSA (Chen et al., 2019) has better BLEU score. This rich expressiveness of natural language by HDSA, stems from the use of large 12-layers BERT (Devlin et al., 2018) model.
Secondly, we compare both adaptation of our methods CASPI(DAMD) and CASPI(MinTL) on the end-to-end dialogue tasks defined by MultiWoz2.0 (Budzianowski et al., 2018b). The results are tabulated at Table:2. CASPI(DAMD) with it's light weight model architecture with no pretraining on any external corpus, was able to out perform all other previous method in all evaluation criteria. This goes to show using CASPI to shepard the gradient update process as sample weights for each dialogue turn leads to a model that's well aligned with true objective of the task. CASPI(MinTL) with its robust pretrained model out performs CASPI(DAMD) by a large margin. This goes to show the ease of adaptation of existing methods with CASPI.

Sample Efficiency
Inverse reinforcement learning, coupled with off-policy policy learning and evaluation are proven to be sample efficient (Thomas & Brunskill, 2016) . We argue CASPI is competitive with other sample efficiency techniques, such as data augmentation and transfer learning as performed by (Zhang et al., 2019) and (Lin et al., 2020) respectively. To demonstrate the hypothesis, we test our method against baseline in a low sample complexity regime. For experimental setup, we adopt the low resource testing strategy from (Lin et al., 2020). We train our model on 5%, 10%, and 20% of the training data and compare with other baselines on end-to-end dialogue and context-to-response generation tasks, Table 3 and 4 list the results. In end-to-end task, CASPI(MinTL) trained only on 10% of data was able to out perform previous state of the art method, MinTL trained on 100% data on two of the three performance metrics. On the context-to-response generation task, CASPI(DAMD) trained on 75% of the data was able to match 100% data performance of HDNO. This goes to show that having the right reward function to guide the budget of the gradient update process to reach the true objective is important in extremely low resource setting.

Human Evaluation
Automatic evaluation metrics have their own biases. True objective of ToD is human experience while interacting with the dialogue systems, which automatic evaluation metrics might fall short to capture. To this end we conduct human evaluation on the quality of the generated response. We define quality by the following criterias: 1) Appropriateness: Are the generated responses appropriate for the given context in the dialogue turn?
2) Fluency: Are the generated responses coherent and comprehensible?
A dialogue turn in the test set is randomly picked. The human evaluators were shown context leading up to the turn. The predictions for the turn by different models were anonymized and displayed to the evaluators. This is illustrated in Fig:6. The human evaluators were asked to give a score between 1 and 5 for appropriateness and fluency, with score of 5 being best and 1 being the worst. 100 randomly selected dialogue turns were presented to 10 participants .We report the mean and variance of the score. We compare our model performance against MinTL (Lin et al., 2020), SimpleTOD (Hosseini-Asl et al., 2020 and DAMD (Zhang et al., 2019). Fig:7 shows the results of the evaluation. CASPI(MinTL) outperforms all other models in appropriateness score. While fluency score of CASPI(MinTL), MinTL and SimpleTOD are comparable to each other.

Human in the loop training
In the previous section we argue automatic dialogue evaluation metrics are biased and doesn't truly reflect the human objective but in our method we use these very same dialogue evaluation metrics to learn reward R(s t , a t , g). To bridge this gap, we performed the following human-in-the-loop (HITL) experiment. We first trained a pair CASPI(MINTL) models with different seeds, on 5% of Multiwoz2.0 dataset. We then used these pair of models to predict on 0.5% of Mul-tiwoz2.0 train data (40 dialogues) and had a human score  . Example of reward learning process these pairs of generated response relative to each other. We then trained for reward R(s t , a t , g) using pairwise causal reward learning as described in Sec:3.3, where examples of the mini batch are randomly sampled either from human scored examples or the ones scored by the automatic evaluation metric as show in Fig:5. We then trained a fresh CASPI(MINTL) model on the original 5% of data and the learnt R(s t , a t , g). We perform human evaluation of the trained model on 24 dialogues form the test using 3 participants. Fig:8 shows the performance.
Though CASPI(MINTL) using just 5% of the data outperforms DAMD trained on 100% of data in 2 out of the 3 automatic evaluation metrics shown in Table:2 and 3, performs poorly in human appropriateness score. With the HITL score in the reward learning, we see a boost in performance in both the human evaluation criteria: appropriateness and fluency. The 5% data CASPI(MINTL)'s human approriateness score is now comparable to 100% data DAMD. This goes to show the versatility of the pairwise causal reward learning. With enough richness of the neural network used, the pairwise causal reward learning can generalize to unknown dialogue evaluation criteria.

REWARDS
In this section we qualitatively analyze the results of pairwise causal reward learning. Fig:9 is the same conversation between a tourist and information center agents that we introduced earlier, now we have reward R(s t , a t , g), that pairwise causal reward learning has predicted against each ' Figure 10. Example of agent behaviour in low sample regime.
turn. We observe that Turn#3 has received the highest reward, retrospectively we realize that this is the turn the transaction happens which is crucial and risk averse turn in a dialogue, which is captured by the success rate of the automatic evaluation metric. Turn#2 gets the next best reward which captures crucial information need for transaction to happen in Turn#3. Turn#4 gets reward an order lower than Turn#3 & 2 because other than nicety, it doesn't contribute much to the success of the conversation. It should be noted that it is typical Turn#4 will appear in almost all conversation and in supervised learning, it'll be receiving the highest share of gradient. The learnt reward redistributes the gradient budget that is aligned to the success of the dialogue objective.

TYPE OF AGENTS
In this section we analyze the type of behaviour CASPI agents sometime exhibit, especially when trained in low sample regime.
Greedy agent: In certain domains, the agents has a tendency to book a service before it has gathered all the required information or before the user requested or agreed for booking a service. The first example in Fig:10 demonstrate this behaviour. Here the user has requested for a taxi, before enough information such as destination or time of departure are gathered, the agent books the taxi. This happens because there are gaps in automatic evaluation metrics. A low BLEU score and relatively high inform and success rate might indicate greedy agent behaviour. Other reasons for low BLEU score includes: lack of diversity in the responses or malformation of response.
Cautious agent: The agent tends to be cautious by providing long winded replies packed with more information than needed. Agent tend to do this so as not to run the risk of loosing rewards through information rate. This behaviour is demonstrated in the second example in Fig:10 These subtle behaviour demonstrates gap in automatic evaluation metrics. These could be weeded out using Human in the loop as described in Sec:5.3.

Thoughts for future work
Appropriate choice of a metric to evaluate a rollout is crucial for learning the intention of the user. A poor choice of metrics may lead to inherited bias and the possibility of reward hacking by the policy. An option to mitigate this would be to use humans or a hybrid of metric and humans to choose between a pair of rollouts. On the flip side the use of humans might be expensive and in some cases defeat the optimization for sample complexity this work strived for. We leave this thought for future works to ponder.

Conclusion
In this work we introduced a fine grained reward learning process using an under-specified metrics function and expert demonstrations for efficiently learning Task oriented dialogue. We demonstrated the efficacy of our method on MultiWoz2.0 dataset by out performing existing state of the art method with only 10% of data. We believe the methods is generic and can be extend to other NLP tasks.