DSPM-NLG: A Dual Supervised Pre-trained Model for Few-shot Natural Language Generation in Task-oriented Dialogue System

In few-shot settings, fully conveying the semantic information of dialogue act is a crucial challenge for Natural Language Generation (NLG) in task-oriented dialogue systems. It is noteworthy that NLG and Spoken Language Understanding (SLU) form a natural dual problem pair. If the SLU module can successfully restore the generated response by the NLG module to the corresponding dialogue act, this would demonstrate that the response is effectively conveying the semantic information of the dialogue act. Based on this idea, a novel Dual Supervised Pre-trained Model for a few-shot Natural Language Generation (DSPM-NLG) is proposed to regularize the pre-training process. We adopt a joint model with a dual supervised framework to learn the dual correlation between NLG and SLU from a probabilistic perspective. In addition, a slot-masked strategy is designed to enable the model to focus more effectively on the key slot-value pairs. DSPM-NLG is continuously trained on publicly available and large-scale labeled data, allowing it to gain a thorough understanding of the duality between the two tasks and to enhance the pre-trained model’s ability for semantic control and generalization. Experimental results illustrate that our proposed model demonstrates exceptional performance on the few-shot benchmark dataset, outperforming the previous state-of-the-art results.


Introduction
Task-oriented dialogue systems have been demonstrated to be effective in aiding users accomplish various tasks in multiple domains, such as airline ticket booking, restaurant and hotel reservations.

System Response
Just make sure, you are looking for an inexpensive hotel in the center area.

Natural Language Generation
Figure 1: NLG and SLU are two complementary components that form a natural duality.While NLG is the process of generating a response in natural language based on a structured semantic representation (in green), SLU is the act of transforming natural language into a structured semantic representation (in blue).
A complete task-oriented dialogue system typically consists of four components (Zhang et al., 2020): spoken language understanding (SLU), dialogue state tracking (DST), dialogue policy learning (DPL), and natural language generation (NLG).The NLG module aims to convert the dialogue act generated by DPL into a natural language, which can be abstracted as a semantically conditioned language generation task.As depcicted in Figure 1, the generated utterance should be sufficient to convey semantic information of the dialogue act, as well as being fluent, natural, and resembling human language to engage users' attention.As the primary module for user interaction, NLG plays a crucial impact in the performance of dialogue systems.
Recently, pre-trained models have revolutionized the field of natural language processing.The introduction of pre-trained models such as GPT2 (Radford et al., 2019) in the NLG task has resulted in a significant improvement in overall performance (Budzianowski and Vulić, 2019;Wu et al., 2019;Hosseini-Asl et al., 2020;Ham et al., 2020;Yang et al., 2020;Peng et al., 2021).Despite their superior performance on simple domains, they necessitate a great deal of high-quality labeled data and are challenging to generalize to the domain-specific.
Nevertheless, acquiring large amounts of domain-specific labeled data in practical scenarios is cost-prohibitive.It is essential that an NLG module is able to effectively generalize with limited domain-specific labeled data in few-shot settings.Recently, a paradigm of the few-shot learning utilizs the existing large-scale annotated data to train a pre-trained model such as GPT-2 (Radford et al., 2019) and subsequently is fine-tuned with only a few domain-specific labeled data to adapt to target domains.Thereby, the paradigm narrows the gap between pre-traineds model and downstream tasks.For instance, Peng et al. (2020) adopted the paradigm and achieved a state-of-the-art performance for few-shot NLG.However, in few-shot settings, one of the challenges of NLG is prone to omit important slot-value pairs and make it difficult to fully convey the semantic information of the dialogue act fully.
To go beyond this limitation, we explore further enhancing the semantically controlling ability of the pre-trained model.It is noteworthy that NLG and SLU are a natural dual problem pair, as illustrated in Figure 1.Ideally, the response generated by the NLG module can be restored to the corresponding dialogue acts by the SLU module.The two dual tasks are intrinsically connected due to the joint probabilistic correlation.Moreover, SLU can provide an additional supervision signal for NLG so that the NLG model better focuses on key slot-value pairs in the dialogue acts.Thus, we explicitly exploit the dual correlation between NLG and SLU to regularize the pre-training process and improve the semantically controlling ability of the pre-trained model.
In this paper, we propose a dual supervised pretrained model for a few-shot Natural Language Generation (DSPM-NLG).DSPM-NLG consists of two primary, the dual supervised pre-training and fine-tuning.In the pre-training stage, the framework of dual supervised learning is introduced to learn the explicit joint probabilistic correlation between NLG and SLU from existing large-scale annotated data.Moreover, a slot-masked strategy is designed, which selects the key slot information detected by SLU, thereby constraining the NLG module to focus more on the slot-value pairs in the dialogue act.In the fine-tuning stage, the pretrained model is fine-tuned with only a few domainspecific labels for adaptation.Experiments demonstrate that the semantically controlling and generalization abilities of DSPM-NLG are significantly improved.In general, the major contributions of this paper are described below: • We propose a novel pre-trained framework for NLG based on dual supervised learning, which explicitly exploits the probabilistic correlation between NLG and SLU to regularize the pre-trained process.
• We design a slot-masked strategy that contributes to constraining the NLG module to focus more on the key slot-value pairs contained in the dialogue act.
• We carry out extensive ablation experiments to demonstrate the advantages of building the framework.The experimental results demonstrate that our model outperforms the existing state-of-the-art results on the few-shot benchmark dataset.

Related Work
Existing NLG models can be mainly summarized into two major categories.(1) Template-based NLG models (Langkilde and Knight, 1998;Stent et al., 2004) generate responses according to manually developed rules.These models generate responses that can convey the semantics information of certain predefined dialogue acts.Nevertheless, the handcraft templates are difficult to cover potentially unforeseen dialogue acts, and the generated response is not always natural.
(2) Statistical-based NLG models (Wen et al., 2015;Dušek and Jurčíček, 2016;Tran and Nguyen, 2017;Su et al., 2018;Gao et al., 2019;Zhu et al., 2019;Wolf et al., 2019b;Su et al., 2020b,a) generate responses via training from massive annotated data.With the rise of attention mechanism, more approaches have been proposed, e.g., Hierarchical attention network (Su et al., 2018;Zhu et al., 2019;Chen et al., 2019).And then, some NLG works adapted a multi-task learning framework to improve the performance (Su et al., 2020b,a).In particular, some scholars exploit the relationship between SLU and NLG to improve the performance of two tasks (Su et al., 2019(Su et al., , 2020a;;Zhu et al., 2020;Tseng et al., 2020;Chang et al., 2021).Subsequently, many works introduce pre-trained models (Budzianowski and Vulić, 2019;Edunov et al., 2019;Dai et al., 2019;Ham et al., 2020;Brown et al., 2020;Kale and Rastogi, 2020;Madotto et al., 2020) such as GPT2, and the overall performance of NLG is greatly improved. 12390 Recently, to deal with the challenge of few-shot learning, data augmentation has been widely applied to NLG.Peng et al. (2020) proposed SC-GPT model.They pre-train GPT with large-scale NLG corpus collected from publicly available dialogue datasets and then fine-tuned the model on the target domain with few training instances.Xu et al. (2021) proposed a data augmentation approach that constructed dialogue acts and responses from the open-domain dialogues and applied the new data to SC-GPT.
Compared with previous work, we try to explore the duality between SLU and NLG in the pre-training stage.The difference between the proposed model and the previous methods is mainly reflected in the following two aspects: First, dual supervised learning is only applied in the pre-training.Thus, in few-shot settings, our model does not require any SLU annotated data and does not increase additional computation in fine-tuning and inference stages.It is worth mentioning that our model also avoids the error transfer between SLU and NLG in the inference stage.Second, in the pre-training stage, we collect a large amount of labeled data for SLU and NLG.The training of a large amount of labeled data enables the pre-trained model to have a strong semantically controlling ability rather than just learning the relationship between the two tasks in some specific domains to improve the performance of both tasks.

Background
Dual Supervised Learning Framework.The overall architecture of the dual supervised learning as shown in Figure 2. Assuming that we involve the dual tasks of NLG and SLU: the primal NLG task takes a sample from the semantics space X as input and maps it to the natural language space Y .The NLG task learns a mapping function f (x; θ x→y ) parameterized by θ x→y .In contrast, a dual task of SLU takes a sample from the natural language space Y as input and maps it to the semantics space X.The SLU task learns a mapping function g (y; θ y→x ) parameterized by θ y→x , where x ∈ X and y ∈ Y .The joint probabilistic duality can be computed as followings: where P (x), P (y) denote the marginal distributions; P (y|x), P (x|y) are conditional probability.
For any x ∈ X, y ∈ Y , ideally, the conditional  distributions of the primal and dual tasks should satisfy the following equality: where θ x→y and θ y→x are the learnable parameter of the model.The core idea of dual supervised learning is to jointly model the two dual tasks by minimizing their loss functions and incorporating the probability duality constraint.A total of three loss functions are optimized.Obtain the maximum likelihood estimation of y i from the labeled input x i via the primal NLG task: (3) Obtain the maximum likelihood estimation of x i from the dual input y i via the dual task: The probabilistic duality constraint is incorporated: where l N LG , l SLU are loss functions; M is the number of the samples and s.t.denotes the constraint.

Task Definition
The goal of NLG is to generate a natural language response containing the dialogue act's semantic information.A dialogue act (DA) includes different types of system actions and slot-value pairs, the formal definition of DA is described as follows: 12391 where A indicates different types of system actions, such as confirm, inform, request, etc.; k is the number of slot-value pairs, which varies in different dialogue acts; slot-value pairs indicate critical structured semantic information of the dialogue act.
The formal definition of NLG is described as follows: given a DA consisting of a system action and k slot-pairs, a response Y = [y 1 , y 2 , . . . ., y n ] can be generated by the NLG model, where n is the response length.For example, a DA is [confirm, (price range = inexpensive)] and the corresponding response is "just to make sure, you are looking for an inexpensive hotel".The format of the SLU labels is described as follows: the utterance "just to make sure, you are looking for an inexpensive hotel" is labeled as , where "B-hotel-pricerange" and "O" are called slots.There is a one-to-one correspondence between a slot and a word.

Proposed Model
The section introduces the proposed DSPM-NLG model.The training procedure of DSPM-NLG mainly includes the dual supervised pre-training and fine-tuning stages.The overall architecture of DSPM-NLG is shown in Figure 3.

Dual Supervised Pre-training Stage
We inherit GPT-2 model (Radford et al., 2019) as our original pre-trained model in the proposed model.The GPT-2 model is a powerful language model which can be used for several downstream tasks.In order to enhance the generalization ability and semantically controlling ability of the pretrained model, we continuously train the GPT-2 model on existing large-scale high-quality annotation pairs (DA, response, slots)1 .The pretraining dataset includes annotated training pairs from the MultiWOZ dataset (Eric et al., 2019) and schema-guided dialogue dataset (Rastogi et al., 2020).The total size of the dual supervised pretraining datasets is approximately 470k samples.
Encoder At the pre-training stage, the DA is pre-processed as a text sequence D. In the meanwhile, the response Y is pre-processed via appending response with a special start token [BOS] and an end token [EOS].The input of our model is where m is the length of the DA and n is the length of the response.The output of the last hidden layer is In the pre-training, the loss value is only compututed for Y corresponding to the hidden layer output  (9) The probabilistic duality constraint is incorporated:  is the constraint of the duality probabilistic.The new loss value of NLG is computed as: where λ x→y is a hyper-parameter.Besides, ℓ duality denotes the regularization term.The regularization term is computed as: ℓduality = log P (x) + log P (y | x; θx→y) Note that the true marginal distribution of P (x) and P (y) are difficult to obtain.As an alternative, we relace them with empirical marginal distributions P (x) and P (y).P (x) is calculated by GPT-2 (language model).The empirical marginal distribution of P (y) is calculated by the statistics of the percentage of each slot in the collected labeled data.The meaning of the regularization term is to minimize the gap between P (x)P (y | x; θ x→y ) and P (y) P (x | y; θ y→x ).Thus, dual supervised learning enhances the process of supervised learning from the duality of the structure between NLG and SLU.The final NLG loss function is formulated as: where M is the number of samples.The regularization term ℓ duality is different from the SVM regularization term or the L1 regularization term.The regularization term of SVM or L1 is only dependent on the model.However, the regularization term ℓ duality in dual supervised learning is both model and data-dependent.During the pre-training process, each training sample contributes to the regularization term.In addition, the probability distribution of SLU contributes to the regularization of the NLG model.
Slot-masked Strategy The slots use the beginning-inside-outside (BIO) data annotation standard (Athiwaratkun et al., 2020) in the SLU task.For example, the utterance "just to make sure, you are looking for an inexpensive hotel" is labeled as We find that most slot labels in SLU are non-value slot "O".According to the statistics, the number of non-value slot labels ("O") is more than ten times that of the valued slots (e.g."B-hotel-pricerange").And the valued slot (not the "O" slot) contains critical semantic information and has great significance.Therefore, a slot-masked strategy is designed to select the vital slots detected by SLU.When calculating the loss value, the model only considers the valued slots, which makes it better focused on the key slots detected by SLU.

Fine-tuning Stage
We fine-tune DSPM-NLG on limited amounts of domain-specific labels for adaptation.The finetuning procedure follows standard supervised learning of NLG in few-shot sittings.The loss value of NLG is computed as follows: It is worth mentioning that dual supervised learning is not applied in the fine-tuning stage, which avoids the error transfer between SLU and NLG.Automatic Metrics In this paper, we continue previous evaluation metrics to evaluate the quality of the generated responses, including BLEU scores and slot error rate (ERR) (Wen et al., 2015).BLEU score is used to evaluate the fluency and naturalness of the generated response.And ERR is used to evaluate whether the generated response contains semantic information in the dialogue act.ERR = (m_slot + r_slot)/k, where k is the number of slots in a dialogue act, m_slot and r_slot denote the number of missing slots and redundant slots in the given realization, respectively.

Experimental Setup
Human Evaluation We conduct human evaluations of different models.We randomly select 100 responses generated by each model for human evaluation in the restaurant domain.Three workers are invited to independently rate the responses generated by each model according to the rules (Peng et al., 2020).The works are required to judge each response from 1(bad) to 3(good) in terms of informativeness and naturalness.Finally, we adopt the average score marked by the three volunteers as the final score of each response.
2 See Appendix A for more details of two datasets.GPT-2: The pre-trained GPT-2 (Radford et al., 2019) is directly fine-tuned on the domain-specific labeled data.

Model information
SC-GPT (strong baseline): Peng et al. ( 2020) regard the structured dialogue act as a sequence of tokens and feed the sequence to the generation model.We apply the obtained annotated data to SC-GPT as a strong baseline system.

Results and Analysis
We compare our model with previous state-of-theart models.The overall results of NLG experiments on the FEWSHOTWOZ dataset are shown in Table 1.Although the strong baseline model has achieved solid results, our model outperforms previous state-of-the-art performance in most domains.For the FEWSHOTWOZ dataset, compared with the SC-GPT baseline, DSPM-NLG has a 3.82% absolute improvement in the BLEU score and a 2.76% absolute reduction in the ERR in the restaurant domain.As shown in  FEWSHOTSGD are shown in Table 3.The results demonstrate that DSPM-NLG reaches stable performance and brings practical values to real-world applications.More importantly, we would like to explore the reason for the improved performance of DSPM-NLG 3 .Therefore, extensive ablation experiments are conducted to analyze the effectiveness of the proposed model.

Ablation Study
We provide integrated analysis results on the critical components of DSPM-NLG to gain detailed insights: Effect of jointly modeling NLG and SLU.From the result, JM-NLG performs better than SC-GPT in some domains.In the pre-training stage, JM-NLG adopts a multi-task learning network that jointly trains two tasks.The loss function of JM-NLG not only learns the implicit correlations between tasks but also provides additional supervision signals, which constrains the joint model better to generate the slot-value pairs of the dialogue act.However, the model only takes advantage of the implicit association between the two tasks.Thus, the improvement of JM-NLG is slight.
Effect of the dual supervised pre-trained model.The experimental results show that, compared with the baseline models, DSPM-NLG-sm significantly improves both BLEU and ERR in most domains.The main reason is the dual supervised learning framework models the explicit joint probabilistic correlation between SLU and DST.In the pre-trained stage, the pre-trained model is continuously trained on large-scale dialogue-acts, responses, and slots annotated datasets, which helps the dual supervised learning framework learn the duality between SLU and NLG.And the objective function can be better optimized with large amounts of data.The result reveals the dual structure strengthens the supervised learning process. 3The parameter settings of the DSPM-NLG model are recorded in Appendix B.
Effect of the slot-masked strategy.To further verify the effectiveness of the designed slot-masked strategy, a statistical analysis is performed on the pre-training dataset in the SLU task.We find that the number of non-value slot labels ("O") is more than ten times that of the valued slots.Although the loss function of SLU assigns a small loss value to the "O"-labeled slots, when the number of "O" slots is large, it may have a negative impact on the model.The slot-masked strategy can mask the "O"-labeled slots and select valued slot information.Therefore, the performance of JM-NLG and DSPM-NLG is further improved.In multi-task learning, the loss value of SLU has a significant impact on the model performance.Therefore, JM-NLG achieves a good performance.And we expect to get a considerable enhancement over DSPM-NLG.However, experimental results show that the performance improvement of DSPM-NLG is limited.To explain it, we think the dual regularization term is related to the loss value of SLU, and the value of the hyperparameter λ in the regularization term is generally small.Although the strategy is reasonable and feasible, the impact of the slotmasked strategy on DSPM-NLG is not significant.

In-depth Analysis
The generalizability and semantical controllability learned by the pre-trained model is critical to the performance of the model in the fine-tuning stage for few-shot learning.Next, experiments are conducted to analyze the generalization and semantically controlling abilities learned by DSPM-NLG.
Generalizeability  In the restaurant domain, we split the test set into two subsets seen dialogue acts (DAs) and unseen dialogue acts.The dialogue acts that appear in the training set are called seen DAs ; otherwise, it is marked as unseen DAs.The performance of the unseen DAs can well reflect the generalization ability of the model.The performance of different models is compared on the seen DAs and unseen DAs, as shown in Table 4. On the two subsets, DSPM-NLG yields higher BLEU and lower ERR.It performs consistently better than SC-GPT and JM-NLG.What's more, the improvement of the model is more obvious in the unseen subset.Experiments demonstrate that DSPM-NLG has a strong generalization ability.
Controllability (1) We compare the generated  responses of different models.( 2) We analyze the performance of different models on the ERR.As shown in Figure 5, we select a couple of cases from the FEWSHOTWOZ test set to specifically analyze the difference in generated response between our method and baseline models.We find that these NLG models have three types of errors in conveying dialogue acts: Wrong slot-value pairs, Redundant slot-value pairs, and Omissive slot-value pairs.In the first two cases, SC-GPT generates wrong slot-value pairs and redundant slot-value pairs, respectively.The appearance of the word "restaurant" in the dataset is relatively high.The SC-GPT baseline learns more about the data feature in the dataset than the semantic structure feature of dialogue acts.Consequently, in the baseline model, "cafes" is mislabeled as "restaurants", and "accessories","pricerange" are redundant.DSPM-NLG correctly conveys the semantic information of the dialogue act.This further indicates that DSPM-NLG is capable of constraining the NLG task with the semantic information detected by SLU so that our model can convey more accurate dialogue acts.In the fourth case, the baseline model misses a slot-value pair.For the slot "goodformeal","address" , our model accurately generates it.We think the main reason may be that the key slot information detected by SLU can supervise the generated response, whether it contains slot-value pairs of the dialogue act.And the slot-masked strategy can accurately select the key slot information detected by SLU to restrict the slots that need to be generated.The above results indicate the correctness of exploring the dual correlation between SLU and NLG.
To further quantitatively analyze three types of errors (Wrong, Redundant, Omissive) in conveying dialogue acts of the NLG model, we counted the percentage of three types of errors in the restaurant domain for SC-GPT and DSPM-NLG.The results are shown in Table 5.We found that SC-GPT is prone to omissive important slot-value pairs contained in dialogue acts.In particular, when the number of slot-value pairs in a dialogue act is greater than 4, omissive slot-value pairs of errors are more 12396 angkor borei restaurant is a nice restaurant that serves cheap food in the bernal heights area.||| angkor borei restaurant is a nice restaurant that serves cheap food and good for dinner in the bernal heights area.serious.Compared with the baseline model, three types of errors of the DSPM-NLG model reduces "1.55%", "2.33%", and "7.75%", respectively.The experimental results reflect that our model effectively alleviates three types of errors in conveying dialogue acts.In particular, for the err of omissive slot-value pairs, the error rate of DSPM-NLG dropped significantly.The main reason may be that the joint probability between SLU and NLG constrains the model to accurately convey the semantic information of the dialogue act.In addition, the slot-masked strategy contributes to the reduction of wrong slot-value pairs.When these errors are reduced, ERR is reduced and the BLEU score is improved.The experimental results demonstrate that the DSPM-NLG model has a stronger semantic control ability than the baseline model.

[goodformeal = dinner]
Effects of λ.In the dual supervised learning framework, the Lagrange parameter λ setting greatly affects the model.Therefore, a sensitivity analysis of the λ is conducted.As shown in Table 6, we set λ and report the performance of different λ.From the result, λ = 0.1 is the optimal value for obtaining the best performance based on the dataset.When the value of λ = 0, the training of the model is the standard supervised learning process.We can see that, within a relatively large interval of λ, the performance of dual supervised learning is stronger than that of standard supervised learning.

Conclusion
In this paper, we proposed a novel dual supervised pre-trained model for NLG.We explore the duality between SLU and NLG from the perspective of joint probability in the pre-training stage.The slot-masked strategy is designed to constrain the DSPM-NLG model to focus on the slot-value pairs in dialogue acts.Thus, the proposed model endows the NLG module with strong semantically controlling and generalization abilities.Experiments on two benchmark datasets show significant improvement over previous state-of-the-art models in both automatic and human evaluations.

Limitations
In the pre-training stage, the performance of DSPM-NLG depends on a large amount of annotated data.Despite the improved result, the annotated data is directly obtained from existing publicly available datasets, which has two main limitations: limited data volume and lack of data diversity.This renders limited scalability performance when dealing with complex tasks.When the data volume and diversity of the annotated data are rich enough, DSPM-NLG can fully learn the joint probability and mapping between dual tasks.Compared with the baseline model, the semantic controllability and generalization ability of DSPM-NLG will be improved more significantly.

B Experiment Setup
Using the Huggingface Transformers public library (Wolf et al., 2019a), we implement our model on PyTorch.The GPT-2-Medium model with 24 layers and 16 attention heads is chosen as the backbone, and byte pair encodings (Sennrich et al., 2015) is used for the tokenization.And the model uses Adam (Kingma and Ba, 2014) as the optimizer an initial learning rate of 5e-5, a scheduler with a linear warm-up to update and adjust the learning rate.We set the maximum sequence length to 80 and the batch size to 8. The GPU used for the training is NVIDIA Quadro RTX 8000-64G.In the pre-training stage, we jointly (SLU and NLG) train GPT-2 until observing no obvious improvement in validation loss or up to 20 epochs.And we save the model parameters for the fine-tuning stage.

Figure 2 :
Figure 2: Illustration of the dual supervised learning.
h m+n denote the final hidden state of the special [BOS] and [EOS] token.

Figure 4 :
Figure 4: The experimental results of our models and baseline models under different training data sizes.
cafes that are cafes near presidio heights.there are 0 restaurants in the presidio heights area.||| there are 0 cafes in the presidio heights area.name = caerus 63; type = television; accessories = remote control and european warranty; audio = nicam stereo; ecorating = a+) the caerus 63 features nicam stereo , a+ eco rating and comes with remote control and european warranty .it is a nice television.the caerus 63 television, with remote control and european warranty and nicam stereo audio, a+ ecorating, and a remote control and european warranty price range of $80.||| the caerus 63 television, with a+ ecorating and nicam stereo audio, a remote control and european warranty.= thep phanom thai restaurant; address = 400 waller street; phone = 4154312526; postcode = 94117)thep phanom thai restaurant s address is 400 waller street , its phone number is 4154312526 , and the postcode is 94117.thepphanom thai restaurant is located at 400 waller street , postcode is 94117 , and the phone number is 4154312526.thepphanom thai restaurant 's phone number is 4154312526 and the postcode is 94117.= angkor borei restaurant; goodformeal = dinner; area=bernal heights; pricerange = cheap) is good dinner choice for your cheap range in bernal heights.

Figure 5 :
Figure 5: Examples of generated responses from different models on FEWSHOTWOZ.Three different types of errors correspond to three colors (better viewed in color).The blue text means Wrong.The green text denotes Redundant.And the red text indicates Omissive.Model λ = 0 λ = 0.1 λ = 0.01 λ = 0.001 BLEU ERR BLEU ERR BLEU ERR BLEU ERR DSPM-NLG 34.08 6.08 38.72 3.76 35.73 4.63 34.6 5.75 where h i ∈ H y denotes the final hidden state of the i th token in H y .For the NLG task, we utilize the final hidden state H y to generate responses, and probability distribution P (y ′ | x; θ x→y ) of the generated tokens is calculated by: P y ′ | x; θx→y = sof tmax(hiWU + bU ), |U | are weight matrix and bias vector, respectively.d is the dimension of the hidden state vector.Besides, |U | is the length of vocabulary, θ x→y is the learnable parameter of the model.For the SLU task, we input the final hidden state H y to another trainable linear layer, which is used to predict the slot of the corresponding input token.Then the probability distribution P (x ′ | y; θ y→x ) of slots is calculated by: Loss Function In this section, we introduce the joint training procedure with dual supervised learning in detail.l N LG , l SLU are loss functions, and the loss values of NLG and SLU are computed as: where f (x; θ x→y ) is mapping function for NLG;W U ∈ R d×|U | and b U ∈ Rwhere g (y; θ y→x ) is a mapping function for SLU; W S ∈ R d×|S| and b S ∈ R |S| are weight matrix and bias vector, respectively.Besides, |S| is the number of slot labels, and θ x→y is the learnable parameter of the model.

Table 1 :
Experimental results of our models and baseline models.The experimental results with the highest value is bolded.The "JM-NLG" adopts a multi-task learning method to jointly model NLG and SLU in the pre-training.

Table 2 ,
the DSPM-NLG model also achieves better performance in human evaluation indicators.The experimental results express the same trend with automatic evaluation indicators.The results of DSPM-NLG in BLEU on the 12394

Table 3 :
The results in BLEU on the FEWSHOTSGD dataset.And the DSPM-NLG model was pre-trained using only the MultiWOZ dataset.

Table 4 :
The experimental results of our models and baseline models on the seen dialogue acts and unseen dialogue acts.

Table 5 :
The statistics of generated responses for three types of errors in conveying dialogue acts.

Table 6 :
Valid BLEU and ERR with reference to λ.

Table 7 :
Data Statistics of two datasets.