UniGDD: A Unified Generative Framework for Goal-Oriented Document-Grounded Dialogue

The goal-oriented document-grounded dialogue aims at responding to the user query based on the dialogue context and supporting document. Existing studies tackle this problem by decomposing it into two sub-tasks: knowledge identification and response generation. However, such pipeline methods would unavoidably suffer from the error propagation issue. This paper proposes to unify these two sub-tasks via sequentially generating the grounding knowledge and the response. We further develop a prompt-connected multi-task learning strategy to model the characteristics and connections of different tasks and introduce linear temperature scheduling to reduce the negative effect of irrelevant document information. Experimental results demonstrate the effectiveness of our framework.


Introduction
Recent years have seen significant progress in goaloriented dialogues (Loshchilov and Hutter, 2017;Wen et al., 2017;Wu et al., 2019;Hosseini-Asl et al., 2020;, which aim at assisting end users in accomplishing certain goals via natural language interactions. However, due to the lack of external knowledge, most goal-oriented dialogue systems are restricted to providing information that can only be handled by given databases or APIs (Kim et al., 2020) and completing certain tasks in a specific domain such as restaurant booking. To address this challenge, goal-oriented document-grounded dialogue has been proposed to leverage external documents as the knowledge source to assist the dialogue system in satisfying users' diverse information needs (Feng et al., 2020;Wu et al., 2021).
Each time you renew your license, it is renewed for two years.
I would like to renew my Driving School License, when is the right time to do so?
Renewal of a Driving School License must be performed between 30 and 60 days before the expiration date as seen on your license.

Dialogue Context
Supporting Document Figure 1: An example of the goal-oriented documentgrounded dialogue problem.
As shown in Figure 1, the goal-oriented document-grounded dialogue problem is commonly formulated as a sequential process including two sub-tasks: knowledge identification (KI) and response generation (RG) (Feng, 2021). Given the dialogue context and supporting document, knowledge identification aims to identify a text span in the document as the grounding knowledge for the next agent response, which is often formulated as a conversational reading comprehension task (Feng, 2021;Wu et al., 2021). Response generation then aims at generating a proper agent response according to the dialogue context and the selected knowledge. Therefore, one straightforward solution for this problem is to use two models to conduct KI and RG in a pipeline manner (Daheim et al., 2021;Kim et al., 2021;Xu et al., 2021;Chen et al., 2021;. However, such pipeline methods fail to capture the interdependence between KI and RG. As a result, error propagation is a serious problem. The problem is more pronounced in low-resource scenarios, where accurate knowledge identification is difficult due to limited data, making it harder to generate appropriate responses. To address the aforementioned issue, we propose a Unified generative framework for Goal-oriented Document-grounded Dialogue (UniGDD). Given the dialogue context and associated document, instead of treating KI and RG as two separate processes, we tackle them simultaneously via sequen-tially generating the grounding knowledge and the agent response. Therefore, the inherent dependencies between these two sub-tasks can be naturally modeled. On one hand, the generation of the agent response depends not only on the dialogue context and external document but also on the identified knowledge, forcing the model to focus on the specific knowledge. On the other hand, the generation of the grounding knowledge receives the supervision signal from the agent response when training, leading to more accurate knowledge identification.
Although KI and RG can be unified with the proposed generative method, they have different characteristics. Generating the grounding knowledge is similar to copying appropriate sentences from the document, while generating the response needs more effort to make the response coherent with the dialogue and consistent with the grounding knowledge. Therefore, in addition to the main task that uses the concatenation of the grounding knowledge and response as the target sequence, we introduce the generation of the grounding knowledge and the generation of the response as two auxiliary tasks in the same framework to force the model to capture their characteristics so as to perform well on them as well. Moreover, inspired by the recent success in prompt learning for pre-trained models (Li and Liang, 2021;Lester et al., 2021;, we design prompts for these three tasks to guide the model on what to generate for each task. These prompts can naturally connect these tasks via indicating the model that each auxiliary task aims to generate a part of the target sequence of the main task. Through this prompt-connected multi-task learning strategy, the model can capture the characteristics of different tasks as well as exploit the connections between them.
In addition, for a particular user query in the goal-oriented dialogue, the selected knowledge and generated response need to be specific, while the generation conditions on a relatively long document. Thus, much information in the input document is irrelevant. To tackle this problem, we introduce linear temperature scheduling to make the attention distribution to the input document gradually sharper during the training process in order to enable the model to learn to pay more attention to the relevant content.
Our contributions are summarized as follows: (1) We propose a unified generative framework for the goal-oriented document-grounded dialogue. (2  We develop a prompt-connected multi-task learning strategy to exploit the characteristics and connections of different tasks and introduce linear temperature scheduling to enable the model to pay more attention to relevant information. (3) Our framework advances state-of-the-art methods on the concerned task, especially in low-resource scenarios.

Our UniGDD framework
UniGDD is a multi-task generative framework for the goal-oriented document-grounded dialogue problem.
Main Task Given the dialogue context C = (u 1 , a 1 , . . . , u t−1 , a t−1 , u t ) and grounding document D, where u i is the i-th user utterance and a i is the i-th agent utterance, our main task aims to generate the target sequence Y = (k t , a t ), where k t is the grounding knowledge from D and a t is the response to u t . Specifically, for the example in Figure 1, the input and output of the main task are as follows: We use different special tokens to identify different elements in the input and output. For example, we add "<user>" in front of each user utterance, "<agent>" in front of each agent utterance, and "<grounding>" in front of the grounding knowledge. The prompt "generate <grounding> then <agent>:" is added to the dialogue context and supporting document to form the input and guide the model to generate the grounding knowledge and the response in order. The input-to-target generation can be modeled with a pre-trained encoder-decoder model M : (C, D, T P ) → (k t , a t ) such as T5 (Raffel et al., 2020), where T P is the task prompt.

Prompt-Connected Multi-Task Learning
We introduce two auxiliary tasks to steer our framework to model the respective characteristics of knowledge identification and response generation. Given the dialogue context C and grounding document D, these two tasks aim to generate the grounding knowledge k t and the response a t with the same model M. As depicted in Figure 2, we construct prompts "generate <grounding>:" and "generate <agent>:" for them. These prompts indicate the model that the goals of the two auxiliary tasks are to generate the first part and the second part of the target sequence of the main task, respectively. As a result, the connections between different tasks are naturally modeled. Instead of using discrete language phrases, we randomly initialize the embeddings of those special tokens in the prompts and train them end-to-end to better encode the characteristics and connections of these tasks.
Linear Temperature Scheduling For a specific user query in the dialogue, many document contents are actually irrelevant. To force the model to pay less attention to the irrelevant parts, we propose a linear temperature scheduling strategy to make the attention distribution of cross-attention gradually sharper during the training process. Specifically, we design the softmax function in the cross-attention module of each decoder layer as follows: where a i is the attention weight for the i-th input token, z i is the logit for the i-th input token, S c is the current training step, S total is the total training steps, τ s and τ e are the starting and ending temperature respectively, τ e < τ s , and 0 < τ e < 1. Compared with the original cross-attention module, the ending temperature 0 < τ e < 1 leads to a sharper attention distribution, giving more attention weight to the relevant content.
Training The model is trained with a maximum likelihood objective. Given the training example e = (C, D, T P, Y ), the objective L θ is defined as where θ is the model parameters, T P is the task prompt, Y is the target sequence, and n is the  length of Y . We mix the data of the main task and two auxiliary tasks for training. Inference After training, for each pair of dialogue context and document (C, D), we generate the target sequence of the main task for obtaining the grounding knowledge k t and the response a t .

Experimental Setup
Dataset We conduct experiments on the goaloriented document-grounded dialogue dataset Doc2Dial (Feng, 2021), which is adopted by the Di-alDoc21 shared task 1 . It contains 3,474 dialogues with 44,149 turns for training and 661 dialogues with 8539 turns for evaluation 2 .
Evaluation Metrics Following Feng (2021), we use Exact Match (EM) and token-level F1 for knowledge identification and BLEU (Papineni et al., 2002;Post, 2018) for response generation.
Baselines For knowledge identification, we compare UniGDD with several strong baselines, including BERTQA (Devlin et al., 2019), BERT-PR (Daheim et al., 2021), RoBERTa-PR (Daheim et al., 2021), Multi-Sentence (Wu et al., 2021), and DI-ALKI (Wu et al., 2021). These models formulate knowledge identification as the machine reading comprehension task and extract the grounding span from the document. For response generation, we compare UniGDD with several pipeline methods, including DIALKI+BART (Wu et al., 2021) that uses DIALKI to conduct knowledge identification, followed by BART (Lewis et al., 2020) to conduct response generation and RoBERTa-PR+BART (Daheim et al., 2021). We also build a strong baseline model RoBERTa+T5 which uses the same pretrained generative model as ours.
Implementation Details We report results of UniGDD with two model sizes: UniGDD-base and UniGDD-large, which are initialized with pretrained T5-base and T5-large models (Raffel et al., 2020), respectively. We adopt the implementation from Hugging Face Transformers (Wolf et al., 2020). We set the max input length to 2560. Any sequence over 2560 tokens will be truncated. For training, we use the AdamW (Loshchilov and Hutter, 2019) optimizer with an initial learning rate of 10 −4 and a linear learning rate decay scheduler. We train 10 epochs for single-task learning and 5 epochs for multi-task learning. For decoding, we use beam search, and the beam size is 2. For linear temperature scheduling, we set the starting temperature τ s = 1 and choose the best ending temperature from {0.5, 0.6, 0.7, 0.8, 0.9}. For our constructed baseline RoBERTa+T5 for response generation, we use RoBERTa-large and T5-base and adopt the implementation from the DialDoc21 shared task.

Results
The results on knowledge identification and response generation are shown in Table 1 and Table  2, respectively. Our UniGDD framework outperforms all the baselines on two sub-tasks. On the knowledge identification task, UniGDD-base can obtain comparable results to previous state-of-theart methods. With a larger model size, UniGDDlarge achieves new state-of-the-art performance. On the response generation task, UniGDD obtains a marked improvement over all pipeline methods. This verifies our assumption that our unified generative framework can alleviate the error propagation problem of pipeline approaches.
Effect of Prompt-Connected Multi-task Learning (PCMTL) and Linear Temperature Scheduling (LTS) To verify the effectiveness of PCMTL and LTS, we first remove PCMTL (i.e., training with the main task only), and the performance of UniGDD-base on two tasks decreases Effect of Connected Prompts (CP) To examine whether CP can capture the connections of different tasks, we use an alternative approach that employs task-independent prompts "<Task1>:", "<Task2>:", and "<Task3>:" to specify each task for comparison. As in the case of CP, we randomly initialize the embeddings of these three special tokens. With these prompts, UniGDD-base obtains 64.9 EM, 76.2 F1, and 42.3 BLEU, which performs worse than using CP. This indicates that CP enables the model to take advantage of the connections between the three tasks.
Low-Resource Setting To evaluate the model in low-resource scenarios, we randomly shuffle the training set and then take 1/32, 1/16, 1/8, and 1/4 of the data for training. Figure 3 shows the results of UniGDD-base and the best-performing pipeline baseline RoBERTa-large+T5-base on the four low-resource training splits. Generally, our framework performs substantially better than the pipeline method on both tasks. Particularly, when there is only 1/32 training data, UniGDD-base obtains more than 20 and 10 absolute points improvement over the pipeline approach on EM and BLEU, respectively.
Case Study Figure 4 shows a real case including the dialogue context, supporting document, and the responses generated by the pipeline method and our proposed UniGDD framework. It can be observed that our framework identifies accurate knowledge from the supporting document and thus provides a I filled out all of the information in the Retirement Estimator and it took a long time. When I came back from answering the door, all of the information was gone. What happened?
Oh that's too bad. Were you gone for a long time?
Yes I guess I was.

Dialogue Context
RoBERTa-large+T5-base Do you have any more questions about the Retirement Estimator?
UniGDD-base For security reasons, there are time limits for viewing each page. You will receive a warning after 25 minutes without doing anything and you will be able to extend your time on the page.
For reasons of security, there are time limits for viewing each page.

Ground Truth
…… How Long Can You Stay On Each Page? For security reasons, there are time limits for viewing each page. You will receive a warning after 25 minutes without doing anything, and you will be able to extend your time on the page. After the third warning on a page, you must move to another page. If you do not, your time will run out and your work on that page will be lost. proper and informative response about the reasons for the problem the user encounters. In contrast, the pipeline method only gives a relatively general response that is not suitable in this case.

Human Evaluation
We randomly sample 100 evaluation instances. For each instance, given the dialogue context and grounding document, three human annotators are asked to conduct a pairwise comparison between the response generated by UniGDD-base and the one generated by the pipeline baseline RoBERTa-large+T5-base in terms of two aspects: (1) Relevance: which response is more relevant and appropriate to the user query? (2) Informativeness: which response is more informative? Results are shown in Table 3. Compared with the pipeline method, our framework can reduce error propagation, resulting in more relevant and appropriate responses. Moreover, our framework has a clear advantage over the baseline in terms of Informativeness since it can utilize rich document context during the generation. Relevance  26  64  10  Informativeness  23  69  8   Table 3: UniGDD-base vs RoBERTa-large+T5-base.

Win Tie Lose
The numbers indicate how many instances there are in each case.

Conclusion
Our UniGDD framework unifies knowledge identification and response generation and models their characteristics via a multi-task generative modeling strategy. Both automatic evaluation and human evaluation demonstrate the effectiveness of our framework.