HyKnow: End-to-End Task-Oriented Dialog Modeling with Hybrid Knowledge Management

Task-oriented dialog (TOD) systems typically manage structured knowledge (e.g. ontologies and databases) to guide the goal-oriented conversations. However, they fall short of handling dialog turns grounded on unstructured knowledge (e.g. reviews and documents). In this paper, we formulate a task of modeling TOD grounded on both structured and unstructured knowledge. To address this task, we propose a TOD system with hybrid knowledge management, HyKnow. It extends the belief state to manage both structured and unstructured knowledge, and is the first end-to-end model that jointly optimizes dialog modeling grounded on these two kinds of knowledge. We conduct experiments on the modified version of MultiWOZ 2.1 dataset, where dialogs are grounded on hybrid knowledge. Experimental results show that HyKnow has strong end-to-end performance compared to existing TOD systems. It also outperforms the pipeline knowledge management schemes, with higher unstructured knowledge retrieval accuracy.


Introduction
Recently, Task-Oriented Dialog (TOD) systems (Mehri et al., 2019;Zhang et al., 2020a,b;Le et al., 2020;Hosseini-Asl et al., 2020;Peng et al., 2020;Li et al., 2021) have achieved promising performance on accomplishing user goals. Most systems typically query structured knowledge such as tables and databases based on the user goals, and use the query results to guide the generation of system responses, as shown in the first dialog turn in Fig. 1.
However, real-world task-oriented conversations often step into dialog turns which are grounded on unstructured knowledge (Feng et al., 2020), such as passages and documents. For example, as the Figure 1: Illustration of task-oriented dialog modeling with hybrid knowledge management. Words in red and blue illustrate the new domain-slot-value triple and the topic of user utterance that we introduce into the belief state, respectively. Words in yellow illustrate the topics of documents that we extract through preprocessing. second dialog turn in Fig. 1 shows, the user asks about customers' favorite food at Pizza Hut, which is grounded on the customer reviews of this restaurant. Current TOD systems fall short of handling such dialog turns since they cannot utilize relevant unstructured knowledge. This deficiency may interrupt the dialog process, causing difficulties in tracking user goals and generating system responses.
In this work, we consider incorporating more various forms of domain knowledge into the TOD systems. Therefore, we define a task of modeling TOD whose turns involve either structured or unstructured knowledge. In turns involving struc-tured knowledge, the system needs to track the user goals as triples and use them to perform database queries, whose results are used to generate the system response. While in turns involving unstructured knowledge, the system manages a document base to retrieve relevant references for generating the response.
To address our defined task, we propose a taskoriented dialog system with Hybrid Knowledge management (HyKnow). This model extends the belief state to handle TODs grounded on hybrid knowledge, and further uses the extended belief state to perform both database query and document retrieval, whose outputs are thereby used to generate the final response. We consider two implementations of our system, with different schemes of extended belief state decoding. Both implementations are in an end-to-end multi-stage sequence-tosequence (Seq2Seq) (Lei et al., 2018;Liang et al., 2020;Zhang et al., 2020a,b) framework, where dialog modeling grounded on the two kinds of knowledge can be jointly optimized. We evaluate our system on the modified version of MultiWOZ 2.1 (Kim et al., 2020) dataset, where dialogs are grounded on hybrid knowledge. Experimental results show that HyKnow outperforms existing TOD systems which do not leverage large pretrained language models, no matter whether they add extra unstructured knowledge management or not. It also has a higher accuracy in unstructured knowledge retrieval, compared to the pipeline knowledge management schemes.
Our contributions are summarized as below: • We formulate a task of modeling TOD grounded on both structured and unstructured knowledge, to incorporate more domain knowledge into the TOD systems. • We propose a TOD system HyKnow to address our proposed task. It extends the belief state to manage hybrid knowledge, and is the first end-toend model to jointly optimize dialog modeling grounded on the two kinds of knowledge.  et al., 2013;Henderson et al., 2014). The states are converted into a representation of constraints based on different schemes to query the databases (El Asri et al., 2017;Budzianowski et al., 2018;Rastogi et al., 2020;Zhu et al., 2020). The entry matching results are then used to generate the system response.
With the development of intelligent assistants, the system should have a good command of massive external knowledge to better accomplish complicated user goals and improve user satisfaction. To realize this, some researchers (Zhao et al., 2017;Yu et al., 2017;Akasaki and Kaji, 2017) equip the system with chatting capability to address both task and non-task content in TODs. Other studies apply knowledge graph (Liao et al., 2019; or tables via SQL (Yu et al., 2019) to enrich the knowledge of TOD systems. However, all these studies are still limited in dialog modeling grounded on structured knowledge.
There are a couple of studies to integrate unstructured knowledge into TOD modeling recently. Kim et al. (2020) introduce knowledge snippets to answer follow-up questions out of the coverage of databases. Feng et al. (2020) formulate document-grounded dialog for information seeking tasks. However, they only focus on dialog turns grounded on unstructured knowledge instead. In this paper, we aims to fill the gap of managing domain-specific knowledge with various sources and structures in traditional TOD systems.

Task Definition
In this section, we introduce our formulation of modeling TOD grounded on hybrid knowledge. In particular, we assume that each dialog turn in TOD is grounded on either structured or unstructured knowledge. We formulate the modeling of the two kinds of dialog turns separately.
In turns that are grounded on structured knowledge, the system needs to track user goals, i.e. the belief state, as domain-slot-value triples, and then query a database (DB) to guide response generation. Specifically, we denote the user utterance and the system response at turn t as U t and R t respectively. Given the dialog context C t = [U t−k , R t−k , ..., U t ] and previous belief state B t−1 , the system needs to generate current belief state B t , which is formulated as B t = f (s) b (C t , B t−1 ). Then the system performs DB query based on B t to get the matching Figure 2: Overview of HyKnow. Solid arrows denote the input/output of the encoders or decoders. Dashed arrows denote the knowledge operations. C t , m t , D t and R t represent turn t's dialog context, DB query result, relevant document and system response. B t and h Bt enc denote the extended belief state and its hidden states at turn t. The decoding of B t (orange dashed box) is implemented in two different ways: (a) using a single decoder to generate the whole state, and (b) using two decoders to generate the domain-slot-value (DSV) triples and the topic separately.
result m t . In this paper, we follow Budzianowski et al. (2018) to represent m t as a vector indicating the number of matched entities and whether the booking is available or not. Afterwards, the system generates the response R t , formulated as In turns that are grounded on unstructured knowledge, the system manages a document base to guide response generation, which contains lists of documents characterized by different domains and entities, as showed in Fig. 1. Specifically, given the dialog context C t , the system first retrieves a relevant document D t in the document base, formulated as D t = f (u) d (C t ). Then the system generates the response R t based on C t and retrieved D t , which is formulated as R t = f (u) r (C t , D t ). Noting that the original belief state is not updated in the unstructured knowledge-grounded turns, namely B t = B t−1 . However, in this paper, we introduce extra belief state extension to facilitate the document retrieval. Fig. 2 shows an overview of our proposed system HyKnow with end-to-end sequence-to-sequence (Seq2Seq) implementations. It addresses our proposed task in three steps. First, it uses the extended belief tracking to track user goals through dialog turns that involve hybrid knowledge. Secondly, it performs hybrid knowledge operations based on the extended belief state, to search structured and unstructured knowledge that is relevant to the user goals. Finally, it uses the extended belief state and relevant knowledge to perform the knowledgegrounded response generation.

Extended Belief Tracking
Belief State Extension. We define an extended belief state B t which is applicable to track user goals in TODs that are grounded on both structured and unstructured knowledge. Specifically, in turns that are grounded on structured knowledge, B t is same as the original B t , which describes user goals as domain-slot-value triples. While in turns that are grounded on unstructured knowledge, B t has an additional slot ruk to indicate that current dialog turn requires unstructured knowledge. The prefix and value of the slot ruk represent the involved domain and entity, e.g. restaurant-ruk: Pizza Hut colored in red in Fig. 1. We denote the combination of original and newly introduced domain-slot-value triples as DSV t . In addition, the topic of U t is abstracted in B t as a word sequence T t in each unstructured knowledge-grounded turn, e.g. favorite colored in blue in Fig. 1.
Extended Belief State Decoding. Following Seq2Seq framework, we first use the context encoder to encode the dialog context C t , whose last output is used as the initial hidden state of decoders. Based on the hidden states of context encoder h Ct enc and previous extended belief state h Since DSV t and T t are grounded on quite different vocabularies, we consider implementing the decoding of B t in two ways: (a) using the belief state decoder to generate the whole B t , and (b) using the DSV decoder and the topic decoder to generate DSV t and T t separately. Each implementation has its own advantages over the other. Specifically, in the single-decoder implementation, the decoding of DSV t and T t can be jointly optimized via shared parameters: While in the multi-decoder implementation, the decoding of DSV t and T t are fitted to their own smaller decoding spaces (vocabularies), and thus the generation of B t can be decomposed into two simpler decoding processes: (2)

Hybrid Knowledge Operations
Based on the extended belief state B t , we conduct both DB query and document retrieval to get the query result m t and the relevant document D t , which are used to guide the generation of response. In the operation of DB query, we simply match the original triples in B t with the DB entries. While in the operation of document retrieval, we first preprocess the document base to extract the topic of each document as its retrieval index, e.g. vegetarian and favorite colored in yellow in Fig. 1. Then we use the extended part of B t to match the domain, entity and extracted topic of each document, and select the best-matched one as D t . 2

Knowledge-Grounded Response Generation
We generate system response based on the dialog context C t , the extended belief state B t , and the outputs of hybrid knowledge operations m t and D t . We first use the same context encoder in Sec. 4.1 to encode C t . Moreover, we use the belief state encoder and the document encoder to encode B t and D t into hidden states h Bt enc and h Dt enc , respectively. Based on the hidden states of all the encoders and the vector m t , we use the response decoder to generate the system response R t , formulated as: where Encoder (b) and Encoder (d) denote the belief state encoder and the document encoder.
Following previous TOD systems with Seq2Seq architectures (Lei et al., 2018;Liang et al., 2020;Zhang et al., 2020a,b), we use one-layer, bidirectional GRU as encoders and standard GRU as decoders. We also apply global attention (Bahdanau et al., 2015) and copy mechanism (Gu et al., 2016) in all the Seq2Seq processes, to improve the context-awareness of decoding B t and R t .

Model Training
HyKnow is optimized through supervised training. Specifically, each dialog turn in the training data is initially labeled with the original belief state and the relevant document. We extend the belief state label based on the domain, entity and extracted topic of the relevant document. Then the extended belief state label and the reference response are used to calculate the cross-entropy loss with the generated B t and R t , respectively. We sum the two losses together and perform gradient descent in each turn to optimize the model parameters. In our paper, the dialog context C t is set as the concatenation of previous system response R t−1 and current user utterance U t . 3 5 Experimental Settings

Dataset
We evaluate our proposed system on the modified MultiWOZ 2.1 (Kim et al., 2020)

Baselines
We compare HyKnow with 1) existing end-to-end (E2E) TOD models and dialog state tracking (DST) models, to show the benefits of incorporating unstructured knowledge management into TOD modeling. We also compare HyKnow with 2) unstructured knowledge management models, to investigate our system's document retrieval performance. For the comparison with pipeline systems that have hybrid knowledge management, we also consider the combinations of 1) and 2) as our baselines. E2E TOD Models and DST Models. We consider three baseline E2E TOD models with different types of structures: UniConv (Le et al., 2020) uses a structured fusion (Mehri et al., 2019) design, LABES-S2S (Zhang et al., 2020a) uses a multistage Seq2Seq (Lei et al., 2018) architecture, and SimpleTOD (Hosseini-Asl et al., 2020) is based on a single auto-regressive language model initialized from GPT-2 (Radford et al., 2019). All three E2E models only manage structured knowledge (database) in their TOD modeling. In addition to E2E TOD models, we also compare HyKnow with existing DST models in the belief tracking evaluation. Specifically, we use TRADE  and TripPy (Heck et al., 2020) as two DST baselines, which are representative BERT-free and BERT-based DST models, respectively.
Unstructured Knowledge Management Models. We first compare our system with Beyond Domain APIs (BDA) (Kim et al., 2020). This baseline model uses two classification modules based on BERT (Devlin et al., 2019) to detect unstructured knowledge-grounded dialog turns and retrieve relevant documents, respectively. Moreover, we use standard information retrieval (IR) systems TF-IDF (Manning et al., 2008) and BM25 (Robert-son and Zaragoza, 2009) as the other two baseline models.
Combinations. We combine the unstructured knowledge management model BDA with every DST or E2E TOD model. Specifically, BDA detects dialog turns that are grounded on unstructured knowledge, and uses a fine-tuned GPT-2 to generate responses in these turns, based on the dialog context and retrieved documents. While the DST or E2E TOD model handles the rest dialog turns which are grounded on structured knowledge.
Noting that TripPy and SimpleTOD use largescale pretrained language models (LM) to improve their dialog modeling performance, which requires large model sizes and computing resources. For fair comparisons, we distinguish them from other light-weight models in our experiments.

Results and Analysis
We test our system's performance under both the single-decoder and multi-decoder belief state decoding implementations, denoted as HyKnow (Single) and HyKnow (Multiple), respectively. Both implementations of HyKnow come to the same conclusions when compared with the baseline models, which are described in detail below. Table 1 shows our experimental results of the endto-end (E2E) evaluation, where we evaluate the task completion rate and language quality of system responses. In terms of the task completion rate, we measure whether the system provides correct entities (Inform rate) and answers all the requested information (Success rate) in a dialog, following Budzianowski et al. (2018). For the evaluation of language quality, we adopt commonly used metrics  BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE-L (Lin, 2004). Moreover, we use Combined score computed by (Inf orm + Success) × 0.5 + BLEU for overall evaluation, as suggested by . We find that HyKnow has better task completion rate than the light-weight E2E TOD models, which is comparable with SimpleTOD who uses large-scale pretrained GPT-2. It also generates responses with better language quality compared to all the E2E models. This is because our extended belief state can distinguish whether a dialog turn is grounded on structured or unstructured knowledge, which avoids the confusion between handling the two kinds of turns. In addition, we manage the document base to provide relevant references for generating the response, which guide our system to give more appropriate responses in turns that are grounded on unstructured knowledge.

End-to-End Evaluation
We also observe that HyKnow outperforms the combinations of BDA and light-weight E2E TOD model. This indicates that our end-to-end model framework has advantages over the pipeline structures of combination models. In particular, dialog modeling grounded on the structured and unstructured knowledge are integrated in a uniform Seq2Seq architecture in our system, where they are jointly optimized to an overall better performance. Although HyKnow does not significantly outperform the combination of BDA and SimpleTOD, our system has lower deployment cost since it is trained end-to-end.

Context-to-Response Generation
We also conduct evaluations on the context-toresponse (C2R) generation, where systems directly use the oracle belief state and knowledge to generate the response. The experimental results are shown in Table 2, where we observe the same conclusions as in the E2E evaluation (Table 1). This  again shows our system's superiority in TOD modeling grounded on hybrid knowledge. Additionally, we observe that HyKnow's performance gap between E2E and C2R evaluations is smaller than the baseline models, reflected in the smaller variations of the combined score. This shows that the belief state and knowledge provided by our system are probably closer to the oracle and may give stronger guidance to generate a response.

Knowledge Management
To further investigate our system's end-to-end performance, we conduct evaluations on the intermediate knowledge management. In particular, we evaluate the structured and unstructured knowledge management separately in the original and newly inserted dialog turns. In the original turns grounded on structured knowledge, we evaluate the belief tracking performance which directly determines the database query accuracy. Specifically, we use the Joint Goal accuracy (Henderson et al., 2014) to measure whether belief states are predicted correctly in a dialog turn. While in newly inserted turns grounded on unstructured knowledge, we adopt standard information retrieval metrics R@1 and MRR@5 to evaluate the document retrieval performance. Table 3 and 4 shows our evaluation results of belief tracking and document retrieval, respectively.
In terms of belief tracking, HyKnow outperforms the light-weight DST/E2E models. This is because our extended belief tracking can detect the  newly inserted turns apart from the original turns (via the slot ruk), which improves our system's awareness on deciding when to update the original triples in the belief state. HyKnow also has better belief tracking performance than the combinations of BDA and light-weight DST/E2E model. This is because error propagation on updating belief states is eliminated in our system compared to the pipeline framework: The pipeline system either updates the belief state or retrieves the document in one turn, but HyKnow can perform both operations in the nature of its E2E design. Although the belief tracking performance of HyKnow is not as good as that of TripPy and SimpleTOD, our system does not use large-scale pretrained BERT or GPT-2 and is thus computational cheaper. 5 In the document retrieval evaluation, we find that HyKnow outperforms the unstructured knowledge management models, especially on the R@1 metric. This shows that our system's document retrieval scheme with topic matching has a higher accuracy, compared to the classifier-based BDA and the standard information retrieval (IR) systems. Specifically, HyKnow retrieves documents based on the highly simplified semantic information, i.e. the topic, which reduces the complexity of the retrieval process. This makes the retrieval scheme of HyKnow more concise and effective than the baseline models, who directly calculate the relevance of dialog context to every document content.

Single vs. Multiple Decoders
We then compare our two implementations of extended belief state decoding. We calculate the vocabularies of DSV triples, the topic and their combination (which are 709, 166 and 862), and observe that the last one approximately equals to the sum of the former two. This confirms our assumption 5 See Appendix D for details on model size comparison. in Sec. 4.1 that DSV triples and topic have quite different vocabularies, which motivates the multidecoder implementation in belief state decoding.
However, we find that HyKnow (Single) outperforms HyKnow (Multiple) in both E2E and knowledge management evaluations, as shown in Table  1, 3 and 4. This shows that the decoding of DSV triples and topic can benefit from the joint optimization via shared parameters, although they are grounded on quite different vocabularies. The superiority of joint optimization further implies that the structured and unstructured knowledge management in TOD modeling have a positive correlation, since they commonly involve task-specific domain knowledge and entities. Therefore, the two kinds of knowledge management can learn from each other through joint training, and achieve overall better performance compared to separating them apart.

Ablation Study
We ablate the joint optimization of structured and unstructured knowledge-grounded TOD modeling to investigate its role in our framework, denoted as w/o Joint Optim in Table 1, 3 and 4. Specifically, we train two HyKnow models separately on the original and newly inserted dialog turns, and use them to handle TOD grounded on structured and unstructured knowledge, respectively. To determine which model should be used, the oracle label of slot ruk is used to judge which knowledge type the current dialog turn is grounded on.
We observe that removing joint optimization brings HyKnow evident performance declines in the end-to-end evaluation (Table 1). This suggests that joint optimization plays a significant role in improving HyKnow's end-to-end performance, where TOD modeling grounded on the two kinds of knowledge can benefit each other by learning shared parameters. The ablation of joint optimization also causes performance declines in HyKnow's knowledge management (Table 3 and 4). This again indicates that the two kinds of knowledge management are positively correlative and can get benefit from joint training.

Between Structured and Unstructured Knowledge
In this section, we investigate how the newly inserted dialog turns (grounded on unstructured knowledge) affect systems' E2E performance in the original dialog turns (grounded on structured knowledge). Specifically, we evaluate systems'   E2E performance on both the original and modified MultiWOZ 2.1 test sets. This evaluation is conducted only in the original dialog turns, which is different from the E2E evaluation conducted in all turns (Table 1). Table 5 shows the results of this experiment, where we compare HyKnow (Single) with strong combination models. We find that all the models' performance is degraded when transferred from the original to the modified test set. This indicates that the inserted turns grounded on new knowledge may interrupt the original dialogs, which complicates the dialog process and causes difficulties in the original turns' dialog modeling.
However, we observe that HyKnow (Single) suffers from less reduction compared to the baseline combination models. This shows that our system has a stronger resistance to the interruptions of newly inserted turns, which benefits from our endto-end modeling. Specifically, HyKnow jointly optimizes dialog modeling of the original and newly inserted turns in a uniform end-to-end framework. This unified modeling approach improves our system's flexibility in switching between the two kinds of turns, and thus makes it more competent in handling the complicated dialog process.

Human Evaluation
There is still a gap between the evaluation results of automatic metrics and the real E2E performance of TOD systems. Therefore, we conduct human evaluation to more adequately test our system's E2E performance. In particular, we compare HyKnow (Single) with a strong E2E baseline SimpleTOD and its combination with BDA.
We conduct human evaluation separately on the two types (original and newly inserted) of dialog turns. Specifically, we sample fifty dialog turns of each type and ask the judges to evaluate each turn's system response on three aspects. Coherence (Cohe.) measures how well the response is coherent with the dialog context. Informativeness (Info.) measures how well the response can provide sufficient information that meets the user requests. Correctness (Corr.) measures how well the information in response is consistent with the ground truth knowledge, i.e. relevant DB entries or documents. All the three aspects are scored on a Liker scale of 1-3, which denotes bad, so-so and good. Table 6 shows our human evaluation results. In the original dialog turns, HyKnow (Single) scores close to SimpleTOD and its combination with BDA on all the three aspects. This indicates that our proposed light-weight system is comparable with the large GPT-2 based models in managing structured knowledge to generate the response. In addition, our model outperforms the two baseline models in the newly inserted dialog turns. Specifically, HyKnow (Single) generates responses with significantly better informativeness and correctness than SimpleTOD. This again shows that the management of unstructured knowledge is beneficial for generating appropriate responses. Compared to the combination of SimpleTOD and BDA, the responses generated by HyKnow (Single) also achieve much better correctness, which benefits from our model's higher document retrieval accuracy (as shown in Table 4).

Case Study
An example dialog segment (U 1 , B 1 , R 1 , U 2 ) and corresponding output results of each model (B 2 , D 2 , R 2 ) are presented in Table 7. Without access to unstructured document base, SimpleTOD misunderstands the user query, and instead recognizes U 1 : Hello, I would like to find a hotel that has WiFi in the north part of the town. B 1 : hotel-area: north, hotel-internet: yes R 1 : The Arbury Lodge Guesthouse is one of 12 options for you. Shall I make a reservation for you? U 2 : Do they provide Italian breakfast?  the term "Italian" in user utterance as a constraint to update the belief state. As a result, the system makes an inappropriate recommendation. By combining with BDA, SimpleTOD predicts correct belief state, but fails in finding the relevant document, thus providing a wrong answer. This is because the wrong document's content has many common words with the dialog context, e.g. "Italian" and "Breakfast", which mislead the retrieval process of BDA. In contrast, HyKnow gives a proper response with accurate information as it identifies the entity ("Arbury Lodge Guesthouse") and captures the topic ("breakfast") to avoid the misleading of common words in document retrieval.

Conclusion
In this paper, we define a task of modeling TOD with access to both structured and unstructured knowledge. To address this task, we propose a TOD system HyKnow which uses an E2E framework to jointly optimize TOD modeling grounded on the two kinds of knowledge.
In the experiments, HyKnow shows strong performance in modeling TOD with hybrid knowledge management, compared to existing TOD systems and their pipeline extensions. For future work, we plan to incorporate large-scale pretrained language models into our proposed system to further enhance its performance. Furthermore, we consider evaluating our system on different scenarios where dialogs are grounded on hybrid knowledge.   Table 8 shows the model size of our proposed Hy-Know and some baseline models. We find that Hy-Know has a comparable model size with the lightweight baseline models, which do not leverage pretrained language models (LM). But its model size is much smaller than that of TripPy and Simple-TOD, which use pretrained BERT and GPT-2, respectively. Therefore, HyKnow requires much less computational resources, compared to TripPy and SimpleTOD that use large-scale pretrained LM.