Retrieval-Generation Alignment for End-to-End Task-Oriented Dialogue System

Developing an efficient retriever to retrieve knowledge from a large-scale knowledge base (KB) is critical for task-oriented dialogue systems to effectively handle localized and specialized tasks. However, widely used generative models such as T5 and ChatGPT often struggle to differentiate subtle differences among the retrieved KB records when generating responses, resulting in suboptimal quality of generated responses. In this paper, we propose the application of maximal marginal likelihood to train a perceptive retriever by utilizing signals from response generation for supervision. In addition, our approach goes beyond considering solely retrieved entities and incorporates various meta knowledge to guide the generator, thus improving the utilization of knowledge. We evaluate our approach on three task-oriented dialogue datasets using T5 and ChatGPT as the backbone models. The results demonstrate that when combined with meta knowledge, the response generator can effectively leverage high-quality knowledge records from the retriever and enhance the quality of generated responses. The codes and models of this paper are available at https://github.com/shenwzh3/MK-TOD.


Introduction
Task-oriented dialogue systems (TOD) assist users to accomplish daily tasks such as restaurants, scheduling appointments, and navigating traffic by leveraging external knowledge bases.Among them, pipeline systems (Henderson et al., 2014;Hosseini-Asl et al., 2020) involve several intermediate stages such as dialog state tracking and system policy learning for retrieving knowledge and generating responses.In contrast, end-to-end task-oriented dialog systems (E2E-TOD) (Wu et al., 2022;Tian et al., 2022) have gained increasing concentration for their ability to directly generate responses based * Corresponding authors Can I get the address and phone number, please?
The address is Wark Terrace and the phone number is 3363682.

Name
Area Price Address Phone Can you help me find a hotel called A and B Guesthouse?
It is in the east, and moderately priced.Would you like to book a room?
 Figure 1: A demonstrative case in E2E-TOD.The table displays retrieved entities sorted by retrieval order.The correct entity is highlighted in blue.However, the response generator mistakenly selects the false entity, highlighted in red, leading to an erroneous response.
on the knowledge base without intermediate annotations.Although the end-to-end paradigm appears to be more compatible with practical scenarios and large-scale language models, it imposes challenges in acquiring and utilizing external knowledge as no belief state is provided for knowledge retrieval.
Retrieval-augmented generation (Lewis et al., 2020;Ren et al., 2021;Singh et al., 2021) has demonstrated success in various knowledgeintensive tasks by employing a held-out dense retriever to retrieve knowledge and then taking the knowledge to generate results.Q-TOD (Tian et al., 2022) applies this approach to E2E-TOD and significantly outperforms previous methods that combine knowledge retrieval and response generation into a single model (Madotto et al., 2018;Qin et al., 2020;Raghu et al., 2021).However, our preliminary study in Section 5.3 shows that under this framework the correlation between the performance of knowledge retriever and that of response generator is relatively weak, meaning that simply improving the retriever may not lead to a better generator.We characterize this phenomenon as the misalignment between the retrieval and generation processes in E2E-TOD systems.This misalignment poses a bottleneck for current dialogue systems, as improvements in the retriever component do not necessarily translate to enhanced generation quality.
Through qualitative analysis, we hypothesize that the misalignment between retrieval and generation is attributed to the homogeneity of retrieved knowledge entities.As illustrated in Figure 1, the retrieved entities exhibit a high degree of similarity, with only minor variations in their values.Consequently, since the response generator is trained on reference responses that predominantly consist of language tokens rather than knowledge-related tokens, it struggles to differentiate between similar entities and may inadvertently select inappropriate entities for response generation.
In this paper, we introduce Meta Knowledge for end-to-end Task-Oriented Dialogue system (MK-TOD) as a solution to address the retrievalgeneration misalignment.MK-TOD aims to correlate the performance of the knowledge retriever and response generator for improved system performance.To enhance the knowledge retriever, we propose the application of maximum marginal likelihood (Singh et al., 2021) for progressive retriever updating during the training of the response generator.In order to enable the response generator to distinguish between entities, we explore several methods for utilizing retrieval-related meta knowledge.Here, meta knowledge refers to various information about the retrieved entities, such as retrieval order, retrieval confidence, and co-occurrence rate.We propose three approaches for incorporating the meta knowledge: adding special prefix tokens, using prompts, and applying contrastive learning.Additionally, we investigate the introduction of negative knowledge during the generator's training to enhance its discriminative ability.
We apply MK-TOD to several backbone models, including T5 (Raffel et al., 2020) and the large language model ChatGPT (OpenAI, 2022).We compare MK-TOD with other E2E-TOD systems on three benchmark datasets, namely SMD, CamRest, and WoZ (Eric et al., 2017;Wen et al., 2017;Eric et al., 2020).The empirical results demonstrate the superiority of our proposed system over the current state-of-the-art systems with similar model scales.Additionally, our system effectively enhances the performance of ChatGPT in E2E-TOD with incontext learning.Furthermore, through comprehensive analysis, we uncover that our meta-knowledge approach successfully alleviates the misalignment between the retriever and generator.This approach empowers the generator to better differentiate between similar entities during response generation.
2 Related Works 2.1 End-to-End Task-Oriented Dialogue The existing work on the usage of external knowledge in end-to-end task-oriented dialogue systems can be divided into three categories.The first category takes the whole knowledge base as the model input, and conducts knowledge selection and response generation in one single model.For instance, Mem2seq (Madotto et al., 2018), KB-Retriever (Qin et al., 2019), GLMP (Wu et al., 2019) and CDNET (Raghu et al., 2021) employ memory networks for querying knowledge.Uni-fiedSKG (Xie et al., 2022) directly concatenates entities as the input of Transformers.The second category directly encodes knowledge into model parameters.GPT-KE (Madotto et al., 2020) pretrains their model on augmented dialog data to embed the knowledge base, while ECO (Huang et al., 2022) applies tri-constraints on top of GPT-KE to ensure entity consistency.The third category is to use an individual retriever to retrieve knowledge.For example, Q-TOD (Tian et al., 2022) decouples the dialogue system into a retriever and a generator and uses the generator to generate a query for knowledge retrieval.DialogKG (Rony et al., 2022) inputs the flattened records to a graph neural network to select entities.And MAKER (Wan et al., 2023) introduces a multi-grained retrival with both entity and attribute selection.As mentioned earlier, although the retrieve-then-generate framework has been one of the most successful paradigms to date, it can lead to misalignment between the retriever and the generator in end-to-end task-oriented dialogue systems.

Retrieval-Augmented Generation
With the success of the dual-encoder neural retriever (Karpukhin et al., 2020), the retrievalaugmented generation framework is widely applied to various knowledge-intensive tasks.This framework uses a retriever to retrieve knowledge from a knowledge base and inputs the retrieval results into a generator to generate the answer.Among them, RAG (Lewis et al., 2020)  Marginal Likelihood The MK-TOD framework comprises a knowledge retriever and a response generator.Given the dialogue context, the retriever retrieves entities from the knowledge base.Each entity is concatenated with its corresponding meta knowledge and subsequently input into the generator to generate the response.The optimization process involves two likelihoods: the normal text generation likelihood and the marginal likelihood.
based on each entity.FiD (Izacard and Grave, 2021) encodes each retrieved knowledge like RAG and fuses their hidden states in the decoder.FiD-KD (Izacard and Grave, 2022) and EMDR 2 (Singh et al., 2021) are both based on the FiD framework but with different retriever training methods: FID-KD uses knowledge distillation while EMDR 2 uses marginal likelihood.REPLUG (Shi et al., 2023) applies the method of RAG to large language models but only updates the retriever during training.

Methodology
The framework of our proposed MK-TOD is depicted in Figure 2. MK-TOD consists of a retriever and a response generator.In each dialogue turn, the retriever retrieves a set of relevant entities, which are then combined with retrieval-related meta knowledge and the dialogue context.The generator utilizes this information to generate a response for the current turn.In the following section, we will introduce the notations and provide an overview of our method.Then, we will delve into the detailed explanations of two crucial components: maximum marginal likelihood and meta knowledge.

Notations
Given a dialogue D = {u 1 , r 1 , ..., u T , r T } of T turns, where u t and r t are the t-th turn user utterance and system response, respectively.We use c t to represent the dialog context of the t-th turn, where c t = {u 1 , r 1 , ..., u t−1 , r t−1 , u t }.An external knowledge base (KB) is provided in the form of a set of entities, i.e., K = {e 1 , e 2 , ..., e B }, where each entity e i consists of several attribute-value pairs and B is the size of knowledge base.Endto-end task-oriented dialog systems take dialogue context c t and knowledge base K as input and generate an informative natural language response r t .

System Overview
The retriever module comprises a context encoder and an entity encoder.The context encoder transforms the current dialogue context c t into a vector representation h ct .On the other hand, the entity encoder concatenates the attribute-value pairs of each entity as plain text and encodes it into a vector representation h e i .The matching score s t,i is computed by taking the dot product between h ct and h e i .Consequently, the top-K entities with the highest scores are selected as candidate entities E t for the current dialogue turn.Furthermore, if meta knowledge is utilized, each entity in E t is augmented with its corresponding meta knowledge.
The generator takes the retrieved entities E t and dialogue context c t as inputs to generate the final system response r t .The probability of generating the response r t given the entities E t and dialogue context c t can be calculated as follows: where θ denotes the parameters of the generator.
Similar to most text generation tasks, we incorporate the negative log-likelihood (NLL) loss as a training objective to train the generator: (2)

Maximum Marginal Likelihood
Due to the absence of retrieval labels for training the retriever, we depend on supervision signals from the generator.However, since it is not possible to backpropagate gradients through the NLL loss in Eq. ( 2) to the retriever, we propose updating the retriever's parameters by maximizing the marginal likelihood (MML) of the response r t .The marginal likelihood offers a Bayesian perspective to compute p(r t |c t , K) by integrating the likelihood over all the entities in the knowledge base: q(e i |c t ; ϕ)p(r t |c t , e i ; θ), (3) where ϕ denotes the parameters of the retriever and q(e i |c t ; ϕ) is the retrieval probability of entity e i .Note that computing q(e i |c t ; ϕ) for all entities in the entire knowledge base would incur an unaffordable computational cost for Eq. ( 3).Therefore, following the approach of EMDR2 (Singh et al., 2021), we compute q(e i |c t ; ϕ) over the retrieved entities E t instead of the entire knowledge base K: where q(e t,i |c t ; ϕ) is implemented as follows: e t,j ∈Et exp(s t,j ) . (5) The loss function for the marginal likelihood is defined as follows: q(e t,i |c t ; ϕ)p(r t |c t , e t,i ; θ).
(6) By incorporating q(e t,j |c t ; ϕ), we can propagate gradients back to the retriever and update its parameters.The ultimate training loss function for MK-TOD is defined as follows: where α and β are hyperparameters.

Meta Knowledge
We introduce the concept of retrieval-related meta knowledge, which encompasses various information about the retrieved entities to guide the generator and enhance the alignment between retrieval and generation.Three key factors are considered in the meta knowledge: retrieval order, retrieval confidence, and co-occurrence.
Retrieval order: The retriever evaluates entities based on their matching with the dialogue context, prioritizing those that exhibit a higher degree of matching.We incorporate the retrieval order of each entity as a part of the meta knowledge.
Retrieval confidence: To provide more retrieval information, we categorize retrieved entities into low-confidence, middle-confidence, and highconfidence based on retrieval scores.The thresholds for categorizing entities are hyper-parameters1 .Retrieval confidence, in conjunction with retrieval order, enables the generator to disregard entities with low confidence but high retrieval order.
Co-occurrence: Entities that have already appeared in the dialogue context are more likely to be relevant for future responses.Thus, we inform the generator about the occurrence of entities in the dialogue context through meta knowledge.
To implement the above meta knowledge in our system, we design three approaches: prefix, prompt, and contrastive learning.

Prefix
In this approach, we create a mapping function that assigns special tokens representing meta knowledge to each entity.For instance, an entity ranked second in retrieval order, with middle retrieval confidence, and not yet mentioned in the context would be mapped to the set of <2nd-entity>, <mid-confidence>, <new-entity> 2 .These prefix tokens are then concatenated with the corresponding entity and input to the generator during both training and inference stages.

Prompt
To fully leverage the generator's language modeling capability, we explore using prompts to incorporate meta knowledge.Here, we design a mapping function that assigns each entity a set of prompts, which are natural language sentences representing the meta knowledge.For example, a prompt can be "This is the top-1 recalled entity with low confidence".3Similar to the prefix approach, these prompts are concatenated with the corresponding entities and fed into the generator.

Contrastive Learning
We can also train the generator to distinguish between entities by employing contrastive learning.In this approach, we select a subset of entities from the retrieved entities E t based on their retrieval order, forming a positive entity set E * t . 4For each entity e t,i in E * t , we compute its length-normalized log-likelihood of generating the response: where |r t | is the length of r t .Additionally, we calculate the log-likelihood of generating the response without any entity as the baseline likelihood: We employ a pairwise marginal ranking loss that ensures the likelihood of positive entities greater than the baseline likelihood by a certain margin: where λ is a hyperparameter.We then add this loss term to the loss function of MK-TOD:

Negative Entity
Inspired by negative sampling in information retrieval (Karpukhin et al., 2020), we also consider incorporating negative entities into the generator.
The negative entity, denoted as e − t / ∈ E t , is chosen as the entity with the lowest retrieval score from K. Special meta knowledge is designed for the negative entity as well.5Note that the negative entity is different from the baseline likelihood in the above contrastive learning (Section 3.4.3).

Model Inference
During inference, we first retrieve entities using the retriever.Then, we prepend each entity with its corresponding meta knowledge.Finally, we concatenate the entities with the dialogue context and input the resulting sequence to the generator to generate the final response.Notably, we do not include negative entities during inference.

Dataset
We evaluate our MK-TOD on three task-oriented dialogue datasets: MultiWOZ 2.1 (MWOZ) (Eric et al., 2020), CamRest (Wen et al., 2017), and Stanford Multi-Domain (SMD) (Eric et al., 2017).We compare the methods with two different settings about the knowledge bases: First, each dialogue has a small session-level knowledge base associated with the user goal, which is referred to as the condensed knowledge base.Second, Q-TOD (Tian et al., 2022) proposes to construct a dataset-level large-scale knowledge base for MWOZ and Cam-Rest by injecting all the condensed knowledge bases.There are 223 and 112 entities in the largescale knowledge base for MWOZ and CamRest, respectively.The setting of the large-scale knowledge base imposes more challenges for the retrieval and utilization of knowledge.Other detailed statistics of these datasets are shown in Appendix A.
For all three datasets, we employ BLEU (Papineni et al., 2002) and Entity F1 (Eric et al., 2017) as the metrics to evaluate the quality of generated responses.Entity F1 assesses the presence of accurate knowledge in the responses by calculating the micro-averaged precision and recall scores of attribute values.Additionally, for experiments conducted on large-scale knowledge bases, we introduce Recall@K as a performance metric for the retriever.Recall@K measures the percentage of gold entities appearing in the retrieved entities.

Implementation Details
We utilize BERT (Devlin et al., 2019) as the context encoder and entity encoder for the retriever.As for the generator, we employ T5 (Raffel et al., 2020) and ChatGPT (OpenAI, 2022).Note that ChatGPT is not fine-tuned but instead undergoes in-context learning using our datasets. 6The retriever for ChatGPT is directly copied from the retriever trained with T5 using MML.All experiments are performed on a single 24G NVIDIA RTX 3090 GPU, and the best checkpoints are selected based on Entity F1 scores on the validation set.Hyperparameter settings are listed in Appendix E.
Consistent with previous studies (Qin et al., 2019), we initialize the retriever through pretraining with distant supervision to prevent collapsed representations.Additional details on the pre-training process can be found in Appendix D.

Baseline Methods
We include several strong baselines for comparison.
Large language models: Large language models (LLMs), such as ChatGPT (OpenAI, 2022), have demonstrated remarkable capabilities in engaging in dialogues with humans.We establish a baseline LLM utilizing ChatGPT as the response generator by leveraging the gpt-3.5-turboAPI.To enhance its performance in TOD tasks, we integrate our knowledge retriever with the system.

Results and Analysis
In this section, we present the overall results obtained using both large-scale knowledge bases and condensed knowledge bases.Besides, we demonstrate the phenomenon of retrieval-generation misalignment and conduct ablation studies.

Overall Results with Large-Scale KBs
Comparing our method with others in the setting of retrieving knowledge from a large-scale knowledge base aligns better with real-world TOD scenarios.Therefore, we begin by comparing our pro-posed MK-TOD approach with baselines in the context of large-scale knowledge bases.The results of this comparison are displayed in Table 1.
The upper part of Table 1 shows the results of methods employing a fine-tuned response generator."Ours pref ix ", "Ours prompt ", and "Ours ctr " denote our method implementing meta knowledge using prefix, prompt, and contrastive learning techniques, respectively."Base" and "Large" following the method names indicate the use of T5-Base or T5-Large as the response generator.We observe that Q-TOD's retriever outperforms ours due to their utilization of an additional query generator, which helps generate more accurate retrieval queries.However, even when employing T5-Base and a relatively weaker retriever, our method still surpasses Q-TOD in terms of BLEU and Entity F1 by a significant margin.This indicates that our proposed method effectively utilizes the retrieved knowledge better than Q-TOD.
The bottom part of Table 2 presents the results of methods employing ChatGPT.Since ChatGPT is not fine-tunable, we did not apply contrastive learning for meta knowledge.According to the results, we found out that relying solely on in-context learning does not enable ChatGPT to perform as well as the fine-tuned methods in the context of E2E-TOD.However, our proposed approach outperforms the baseline.Additionally, our proposed prefix method for implementing meta knowledge yields only marginal improvement or even performs worse than ChatGPT.This is attributed to ChatGPT's limited ability to learn the special prefix tokens representing meta knowledge from a limited number of in-context demonstrations and concise explanatory text.In contrast, our proposed prompt method significantly enhances its performance.

Overall Results with Condensed KBs
To make a comprehensive comparison with the previous methods, we also follow the previous works' setting the conduct evaluations on the condensed knowledge base.The results in Table 2 indicate that our proposed method surpasses the baselines on MWOZ and SMD with the same model scale, validating the efficacy of our approach.However, among the three meta knowledge implementations, it is challenging to determine a clear preference as the fine-tuned generator tends to learn all of them.
For the evaluation with ChatGPT on the condensed knowledge base, we can still observe the performance gain of ChatGPT when enhanced with our proposed meta-knowledge.Biside, the performance gain is more significant than that of the large-scale knowledge bases, suggesting that Chat-GPT has a higher demand for retrieval quality.

Retrieval-Generation Misalignment
To investigate the influence of retrieval performance on the E2E-TOD generator, we select six retrievers on MWOZ with a large-scale knowledge base.The details of the retrievers can be found in Appendix G.We then use different generators to generate responses based on the retrieval results.As generators, we choose Q-TOD, FiD (Izacard and Grave, 2021), and ChatGPT.The Entity F1 scores of these generators are depicted in Figures 3(a) and 3(b) as the retrieval performance varies with different retrievers.
The solid lines in the figures show that the performance of generators does not consistently align with that of retrieval performance.Furthermore, the performances of Q-TOD and FiD with oracle entities are even worse than those with a weak retriever.We refer to this phenomenon as retrievalgeneration misalignment.In contrast, the dashed lines, which depict the results of the generators with our proposed meta knowledge, exhibit greater consistency between the retrieval performance and the generators.This indicates that our proposed method mitigates the misalignment issue.The correlation coefficients shown in parentheses next to method names further confirm this observation.

Ablation Study
We assess the impact of maximum marginal likelihood, various types of meta knowledge, and the inclusion of negative samples.Unless otherwise specified, the ablation study is performed on the MWOZ dataset using T5-Base as the generator, considering the computational resource constraints.

Maximum Marginal Likelihood
Table 3 presents the impact of the maximum marginal likelihood (MML) loss.The methods labeled as "w/o MML" utilize the warmed-up retriever described in Section 4.2, without the joint training with the response generator.The results demonstrate that the inclusion of maximum marginal likelihood enables further enhancement of the retriever during training.Consequently, the improved retrievers lead to enhanced final generated responses.

Types of Meta Knowledge
We compare different types of meta knowledge, and the results are presented in Table 4.The findings indicate that using a single type of meta knowledge yields inferior performance compared to combining all three types.Furthermore, an interesting observation emerges when using the prefix: the retrieval order outperforms other types of meta knowledge.In contrast, when using the prompt, the results are reversed.We attribute this phenomenon to the design of the prefix and prompt.Representing meta knowledge with a prefix introduces a higher diversity in ranking order since a distinct prefix is assigned to each ranking order.This increased diversity enables the generator to better distinguish the recalled entities.On the other hand, the distinction between retrieval confidence and cooccurrence in the prefix setting is less obvious.In  Table 4: Ablation study of different types of meta knowledge on MWOZ with condensed and large-scale knowledge bases."order", "conf", "cooccur" and "all" mean using only retrival order, retrieval confidence, cooccurrence, or all types of meta knowledge, respectively.
contrast, when representing meta knowledge with a prompt, the retrieval order becomes less diverse.

Negative Samples
We conduct an investigation into the impact of negative entities on the performance of dialogue systems.The results presented in Table 5 demonstrate that the inclusion of negative entities significantly improves the performance of dialogue systems when applied to T5-Base.This performance enhancement can be attributed to two main factors.Firstly, the presence of negative entities facilitates easier entity distinction for the generator, enabling it to learn more effectively.Secondly, the introduction of negative entities aids in training the retriever through the MML loss in Equation ( 6).This concept is somewhat analogous to the motivation behind incorporating negative samples in knowledge retrieval tasks (Karpukhin et al., 2020).However, when applied to ChatGPT, negative entities do not contribute to model performance.The reason is that ChatGPT cannot be fine-tuned, meaning that solely adding negative entities to the incontext demonstrations does not effectively teach ChatGPT to differentiate between entities.Consequently, we opt not to include negative entities when employing our method with ChatGPT.

Behavior of Generator
We examine how the generator utilizes the retrieved knowledge with the assistance of meta knowledge on the MWOZ test set.For our model and the baseline, which is T5-Large, we gather all their responses that contain entities, and analyze the percentage of retrieved entities that correctly appear in the responses according to retrieval order and confidence.As illustrated in Figure 4, our generator exhibits a higher propensity than the baseline to utilize entities with both a high retrieval order and high confidence.We have assessed the retrieval results on the MWOZ test set.The findings demonstrate that our retriever adeptly recalls 80.69% of the gold entities with a top-1 retrieval order, which directly correlates with the systemwide performance enhancement.This observation suggests that our proposed meta knowledge aids the generator in developing an inductive bias to prioritize entities that are highlighted by the retriever.

Conclusion
This paper aims to address the retrieval-generation misalignment in end-to-end task-oriented dialogue systems by introducing maximal marginal likelihood to train a perceptive retriever that leverages signals from response generation.To enable the response generator to better distinguish between entities, we explore several methods for incorporating retrieval-related meta knowledge.We also propose to incorporate negative entities to enhance the discriminative capability.Experimental results demonstrate that when combined with meta knowledge, the response generator effectively leverages high-quality retrieval knowledge, leading to enhanced quality in the generated responses.Through analysis, we observe that previous retrieval-augmented generator models suffer from severe retrieval-generation misalignment, while our method mitigates this misalignment.

Limitations
There are three potential limitations of the paper that warrant consideration.Firstly, the employ-ment of the marginal likelihood method necessitates computing the likelihood for each retrieved entity, resulting in higher computational resource requirements compared to solely using negative log-likelihood (NLL).Secondly, despite conducting various comparisons and ablation studies in this paper, there are certain aspects missing in our proposed meta knowledge, such as investigating the combined utilization of prompt and contrastive learning, as well as exploring the utilization of retrieval order alongside co-occurrence.Lastly, the theoretical rationale behind the contribution of our proposed meta knowledge to task-oriented dialogue (TOD) is not thoroughly discussed.G Different Retrievers for Section 5.3 In Section 5.3, we investigated the retrievalgeneration misalignment by introducing several retrievers with different performances.The details of these retrievers are introduced as follows.
BM25: The BM25 retriever the BM25 score between the dialogue context and each entity.
Frequency: This is a rule-based method.For each entity in the knowledge base, we compute the number of its attribute values that appear in the context.We then take the entities with the most attribute values appearing in the dialogue context as the recalled entities.
Pre-train: This retriever is the pre-trained retriever introduced in Appendix D.
Ours: This is the retriever introduced in our method (Ours prompt (Base)).
Q-TOD: This is the retriever of Q-TOD.
Oracle: This method uses the condensed knowledge base as the retrieved entity set.The gold entity, which must appear in the condensed knowledge base, is marked as the top-1 retrieved entity with high retrieval confidence, while other entities are marked with low retrieval confidence.
We show the performance (Recall@7) of these retrievers in Table 9. "The top-3 recalled:" Fourthly recalled entity "The top-4 recalled:" Fifthly recalled entity "The top-5 recalled:" The entities recalled behind the 5th entity and the easy negative entity "The negative entity recalled:" Retrieval Confidence Entity with retrieval score >= 0.75 "with high confidence:" Entity with retrieval score < 0.75 and >= 0.25 "with middle confidence:" Entity with retrieval score < 0.25 and the easy negative entity "with low confidence:" Co-occurrence Relation Entity existed in the dialogue context "existed in history:" Entity not existed in the dialoglue context and the easy negative entity "newly recalled:" You answer questions like a customer service.There is a knowledge base for each question.Each record of knowledge base is accompanied by three tags.
The first tag indicates whether this entity appeared before.<new-entity> means this is a new entity, and <old-entity> means this entity appeared before.
The second tag indicates the authenticity of the third tag.There are three types <low-confidence>, <mid-confidence> and <high-confidence> indicating low, middle, high retrieval confidence respectively.Higher retrieval confidence mean the entity is potentially more related to the user goal.
The third tag indicates its importance to the dialogue.<nth-entity> means it is the nth important entity in the knowledge base, for example, <1th-entity> is the top-1 important and <other-entity> means it is not important.The max length of your response is 50 words.
Next are some demonstrations.There are three special tokens That's all of knowledge base.The dialogue is as follow: [user] yes , i am looking for a place to stay tonight .the hotel should be like a guesthouse in looks and style .ideally , i ' d like one in the moderate price range , please .
[answer] is there a specific area you would like to stay in ?also , do you need internet and / or free parking ?That's all of example 1 ...... {Add the test sample here} You answer questions like a customer service.There is a knowledge base for each question.
The max length of your response is 50 words.
Next are some demonstrations.That's all of knowledge base.The dialogue is as follow: [user] yes , i am looking for a place to stay tonight .the hotel should be like a guesthouse in looks and style .ideally , i ' d like one in the moderate price range , please .
[answer] is there a specific area you would like to stay in ?also , do you need internet and / or free parking ?That's all of example 1 ...... {Add the test sample here}

Figure 3 :
Figure 3: Entity F1 scores of generators ((a) FiD&Q-TOD and (b) ChatGPT) as the retrieval performance varies with different retrievers.Bracketed numbers following model names refer to the correlation coefficients between retrieval performance and Entity F1.

Figure 4 :
Figure 4: The percentage of samples utilizing the entities to generate responses with respect to (a) retrieval order and (b) retrieval preference.
[assistant], [user] and [answer].[assistant] leads the response of the customer service, [user] leads what user say and [answer] leads the Ground Truth answer of the example.You answer questions like a customer service of hotel reservation with a knowledge base.Knowledge base is in the form of : address | area | internet | name | parking | phone | postcode | pricerange | stars | type.Knowledge base is as follow.First one : 124 tenison road | east | yes | a and b guest house | 01223315702 | cb12dp | moderate | 4 star | guesthouse <1st-entity> <low-confidence> <new-entity> Next one : ...... Next one : ......

Figure 5 :
Figure 5: The input prompt and demonstration for ChatGPT with meta knowledge as the prefix.
There are three special tokens [assistant], [user] and [answer].[assistant] leads the response of the customer service, [user] leads what user say and [answer] leads the Ground Truth answer of the example.You answer questions like a customer service of hotel reservation with a knowledge base.Knowledge base is in the form of : address | area | internet | name | parking | phone | postcode | pricerange | stars | type.Knowledge base is as follow.First one : 124 tenison road | east | yes | a and b guest house | 01223315702 | cb12dp | moderate | 4 star | guesthouse This is a new entity.It has low possibility that this entity is top-1 important.Next one : ...... Next one : ......

Figure 6 :
Figure 6: The input prompt and demonstration for ChatGPT with meta knowledge as prompt.

Table 1 :
Overall results of E2E-TOD systems with large-scale knowledge bases on MWOZ and CamRest, where " * " means that we directly use the retriever co-trained with T5-Base using MML.

Table 2 :
Overall results of E2E-TOD systems with condensed knowledge bases on MWOZ, SMD, and CamRest.The best results are highlighted in bold, and the second-best results are underlined.

Table 3 :
Ablation study of the MML loss.

Table 5 :
Ablation study of negative entities.

Table 7 :
The number of retrieved entities under different settings.

Table 8 :
Overall results of E2E-TOD systems with largescale knowledge bases on MWOZ and CamRest.

Table 9 :
The performance (Recall@7) of different retrievers for the retrieval-generation misalignment study.

Table 10 :
The mapping rules from different types of meta knowledge to the prefix token.

Table 11 :
The mapping rules from different types of meta knowledge to the prompt for T5.Entity not existed in the dialoglue context and the easy negative entity "This is a new entity."

Table 12 :
The mapping rules from different types of meta knowledge to the prompt for ChatGPT.

Table 13 :
Hyperparameter settings of our system.