Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model

In this work, we present a model for document-grounded response generation in dialog that is decomposed into two components according to Bayes theorem. One component is a traditional ungrounded response generation model and the other component models the reconstruction of the grounding document based on the dialog context and generated response. We propose different approximate decoding schemes and evaluate our approach on multiple open-domain and task-oriented document-grounded dialog datasets. Our experiments show that the model is more factual in terms of automatic factuality metrics than the baseline model. Furthermore, we outline how introducing scaling factors between the components allows for controlling the tradeoff between factuality and fluency in the model output. Finally, we compare our approach to a recently proposed method to control factuality in grounded dialog, CTRL (arXiv:2107.06963), and show that both approaches can be combined to achieve additional improvements.


Introduction
Recently, document-grounded dialog systems have seen an increase in popularity.Not only are they used to extend task-oriented systems beyond the narrow scope of fixed databases and APIs (Feng et al., 2020;Kim et al., 2020Kim et al., , 2021;;Cohen et al., 2022), but also to ground open-domain conversations in information from the web (Zhou et al., 2018;Dinan et al., 2019;Komeili et al., 2022) or in persona descriptions to make dialog agents more interesting and engaging (Zhang et al., 2018).In any of these applications, the conversational system has to use the information from the document Code available at: https://github.com/ndaheim/noisy_channel_model * Work done while at RWTH Aachen University, Chair for Human Language Technology and Pattern Recognition and blend it into the response (Roller et al., 2020).This means that the response should remain fluent, grammatically correct and coherent to the previous utterances in the dialog on the one hand, and on the other hand that it correctly reflects the information from the document.This entails that no information is altered and also that no new information should be added if it is not immediately verifiable.This is especially crucial in all cases where a user uses a system to satisfy an information need (Santhanam et al., 2021).
While previous work has shown that retrieving relevant information is a crucial step for taskoriented (Kim et al., 2020) and open-domain document-grounded dialog systems and a potential mitigator of inconsistencies (Shuster et al., 2021), there is sufficient evidence that grounded response generation models may still fail to produce factual responses, even when the correct information is contained in its grounding document.In general, generating outputs that are both fluent and correct remains an open problem not only in dialog systems but natural language generation as a whole (Cao et al., 2018;Maynez et al., 2020;Roller et al., 2021;Ji et al., 2022) which potentially limits industry adaptation.Models have been found to contradict themselves (Shuster et al., 2022b) or the grounding, and to add additional information that might be harmful and is not verifiable (Shuster et al., 2022a;Ji et al., 2022).Recently, different mitigation strategies were proposed.Cohen et al. (2022) for example use learned discriminators to decide from an n-best list while Rashkin et al. (2021) introduce special control tokens (Keskar et al., 2019) to encourage lexical overlap and entailment between grounding and response.However, discriminators based on estimating human judgments require additional data for training (Cohen et al., 2022) which is costly to obtain.Furthermore, discriminating based on single attributes holds the potential of harming other relevant properties.For example, discriminating based on factuality can lead to responses mostly repeating their grounding information (Cohen et al., 2022) and might introduce a loss of fluency and dialog coherence.In general, these two goals might be conflicting and their importance depends on the task at hand.While chit-chat lives from engagement and might tolerate inconsistencies, factuality is service-critical in task-oriented systems.
In this work, we present a probabilistic model that inherently combines both of these goals.By factorizing the model according to Bayes' theorem, we obtain one component that models each goal explicitly.Hence, introducing scaling factors allows for controlling between them.Furthermore, additional unlabeled dialog data can be integrated easily to train one of its components.As directly decoding the model is intractable, we present different approximate decoding schemes for reranking and online decoding that yield significant gains in terms of automatic factuality metrics on several datasets.

Document-grounded dialog systems
There has been significant work in documentgrounded dialog systems in recent years.A large number of datasets have been proposed for opendomain dialog, in order to facilitate engaging conversations about a variety of topics, such as movies (Zhou et al., 2018), Wikipedia knowledge (Dinan et al., 2019;Dziri et al., 2022a), personal attributes of the agent (Zhang et al., 2018;Dinan et al., 2020) or arbitrary information from the internet (Komeili et al., 2022).Similarly, different task-oriented dialog datasets for information-seeking conversations have been proposed (Kim et al., 2020(Kim et al., , 2021;;Feng, 2021).Different works have dealt with the problem of document retrieval, for example on batching hard negatives (He et al., 2021) or efficient document retrieval (Thulke et al., 2021), as well as with identifying (Feng, 2021) or rephrasing salient passages within them (Shuster et al., 2022a).Finally, there is also significant work on generating grounded responses using this information, for example in low-resource scenarios (Zhao et al., 2020) or with an emphasis on faithful generations, which we will explore in the following section.

Hallucination in language generation and dialog
The problem of hallucinations, which one might define as information that is not grounded in the document, dialog context or by common sense, has recently received plenty of attention in neural language generation (Ji et al., 2022), for example in the field of summarization (Cao et al., 2018;Maynez et al., 2020) and dialog systems (Roller et al., 2021).Hence, different types of mitigation strategies that aim to increase the faithfulness of responses have been proposed.Notably, Along with mitigation strategies, methods for model output and metric evaluation for factuality have been proposed.For example, Q 2 (Honovich et al., 2021) proposes a question-answering-based matching and BEGIN (Dziri et al., 2022c) a benchmark for metric evaluation.
Recently, Dziri et al. (2022b) also show that current grounded datasets contain ground-truth responses that further encourage hallucination by being insufficiently grounded and Dziri et al. (2022a) propose FaithDial as a filtered version of Wizard-of-Wikipedia (Dinan et al., 2019) that aims to mitigate this.

Noisy Channel Modeling in NLP
Given an input sequence x T 1 and output y N 1 , the noisy channel approach (Shannon, 1948) models the posterior probability of y For a long time, such models have been the dominant way of performing Automatic Speech Recognition (ASR) and Machine Translation (MT) (Brown et al., 1993).In ASR, p(x T 1 | y N 1 ) models the acoustic channel (Bahl et al., 1983) and is often called channel model.With the advent of deep learning, discriminative approaches have become popular in both fields and achieve state-of-the-art results (Graves et al., 2006(Graves et al., , 2013;;Vaswani et al., 2017;Gulati et al., 2020).Nevertheless, the noisy channel approach has recently been explored again for MT (Yu et al., 2017;Yee et al., 2019;Yu et al., 2020;Jean and Cho, 2020;Subramanian et al., 2021), text classification (Min et al., 2022), style transfer (Thulke et al., 2022) and task-oriented dialog systems that are not document-grounded (Liu et al., 2021).

Grounded Response Generation
The goal of dialog systems is to find an appropriate system response u T +1 conditioned on a sequence of previous turns u T 1 := (u 1 , . . ., u t , . . ., u T ) taken by different interlocutors, where each turn ) is a sequence of N t tokens from the model vocabulary V prepended with the start of sequence symbol [u T +1 ] 0 := ⟨sos⟩.This is usually done by means of a probabilistic language generation model that models the posterior distribution of the response given the context and is locally-normalized such that the response is generated autoregressively according to (1) In document-grounded dialog systems the response is additionally grounded in information that defines parts of its meaning and is given in the form of unstructured text.For example, in a restaurant booking setting, text on websites might show whether dogs can be brought or in a chit-chat system the grounding information might define the agent's persona.While in the latter the grounding is known at test-time as a property of the agent, it is usually unknown in the former case.Then, the dialog system has to decide for relevant grounding documents, which are obtained from a document base D. In this case, retrieval models that model the distribution p d | u T 1 , D in order to rank documents d ∈ D may be employed.By introducing d as a latent variable, the posterior distribution of the response given u T 1 and D is given as follows.
For large D the sum is then approximated by either top-k (Lewis et al., 2020b;Thulke et al., 2021) or maximum approximation.Furthermore, due to the input length restriction in current language generation models, the dependency on D is usually dropped, as outlined in the last step.Using maximum approximation the model becomes where d is the argument of the maximization of just p(d | u T 1 , D) over D since the simultaneous maximization over both components is intractable.This results in a widely-used two-step approach (e.g.(Kim et al., 2020;He et al., 2021)), where retrieval using p(d | u T 1 , D) is followed by a response generation model that uses the retrieved document d.Then, since the retrieval probability is constant during response generation, we may decide for a response according to the decision rule (2) In line with recent work (Yee et al., 2019;Liu et al., 2021), we will refer to the model used in Equation (2) as direct model in the following.While comparatively simple to use, we note that the model can not use data without grounding annotations directly and has been observed to produce a significant number of incorrect outputs, even with ground-truth grounding (Santhanam et al., 2021;Cohen et al., 2022).

A Noisy Channel Approach
We may use the following equivalent decision rule 1 , where the direct model is factorized according to Bayes' Theorem: "response generation model" . (3) Then, the first component can be formulated as a language generation model The first component can be interpreted as favouring responses that allow to reconstruct the grounding based on the generated response.Hence, one would expect accurate responses to receive higher probability estimates.The second component is an ungrounded response generation model as in Equation (1) and favours fluent responses irregardless of the grounding.Furthermore, it can be trained on large amounts of additional dialogues without grounding annotations.In line with previous work on similar models, we will refer to the first component as channel model.Introducing scaling factors between both components then allows for weighting the two objectives of correctness and fluency in order to control the outputs of the model.
We also note that one might arrive at a similar model by the following computation starting from the document-grounded dialog model without maximum approximation: (5) Nevertheless, we leave the exploration of this model to future work.
Decoding the Noisy Channel Model Since the channel model has to be evaluated for each hypothesis generated by the response generation model, decoding the model even with beam search is intractable, as k • |V|-many hypotheses would need to be scored at each iteration with a beam size of k.Therefore, we derive two algorithms to approximately decode the model.First, we introduce reranking, where the noisy channel model is used to score a set of full candidate responses.Then, we introduce online decoding, where the noisy channel model is used to score partial responses during beam search.In both cases, we resort to a proposal model q to generate candidates.Recall, that we have the following relationship between direct and noisy channel model: .
(6) During decoding with the maximum-approximated model introduced in the previous section, p(d | u T 1 ) is constant and may be dropped.This makes the direct model the natural choice for a proposal model as we would get the same results as with the noisy channel model given the true distributions.
In reranking, the direct model then generates a set of full responses U T +1 , from which we decide according to the noisy channel model as follows: where λ 1 ∈ R ≥0 and λ 2 ∈ R ≥0 are scaling factors.
One might also add the direct model without additional computational effort (the probabilities are already calculated during beam search) as an additional factor which has shown beneficial in earlier works (Liu et al., 2021).We note that this resembles the use of discriminators to select responses but does not require additional annotation effort beyond the grounding annotations.Since reranking is limited by the hypotheses generated by the proposal model, we propose an online decoding algorithm in which the noisy channel model is used during beam search.Since the channel model p(d | u T +1 , u T 1 ) depends on the final hypothesis that is not available during search, we approximate it using a model p(d | [u T +1 ] n 0 , u T 1 ) that only depends on partially generated hypotheses similar to Liu et al. (2021).At each step, we score the k hypotheses in our beam using the noisy channel model.Since scoring all k • |V | possible extensions is infeasible, we select the best k extensions only using their direct model score.The noisy channel model score for each of the k partial hypotheses [u T +1 ] n 0 up to length n is then calculated 1368 Algorithm 1 Pseudocode for Liu et al.'s (-) (Liu et al., 2021) and our (+) decoding algorithm.
as follows: 8) Again, we might add the direct model as an additional factor to the score.The algorithm requires locally-normalized models and a channel model trained on partial responses, which we obtain by truncating responses according to a uniform distribution over their length in training.
Lastly, we experiment with the algorithm proposed by Liu et al. (2021) which uses the direct model to generate k 1 extensions to each of the k 2 hypotheses in the beam.The k 1 • k 2 new hypotheses are then pruned back to size k 2 using the noisy channel model.We outline both algorithms in Algorithm 1.

Experiments
We evaluate our model on multiple different English document-grounded dialog datasets by comparing it to the direct modeling objective and the CTRL model (Keskar et al., 2019) presented by Rashkin et al. (2021), where the input is augmented by a sequence of control tokens c m 1 .Thus, the model becomes p(u in line with Section 4. We omit the "objective voice" token, since, for example, Personachat specifically targets conversations where the system responds in first person.The datasets and metrics are described in Section 5.1 and 5.2, respectively.In all our experiments, we finetune the BART-large (Lewis et al., 2020a) checkpoint that is provided as part of the huggingface transformers (Wolf et al., 2020) library, which we further use to implement our experiments.In order to determine the scaling factors for the Noisy Channel model, we do a hyperparameter sweep across λ i ∈ {0.1, 0.2, . . ., 2.0} on the validation sets and choose the parameters that perform best in terms of Q 2 .We use λ 1 = 0.6, λ 2 = 0.4 for online decoding and λ 1 = 0.5, λ 2 = 0.2 for reranking for all experiments after seeing similar trends on all datasets.Furthermore, we always use our proposed online decoding algorithm (see Section 6.3 for a comparison).The results obtained with these experiments are discussed in Section 6.

Datasets
This section gives a brief overview of the different datasets used in our experiments which capture a variety of settings, for example task-oriented and open-domain dialogs grounded in documents of varying lengths.Dataset statistics can be found in Appendix A.3.
Personachat Personachat (Zhang et al., 2018) is a crowdsourced open-domain dialog dataset, where dialogs are grounded in persona descriptions that consist of five short sentences.In our experiments, we use the self configuration and evaluate on the validation split.
Wizard-of-Wikipedia Wizard-of-Wikipedia (WoW) (Dinan et al., 2019) is a crowdsourced open-domain dialog dataset, where turns are grounded in Wikipedia articles.Only the wizard can access the grounding in order to teach the apprentice but also choose not to use any grounding.We evaluate on the subset of grounded wizard turns.DSTC9 DSTC9 Track 1 is an extension of the MultiWoz 2.1 dataset (Eric et al., 2020), where turns require information beyond the existing API structure and which was collected from FAQ documents.The test set contains conversations about a new location and a held-out domain, as well as transcripts of spoken conversations.Doc2Dial Doc2dial (Feng et al., 2020)  given on different levels.We use a concatenation of the annotated spans as a grounding instead of taking entire paragraphs.
FaithDial Based on the observation that current dialog datasets contain insufficiently grounded annotations that encourage hallucinations (Dziri et al., 2022b), Dziri et al. (2022a) release FaithDial, an edited version of Wizard-of-Wikipedia that contains significantly less hallucinations.For this, crowdworkers have edited 44% of the grounded training responses from seeker-initiated conversations and all those from the validation and test set.

Evaluation Metrics
In line with the shared tasks on some of the datasets we use for evaluation (Kim et al., 2020(Kim et al., , 2021;;Feng et al., 2020), we use the sacrebleu (sBLEU) (Post, 2018) implementation of BLEU (Papineni et al., 2002), and METEOR (Banerjee and Lavie, 2005) to assess our model generations with wordoverlap based metrics.In addition to that, we use BERTScore (Zhang et al., 2020) and the token-level F1-Score between u T +1 and d, as well as the recently proposed Q 2 metric (Honovich et al., 2021) to evaluate the factual consistency of our models.
Q 2 is a model-based metric that matches the answers, which are derived from response and grounding, to questions generated based on the response using an NLI model and has shown strong correlations with human judgements in system-level evaluation on WoW.

Retrieval
In addition to experiments that use ground-truth grounding, we also experiment with using the outputs of retrieval models, since usually the grounding is not known at inference time.For retrieval we use two architectures.First, a Bi-Encoder (Bromley et al., 1993), where a dialog and document encoder model map u T 1 and each d ∈ D to a fixed-size dense vector of the same dimension, respectively.The grounding document is determined by nearest neighbor search, i.e. the decision is made for the document whose vector is closest to the dialog vector in the embedding space.In our experiments, the weights of dialog and document encoder are shared and trained using the Triplet loss criterion.
Secondly, we employ a Cross-Encoder which provides strong performance across a variety of tasks but remains too inefficient in order to be used in practice with large D (Reimers and Gurevych, 2019;Humeau et al., 2020;Karpukhin et al., 2020;Thulke et al., 2021).In the Cross-Encoder, dialog context and document are concatenated as the input to a Transformer model that subsequently performs relevance classification such that the document with the highest score is retrieved.

Results
Table 1 shows the results obtained with our model using our proposed online decoding algorithm, reranking, and no additional data in training but the corresponding training set.We identify the following trends: 1) Our model consistently outperforms the direct modeling objective in terms of all automated factuality metrics.
2) There is no clear trend in terms of wordoverlap-based metrics, where our model and the direct model show comparable performance.
3) CTRL gives larger improvements in terms of Q 2 on all datasets but DSTC9.Nevertheless, the additional control tokens may be seen as a data filtering method that adds new information to the training data that is not available to our model.4) Adding control tokens to our model, i.e. combining CTRL with the noisy channel approach, gives further improvements in terms of factuality metrics.
5) The improvements obtained on the unseen set of Wizard-of-Wikipedia indicate that the model is also able to generalize appropriately to new information.
Overall, the results indicate that by scaling the channel model contribution appropriately, the faith- fulness of responses can indeed be improved in comparison to the direct model.In the following, we present further results to understand how the model behaves under different scaling factors (Section 6.1), uncertain retrieval (Section 6.2), different compute budgets (Section 6.3), and the presence of additional data (Section 6.4) before concluding the section with a qualitative analysis (Section 6.5).

Controllability
Table 2 shows results obtained with noisy channel online decoding with different scaling factors.
In addition to the previously mentioned metrics, we report Perplexity as a proxy for fluency (Dinan et al., 2020)  (LCS) of u T +1 and d normalized by the length of u T +1 to indicate how much of the response is copied from the grounding.We can see that with a higher channel model factor, the model copies more from the grounding and produces more factual responses.However, these responses have a higher perplexity, as judged by a response generation model trained on all datasets.Thus, by differing the scaling factors one might control how much from the grounding is copied and how much importance is given to retaining a response that fits well into the dialog.CTRL copies significantly more from the grounding than our model.All models, including the direct model, copy more than the ground-truth responses, which is also consistent across all datasets.We note that this may not be desirable in open-domain dialog settings.

Retrieval
Table 3 show results obtained with the noisy channel model and the direct model with uncertain retrieval.We limit our study to DSTC9, since setting-up competitive retrievers for each dataset is non-trivial.For example, on doc2dial a QA-based pipeline is often used (Daheim et al., 2021;Feng, 2021)).For both the Bi-Encoder and the better performing Cross-Encoder retrieval model, the noisy channel model produces more factual responses than the direct model.This shows that the model remains effective in a realistic setting.

Online Decoding by compute budget
Figure 1 shows a comparison of our proposed decoding algorithm, the algorithm outlined in (Liu et al., 2021) and standard beam search using the direct model.We can see that the noisy channel model with our decoding algorithm consistently gives the best performance in terms of Q2 .Furthermore, the direct model even performs better than the decoding algorithm proposed in (Liu et al., 2021) on Wizard-of-Wikipedia for the same compute budget, which takes a significant number of beams for improvements to show.In general, a larger beam size improves the performance of the noisy channel model but not the direct model.

Additional data
In order to study the effect of additional training data for the response generation model, we train the component on all target datasets as well as Multi-WoZ 2.1 (Eric et al., 2020) 2 , Taskmaster-1 (Byrne et al., 2019), TopicalChat (Gopalakrishnan et al., 2019) and CMU DoG (Zhou et al., 2018).However, as shown in Table 4, we do not see consistent but often dataset-specific improvements.For example, Q 2 tends to be better with more data in reranking but not online decoding, which also did not change in our experiments with different scaling factors.

Qualitative Analysis
When comparing the generation outputs of all models, we can make the following observations: 1) the outputs of both the noisy channel model and CTRL appear more faithful to the grounding and more specific.For example, on DSTC9 and doc2dial the direct model sometimes leaves out important details, such as that an ID card needs to be shown when someone picks up a ticket at a train station, which both noisy channel model and CTRL incorporate into the response.Furthermore, the direct model appears to generate generic responses more often, such as "do you have a pet?" when the topic in Personachat is "dog" or "cat".
2) In general, qualitative analysis supports the results from our automatic evaluation that a higher channel model factor implies more from the grounding being copied into the response, which however can come at a loss of fluency and coherence.
3) A higher response generation model factor leads to more abstractiveness and a better connected response that, for example.contains followup questions more often.On the other hand, a too high factor also led to hallucinations in our experiments.4) When comparing CTRL and the noisy channel model, the main difference appears to be that the responses of the noisy channel model (with a suitable factor) seem more connected to the previous turns than in CTRL, especially on a chit-chat task like Personachat.Nevertheless, with a higher channel model factor the generations become more similar to those of CTRL.The combination of CTRL and the noisy channel model might be especially suited for task-oriented dialog, where faithfulness is crucial, whereas for some open-domain settings we think that our model may be more suitable without control tokens.
Finally, some example outputs can be found in the Appendix.

Conclusion
In this paper, we present a model for response generation in document-grounded dialog that explicitly optimizes for faithfulness and fluency.The model decomposes the posterior distribution of response given context and grounding into two components according to Bayes' Theorem and thus, by introducing scaling factors, allows for encouraging more correct or more fluent responses.Since decoding the model directly is intractable, we derive and compare different approximate decoding schemes that use reranking or online decoding.We compare the model to directly modeling the posterior distribution of response given context and grounding and a variant of CTRL that was proposed recently to encourage faithfulness in grounded response generation, which we again factorize into two components.An evaluation on five different open-domain and task-oriented dialog datasets shows improvements in terms of factuality on top of both models.Furthermore, we highlight how the scaling factors can be used to control how much the model copies from the grounding and how much weight is given to a well-connected response, and we investigate the influence of additional training data for one of the model components.Lastly, we show that our model also gives improvements when dealing with uncertain document retrieval.

Future Work
In future work we would like to explore the noisy channel model for document-grounded response generation without maximum approximation, such as the one that we present in Equation ( 5).

Limitations
The main limitation of our model is that the approximate decoding schemes introduce significant computational overhead in comparison to the direct model and CTRL, which amounts to a factor of up to 10 for online decoding with a large beam size using a not yet fully optimized implementation.Furthermore, decoding is a lot more complex and the scaling factors need to be tuned which implies significant additional computations that are necessary and results in a larger carbon footprint.In addition, our noisy channel model has a significantly higher number of parameters than the baseline model.Finally, we mainly rely on automatic metrics to assess the faithfulness of the proposed approaches and leave out a broader evaluation of the general quality of generated responses but for a small-scale qualitative study cf.Section 6.5.

Broader Impact
In general, generative dialog systems are a promising field of research and can be less restricted in the topics they can deal with than rule-based models or approaches that use predefined dialog flows, for example, which require a lot of handcrafting and possibly experts to write appropriate responses.Nevertheless, such language generationbased approaches bear the danger of repeating harmful content and biases that may have been present in the training data or of generating inappropriate responses, in general.Furthermore, some applications require faithful responses by law or are at least critical to the service.While our model can improve the faithfulness of responses, it can not be guaranteed.Therefore, we would not recommend to use the model in these applications.While document-grounded dialog systems in general can be used to ground system responses in helpful and correct real-world information, these systems could potentially also be misused to ground dialog systems in misinformation or other harmful documents.λ 2 = 0.5 that's cool.i work at a shipyard.i love computers and programming.Grounding i'm a descendant of christopher columbus .i love to cook paella .i've a weakness for fish and chips .i visit europe twice a year .i am an art major in college .uT europe must be very beautiful my sister went there λ 1 λ 2 = 0.5 it is.i love it.i've been there twice a year since i was in college.
λ 1 λ 2 = 1.4 it is.i love it.i go twice a year, and i cook paella and fish and chips Grounding my family has raised horses for a generations .i dream of moving to the city .i'm from a small town .i was raised on a horse farm .i want to be a lawyer .uT keep it in your family , that's nice for future generations λ 1 λ 2 = 0.5 thank you. i want to move to the city and live on a horse farm.The museum opens at 11am on Wednesdays and at 5pm on Sundays.Do you have any other questions about the museum?It is open from Monday to Friday from 9:30 am to 4:00 pm.On Saturday it opens at 11:00 am to 5:30 pm and on Sunday it is open at 9:00am to 3:30.
Is there anything else I can help you with?
Figure 1: Q 2 and sBLEU for the noisy channel model using our decoding (o) , the one proposed by Liu et al. 2019 (*) , where k 1 • k 2 is the effective beam size, and direct model beam search (|) by compute budget.
work in a shipyard full time.what do you do in your spare time?
you. i want to move to the city one day, but i live in a small town Grounding Opening hours are from 11am to 5pm on Wednesdays only.uT What time does the Oakland Museum open?
The museum opens at 11am on Wednesdays only.Is there anything else I can help you with?Grounding Cathedral of Saint Mary of the Assumption is open from Monday to Friday from 9:30 am to 4:00 pm.On Saturday it is open from 11:00 am to 5:30 pm and on Sunday it is open from 9:00 am to 3:30 pm.uT Yeah, is it open on weekends too?The Cathedral of Saint Mary of the Assumption is open from Monday to Friday from 9:30 am to 4:00 pm.Is there anything else I can help you with today?

Table 1 :
(Rashkin et al., 2021)og dataset, where the agent provides a user with information from public government service websites.The grounding annotations are Main results of our model compared to the direct model and CTRL(Rashkin et al., 2021).We use our online decoding algorithm and all results are within an effective beam size of 30.

Table 2 :
Q 2 and Perplexity by ratio of factors on Personachat with beam size 10.

Table 3 :
and the longest common subsequence Results on the outputs of a Bi-Encoder and Cross-Encoder retrieval model on DSTC9 test.

Table 4 :
Results with noisy channel model and additional response generation model training data.
eral Ministry of Education and Research BMBF (03ZU1106DA).The German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE.The European Union under the Horizon Europe grant No. 101070351 (SERMAS).asthey have access to the payment card for identification.Is there anything else I can help you with?Online Decoding Yes as long as they have access to the payment card for identification.Is there anything else I can help you with today?CTRL Yes as long as he has access to the payment card for identification.Is there anything else I can help you with?CTRL Online Yes as long as they have access to the payment card for identification.Is there anything else I can help you with? the founder of the Chanel brand's first name is Coco.Are you familiar with the brand?Direct Model Have you ever heard of Coco Chanel?She was the founder of the Chanel brand.Reranking: Have you heard of Coco Chanel?She was the founder and namesake of the Chanel brand Online Decoding: I love Coco Chanel.She was the founder and namesake of the Chanel brand.CTRL She was the founder and namesake of the Chanel brand.CTRL Online Coco Chanel was the founder and namesake of the Chanel brand Grounding The giant panda ("Ailuropoda melanoleuca", "black and white cat-foot"; , "big bear cat"), also known as panda bear or simply panda, is a bear native to south central China.uT -Ground truth Hi!I love giant pandas, they are my favourite animal.they're also called "big bear cat" sometimes?how about you?Direct Model Have you ever seen a giant panda?They are native to south central China.Reranking My favorite animal is the giant panda.They are native to south central China.Online Decoding Giant pandas are so cute!They are native to south central China.CTRL The giant panda is a bear native to south central China.CTRL Online The giant panda is a bear native to south central China.React (sometimes styled React.js or ReactJS) is a JavaScript library for building user interfaces.uT I'm not sure I've heard of it before, what is it?It is a JavaScript library for building user interfaces.React (sometimes styled React.js or ReactJS) is a JavaScript library for building user interfaces.Grounding David Suchet stars as the eponymous detective, Agatha Christie's fictional Hercule Poirot.uT No do you know what's it about?It's about a detective named Hercule Poirot, played by David Suchet.David Suchet stars as the eponymous detective, Agatha Christie's fictional Hercule Poirot.Grounding my friends are all skateboarders .i've a broken arm .i am a skateboarder .my favourite skater is rob dryden .uT oh no !what is wrong with it ?are you a pro skateboarder λ 1 λ 2 = 0.5 i broke my arm trying to skateboard with my friend rob dryden λ 1 λ 2 = 1.4 i broke my arm and my friends are all skateboarders Grounding i like to work on vintage cars .i own two vintage mustangs .i like to go to country concerts on weekends .my favorite music is country .i have two dogs .uT that's amazing .i'm currently studying spanish .= 1.4 i love to go to country concerts on the weekends with my dogs Grounding i don't drink soda .i've a boyfriend .i live in a house .i like to eat hamburgers and french fries .uT i try to work and stay busy , i am in between jobs right now .do you like hamburgers and french fries?i love them.Grounding i volunteer my time with a local bunny rescue .my favorite color is navy blue .i work fulltime in a shipyard .i donate a part of my salary to charity each month .uT wow that's exciting good for you .i'm learning about computers , programming