q2d: Turning Questions into Dialogs to Teach Models How to Search

One of the exciting capabilities of recent language models for dialog is their ability to independently search for relevant information to ground a given dialog response. However, obtaining training data to teach models how to issue search queries is time and resource consuming. In this work, we propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions. We prompt a large language model (PaLM) to create conversational versions of question answering datasets, and use it to improve query generation models that communicate with external search APIs to ground dialog responses. Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale. Our experiments demonstrate that: (1) For query generation on the QReCC dataset, models trained on our synthetically-generated data achieve 90%--97% of the performance of models trained on the human-generated data; (2) We can successfully generate data for training dialog models in new domains without any existing dialog data as demonstrated on the multi-hop MuSiQue and Bamboogle QA datasets. (3) We perform a thorough analysis of the generated dialogs showing that humans find them of high quality and struggle to distinguish them from human-written dialogs.


Introduction
Recent dialog generation models, such as LaMDA (Thoppilan et al., 2022), BlenderBot3 (Shuster et al., 2022b) and Sparrow (Glaese et al., 2022) use an external search API to generate grounded and factually accurate responses (Parisi et al., 2022).This is important for providing reliable and consistent answers (Shuster et al., 2022a), especially ‡ Work done during an internship at Google Research.
Figure 1: Left: Our q2d method starts from an existing query or question and prompt a few-shot language model to transform it into a dialog.We filter out cases where the intent of the generated dialogue differs from the intent of the initial query and apply additional filters.Right: We take a question from the QReCC dataset (surrounded by a rectangle) and generate an information-seeking dialog with q2d.By starting with a query and generating a dialog, we create {dialogue → query} dataset, which is used to train and evaluate query generation models, which communicate with an external search API to generate factual responses.
when discussing entities and asking related questions with anaphora.To do this, these models use a query generation component that is trained on dialog-to-search-query datasets.When the model is triggered with a dialog turn that requires search, it generates a query that is used to obtain a search result, which is then used to generate a grounded response.This allows the model to provide relevant information about the world in its responses to user queries.For example, a model trained in 2021 should be able to provide a factual response to the question "How old is Joe Biden?" even in 2023.In a conversation, one might discuss an entity (e.g."Joe Biden") and later ask a related question (e.g."How old is he?") with anaphora.In order to pro-vide reliable and consistent answers, it is necessary to generate a decontextualized query (e.g., "How old is Joe Biden") for a search engine.
Using APIs also decouples language and reasoning from knowledge (Borgeaud et al., 2021;Parisi et al., 2022), which can help prevent errors caused by outdated information being stored in the model's parameters.For example, if a model trained at the end of 2021 is asked "How old is the current president?", it may produce the incorrect query "How old is Donald Trump" if its parameters are outdated or if it provides factually-inconsistent responses (a.k.a "hallucinations").
Query generation datasets have been created using human annotators, limiting them in scale, control, and quality (Komeili et al., 2021).As a result, when a new domain is introduced, a significant amount of human effort is required to create a new query generation dataset for that domain (Gupta et al., 2021;Dziri et al., 2021).The fact that language models often generate hallucinations (Zhao et al., 2020;Maynez et al., 2020;Lee et al., 2018), especially in new domains or dialogs that differ from the training data (Nie et al., 2020;Honovich et al., 2021Honovich et al., , 2022)), highlights the need for more effective query generation datasets that will foster more grounded and factually consistent models.
In this work, we propose a data generation pipeline to improve grounded dialog models with access to search engines.To create a dialog-tosearch-queries dataset for training the query generation component in such models, we reverse the process, starting from a search query and generating an information-seeking dialog that corresponds to that query.Our automatic pipeline, shown in Figure 1, begins with a search query or question, and prompts a large language model (PaLM; Chowdhery et al., 2022) to generate a conversational dialog that conveys the information need implied by the given query.For example in Figure 1, we take the question "Who played Ardra on star trek the next generation?"from the Natural Questions dataset (Kwiatkowski et al., 2019) and generate a dialog with a similar intent: the correct answer to the original question ("Marta DuBois") is also a correct response to the generated dialog.This process allows us to leverage existing question-answering datasets, which are widely available for different domains, and extend them by generating dialogs that preserve the original information need while controlling the dialog domain and style.
To assess whether the automatically generated dialogs can replace human-generated dialogs, we experiment with QReCC NQ (Anantha et al., 2020), a human-curated dialog dataset.We generate a training set that is the same size as the original dataset, but with synthetic dialogue, and use it to train a query generation model.The resulting model obtains 90%-95% of the performance of models trained on the human-generated training data, using the same metrics used to evaluate QReCC (Anantha et al., 2020).
Other than training query generation models, our approach is also useful for training the dialog generation models themselves when no dialog data is available for a new domain.We demonstrate that on the domain of multi-hop question answering , where we first show that existing dialog models struggle to perform well on a domain-specific challenge set.We then generate synthetic dialog data from the MuSiQue (Trivedi et al., 2021) multi-hop QA dataset, and show that training a dialog model on this data improves performance.
We provide a thorough analysis of the quality of the generated datasets, demonstrating that they (a) looks natural, and humans struggle to distinguish the synthetic dialogs from natural; (b) factual: generated and human-annotated answers perform similarly in query generation; (c) correct: dataset labels are accurate, and strict filtering improves results.
To conclude, our main contributions are: 1. We introduce q2d: an automatic method to generate information-seeking dialogs from questions using large language models.
2. We show that our method is beneficial for training query generation and dialog generation, including in different domains like multihop QA.
3. A thorough analysis showing that the synthetically generated dialogs are natural, factual and correct.
2 Generating Dialogs from Questions In this section, we describe our automatic method, called q2d, for generating dialogs from questions, and the properties of datasets produced by this method.Our goal is to reduce the effort associated with generating a training dataset for training generation, and to improve query-generation-based dialog models with a high-quality training dataset.
Algorithm 1 Generate Dialogues from Questions if filter(dialogue,q,q',a) then dataset.add((dialogue,q, a)) output Query Generation Dataset: Query generation can start by extracting queries from existing dialogs.However, our approach is unique in that it begins with factual queries or questions, allowing us to leverage existing resources.Any question-answering dataset, queries dataset, or queries used in popular web search services or dialog model logs can be used with our algorithm.The algorithm is described in Algorithm 1 and consists of three main steps: 1. Starting from a query or question from the set Q, we use a few-shot model M f s , specifically we use PaLM, and instructions I to generate a dialog given the query.The few-shot prompts can be manually written to adapt to different conversation styles, or sampled from existing dialogs dataset.
2. Using the same few-shot examples in reverse, S d and I r , we generate a query based on the generated dialog, q .
3. Filtering: we filter dialogs with different intent, or dialogs where the dialog answer is contained in the dialog.We elaborate on the different filters below.
Filtering.In this part we attempt to filter (dialog, query) samples that would not be beneficial for training or testing.We do it in three steps, elaborated below.We stress that there are many more filtering strategies possible, and exploring them is left for future work.First, we filter out dialogs whose intent is different from the original query by measuring the similarity between the query and its reversed version using SBERT similarity (sim(q, q )) and comparing it to a threshold (T query ).If the similarity is below the threshold, the generated query is considered to have a different intent and the dialog is filtered.Appendix A, Section A.2 shows several examples of dialogs, original and reversed query and SBERT semantic similarity.Second, we filter out cases where the answer is included in the dialog by measuring the n-gram overlap between the dialog and the answer using the Rouge metric (Lin, 2004).If the overlap is above a threshold (T answer ), the answer is entailed in the dialog and the example is filtered.For example, if the final answer ("Marta DeBois") would have been already written in the dialog for the role of playing Ardra, the final question ("Who played Ardra") would not make sense.Finally, we filter out cases where the last turn of the dialog is similar (>80%) to the original question using SBERT similarity.These cases include situations where no anaphora is required.
In this work, we use PaLM (Chowdhery et al., 2022), a large language model with 540B parameters, as the few-shot language model for generating dialogs with a temperature of 0.6.We provide a fully working code with GPT-3 (Brown et al., 2020) for reproducibility.The set of prompts and instructions can be found in Appendix A, Section A.3.For the similarity metric (sim), we use the allmpnet-base-v2 model from Sentence Transformers, with a threshold similarity of T query = 0.999.This threshold is justified through human-evaluation and ablation studies for the filtering in Section 5.3.

Replacing Human-Annotated with
Auto-Generated Data In this section, we evaluate the extent to which our automatically generated dataset can replace the human-annotated dataset.We use the QReCC (Anantha et al., 2020) NQ dataset, which contains (dialog, query) pairs, and automatically generate a dialog from natural questions.This allows us to create an automatically generated train set of the same size, and compare it to the human-annotated dataset.An example of a human-generated dialog compared to an automatically generated dialog is shown in Figure 2. We use the version of the dataset where the intermediate questions are contextualized, rather than decontextualized.For example, the second and third user turns in the figure are contextualized versions of the decontextualized  Metrics and Models.Our metrics are the same as those used in the QReCC dataset, comparing the original and generated queries.These include Rouge-1 Recall (Lin, 2004) for measuring the sim-ilarity between two text unigrams, and SBERT embedding semantic similarity for comparing the semantic content of two sentences (same metric as in §2).1 .We also use Recall@10 to compare the retrieved URLs for the ground-truth query and the generated query. 2 We conduct experiments using an open-source T5-3B model (Raffel et al., 2020) in its original form (referred to as 'None'), by finetuning it on the natural QReCC training data and contrasting the results with those obtained from training on the auto-generated QReCC dataset.We use a batch size of 32, an Adam optimizer, a learning rate of 0.0001, and fine-tune it for 10,000 steps.
Results.Results are presented in Table 1.We observe that by replacing human annotated data with auto generated data we were able to reach 90%-95% of the results with a set of the same size using the same model, demonstrating the efficacy of our q2d approach in minimizing annotation labor and producing synthetic training data that is nearly as effective as human-annotated data.
4 Extending Query Generation: Multi-Hop QA  This section shows that our method is effective as a benchmark and training signal that generalizes to human-annotated data.It is also flexible and able to adapt and improve for specific styles of dialog, even without annotated data.It allows us to create dialogs similar to a target domain and provide a fully labeled query-generation dataset.The generated data is useful for training and evaluation, as well as exploring model performance in new scenarios.We demonstrate this using a multi-hop question answering example.

Manual
Dialog Construction.We define a challenging test set for multi-hop dialogs by annotating the Bamboogle dataset (Press et al., 2022), which consists of 125 multi-hop human-constructed questions.We create dialogs that ask the same questions, with the user as the information seeker and the assistant as the information provider.The assistant should help the user obtain the information they are seeking, clarify any questions, and move the conversation forward without trying to mimic human-to-human interaction.An example from the generated dataset is presented in Figure 3. Full instructions, examples and annotated data can be found in the Appendix A, Section A.4, including examples with model predictions.
Dataset Generation.We use our q2d method as described in Section 2 to generate dialogs that ask multi-hop questions, using the MuSiQue dataset (Trivedi et al., 2021) as the base for the questions (Q).MuSiQue is a challenging multi-hop QA dataset that is partially auto-generated, so we generate dialogs from partially generated questions.This illustrates how we can use automatically generated data to improve on human-annotated data.We use seven few-shot examples (S q and S d ).As a result, we generate 3K train samples and 480 test samples.Full prompts, instructions and examples are available in Appendix A, Section A.1.
Metrics.The metrics used in this work are the same as those described in the previous section: Rouge-1 Recall, SBERT embedding semantic similarity, and Recall@10.
Models.We evaluate several state-of-the-art language and dialog models.These include PaLM 540B (Chowdhery et al., 2022), Flan-U-PaLM 540B (Chung et al., 2022), T5-3B (Raffel et al., 2020), BlenderBot3-3B (Shuster et al., 2022b), WizInt Search Engine FiD (Lewis et al., 2019)These models are used in a zero-shot setting, except for T5,which is fine-tuned on the autogenerated MuSiQue dialogs in the same method presented in Section 3. BlenderBot3 and Wiz-Int are publicly available in Parlai (Miller et al., 2017), exact details and versions are described in Appendix A, Section A.7.More details on the instructions for zero-shot models can be found in the Appendix A, Section A.3.
Results.Query generation results are presented in Table 2. 3 Qualitative examples with T5 model predictions are available in Appendix A, Section A.1.The T5 model improves performance on the human-curated Bamboogle test by 14%-59% after fine-tuning on the auto-generated MuSiQue multi-hop dialogues.We show examples for it in Appendix A, Section A.6.This improvement also correlates with improvements on the autogenerated test set, indicating the effectiveness of our method for creating evaluation data.To conclude, our results show that our datasets are effective as a benchmark for query generation, as well as training data that generalizes to both autogenerated and human-annotated test sets.
Producing a Partially Decomposed Query.Given a multi-hop dialog, query generation models may resolve partial information.For example, if a dialog asks "How old is the current US president?", a query generation model may produce "How old is Joe Biden?", which is correct at the time but may become outdated in the future, or may produce hallucinations.To prevent this, we can make two query generation calls (first to discover the current US president and then their age), decouple knowledge from executing (Borgeaud et al., 2021;Parisi et al., 2022), periodically update the model's weights, or disallow the model from making partial resolves.This will help ensure that the generated query remains accurate and relevant over time.The fine-tuning technique described in this section uses the last approach to avoid making assumptions about the current president's age or identity.

Intrinsic Evaluation: Naturalness, Factuality and Correctness
In this section we perform a thorough analysis of the generated dialogs, focusing on the QReCC NQ dataset which contains human annotated dialogs, and evaluate their naturalness ( §5.1), factuality ( §5.2) and correctness ( §5.3).

Naturalness: Humans Struggle to Distinguish Synthetic Dialogs from Natural
We define a human-evaluation task to distinguish between naturally generated dialogs and autogenerated dialogs.We sample 100 annotated dialogs from QReCC NQ (Anantha et al., 2020) and mix them with 100 dialogs we generated.The annotators, who are not authors of the paper and have a STEM degree, were asked to mark 1 if the dialog seems to be generated by a machine, and 0 otherwise. 4The labels were hidden.We use three annotators for each sample and select their majority vote as the final answer.The results show that the majority vote achieved a success rate of 50.5%, while the random chance is 50%.All individual annotators achieved between 50%-55% in this task.In 26% of the cases there is a full agreement between all three annotators.When all agreed, the result improves to 51.9%, which is still close to random chance.These results indicate that humans struggle to differentiate between natural and auto-generated dialogs.This suggests that the auto-generated dialogs are of high quality and are similar to human annotations, and can be used in place of human-generated dialogs in certain situations, saving time and resources.

Factuality: Generated and Human-Annotated Answers Perform Similarly in Query Generation
The q2d method generates a dialog by starting with a query and generating a series of related questions and answers.However, since the intermediate answers are generated by a large language model, there is a chance that they may be factually correct or incorrect.This raises the following questions.
(1) Are the intermediate answers factually correct?
(2) How does the factuality of the generated answers affect the results of downstream tasks?We replace all human annotated answers in the QReCC NQ training split with PaLM generated answers.To produce PaLM answers, we use a fewshot prompt, where the input is the original dialog ending in a question, and the output is the PaLM response.An example is provided in Figure 4.
Intermediate Answers Factuality According to Automatic Metrics and Human Raters.To answer the first question, we evaluate the factual correctness of the generated answers by using an NLI (Dagan et al., 2005) model presented by Honovich et al. (2021).We take the question ("q"), the response ("r") that may be the ground-truth annotated response or the generated response, and the Wikipedia document ("d") summary available in QReCC dataset.We construct the following NLI instance: "premise: {d} hypothesis: The answer to the question {q} is {r}' and produce NLI scores for the ground-truth responses vs. the generated responses.Figure 4 illustrates our process.The average NLI scores for the human responses are 62%, and for the PaLM responses is 38%.However, this measure is biased towards the human responses since we measure it with the Wikipedia document that was used to generate the answer.PaLM might also produce a correct answer, that is just not written in the same exact words in Wikipedia.To test this, we conducted an annotation task with an annotator that is not a part of the paper authors.The annotator was presented with a 50 samples of dialog, query, and two options: A and B. One of the options was the original answer and the other was the generated answer.The annotator's task was to mark 0/1 for each answer indicating whether it was factual and relevant for the question.The results are that PaLM responses were marked as correct in 82% of the cases, compared to 93% correctness of the human responses.This result indicates the factuality and relevancy of the generated responses.
For Query Generation, Generated Answers Perform Similar to Human-Annotated.To answer the second question, we replace all of the human annotated answers with automatically generated answers, receiving a semi-auto-generated training set with the same structure and same annotated questions, but with PaLM generated dialogs.Then we train a T5-3B (Raffel et al., 2020) model on the human annotated and the semi-autogenerated version and compare the results.For example in Figure 4, the semi-auto-generated dialog is the one with the answers on the right side.We train the same way as we presented in Section 3. The results are 86.6%Rouge-1 Recall with the semi auto-generated training set, only a small drop (1.5%) from the results of the model trained on the natural data, indicating that although PaLM sometimes (<48%) produce in-factual responses, it only has negligible effect on the query generation task.

Correctness: Generated Datasets Labels are Accurate, and Strict Filtering Improves Results
Our main filter measures a similarity between the original query and the reversed query sim(q, q ) and compare it to a threshold T query .We measure its effect in human-evaluation and automatic ablation studies.Both experiments indicate the label correctness for the task of predicting the query from a dialog and the value of stricter filtering threshold.
Humans Find that Dialogs Generated by Queries Have the Same Intent.We define a human annotation task to determine whether the dialogs are intent-preserving.Annotators were asked  Table 3: Reversed queries similarity filter.The similarity is measured between the original query q and the reversed query q predicted with the few-shot model q ← M f s (S d , I r , dialogue).The higher the filter threshold (strict filter), the better the results.
to mark 1 if the dialog is intent-preserving, and 0 otherwise. 5We use three annotators for each sample, and select their majority vote as the final answer.We follow the notation suggested by (Groenendijk and Stokhof, 1984) about entailment between questions: an interrogative q entails another d iff every proposition that answers q answers d as well (Jiang and de Marneffe, 2022).
Here, q stands for a question and d stands for an information-seeking dialog.We defined eight SBERT semantic similarity score buckets, with 15 in each, covering all similarities between 0 and 100.Results are presented in Figure 5.All three annotators agree in 88% of the cases.The proportion of intent-preserving annotations grows according to the SBERT semantic similarity score, with a strong gain between 0.95 and 1, the only bucket with 100% intent-preserving annotations.Accord-ingly, we only select samples that have generated queries very similar to the original query (≥0.99) in the filtering step.
Strict Filtering Leads to Higher Quality Data, Resulting in Improved Downstream Results.
We measure different thresholds tested on an evaluation set of 1,000 instances we generated from other train queries.We also add another filtering method based on an NLI (Dagan et al., 2005) model, given a dialog "d" and a question "q", we construct the following NLI sample: "premise: {d} hypothesis: The dialog asks the question {q}", with different thresholds.Results are presented in Table 3.We report the Rouge-1 recall on the evaluation set.We see that performance increases as the reversed similarity threshold rises, and with a clear trade-off with the filtering proportion.The more data we generate, we are able to apply a more strict filtering, receiving higher quality data, that leads to better results. 6We produced four options for the NLI-based method, with thresholds ranging from 0.65 to 0.82, and above it filtered too much data (below the goal of 13K).The max performance for the 0.82 threshold group is 70%, much lower than the alternative reverse queries filter.

Related Work
Our work relates to data generation, query generation for search-based models, and information retrieval datasets.
Data Generation Several works have used large language models for data generation.Dai et al. (2022b) applies this technique to information retrieval, creating retrievers based on generated data that generate queries given the document.Their method involves round-consistency filtering using a large language model, a method similar to reverse translation.In the context of dialog generation, Dialog Inpaintint (Dai et al., 2022a) starts from a document and generates a dialog.Our approach focuses on generating dialogs from queries, which allows us to leverage the availability of existing QA datasets.This enables us to create informationseeking dialogs with the same intent as the original questions, along with automatically generated labels for the queries and answers.
Search Based Query Generation dialog models like LaMDA and BlenderBot use search APIs to generate factual responses.Training and evaluation data for such models is obtained mostly with human annotated data.Previous works (Shuster et al., 2022b;Thoppilan et al., 2022;Komeili et al., 2021) evaluated only the end-to-end dialog response without evaluating the generated query.The evaluation was primarily based on automated metrics of perplexity and F1, or with human annotations assessing whether the model response is sensible, specific, and interesting (SSI), or whether it is correct, engaging, and consistent.The evaluated dialogs were general, not necessarily information-seeking.The focus of this paper is on the query generation task for information-seeking dialogs, with a concrete question and an expected response.
Question Rewrite Works like QReCC (Anantha et al., 2020), Question Answering in Context (QuAC) (Choi et al., 2018), TREC Conversational Assistant Track (CAsT) (Dalton et al., 2020), QuAC and CANARD (Elgohary et al., 2019) in the information retrieval domain use human annotated data, that mostly contain follow-up dialogs, questions followed by answers.Our work focuses on the application of dialog models like LaMDA and BlenderBot, which often involve the use of less formal language and more human-like conversations.
The need for a variety of query generation datasets has motivated us to develop an automatic method for generating dialogs for the query generation task, with a range of different styles and skills required.

Conclusions
We introduced q2d, a data generation pipeline that produces dialogs based on questions.We demonstrated that our method can replace humanannotated data to train query-generation models, and to create effective, natural, factual, and accurate evaluation and training data in new domains, even when no existing dialogue data is available.Instructions for annotator: "Your task is to write a dialog given a question.You may use the internet in order to add factual information to dialog.The dialog intent should be understandable yet challenging.The responses should be 'assistant like'not trying to mimic human emotions."

A.5 Human Evaluation Instructions
Instructions for the intent-preserving task: "You will receive a dialog and a query, and need to label 1 if the dialog has the same intent as the query.If you are unsure, you can check to see whether both intents return the same Google search results.".
Instructions for the natural vs. auto-generated task: "Can you guess which of the following dialogs is natural and which is auto-generated by a User: who plays cami harris Question: who plays cami harris on last man standing?Answer: Sarah Gilman Dialog: User: who were the confederates in the civil war Assistant: The Confederates in the American Civil War were the Confederate States of America, or the South.User: who fought them Question: who fought the confederates in the american civil war?Answer: The Union Dialog: User: when did adele webber first come on grey's anatomy Assistant: The character of Adele Webber was first introduced in the episode ""Save Me"".User: who plays her Question: who plays adele webber on grey's anatomy?Answer: Loretta Devine Dialog: User: what did the effigy mounds look like Assistant: The effigy mounds are earthen mounds built in the shapes of animals, humans, and other objects.User: when were they discovered Assistant: The first effigy mounds were discovered in the 1700s but were not identified as such until the late 1800s.User: when were most of them built Questions: when were most of the effigy mounds built?Answer: 350 -1300 CE machine?There are 200 dialogs.100 are synthetic, and 100 are natural.Enter 1 for Synthetic and 0 for Natural in the "synthetic?" field."

Figure 2 :
Figure 2: An example of human annotated dialogue from QReCC and an automatically generated dialogue produced for the same question.

Figure 3 :
Figure 3: An example of auto-generated dialog, where we take a multi-hop question from MuSiQue and use q2d to generate dialog in a conversational style with the same intent.

Figure 4 :
Figure 4: Illustration of the response factuality evaluation.For each turn, we produce a response with PaLM, and compare the generated response to the human annotated response.We use an NLI model to score whether the response answers the question ("Hypothesis: The answer to the question {q} is {r}") according to the Wikipedia document d used by the human annotator in the ground-truth response generation ("Premise: {d}").In the first response there is a lower score for the PaLM response because it misses the mention of 'Cornel Wilde' that appears in the document summary.

Figure 5 :
Figure 5: Intent-preserving annotation task results.The proportion of samples that were annotated as intentpreserving increases with the semantic similarity score.

Figure 6 :
Figure 6: Examples from the human-annotated dialogues for Bamboogle.The model predictions above/below the line are of T5, before/after fine-tuning on MuSiQue dialogues.

Figure 7 :
Figure 7: Examples from the auto-generated dialogues for MuSiQue.The model predictions above/below the line are of T5, before/after fine-tuning on MuSiQue dialogues.

Table 1 :
Results on the human-annotated QReCC NQ test set, experimenting with replacing the humanannotated data with automatically generated data with the q2d method.Bold shows the percentage of performance for a model trained with auto-generated data out of a model trained with human-annotated data.Training on the automatically generated data achieves 90%-95% of the model trained on the human annotated results.questions"Whodirected the film, The Vikings?" and "Was the film The Vikings based on a novel?".Dataset Generation.To generate our dataset, we use our q2d method as described in Section 2. For the few-shot examples of queries and dialogs (S q and S d ), we sample 15 examples from QReCC that fit within the maximum input sequence length.

Table 4 :
Random Examples from Auto-Generated QRECC Dialog: User: who plays the lead role in wish upon a star Assistant: The cast of Wish Upon a Star include Katherine Heigl, Danielle Panabaker, and Mary Page Keller.User: who plays haley Question: who plays haley in wish upon a star?Answer: Danielle Harris Dialog: User: where is assam Assistant: Assam is a state in India User: what is the capital Assistant: The capital of Assam is Dispur User: who lead the first revolt against british in the state Question: who lead the first revolt against british in assam?Answer: Gomdhar Konwar Dialog: User: when did last man standing air Assistant: Last Man Standing premiered on ABC on October 11, 2011.User: what is the main character's name Assistant: Tim Allen stars as Mike Baxter, the central character of the series.User: what are the other character's names Assistant: The other characters are Mike's wife and three daughters.

Table 5 :
Random Examples from Auto-Generated MuSiQue