MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents

We propose MultiDoc2Dial, a new task and dataset on modeling goal-oriented dialogues grounded in multiple documents. Most previous works treat document-grounded dialogue modeling as machine reading comprehension task based on a single given document or passage. In this work, we aim to address more realistic scenarios where a goal-oriented information-seeking conversation involves multiple topics, and hence is grounded on different documents. To facilitate such task, we introduce a new dataset that contains dialogues grounded in multiple documents from four different domains. We also explore modeling the dialogue-based and document-based contexts in the dataset. We present strong baseline approaches and various experimental results, aiming to support further research efforts on such a task.


Introduction
With the recent advancements in NLP, there has been a surge of research interests and efforts in developing conversational systems for various domains. An important task in the field is conversational question answering and document-grounded dialogue modeling. Prior work typically formulates the task as a machine reading comprehension task assuming the associated document or text snippet is given, such as QuAC (Choi et al., 2018), ShARC (Saeidi et al., 2018), CoQA (Reddy et al., 2019), OR-QuAC (Qu et al., 2020a) and Doc2Dial (Feng et al., 2020b). However, such task setup neglects the common real-life scenarios where a goaloriented conversation could correspond to several sub-goals that are addressed in different documents. In this work, we propose a new task and dataset, MultiDoc2Dial, on modeling goal-oriented dialogues that are grounded in multiple documents.
We illustrate the proposed task in Figure 1. It includes a goal-oriented dialogue with four segments * Equal contribution. on the left and three relevant documents on the right. Each dialogue segment indicates that all turns within it are grounded in a same document, e.g., turns from A3 to A7 in Seg-2 are all grounded in Doc-2. The blue dashed lines connect a dialogue turn with its corresponding relevant span in a document. The red dotted lines with arrows indicate that the dialogue flow shifts among the grounding documents through the conversation, i.e., Doc-1 → Doc-2 → Doc-1 → Doc-3. This example highlights certain challenges for identifying the relevant grounding content among different documents dynamically in a conversation. For instance, agent response A2 mentions 'insured' as an important condition based on a span in Doc-1. However, there are no more details about 'insured' in Doc-1.
To further discuss about 'insured', the conversation naturally switches to refer to another document Doc-2. Another challenge is to handle deep dialogue context. For instance, to provide response to U4 or U6 in Seg-2, the agent needs to understand the context of 'disability benefit' mentioned in Seg-1. There are also cases such as Seg-4 where its turn U10 starts a new question related to Doc-3 but seems independent of the previous segments. The task goal in the example is naturally simple, but it still reveals the realistic expectation of document-grounded dialogue modeling that is still yet to be met.
To the best of our knowledge, there is no existing task or dataset that addresses the scenarios where the grounding documents of goal-oriented dialogues are unknown and dynamic. To facilitate the study in this direction, we introduce a new dataset that contains conversations that are grounded in multiple documents. To construct the dialogue flows that involve multiple documents, we derive a new approach from the data collection pipeline proposed in Feng et al. (2020a). The newly composed dialogue flows have multiple dialogue segments where two adjacent segments are

Social Security Credits
You must earn at least 40 Social Security credits to qualify for social security benefits.

Number of Credit Needed for Disability Benefits
To be eligible for disability benefits, you must meet a recent work test and a duration work test.

Number of Credit Needed for Retirement Benefits
If you are born after 1928, you will need 40 credits to qualify for retirement benefits. 30 years or older -In general, you must have at least 20 credits in the 10-year period immediately before you become disabled.

The Basics about Disability Benefits
The Social Security Disability Insurance (SSDI) program pays benefits to you and certain family members if you are "insured". When you start receiving disability benefits, certain members of your family may qualify for benefits based on your work.

Benefits For Your Spouse
Benefits are payable to qualifying family members on your record. The maximum amount is up to 50%. If your spouse is eligible for retirement benefits on their own record, we will always pay that amount first. But, if the spouse's benefit that is payable on your record is a higher amount, they will get a combination of the two benefits that equals the higher amount.

Benefit For Your Children
Benefits are payable to qualifying family members .

Qualification
To receive corresponding benefits, the child must: -Be unmarried.
-Be under age 18; -or Be 18 or older and disabled from a disability that started before age 22.
Social Security Disability Insurance (SSDI) program pays benefits to you and certain family members if you are " insured ".  Guu et al., 2020;Karpukhin et al., 2020;Lewis et al., 2020;Khattab et al., 2020a), we develop baseline models based on the retriever-reader architecture (Karpukhin et al., 2020;Lewis et al., 2020). Compared to the existing open retrieval QA  and open retrieval conversational QA tasks (Qu et al., 2020b;, our dataset contains more complex and diverse dialogue scenarios based on diversified documents from multiple domains. To work towards modeling the interconnected contexts from dialogues and documents, we utilize document and dialog-based structure information. For the former, we segment a document into passages while maintaining its hierarchical contextual information. For the latter, in addition to combining current turn and dialogue history (Qu et al., 2020b(Qu et al., , 2021, we also experiment with different ways to encode the current turn separately based on the intuition that the latest turn, with a change in topic, could be semantically distant from the dialogue history. We also explore different retriever settings in our experiments.
We propose two tasks for modeling dialogues that are grounded in multiple documents. One is to generate the grounding document span; and the other is to generate agent response given current turn, dialogue history and a set of documents. We evaluate the performances of the retriever and generator in baseline models trained on MultiDoc2Dial dataset.
We summarize our contributions as follows: • We propose a novel task and dataset, called MultiDoc2Dial, on modeling goal-oriented dialogues that are grounded in multiple documents from various domains. We aim to challenge recent advances in dialogue modeling with more realistic scenarios that is hardly addressed in prior work.

Data
We present MultiDoc2Dial, a new dataset that contains 4796 conversations with an average of 14 turns grounded in 488 documents from four domains. This dataset is constructed based on Doc2Dial dataset V1.0.1 1 . MultiDoc2Dial shares the same set of annotations as Doc2Dial. For document data, it includes HTML mark-ups such as list, title and document section information as shown in Figure 1. For dialogue data, each dialogue turn is annotated with role, dialogue act, human-generated utterance and the grounding span with document information. Each dialogue contains one or multiple segment where each indicates that all turns within one segment are grounded in a same document. For instance, the dialogue in Figure 1 has four segments that are grounded in three documents. We exclude the 'irrelevant' scenarios in Doc2Dial where the user question is unanswerable and leave it for future work. We also filter out certain dialogues when we identify more than four noisy turns per dialogue. There is a total of 61078 dialogue turns in MultiDoc2Dial dataset, which consists of 38% user questions, 12% agent followup questions and the rest as responding turns. Table  2 shows the statistics of the dataset by domain, including the number of dialogues with two segments (two-seg), more than two segments (>two-seg), and no segmentations (single).
To create the data, we derive a new data construction approach from the pipelined framework by Feng et al. (2020a). We first create dialogue flows that correspond to multiple documents and then re-collect the utterances for certain turns based on dialogue scenes in the given flow via crowdsourcing. In addition, we aim to reuse previous turns from doc2dial dataset wherever possible and collect new turns when necessary to compose the new dialogues.

Dialogue Flow
To construct dialogue flows grounded in multiple documents, we need to split the existing dialogues into segments and recompose them. The main idea is to identify the position where the previous topic can possibly end and then find a segment with a new topic that is grounded in a different document, for which we utilize both document-based and dialogue-based structure knowledge.  Dialogue Segmentation To segment dialogues, we identify all the candidate splitting positions based on dialogue act and turn index. Intuitively, we aim to maintain semantic coherence of a dialogue segment (Mele et al., 2020). Thus, we only split after an agent turn with dialogue act as 'responding with an answer' while the next turn is not 'asking a follow-up question'. We randomly select a number of splitting positions per existing dialogue and obtain 2 to 4 segments per dialogue.
Document Transition To simulate the document-level topic shift between dialogue segments, we identify different types of grounding document transition in dialogue, including (1) the following grounding document is explicitly closely related to the preceding grounding document, such as Seg-1 and Seg-2 in Figure 1; and (2) the two documents are not necessarily closely related, such as Seg-3 and Seg-4. For the former case, we exploit document-based structure knowledge of a domain to determine the semantic proximity of document pairs, including (1) document-level hierarchical structure indicated by the website URLs, such as Doc-2 and Doc-3 shares one parent topic; and (2) hyperlinks between pages, such as, hyperlink of 'insured' in Doc-1 to Doc-2. For the latter case, we just randomly select document pairs from the same domain if they do not belong to the former case.
Re-composition Last, we combine multiple dialogue segments to form a new dialogue flow based on the following rules: (1) a dialogue segment can only appear in one new dialogue flow; (2) the grounding documents of two adjacent dialogue segments must be different; (3) we keep the new dialogue flows between 6 and 20 turns by filtering out shorter dialogues or discarding later turns of longer dialogues.  Table 3: Data statistics of document passages and dialogue data based on splits. The average length is based on the number of tokens.

Data Collection
After we re-compose dialogue flows with multiple segments, we need to re-write certain dialogue turns since some of the original turns could be under-specified when taken out of the previous context, especially when they are re-positioned at the beginning of a dialogue segment. For instance, if we use Seg-3 in a new dialogue context, then we expect U8 to be enhanced with necessary background such as "I am qualified for disability benefit. My wife is currently unemployed. I want to know what benefit she gets from me" based on Doc-1.
To collect the rewriting of a given dialogue turn, we provide context information including up to four preceding turns, the succeeding turn, the associated document title information and the grounding span in a document section. We ask crowdsourced contributors to rewrite an utterance fit for the given context, by adding necessary background information and removing the irrelevant or contradicting content. For quality control, we also insert various template-based placeholders for the crowd to modify accordingly. The task would be rejected if they fail to modify the placeholders. We collect over 6000 turns for improving the multi-segmented dialogues. The task was performed by 30 qualified contributors from appen.com. More information about the crowdsourcing task can be found in Appendix A.

Tasks
We propose two tasks for the evaluations on Multi-Doc2Dial dataset.

Task I: Grounding Span Prediction
This task aims to predict the grounding document span for the next agent response. The input in-cludes (1) current user turn, (2) dialogue history, and (3) the entire set of documents from all domains. The target output is a grounding text span from one document that is relevant to the next agent response. To train a dialogue system to be able to provide Fine-grained grounding information can be an important step for improving the interpretability and trustworthiness of neural-model-based conversational systems (Carvalho et al., 2019).

Task II: Agent Response Generation
In this task, we aim to generate the next agent response, which includes asking follow-up questions or providing an answer to a user question. Again, the input includes (1) current user turn, (2) dialogue history and (3) the entire set of documents from all domains. The target output is the next agent response in natural language. This task is considered a more difficult task than Task I since agent utterance varies in style and not directly extracted from document content. A related task is to simulate user utterances, which could be a more challenging task, since they are even more diversified in style and content. We leave this task for future work.

Model
We propose to formulate the tasks on predicting agent turn as an end-to-end generation task inspired by the recent development of retriever-reader architecture (Karpukhin et al., 2020;Lewis et al., 2020). We consider Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) as the base model. It includes a retriever component and a generative reader component. The retriever aims to retrieve the most relevant document passages given dialogue query; the generator, a pre-trained seq2seq model in our case, takes the combined dialogue query and top-n document passages as input and generates the target output.
Since we need to deal with the contextual information of dialogue turns and document passages, we further investigate how to utilize dialogue-based and document-based structure information for the retriever component.

Document-based Structure
Previous approaches for open retrieval question answering typically split long documents into smaller text passages by sliding window. In this work, we investigate two ways of segmenting document content: (1) we also split a document based on a sliding window size of N tokens; (2) we utilize document structural information indicated by markup tags in HTML files. Inspired by document tree structures used in Feng et al. (2020a); Wan et al. (2021), we segment the document based on original paragraphs indicated by mark-up tags such as <p> or <ul> and then attach the hierarchical titles to each paragraph as a passage, e.g., adding 'The Basics about Disability Benefits / Benefits for Your Children / Qualification' to the last paragraph in [Doc-1] in Figure 1.

Dialogue-based Structure
For dialogues with multiple topic-based segments, a turn can be more distant from previous turns lexically and semantically when the topic shifts at the turn (Arguello and Rosé, 2006). Thus, we also consider incorporating the retrieval results only by the current turn in addition to the retrieval results by the combination of current turn and history (Qu et al., 2020b(Qu et al., , 2021. To obtain the representation for current turn, we experiment with two ways for BERT-based question encoder in RAG model. One is based on the common [CLS] token embeddings; one is based on pooled token embeddings (Choi et al., 2021).

Experiments
We evaluate the proposed approaches on two tasks for predicting next agent turn, i.e., grounding generation and response generation. Given current turn, dialogue history and all available documents in the dataset, we aim to evaluate generated text along with intermediate retrieval results. We split the data into train/validation/test sets as shown in Table 3. The ratio between train and validation/test set is close to 5 : 1. Half of the dialogues in validation/test set are grounded in "unseen" documents in train set. All experiments were run with 1 V100 GPUs with half precision (FP16) training. More details about experimental settings and hyperparameters are reported in B.1 and B.2.

Baseline Approaches
Our baseline approaches are based on RAG models (Lewis et al., 2020). For the retriever, we use DPR biencoder pre-trained on Natural Question dataset 2 . It contains a question encoder for encoding the dialogue query and a context encoder for encoding document passages. We also fine-tune the DPR biencoder using the train and validation set. For the generator, we use BART-large pre-trained on CNN dataset. To train the retriever and generator endto-end, we use RAG-Token model, which allows the generator to select content from multiple documents. We found RAG-Token to perform better than RAG-Sequence intuitively and experimentally for our task because MultiDoc2Dial dataset contains many longer agent responses with an average of 22 tokens that might span over multiple passages. We experiment with different retrievers including BM25 and multiple DPR variances.

Implementations
Fine-tuning DPR To fine-tune DPR, we select positive and negative examples using the train set.
For positive examples, we use the grounding annotations, which include the reference document passage information.
For negative examples, we use grounding as query to select top one retrieval results as hard negative example; we use dialogue query up to 128 tokens as query, and the top 15 to 25 from retrieval results as the 10 regular negative examples based on best-ranked results by BM25. We use gradient checkpointing to support a large batch size, which we set as 128. This also allows for more in-batch negatives as suggested in (Karpukhin et al., 2020;Lewis et al., 2020).

Document Index
To create document index, we segment documents into passages in two different ways as described in Section 3.1. One is to split a document every one hundred tokens, same as the default setting of RAG implementation by Huggingface. We note the token-segmented documents as D token . The other approach is to split a document based on document sections. We note structuredsegmented documents as D struct . We use Maximum Inner Product Search (MIPS) to find the top-k documents with Faiss. For indexing method, we use IndexFlatIP" 3 method for indexing.

Dialogue Query Embedding
We combine current turn and history with [SEP] in between as one dialogue query. The query is truncated if longer than maximum source length. To obtain the representation embedding for a dialogue turn, we consider the common [CLS] token embeddings and pooled token embeddings. For the latter, we    utilize token_type_ids for determining either current turn or history. Then we take average pooling to turn the token embeddings to a fixed-length sequence vector.

Retriever Settings
We experiment with the following variations for the retriever components.
• D token / D struct -nq: uses original pre-trained bi-encoder from DPR. The corresponding document index is based on token/structuresegmented passages. We consider these setups as baselines.
• D token -ft / D struct -ft: uses fine-tuned DPR biencoder. The document index is based on token/structure-segmented passages.
• D token -rr-cls-ft / D struct -rr-cls-ft: uses finetuned DPR bi-encoders. We combine the retrieval results of entire dialogue query and only the current turn and select the top-k unique passages. The representation of current turn is based on [CLS] token embeddings. The document index is based on token/structure-segmented passages.
• D token -rr-pl-ft / D struct -rr-pl-ft: uses finetuned DPR bi-encoders. We combine the retrieval results of entire dialogue query and only the current turn and select the top-k unique passages. The representation of current turn is based on pooled token embeddings. The document index is based on token/structure-segmented passages.
In addition, we also experiment with BM25 (Trotman et al., 2014), noted as D token -bm25/ D struct -bm25), where BM25 is used for retrieving top-k passages following the experiment set up in Lewis et al. (2020).

Evaluation Metrics
We evaluate the passage retrieval results and the generated text for both tasks. For retrieval, we compute recall (@k), which measures the fraction of times the correct document is found in the top-k predictions. We evaluate text generation output based on token-level F1 score (F1), Exact Match (EM) (Rajpurkar et al., 2016) and SacreBLEU score (BL) (Post, 2018).

Passage Retrieval Results
We first evaluate the performance of BM25, DPR and fine-tuned DPR on a passage retrieval task. The query is the combination of current turn and dialogue history from latest to earliest turn up to 128 tokens. Table 5 presents the retrieval results on the validation set. BM25 performs better than DPR-nq but worse than DPR-ft. However, it almost shows no difference for D token and D struct . DPRft shows significant improvement over DPR-nq. Both DPR-nq and DPR-ft seem to benefit from  the document-based structure as they show better performance on D struct than D token . Table 4 and 6 present the evaluation results on test and validation set for the two tasks respectively. All numbers on in tables are the mean of three runs with different random seeds. We omit the standard deviation numbers as they suggest low variance in our experiments. Even though BM25 outperforms DPR-nq in Table 5, BM25 performs much worse than DPR-nq for the generation tasks as shown in Table 4 and 6. RAG models with different DPR-based retrievers generally perform better with D struct than D token on the generation tasks. This is consistent with the DPR-based retrieval results in Table 5. RAG models with DPR-ft show improvement for both D struct and D token over the ones with DPR-nq, which confirms the importance of positive and negative examples even in small quantity (Karpukhin et al., 2020;Khattab et al., 2020b). We also see the retrieval performance gap between D token and D struct is reduced after training the fine-tuned question encoder in RAG. Overall, the retrieval performances for the two tasks seem comparable but generation metric scores for Task II are much lower Task I as the agent responses are free-formed natural language.

Generation Results
We also experiment with a simple way to re-rank retrieved passages of the entire query based on the retrieved results only based on current turn. We experiment with two types of embeddings for current turn as described earlier. As shown in Table 4 and 6, the re-ranking is not every effective. The difference between reranking with two different kinds of encodings is insignificant. In addition, we also eval-uate the baselines on unseen domains, where we train the models using the data from three source domains and test on one target unseen domain. For more experiments results on the domain adaptation setup, please see Appendix B.3.

Qualitative Analysis
To understand the challenges in dialogue grounded in multiple documents and evaluate the data quality, we randomly select some dialog queries from the validation set and examine the queries along with their corresponding retrieved passages by our model. We observe certain ambiguities in the dialogue-based and document-based contexts, which we summarize as follows.
• Ambiguity in dialogue queries: when a user question is under-specific or inquisitive about a higher level topic, it is likely to be quite relevant to multiple document passages. For instance, query U8 in [Seg-3] could be linked to different types of benefits described in multiple documents in ssa.gov.
• Ambiguity in document content: when certain passages with very similar topics in different documents, they could be duplicate or different in context. For instance, same question could be addressed in an FAQ page and in another article regarding a different specific criterion.

Human evaluations on ambiguity in questions
We ask human experts to identify whether a dialogue inquiry turn is ambiguous based on its dialogue history and the associated documents. We consider a dialogue query as 'ambiguous' if it is likely to be relevant to a broad range of domain knowledge; otherwise as 'unambiguous'. Firstly, we ask the the annotators to annotate the queries based on their understanding of dialogue context and the domain knowledge. Secondly, we reveal the reference document passage and another most relevant document passage retrieved by our models and ask them to annotate the queries again. We randomly select 100 dialogue inquiry turns and assign them to two experts with 20% overlap. The Cohen's kappa agreement score is 0.85. In the first setting, 15% of the turns are labeled as 'ambiguous'.
In the second setting, after revealing the relevant passages, 20% turns are considered 'ambiguous'. Such ambiguities are generally inherent to open retrieval settings (Zhu et al., 2021;. In practice, it would require the conversational agents to ask follow-up questions for clarification based on dialogue history and retrieved passages for providing a more fair and informative answer. The current version of MultiDoc2Dial dataset can be further enhanced by adding agent turns for asking clarification questions based on different levels such as highest title (topic) level, sub-title (subtopic) level or even finer level, which we leave for near future work.

Document-grounded Dialogue and Conversational Question Answering
Our work is closely related to the recent work on document-grounded dialogue and conversational machine reading comprehension tasks, such as Doc2Dial (Feng et al., 2020b), ShARC (Saeidi et al., 2018) and DoQA (Campos et al., 2020). The primary goal of these papers is to provide an answer, or a dialogue response based on a single given document or text snippet. In contrast to the closed-book setting of these works, our task is in an open-book setting, which aims to address more realistic scenarios in goal-oriented dialogues where the associated document content are unknown and likely more than one documents. Our work is built on Doc2Dial, which is a goaloriented dialogue modeling task based on a single document. Our dataset shares the same set documents and annotation scheme as Doc2Dial. However, we aim to further the challenge by dealing with cases when the document-level topic shifts through a dialogue. Thus, we proposed a new data construction approach, data tasks, and baseline approaches in an end-to-end setting.

Open Domain Question Answering
Our proposed data and tasks are also related to open domain question answering (Chen et al., 2017;Kwiatkowski et al., 2019;Qu et al., 2020b;Mao et al., 2020;Zhu et al., 2021;Izacard and Grave, 2021;Yu et al., 2021). In particular, our work is closely related to the recently proposed open-retrieval conversational question answering (OR-CQA) setting for QuAC dataset, i.e., OR-QuAC (Qu et al., 2020b). To the best of our knowledge, the search queries for tasks are mostly created based on Wikipedia articles. In OR-QuAC (Qu et al., 2020b), all turns in one conversation are grounded in passages from the Wikipedia page of the given entity, i.e., there are no multiple documents involved in a dialogue. In contrast, our task proposes to model dialogues grounded in multiple documents. The documents are of diverse writing styles from four real user-facing websites. In addition to the difference in document data, Multi-Doc2Dial provides more types of dialogue query based on a richer set of dialogue acts. Table 1 provides a comparison of several most related datasets and tasks in different aspects including whether the setup is open-book or not, the dialogues are goal-oriented or not, the grounding is annotated or not, the associated text is full document or not and each dialogue corresponds to multiple documents or not. Our work is the only one covers all the characteristics.

Discourse Segmentation
This work is also largely related to discourse segmentation tasks (Arguello and Rosé, 2006;Zhang and Zhou, 2019;Mele et al., 2020;Xing and Carenini, 2021), which aims to identify the change of topic in a dialogue. This is a very important task towards modeling of goal-oriented dialogues in general. Some papers, such as Arguello and Rosé (2006); Hsueh et al. (2006); Xing and Carenini (2021), focus on the task of modeling and predicting segmentation; some papers such as  use explicit segmentation as the input of downstream dialogue modeling tasks on a machine reading comprehension dataset, ShaRC. Our task is closely related to the latter, albeit in an open book setting and with an end-to-end modeling approach that encodes the dialogue segmentation information implicitly. We leave the more explicit dialogue segmentation modeling for future work.

Conclusion and Future Work
We introduced MultiDoc2Dial, a new task and dataset that deals with goal-oriented dialogues that have multiple sub-goals corresponding to different documents. We proposed two tasks for predicting next agent turn that we formulate as generation tasks. We presented strong baseline approaches based on retriever-reader architecture and experimented with different variances of neural retrievers. For future work, we aim to address the ambiguity in the open-book dialogue modeling.

Ethical Consideration
One primary motivation of the paper is to provide data instances that simulate how real human users converse with agents to seek information. Such data is essential for training neural models to build conversational systems that could assist various end users to access information in real-life domains such as social benefit websites. However, such a dataset is largely unavailable for research and development. Since we create the dataset via crowdsourcing, one potential ethical concern is that it could be potentially biased or distant from the real user queries. To address such concerns, we try to identify qualified contributors with different backgrounds and train them via several rounds of tasks. In addition, we provide various examples in the instruction and feedback to contributors during the crowdsourcing task.

A Data Construction
For crowdsourcing, we filter out less qualified contributors by adding template-based placeholders in the writing task for detecting bad performances, which seems effective. Most contributors seem to be able to improve either user query or agent response, sometimes both together based on document context. However, we do find the writings from original Doc2Dial dataset appear a bit more natural with more personalized information. We suspect that it could be easier to add such information in the writing if the conversation is built from scratch based on a single document while the contributors write the dialogue history themselves. We provide feedback accordingly to the crowd to address the issue. We observe that the crowd contributors hardly reject any task. For quality control, we also manually review and re-collect data during the data collection process. For the instruction, interface, rules and examples for data collection via crowdsourcing, please see Figure 2a and 2b for reference.

B Experiments
The implementation is in PyTorch. For finetuning RAG (Lewis et al., 2020), we follow the example 4 from HuggingFace and fine-tune it on our dataset for 16 epochs. For fine-tuning DPR (Karpukhin et al., 2020), we train DPR-nq on our dataset using facebookresearch/DPR 5 . Then, we integrate fine-tuned bi-encoder in RAG model facebook/rag-token-nq 6 . For pre-trained DPR, we use the bi-encoder model trained on NQ dataset only from Facebook DPR checkpoint 7 . We train the models for 10 epochs and evaluate using the last checkpoint.

B.1 Hyperparameters for fine-tuning DPR
We fine-tune DPR for 50 epochs with a batch size of 128. We use a learning rate of 2e-05 using Adam, linear scheduling with warmup and dropout rate of 0.1. We set the max encoder sequence length to 128 consistent with the RAG model. We also use one additional BM25 negative passage per question in addition to in-batch negatives. We use gradient checkpointing to support a large batch size as 128.

Retriever
Reader

B.3 Experiment Results
Fine-tuned DPR For fine-tuning DPR, we experiment with different ways of obtaining hard negative examples. One is using the grounding of a dialogue query (grounding) as query, the other is using combined dialogue utterances (query) as query.
The results turn out comparable as shown in Table  9.
Domain Adaptation Setup We also experiment with domain adaption setup, where the train and validate splits are based on all data from the three domains and the test split is based on one unseen domain. The domain information is considered when retrieving relevant passage. The ratio of the number of examples in train and validation is 5 : 1.   source and target unseen domains, which is comparable to Table 5. However, the EM and F1 scores in Table 10 are much lower comparing to the setup without any unseen domain in Table 4 and 6, which confirms that the domain adaptation setup is indeed challenging.