End-to-End Learning of Flowchart Grounded Task-Oriented Dialogs

We propose a novel problem within end-to-end learning of task oriented dialogs (TOD), in which the dialog system mimics a troubleshooting agent who helps a user by diagnosing their problem (e.g., car not starting). Such dialogs are grounded in domain-specific flowcharts, which the agent is supposed to follow during the conversation. Our task exposes novel technical challenges for neural TOD, such as grounding an utterance to the flowchart without explicit annotation, referring to additional manual pages when user asks a clarification question, and ability to follow unseen flowcharts at test time. We release a dataset (FLODIAL) consisting of 2,738 dialogs grounded on 12 different troubleshooting flowcharts. We also design a neural model, FLONET, which uses a retrieval-augmented generation architecture to train the dialog agent. Our experiments find that FLONET can do zero-shot transfer to unseen flowcharts, and sets a strong baseline for future research.


Introduction
Task oriented dialog (TOD) systems (Bordes and Weston, 2017) converse with users to help them with specific tasks such as calendar enquiry (Eric et al., 2017), restaurant reservation (Henderson et al., 2014), and tourist package recommendation (El Asri et al., 2017). These dialog systems (e.g., restaurant reservation system) are trained using past human-to-human dialogs and associated knowledge sources (e.g., a KB of restaurants).
Most existing TOD systems are conversational recommender systems that gather user requirements in the form of attributes (such as cuisine, location), query a KB and generate recommendations based on the retrieved results (e.g, restaurant, its phone number). While there have been recent * D. Raghu  efforts (Feng et al., 2020; to study non-recommendation TOD, several important tasks such as troubleshooting are still unexplored. Troubleshooting is a common task handled by customer support agents. It involves understanding a user's problem, narrowing down the root cause and providing a solution. Figure 1 shows an example dialog between an agent and a user troubleshooting a car problem. Support agents typically follow a flowchart (utterances A1, A4 in our example) to diagnose user problems, but may refer to supplementary knowledge sources like FAQs (A3), if user asks a clarification question (U3).
In this paper, we propose the novel task of endto-end learning of a TOD system that troubleshoots user's problems by using a flowchart and a corpus of FAQs. Our task exposes novel research challenges for TOD system design. First, the system must learn to ground each utterance in the flowchart without explicit supervision. Second, when required, the agent must refer to additional knowledge in the corpus of FAQs to issue clarifications and add details not present in the flowchart. Third, it must learn the general skill of following a flowchart, which is tested in a zero-shot transfer setting with unseen flowcharts shown at test time.
Before collecting a dataset for the task, we first analyze a sample of 100 in-house troubleshooting dialogs with a human customer service agent. Table  N3 N2 N1 My Acura ILX is not starting. When I turn the key, the starter doesn't spin. This happened yesterday too and I missed an important appointment Does the voltage of your car battery read more than 12?
How do I check the car battery voltage?
You can check the car battery voltage using a voltmeter How do I read battery voltage using a voltmeter?
In order to get a measurement, touch the black negative probe of the voltmeter to the battery's black … The battery does not read more than 12.
Your battery is dead (for now). Find someone to help you jump start your car. This should be a quick fix.
I have a friend who can help with that. Thanks for the tip. 1 summarizes the statistics on common utterances in such conversations. This analysis reaffirms the importance of supplementary knowledge (T3). We crowdsource the first version of our dataset, FLODIAL 3 (Flowchart Grounded Dialogs), with these utterance types: problem description (T1), flowchart following (T2), use of supplementary knowledge in the form of FAQs (T3), and closing utterances (T8). FLODIAL has 2,738 dialogs grounded on 12 different flowcharts.
Since this is a new task, existing end-to-end TOD models are not directly applicable to it. We design a baseline network named FLONET 4 -it follows the retrieval augmented generation framework (Lewis et al., 2020) and generates agent response in two steps. First, relevant information from the flowchart and FAQ corpus is retrieved based on the dialog history. Then, this retrieved information and dialog history generate the agent response using an encoder-decoder. We evaluate FLONET in two different settings: (1) Seen Flowcharts (S-Flo) setting, tested on flowcharts seen at train time, and (2) Unseen Flowcharts (U-Flo) setting, to evaluate FLONET's zero-shot transfer ability in handling new flowcharts unseen at train time. To summarize, the main contributions of this paper are: 1. We propose the novel problem of end-to-end learning of flowchart grounded task oriented dialog. 2. We collect a new flowchart grounded task-3 https://dair-iitd.github.io/FloDial 4 https://github.com/dair-iitd/FloNet oriented dialog (FLODIAL) dataset. 3. We propose a baseline solution (FLONET) for the proposed problem and evaluate it in seen flowchart and unseen flowchart settings. We release all our resources for further research on the task.

Related Work
Dialog systems can be broadly divided into two types: task oriented (TOD) (Williams and Young, 2007;Bordes and Weston, 2017) and open domain dialog systems (Vinyals and Le, 2015;Serban et al., 2016). Task oriented dialogs systems can further be divided into end-to-end (Bordes and Weston, 2017;Raghu et al., , 2021Gangi Reddy et al., 2019) and traditional slot filling approaches (Williams and Young, 2007). Slot filling approaches require dialog state annotations in dialog transcripts. Our work falls under end-to-end approaches, which do not require any such intermediate annotations.
We first briefly discuss existing TOD datasets and then review approaches for collecting dialog datasets. Finally, we discuss dialog systems related to FLONET. Dialog Datasets: Exisiting TOD datasets can be grouped based on the type of knowledge source on which the dialogs are grounded. Most of the existing datasets are for the recommendation task and grounded on structured KBs. Some notable KBgrounded datasets are MultiWOZ (Budzianowski et al., 2018), Stanford multi domain dataset (Eric et al., 2017), CamRest (Wen et al., 2016, Frames (El Asri et al., 2017), schema guided dialogs (Rastogi et al., 2020) and taskmaster-1 (Byrne et al., 2019).  augment MultiWOZ with utterances grounded on FAQs. The dialogs in datasets such as ShARC (Saeidi et al., 2018) and doc2dial (Feng et al., 2020) are grounded on snippets from unstructured text documents. To the best of our knowledge, FLODIAL is the first TOD dataset that is grounded on flowcharts and FAQs. Dialog Data Collection: Crowd sourcing frameworks for creating dialog datasets can be broadly grouped into three types. (1) Wizard-of-Oz framework (Kelley, 1984) pairs up two crowd-workers who play the roles of user and agent while conversing. The user is provided with a goal and the agent is given the knowledge necessary to achieve the goal. (2) Self-dialogs framework (Byrne et al., 2019) requires a single crowd-worker to write the entire dialog by playing both user and agent. (3) Dialog paraphrasing framework (Shah et al., 2018) systematically generates a dialog outline (user and agent utterance) and crowdsources paraphrases for each utterance to construct a dialog. We follow this framework for collecting FLODIAL, as it gives us adequate control over dialog flow so that we can incorporate various utterance types in Table 1. Dialog Systems: Large scale pre-trained language models such as GPT2 (Radford et al., 2019) have been used for response generation in both open domain (Wolf et al., 2019;Zhang et al., 2020;Zhao et al., 2020) and TOD systems (Ham et al., 2020;Hosseini-Asl et al., 2020). A major challenge is GPT2's limitation on the input size. For our setting, it becomes difficult to feed a long input (flowchart, dialog history, FAQ corpus) to GPT2. We overcome this by following the retrieval augment generation paradigm (Lewis et al., 2020) -we are probably the first to apply it to a dialog setting.
The task of zero-shot response generation requires a model to generalize to new domains with just domain descriptors and no training dialogs. Existing approaches (Zhao and Eskenazi, 2018;Rastogi et al., 2020) model slots and intents as domain descriptors. We model flowcharts as domain descriptors and expect the system to generalize to new flowcharts unseen during train.

The FLODIAL Dataset
FLODIAL is a corpus of troubleshooting dialogs between a user and an agent collected using Amazon Mechanical Turk (AMT). The dataset is accompa-nied with two knowledge sources over which the dialogs are grounded: (1) a set of troubleshooting flowcharts and (2) a set of FAQs which contains supplementary information about the domain not present in the flowchart -both are in English.
The data collection process uses the dialog paraphrasing framework (Shah et al., 2018) and is illustrated in Figure 2. At a high level, we first systematically construct an outline for each dialog, then decompose the outline into multiple AMT paraphrasing tasks, and finally stitch the dialog using the collected paraphrases. Our data collection process has the following advantages: (i) systematic outline construction guarantees coverage of all paths in the flowchart, and the desired balance of utterance types in dialogs, (ii) the process ensures the annotated labels 5 are always correct and (iii) it provides diversity in the paraphrases collected.

Flowcharts and FAQs
We identify 12 flowcharts 6 on troubleshooting laptop and cars problems, such as overheating laptop, car won't start and car brake failure. The flowcharts encode agent questions as decision nodes and user responses as edges. The agent follows flowcharts based on user responses to reach a terminal node (e.g., node N4 in Figure 1b) which contains the solution. We refer to the sequence of nodes and edges from root to a terminal node as a path in the flowchart. One such path is shown in Figure 1b.
The flowcharts usually contains precise instructions with no details. For example, node N4 in Figure 1b just says "does the battery read over 12V?" but does not provide details such as "which instrument is needed to measure the battery voltage?" or "how does one measure the battery voltage using a voltmeter?". For each flowchart, we collect supplementary FAQs 7 that contain details such as step-by-step instructions for a process (e.g., "how to jump start your car?") and other common doubts (e.g., "where is the ignition coil located?"). A few example FAQs are shown in Figure 1b.

Dialog Outline Construction
We systematically iterate over paths in the flowchart and for each path we construct multiple outlines. Each outline consists of 3 major parts:  problem description, flowchart path traversal and closing. We now discuss each part in detail. Problem Description: The problem description is the first utterance in a dialog. It contains (1) the primary issue faced by the user, (2) secondary information, and (3) other information that may not be relevant for troubleshooting. The primary issue is phrased using title of the flowchart. For example, for Figure 2 it will be car won't start.
The secondary information is any other information that may help in troubleshooting the primary issue. For example, the user could say that starter is not cranking. This secondary information is populated by sampling a random (node, edge) pair from the sampled flowchart path. For example, (N1, no) is populated as a secondary information in Figure 2. By adding this to problem description, we mimic the setting where, an agent may need to skip a few nodes when following the flowchart, based on information already present in the dialog history. Flowchart Path Traversal: After the problem description, we consider each (node, edge) pair in the given flowchart path. For each (node, edge) pair we toss a coin to decide if the pair should be represented as a simple exchange or as a complex exchange. A simple exchange is one where the agent asks the question in the node and the user responds with the answer in the edge. (N 2,No) in Figure 2c is constructed as a simple exchange. Complex exchanges use at least four utterances to represent the information in the (node, edge) pair, e.g., (N 3,No) in Figure 2c. Complex exchange can be of two types: user-initiated digression and agent digression. The example illustrates user digression where the user asks for clarifications to understand the agent question before responding with an answer. An agent digression is similar except that the agent proactively breaks a complex question into a sequence of simple ones. An example agent digression for (N 3, No) would be when the agent first asks "Do you know how to measure the voltage of a car battery using a voltmeter?". If the user responds "no", the agent will then describe the procedure to measure the voltage, and then requests the user to check if the voltage is greater than 12V.
Closing: Closing contains the solution suggested by the agent followed by one or more exchanges to gracefully terminate the dialog. Typically, the users thank the agent and the agent terminates the dialog by acknowledging the user.

AMT Paraphrasing Tasks
We crowdsource paraphrases of each utterance in a dialog outline. Utterances corresponding to each component (problem description, node-edge pairs in the flowchart path and closing) are paraphrased separately and then stitched together to construct a dialog. We define four types of paraphrasing tasks: non-contextual, contextual, problem description and closing tasks. In the non-contextual task, a single utterance from the outline is provided to the crowd workers to paraphrase. We requested the workers to provide two paraphrases for each utterance to improve diversity among paraphrases (Jiang et al., 2017;Yaghoub-Zadeh-Fard et al., 2019). In the contextual task, workers are asked to paraphrase in the context of a specific previously collected paraphrase. Problem descriptions tasks ask the worker to describe the troubleshooting problem using the primary issue and secondary issue as discussed in Section 3.2. In closing task, the worker gracefully terminates the dialog in the context of a troubleshooting solution collected from a non-contextual task. Examples of the four type of tasks can be seen in Figure 2b. As most user responses in a flowchart are yes/no, we design the yes/no paraphrasing task based on a study by Rossen-Knill et al. (1997). We add specific rules in the tasks for workers to follow when paraphrasing a yes/no user response. An example (in blue outline) is shown in Figure 2b.

Dialog Construction
We generate around 110 outlines for each flowchart by equally dividing them amongst the paths in the flowchart. We generate a total of 1,369 outlines and then collect paraphrases of the constructed outline components. Finally the component paraphrases are stitched together to construct 1,369 dialogs as shown in Figure 2c.
The paraphrases corresponding to an outline component are interchangeable across dialogs. We take advantage of this and generate an additional 1,369 dialogs by randomly interchanging paraphrases without breaking semantics. Our final set has 2,738 dialogs with an avg of 15.56 utterances per dialog. The agent and user utterances have an average of 14.95 and 16.17 words in them.

Paraphrase Cleaning
To avoid error propagation, we manually verify all paraphrases and correct errors in grammar, spelling and polarity. It took an author approximately 5 minutes per dialog for this step. An example of a polarity error is when the question 'Do you have an open circuit?' was paraphrased as 'Do you have a closed circuit?' by a crowd worker. Such paraphrases invert the semantics of (yes/no) edges from the given node and will break the correctness of a dialog, if not corrected. About 6% utterances were recollected as they violated instructions.

Task Definition & Baseline System
In this section, we define the problem of learning flowchart grounded task oriented dialogs in an end-to-end manner without the use of intermediate labels. We then describe our proposed baseline model, FLONET, which retrieves necessary knowledge from flowchart/FAQs and generates the agent response using the retrieved knowledge.

Task Definition
We represent a dialog d between a user u and an agent a as a sequence of utterances {c u 1 , c a 1 , c u 2 , c a 2 , . . . , c u m , c a m }, where m denotes the number of exchanges in the dialog. Let F = (N, E) be the flowchart over which the dialog d is grounded, where the set of nodes N represents the agent questions and edges E represent the user responses. The number of outgoing edges from a node depends on the number of possible user responses for the agent question associated with the node. Let Q = {q l : a l } L l=1 be the set of frequently asked question and answer pairs (FAQs) associated with the flowchart F. Our objective is to learn a next response predictor, which takes (1) the dialoghistory h = {c u 1 , c a 1 , . . . , c u i }, (2) a flowchart (F), and (3) a set of FAQs (Q) as input and predicts the next agent response (y = c a i = y 1 y 2 . . . y T ).

Baseline System: FLONET
Since it is a novel task, an existing TOD architecture does not directly apply on this problem. We design a baseline architecture named FLONET for predicting the agent responses in flowchart grounded dialogs. FLONET is trained in an end-to-end manner without the need for any intermediate annotations such as (a) whether the given agent utterance is grounded on a flowchart node or FAQs, or (b) the specific flowchart node or FAQ on which the agent utterance is grounded. FLONET follows the retrieval augmented generation framework (RAG) (Lewis et al., 2020;Guu et al., 2020), which first retrieves the necessary knowledge to generate the response and then generates the agent response one word at a time by using the retrieved knowledge. The framework consists of two main components, a retriever and a generator. The retriever p η (z|h) outputs a distribution over all documents based on the dialog history h. The flowchart F and FAQs Q are represented as documents (discussed further in Section 4.2.1). The generator p θ (y t |h, z, y 1:t−1 ) generates the agent response y word by word by using the dialog history h and a retrieved document z. We generate the response using RAG-Sequence model: The overall network is trained by minimizing the negative log-likelihood of the response given by Equation 1. Following Lewis et al. (2020), we marginalize over all the documents using a topk approximation. We use top-5 documents in our training implementation due to memory constraints. During inference, only the top-1 document is used because the dialog's agent responses need to be grounded on only one flowchart node or FAQ. This is unlike RAG where multiple documents extracted from Wikipedia can contribute to the expected output. See Appendix A.3 for further details.

Retrievable Documents
The retrievable document set includes all flowchart nodes and all FAQ QA pairs associated with the flowchart. In the original RAG model, each (Wikipedia) document had a single dense embedding, based on which a document was retrieved and used. However, for our setting, the content of a flowchart node will typically not be explicitly mentioned in the dialog history. Instead, the right node is best determined based on the flowchart structure -the path to that node -as expressed in the dialog history. Similarly, for FAQs, a QA-pair will typically be matched on the question and the answer will be used in subsequent dialog.
Consequently, we represent each document as a key-value pair. The document-key is used by the retriever to compute p η (z|h) and the document-value is used by the generator during response generation. We construct a document for each node in F and for each FAQ in Q. The document-key of a flowchart node is the sequence of utterances corresponding to the nodes and edges in the path from the root. Its document-value is the agent utterance associated with it. For a FAQ, the document-key and value are the question and answer, respectively.

Retriever & Generator
The retriever scores each document z based on the dialog history. The dialog history is encoded using a hierarchical recurrent encoder (Sordoni et al., 2015). The encoder computes a dense representation of the history φ h (h). The document-key is also encoded using a hierarchical recurrent encoder to compute its vector representation φ z (z). For each document, we assign a score as negative of the Euclidean distance between φ h (h) and φ z (z). The top-k scores are then passed through a Softmax layer to compute p η (z|h). We use GPT2 as the generator p θ (y|h, z) and it receives a separate input for each retrieved document z. The input to GPT2 is constructed by concatenating all the utterances in the dialog history along with the document-value. GPT2 input is described in detail in Appendix A.2. The response is decoded using beam search.

Pre-training
To provide a good initialization to the retriever and the generator, we pre-train both the components separately. For each dialog history and response pair (h, y) in our dataset, we first identify the document over which the response is grounded using weak supervision (Zhao et al., 2020). The document whose document-value has the highest BLEU score (Papineni et al., 2002) w.r.t. the response y is labeled as the pseudo grounded document.
The retriever is pre-trained using a contrastive loss (Hadsell et al., 2006) by using the pseudo grounded document as the positive example and any other random document as a negative example. The generator is pre-trained by minimizing the negative log likelihood of the response given the dialog history and the document-value of the pseudo grounded document. Following Wolf et al. (2019), we add a next-utterance classification loss to the negative log likelihood loss. The classification loss is applied on the output of a linear classification layer which receives the last hidden state of the generator and outputs the probability of a given utterance being the correct agent response. We use randomly sampled incorrect utterances as negative examples to train the generator based classifier.

Data Split
We create two different splits of the dialogs in FLO-DIAL. The S-Flo split is used for evaluating the ability of FLONET to generate responses by following flowchart and FAQs. The U-Flo split is used to study the ability of FLONET to generalize to flowcharts unseen during train in a zero-shot flowchart grounded response generation setting.
To generate the S-Flo split, we divided the dialogs associated with each flowchart as follows: 66% for train set, 17% for validation set and 17% for test set. We randomly select a path in the flowchart and push all the dialogs that follow the path to one set. To generate the U-Flo split, we group all dialogs associated with 8 flowcharts as train set, all dialogs from 2 flowcharts as validation set and the remaining 2 into test set. Thus, the U-Flo split has mutually exclusive sets of flowcharts in each set. Some statistics on the dataset split are shown in Table 2.

Evaluation Metrics
We measure the ability to generate responses using two standard metrics: BLEU and perplexity. As FLODIAL contains the labels of the document (flowchart node or FAQ) over which each agent response is grounded on, we use recall@1 (R@1) to measure the retriever performance. We also compute a task-specific metric called success rate (SR) which is measured as the fraction of dialogs for which an algorithm retrieved the correct flowchartnode/FAQ for all the agent utterances in the dialog.
We perform a human evaluation for the responses generated by FLONET and 3 other variants   (Likert, 1932).

Results
We report the performance of our baseline FLONET on both S-Flo and U-Flo splits of FLODIAL. We also report the numbers for two simple variants of FLONET: TF-IDF + GPT2 and FLONET (No PT    Table 4 reports the performance of the respective retrievers. TF-IDF + GPT2 has reasonable response prediction performance on S-Flo setting, but has a poor U-Flo performance. The poor generalization is due to the TF-IDF retriever's low R@1. This forces the generator to memorize the knowledge necessary to generate a response, rather than inferring it from the retrieved documents.
FLONET achieves a marginal improvement over the No PT variant on S-Flo, and a two point jump in BLEU in U-Flo setting. This shows that the heuristic pre-training contributes to the overall system performance of FLONET. The success rate of various systems is reported in Table 4. The success rate achieved by FLONET retriever in both settings are quite low. We hope this gets improved by further research on the dataset.
The oracle ret. + GPT2 in Table 3 is approximated by assuming a perfect retriever and training GPT2 with ground truth document. The gap in BLEU represents the value of annotation for our task, and the performance gain a better retriever may help achieve.
We also compare the performance of FLONET on the two data splits. We find that while numbers are understandably worse in the U-Flo setting, the zero-shot transferred FLONET is still better than TF-IDF+GPT2's S-Flo performance. This suggests that the model has acquired some general intelligence of following a flowchart, even though there is significant scope for further improvement. Human Evaluation: We randomly sample 75 context-response pairs each from both S-Flo and   U-Flo test sets and collect two sets of judgements for each pair. As we evaluate 4 systems, we collect a total of 1,200 labels from the judges. We report the human evaluation results in Table 5. We find that FLONET's relevance scores are better than the baselines for both S-Flo and U-Flo.
Knowledge Sources: To understand the contribution of each knowledge source towards response generation, we trained 3 variants of FLONET: (i) using only the dialog history (DH), (ii) using the dialog history and the flowchart (DH + FC), and (iii) using dialog history, flowchart and FAQs (DH + FC + FAQ). The performance is summarized in Table 6. The S-Flo trend shows both the knowledge sources contribute to the overall performance. The U-Flo numbers prove that, unsurprisingly, knowledge sources are essential for generalization to new settings, with more than 13 points increase in BLEU.

Analysis & Research Challenges
We now investigate FLONET errors, with the goal of identifying new research challenges posed by FLODIAL. We first manually inspect the output of the generator, given the retrieved document. We find that, by and large, the generator has learned its tasks well, which are deciding whether and how to use the retrieved document in generating a response. We attribute FLONET's errors primarily to the retriever. This is also apparent from Recall@1 in Table 4, which shows that FLONET makes retrieval errors for 18.6% and 33.9% of test examples in S-Flo and U-Flo, respectively. To further diagnose retriever errors, we split them into two categories based on whether the correct retrieval is a flowchart node or a FAQ (digression). For the former case, Table 7 reports the nature of the error. In Table 7, retrieved sibling implies the retrieved node and the correct node are sibling nodes in the flowchart. We notice that for a large fraction of errors, retriever returns a sibling node. This suggests that FLONET could not adequately ground user response to the given agent question. More surprising are 30-45% of errors that are not even in the immediate neighborhood of the true node. A much larger value for U-Flo here also suggests poor retriever generalization. Since retriever performance in this task is closely tied with the ability to follow a flowchart path, it leads to the following research question: how can a model incorporate flowchart structure for better retriever performance? Table 8 analyzes retrieval errors on digressions. We find that the retriever gets a decent Recall@1 for user digressions but has a rather low performance for agent digressions. Moreover, BLEU scores suggest that generator has memorized some common digressions S-Flo, but naturally they do not generalize to U-Flo. This yields a fairly challenging research question: how do we improve retriever performance on agent digressions?
Finally, the challenge of zero-shot generalization to unseen flowcharts gets to the core ability of following a conversation flow, leading to the key research question: how do we improve performance on unseen flowchart setting in FLODIAL?

Conclusion
We define the novel problem of end-to-end learning of flowchart grounded task oriented dialog (TOD) for a troubleshooting scenario. We collect a new flowchart grounded TOD dataset (FLODIAL), which contains 2,738 dialogs grounded on 12 different flowcharts and 138 FAQs. We propose the first baseline solution (FLONET) for our novel task using retrieval-augmented generation. We outline novel technical challenges for TOD research identified in our work. We release FLODIAL 10 and all resources 11 for use by the research community.
Bloomberg and 1MG, a Visvesvaraya faculty award by Govt. of India, and the Jai Gupta chair fellowship by IIT Delhi. We thank Morris Rosenthal for providing us with permission to use the flowcharts from www.ifitjams.com. We also thank the IIT Delhi HPC facility for computational resources.

Ethics Impact Statement
Crowd Worker Compensation: The crowd workers were compensated with approximately 2.5 USD for a creating paraphrases for a dialog with 15 utterances. On an average, the crowd workers spent a little less than a minute on each paraphrase. Potentially, a worker can paraphrase 4 dialogs in an hour and get compensated with 10 USD. Intellectual Property: Flowcharts used in our data collection process are based on the flowcharts from www.ifitjams.com. We used these flowcharts after receiving a written permission from the creator Morris Rosenthal. We include attribution to Morris Rosenthal in the acknowledgements section. Privacy: We now briefly describe each task used for data collection and show how the task design ensures the collected data will not contain any sensitive personal information (SPI). We would like to emphasise that the authors of the paper meticulously went over each data point collected and removed the ones that did not comply with the rules in the task description.
Problem Description Task: Each participant was provided with an artificial scenario which includes a car/laptop model, a car/laptop model year, a car/laptop related problem that they are facing. They were requested to paraphrase this information into a natural language utterance. Since the scenario was provided by us, there was almost no room for providing SPI. The paraphrases deviating from the provided details were rejected.
Paraphrasing Task: The participants were requested to create paraphrases of a given sentence. This task has no room for providing SPI.
Closing Task: The participants were asked to close a conversation between a human agent and a car/laptop user. In this task, the user and the human agent refer to each other using a second-person pronoun (e.g., I hope this was helpful to you, I am happy that this solved your problem). This task also does not involve providing any SPI.

A.1 Training Details
FLONET is trained in two phases: pre-train and fine-tune. In the pre-train phase, we use recall@1 and BLEU as the early stop criteria for the retriever and generator respectively. The recall is computed using the weakly supervised labels. The hyperparameters (embedding size, lr R , lr G , dropout) that achieved the best validation numbers in pretrain stage were (100, 1E-4, 6.25E-5, 0.01) and (200, 1E-4, 6.25E-5, 0) for S-Flo and U-Flo respectively. The best hyper-parameters that achieved the best BLEU in fine tune phase were (100, 1E-4, 2.5E-6, 0) and (200, 1E-5, 2.5E-7, 0) for S-Flo and U-Flo respectively. The BLEU scores on held-out validation set were 16.62 and 9.72 for S-Flo and U-Flo respectively. We use AdamW optimizer (Kingma and Ba, 2014) for training and beam search for decoding with beam width of 5 and a maximum decode length set to 60 tokens. All experiments were run on a single Nvidia V100 GPU with 32GB of memory. The S-Flo retriever, U-Flo retriever and generator have 3M, 23M and 117M trainable parameters respectively. Thus, FLONET has a total of 120M trainable parameters for S-Flo and 140M for U-Flo. FLONET has an average runtime of approximately 7 hours (80 mins per epoch) and 8 hours (82 mins per epoch) for S-Flo and U-Flo respectively.

A.2 GPT2 Input
FLONET uses GPT2 as the generator. Following Wolf et al. (2019), our input is constructed as shown in Figure 3. During inference, GPT2 take the retrieved document concatenated with the sequence of utterances from the dialog history as the input and predicts the agent response.

A.3 GPT2 Inference
We experiment with various inference settings and used the setting which performed the best in the validation set. We tried two decoding techniques: nucleus sampling (Holtzman et al., 2020) with topp as 0.9 and beam search with beam width of 5. We also experimented with the number of top-k documents to be used. In the case of Top-1 decoding, we use only the top retrieved document to generate response candidates. For Top-5, we take the top 5 retrieved documents and generate candidate responses from each document. Each candidate response score is computed as a product of the probability of generating the candidate given the retrieved document T t=1 p θ (y t |h, z, y 1:t−1 ) and the probability of the retrieved document p η (z|h). Lastly, we experimented with response length normalization to avoid favouring shorter sequences. The probability of each candidate is given by ( T t=1 p θ (y t |h, z, y 1:t−1 )) 1/T where T is the length of the candidate. The validation and test BLEU scores of various settings on the S-Flo split is shown in table 9. We see that beam search on top-1 document with length normalization resulted in the best validation BLEU.   Table 11 and 10 shows responses generated by various systems on examples from U-Flo and S-Flo test set respectively. In Table 10, we see that FLONET and FLONET (No PT) generates responses similar to the gold response as they were able to generalize to unseen flowcharts.

A.5 Example Dialogs from FLODIAL
Three randomly selected dialogs from FLODIALare shown in Table 12. The first dialog is grounded on the wireless network troubleshooting flowchart, sec-ond is grounded on steering problems flowchart and the last is is grounded on car won't start flowchart.
A.6 AMT Tasks Figure 4, 5, 6 and 7 show the instructions and examples provided to crowd-workers for problem description, non-contextual paraphrasing, contextual paraphrasing and closing task respectively.

A.7 FAQ Collection
For each flowchart we collected a set of FAQs associated with it. We collect the FAQs in two steps: question collection and answer collection.
In the question collection step, we displayed each flowchart node question to novice (laptop or car) users and asked them if they understand the node question. If not, they were requested to come up with a follow-up question that would help them better understand the node question. We aggregated all the follow-up questions to create the set of FAQs associated with that flowchart. In the answer collection step, in-house (laptop or car) expert provided answers for the questions collected in the first step.
User I've had my Infiniti QX60 for 6 years now. It's recently been making awful noises. I got stopped by the police on my way to work because the noise was no loud. Please help me fix this.

Agent
The sound that is coming from your car, is it a clunking sound?
User I do hear a clunking sound coming from the car.

Gold Response
Is it only when you drive over a bump or a dip that the car makes a noise?
TF-IDF + GPT2 Are the ticking noises heard only when the car is moving?
FLONET (No PT) Does the car make a noise only when you drive over bumps or dips on the road?
FLONET Does the car make the noise only when you drive it over a bump or a dip on the road?

Agent
When the charger isn't plugged in, can the laptop power up?
User I can see that it can start up in that case.

Agent
Do you find that the battery in your laptop quickly drains of power?

User
The battery drains too fast.

Agent
There will be a manufacture date on the laptop's battery. Please check if it is older than a year and let me know.

User
How do I check that date on my battery?

Gold Response
In order to locate the manufacturing date, you need to remove the battery and find a sticker on it with a multifield part number. The fourth field in the number identifies the date.

TF-IDF + GPT2
You can use a multimeter to measure the voltage of the AC power adapter. The voltage should be the same as the voltage printed on the adapter.I am glad I could help. Have FLONET (No PT) Remove the battery and you will find a sticker on it with a multi-field part number.
FLONET Please remove the battery and you will find a sticker on it with a multifield part number. The fourth field in the part number corresponds to the manufacturing date. Task 1 -Problem Description Figure 4: Instructions provided to AMT workers for the problem description task.
Task 2 -Non-Contextual Paraphrasing Figure 5: Instructions provided to AMT workers for the non-contextual paraphrasing task.
Task 3 -Contextual Paraphrasing Figure 6: Instructions provided to AMT workers for the contextual paraphrasing task.
Task 4 -Closing Figure 7: Instructions provided to AMT workers for the closing task.