Smoothing Dialogue States for Open Conversational Machine Reading

Conversational machine reading (CMR) requires machines to communicate with humans through multi-turn interactions between two salient dialogue states of decision making and question generation processes. In open CMR settings, as the more realistic scenario, the retrieved background knowledge would be noisy, which results in severe challenges in the information transmission. Existing studies commonly train independent or pipeline systems for the two subtasks. However, those methods are trivial by using hard-label decisions to activate question generation, which eventually hinders the model performance. In this work, we propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation to provide a richer dialogue state reference. Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.


Introduction
The ultimate goal of multi-turn dialogue is to enable the machine to interact with human beings and solve practical problems Zaib et al., 2020;Huang et al., 2020;Fan et al., 2020;Gu et al., 2021). It usually adopts the form of question answering (QA) according to the user's query along with the dialogue context (Sun et al., 2019;Reddy et al., 2019;Choi et al., 2018). The machine may also actively ask questions for confirmation (Wu et al., 2018;Cai et al., 2019;Zhang et al., 2020b;Gu et al., 2020).
In the classic spoken language understanding tasks (Tur and De Mori, 2011;Zhang et al., 2020a;Ren et al., 2018;Qin et al., 2021), specific slots and intentions are usually defined. According to these predefined patterns, the machine interacts with people according to the dialogue states, and completes specific tasks, such as ordering meals (Liu et al., 2013) and air tickets (Price, 1990). In real-world scenario, annotating data such as intents and slots is expensive. Inspired by the studies of reading comprehension (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Zhang et al., 2020c, there appears a more general task -conversational machine reading (CMR) (Saeidi et al., 2018): given the inquiry, the machine is required to retrieve relevant supporting rule documents, the machine should judge whether the goal is satisfied according to the dialogue context, and make decisions or ask clarification questions.
A variety of methods have been proposed for the CMR task, including 1) sequential models that encode all the elements and model the matching relationships with attention mechanisms (Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020;Gao et al., 2020a,b); 2) graph-based methods that capture the discourse structures of the rule texts and user scenario for better interactions (Ouyang et al., 2021). However, there are two sides of challenges that have been neglected: 1) Open-retrieval of supporting evidence. The above existing methods assume that the relevant rule documents are given before the system interacts with users, which is in a closed-book style. In real-world applications, the machines are often required to retrieve supporting information to respond to incoming high-level queries in an interactive manner, which results in an open-retrieval setting. The comparison of the closed-book setting and open-retrieval setting is shown in Figure 1.
2) The gap between decision making and question generation. Existing CMR studies generally regard CMR as two separate tasks and design independent systems. Only the result of decision making will be fed back to the question generation module. As a result, the question generation module knows nothing about the actual conversation  Previous studies generally regard CMR as two separate tasks and design independent systems. Technically, only the result of decision making will be fed to the question generation module, thus there is a gap between the dialogue states of decision making and question generation. To reduce the information gap, our model bridges the information transition between the two salient dialogue states and benefits from a richer rule reference through open-retrieval (a).
states, which leads to poorly generated questions. There are even cases when the decision masking result is improved, but the question generation is decreased as reported in previous studies (Ouyang et al., 2021).
In this work, we design an end-to-end system by Open-retrieval of Supporting evidence and bridging deCision mAking and question geneRation (OSCAR), 1 to bridge the information transition between the two salient dialogue states of decision making and question generation, at the same time benefiting from a richer rule reference through open retrieval. In summary, our contributions are three folds: 1) For the task, we investigate the open-retrieval setting for CMR. We bridge decision making and question generation for the challenging CMR task, which is the first practice to our best knowledge.
2) For the technique, we design an end-to-end framework where the dialogue states for decision making are employed for question generation, in contrast to the independent models or pipeline systems in previous studies. Besides, a variety of strategies are empirically studied for smoothing the 1 Our source codes are available at https://github. com/ozyyshr/OSCAR. two dialogue states in only one decoder.
3) Experiments on the ShARC dataset show the effectiveness of our model, which achieves the new state-of-the-art results. A series of analyses show the contributing factors.

Related Work
Most of the current conversation-based reading comprehension tasks are formed as either spanbased QA (Reddy et al., 2019;Choi et al., 2018) or multi-choice tasks (Sun et al., 2019;Cui et al., 2020), both of which neglect the vital process of question generation for confirmation during the human-machine interaction. In this work, we are interested in building a machine that can not only make the right decisions but also raise questions when necessary. The related task is called conversational machine reading (Saeidi et al., 2018) which consists of two separate subtasks: decision making and question generation. Compared with conversation-based reading comprehension tasks, our concerned CMR task is more challenging as it involves rule documents, scenarios, asking clarification questions, and making a final decision.
Existing works (Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020; Ques: Am I entitled to the National Minimum Wage? Scen: I am not following a European Union programme.

Rule Retrieval
The following are not entitled to the National Minimum Wage: higher students on a work placement up to 1 year workers on government pre-apprenticeships schemes people on the European Union programmes people working on a Jobcentre Plus Work trial for 6 weeks

Question Generation
BART-base follow-up ques. Figure 2: The overall structure of our model OSCAR. The left part introduces the retrieval and tagging process for rule documents, which is then fed into the encoder together with other necessary information.

Decision Making
2020a,b; Ouyang et al., 2021) have made progress in modeling the matching relationships between the rule document and other elements such as user scenarios and questions. These studies are based on the hypothesis that the supporting information for answering the question is provided, which does not meet the real-world applications. Therefore, we are motivated to investigate the open-retrieval settings (Qu et al., 2020), where the retrieved background knowledge would be noisy. Gao et al. (2021) makes the initial attempts of open-retrieval for CMR. However, like previous studies, the common solution is training independent or pipeline systems for the two subtasks and does not consider the information flow between decision making and question generation, which would eventually hinder the model performance. Compared to existing methods, our method makes the first attempt to bridge the gap between decision making and question generation, by smoothing the two dialogue states in only one decoder. In addition, we improve the retrieval process by taking advantage of the traditional TF-IDF method and the latest dense passage retrieval model (Karpukhin et al., 2020).

Open-retrieval Setting for CMR
In the CMR task, each example is formed as a tuple {R, U s , U q , C}, where R denotes the rule texts, U s and U q are user scenarios and user questions, respectively, and C represents the dialogue history. For open-retrieval CMR, R is a subset retrieved from a large candidate corpus D. The goal is to train a discriminator F(·, ·) for decision making, and a generator G(·, ·) on {R, U s , U q , C} for ques-tion generation.

Model
Our model is composed of three main modules: retriever, encoder, and decoder. The retriever is employed to retrieve the related rule texts for the given user scenario and question. The encoder takes the tuple {R, U s , U q , C} as the input, encodes the elements into vectors and captures the contextualized representations. The decoder makes a decision or generates a question once the decision is "inquiry". Figure 1 overviews the model architecture, we will elaborate the details in the following part.

Retrieval
To obtain the supporting rules, we construct the query by concatenating the user question and user scenario. The retriever calculates the semantic matching score between the query and the candidate rule texts from the pre-defined corpus and returns the top-k candidates. In this work, we employ TF-IDF and DPR (Karpukhin et al., 2020) in our retrieval, which are representatives for sparse and dense retrieval methods. TF-IDF stands for term frequency-inverse document frequency, which is used to reflect how relevant a term is in a given document. DPR is a dense passage retrieval model that calculates the semantic matching using dense vectors, and it uses embedding functions that can be trained for specific tasks.

Graph Encoder
One of the major challenges of CMR is interpreting rule texts, which have complex logical struc-tures between various inner rule conditions. According to Rhetorical Structure Theory (RST) of discourse parsing (Mann and Thompson, 1988), we utilize a pre-trained discourse parser (Shi and Huang, 2019) 2 to break the rule text into clauselike units called elementary discourse units (EDUs) to extract the in-line rule conditions from the rule texts.
Embedding We employ pre-trained language model (PrLM) model as the backbone of the encoder. As shown in the figure, the input of our model includes rule document which has already be parsed into EDUs with explicit discourse relation tagging, user initial question, user scenario and the dialog history. Instead of inserting a [CLS] token before each rule condition to get a sentence-level representation, we use [RULE] which is proved to enhance performance . Formally, the sequence is organized as: Then we feed the sequence to the PrLM to obtain the contextualized representation.
Interaction To explicitly model the discourse structure among the rule conditions, we first annotate the discourse relationships between the rule conditions and employ a relational graph convolutional network following Ouyang et al. (2021) by regarding the rule conditions as the vertices. The graph is formed as a Levi graph (Levi, 1942) that regards the relation edges as additional vertices. For each two vertices, there are six types of possible edges derived from the discourse parsing, namely, default-in, default-out, reverse-in, reverse-out, self, and global. Furthermore, to build the relationship with the background user scenario, we add an extra global vertex of the user scenario that connects all the other vertices. As a result, there are three types of vertices, including the rule conditions, discourse relations, and the global scenario vertex. For rule condition and user scenario vertices, we fetch the contextualized representation of the special tokens [RULE] and [CLS] before the corresponding sequences, respectively. For relation vertices, they are initialized as the conventional em-2 This discourse parser gives a state-of-the-art performance on STAC so far. There are 16 discourse relations according to STAC (Asher et al., 2016), including comment, clarificationquestion, elaboration, acknowledgment, continuation, explanation, conditional, question-answer, alternation, questionelaboration, result, background, narration, correction, parallel, and contrast. bedding layer, whose representations are obtained through a lookup table.
For each rule document that is composed of multiple rule conditions, i.e., EDUs, let h p denote the initial representation of every node v p , the graphbased information flow process can be written as: (1) where N r (v p ) denotes the neighbors of node v p under relation r and c p,r is the number of those nodes. w (l) r is the trainable parameters of layer l. We have the last-layer output of discourse graph: where W (l) r,g is a learnable parameter under relation type r of the l-th layer. The last-layer hidden states for all the vertices r (l+1) p are used as the graph representation for the rule document. For all the k rule documents from the retriever, we concatenate r (l+1) p for each rule document, and finally have r = {r 1 , r 2 , . . . , r m } where m is the total number of the vertices among those rule documents.

Double-channel Decoder
Before decoding, we first accumulate all the available information through a self-attention layer (Vaswani et al., 2017b) by allowing all the rule conditions and other elements to attend to each other. Let [r 1 , r 2 , . . . , r m ; u q ; u s ; h 1 , h 2 , . . . , h n ] denote all the representations, r i is the representation of the discourse graph, u q , u s and h i stand for the representation of user question, user scenario and dialog history respectively. n is the number of history QAs. After encoding, the output is represented as: H c = [r 1 ,r 2 , . . . ,r m ;ũ q ,ũ s ;h 1 ,h 2 , . . . ,h n ], (3) which is then used for the decoder.
Decision Making Similar to existing works (Zhong and Zettlemoyer, 2019;Gao et al., 2020a,b), we apply an entailment-driven approach for decision making. A linear transformation tracks the fulfillment state of each rule condition among entailment, contradiction and Unmentioned. As a result, our model makes the decision by where f i is the score predicted for the three labels of the i-th condition. This prediction is trained via a cross entropy loss for multi-classification problems: where r is the ground-truth state of fulfillment.
After obtaining the state of every rule, we are able to give a final decision towards whether it is Yes, No, Inquire or Irrelevant by attention.
where α i is the attention weight for the i-th decision and z has the score for all the four possible states. The corresponding training loss is The overall loss for decision making is: Question Generation If the decision is made to be Inquire, the machine needs to ask a follow-up question to further clarify. Question generation in this part is mainly based on the uncovered information in the rule document, and then that information will be rephrased into a question. We predict the position of an under-specified span within a rule document in a supervised way. Following Devlin et al. (2019), our model learns a start vector w s ∈ R d and end vector w e ∈ R d to indicate the start and end positions of the desired span: where t k,i denote the i-th token in the k-th rule sentence. The ground-truth span labels are generated by calculating the edit distance between the rule span and the follow-up questions. Intuitively, the shortest rule span with the minimum edit distance is selected to be the under-specified span.
Existing studies deal with decision making and question generation independently (Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020;Gao et al., 2020a,b), and use hard-label decisions to activate question generation. These methods inevitably suffer from error propagation if the model makes the wrong decisions. For example, if the made decision is not "inquiry", the question generation module will not be activated which may be supposed to ask questions in the cases. For the open-retrieval CMR that involves multiple rule texts, it even brings more diverse rule conditions as a reference, which would benefit for generating meaningful questions.
Therefore, we concatenate the rule document and the predicted span to form an input sequence: We feed x to BART encoder  and obtain the encoded representation H e . To take advantage of the contextual states of the overall interaction of the dialogue states, we explore two alternative smoothing strategies: H is then passed to the BART decoder to generate the follow-up question. At the i-th time-step, H is used to generate the target token y i by where θ denotes all the trainable parameters. W d and W w are projection matrices. The training objective is computed by The overall loss function for end-to-end training is

Datasets
For the evaluation of open-retrieval setting, we adopt the OR-ShARC dataset (Gao et al., 2021), which is a revision of the current CMR benchmark -ShARC (Saeidi et al., 2018). The original dataset contains up to 948 dialog trees clawed from government websites. Those dialog trees are then flattened into 32,436 examples consisting of utterance_id, tree_id, rule document, initial question, user scenario, dialog history, evidence and the decision. The update of OR-ShARC is the removal of the gold rule text for each sample. Instead, all rule texts used in the ShARC dataset are served as the supporting knowledge sources for retrieval. There are 651 rules in total. Since the test set of ShARC is not public, the train, dev and test are further manually split, whose sizes are 17,936, 1,105, 2,373, respectively. For the dev and test sets, around 50% of the samples ask questions on rule texts used in training (seen) while the remaining of them contain questions on unseen (new) rule texts. The rationale behind seen and unseen splits for the validation and test set is that the two cases mimic the real usage scenario: users may ask questions about rule text which 1) exists in the training data (i.e., dialog history, scenario) as well as 2) completely newly added rule text.

Evaluation Metrics
For the decision-making subtask, ShARC evaluates the Micro-and Macro-Acc. for the results of classification. For question generation, the main metric is F1 BLEU proposed in Gao et al. (2021), which calculates the BLEU scores for question generation when the predicted decision is "inquire".

Implementation Details
Following the current state-of-the-art MUDERN model (Gao et al., 2021) for open CMR, we employ BART ) as our backbone model and the BART model serves as our baseline in the following sections. For open retrieval with DPR, we fine-tune DPR in our task following the same training process as the official implementation, with the same data format stated in the DPR GitHub repository. 3 Since the data process requires hard negatives (hard_negative_ctxs), we constructed them using the most relevant rule documents (but not the gold) selected by TF-IDF and left the negative_ctxs to be empty as it can be. For discourse parsing, we keep all the default parameters of the original discourse relation parser 4 , with F1 score achieving 55. The dimension of hidden states is 768 for both the encoder and decoder. The training process uses Adam (Kingma and Ba, 2015) for 5 epochs with a learning rate set to 5e-5. We also use gradient clipping with a maximum gradient norm of 2, and a total batch size of 16. The parameter λ in the decision making objective is set to 3.0. For BART-based decoder for question generation, the beam size is set to 10 Model Dev Set  Test Set  Top1  Top5  Top10  Top20  Top1  Top5  Top10  Top20   TF-IDF    for inference. We report the averaged result of five randomly run seeds with deviations. Table 1 shows the results of OSCAR and all the baseline models for the End-to-End task on the dev and test set with respect to the evaluation metrics mentioned above. Evaluating results indicate that OSCAR outperforms the baselines in all of the metrics. In particular, it outperforms the public state-of-the-art model MUDERN by 1.3% in Micro Acc. and 1.1% in Macro Acc for the decision making stage on the test set. The question generation quality is greatly boosted via our approaches. Specifically, F1 BLEU1 and F1 BLEU4 are increased by 2.0% and 1.5% on the test set respectively. Since the dev set and test set have a 50% split of user questions between seen and unseen rule documents as described in Section 5.1, to analyze the performance of the proposed framework over seen and unseen rules, we have added a comparison of question generation on the seen and unseen splits as shown in Table 2. The results show consistent gains for both of the seen and unseen splits.

Comparison of Open-Retrieval Methods
We compare two typical retrievals methods, TF-IDF and Dense Passage Retrieval (DPR), which are widely-used traditional models from sparse vector space and recent dense-vector-based ones for opendomain retrieval, respectively. We also present the results of TF-IDF+DPR (denoted DPR++) follow-  ing Karpukhin et al. (2020), using a linear combination of their scores as the new ranking function.
The overall results are present in Table 3. We see that TF-IDF performs better than DPR, and combining TF-IDF and DPR (DPR++) yields substantial improvements. To investigate the reasons, we collect the detailed results of the seen and unseen subsets for the dev and test sets, from which we observe that TF-IDF generally works well on both the seen and unseen sets, while DPR is degraded on the unseen set. The most plausible reason would be that DPR is trained on the training set, it can only give better results on the seen subsets because seen subsets share the same rule texts for retrieval with the training set. However, DPR may easily suffer from over-fitting issues that result in the relatively weak scores on the unseen sets. Based on the complementary merits, combining the two methods would take advantage of both sides, which achieves the best results finally.

Decision Making
By means of TF-IDF + DPR retrieval, we compare our model with the previous SOTA model MUDERN (Gao et al., 2021) for comparison on the open-retrieval setting. According to the results in Table 1, we observe that our method can achieve a better performance than DISCERN, which indicates that the graph-like discourse modeling works well in the open-retrieval setting in general.

Question Generation
Overall Results We first compare the vanilla question generation with our method with encoder DPR++ Top1 Top5 Top10 Top20   states. Table 7 shows the results, which verify that both the sequential states and graph states from the encoding process contribute to the overall performance as removing any one of them causes a performance drop on both F1 BLEU 1 and F1 BLEU 4 . Especially, when removing GS/SS, those two matrices drops by a great margin, which shows the contributions. The results indicate that bridging the gap between decision making and question generation is necessary. 5

Smoothing Strategies
We explore the performance of different strategies when fusing the contextual states into BART decoder, and the results are shown in Table 8, from which we see that the gating mechanism yields the best performance. The most plausible reason would be the advantage of using the gates to filter the critical information.
Upper-bound Evaluation To further investigate how the encoder states help generation, we construct a "gold" dataset as the upper bound evaluation, in which we replace the reference span with the ground-truth span by selecting the span of the rule text which has the minimum edit distance with the to-be-asked follow-up question, in contrast to the original span that is predicted by our model. We find an interesting observation that the BLEU-1 and BLEU-4 scores drop from 90.64 → 89.23, and 5 Our method is also applicable to other generation architectures such as T5 (Raffel et al., 2020). For the reference of interested readers, we tried to employ T5 as our backbone, achieving better performance: 53.7/45.0 for dev and 52.5/43.7 for test (F1BLEU1/F1BLEU4).   89.61 → 85.81 after aggregating the DM states on the constructed dataset. Compared with the experiments on the original dataset, the performance gap shows that using embeddings from the decision making stage would well fill the information loss caused by the span prediction stage, and would be beneficial to deal with the errors propagation.

Closed-book Evaluation
Besides the openretrieval task, our end-to-end unified modeling method is also applicable to the traditional CMR task. We conduct comparisons on the original ShARC question generation task with provided rule documents to evaluate the performance. Results in Table 9 show the obvious advantage on the openretrieval task, indicating the strong ability to extract key information from noisy documents.

Case Study
To explore the generation quality intuitively, we randomly collect and summarize error cases of the baseline and our models for comparison. Results of a few typical examples are presented in Figure. 3. We evaluate the examples in term of three aspects, namely, factualness, succinctness and informativeness. The difference of generation by OSCAR and the baseline are highlighted in green, while the blue words are the indication of the correct generations. One can easily observe that our generation outperforms the baseline model regarding factualness, succinctness, and informativeness. This might be because that the incorporation of features from the decision making stage can well fill in the gap of information provided for question generation.

Succinctness
(does not contain redundant information) ..., In general, loan funds may be used for normal operating expenses, machinery and equipment, minor real estate repairs or improvements, and refinancing debt.

Informativeness (covers the most important content)
The eligible items include: (1) medical, veterinary and scientific equipment (2) ambulances (3) goods for disabled people (4) motor vehicles for medical use.
expenses, machinery, equipment goods for disabled people Will it be used for machinery and equipment?
Will it be used for expenses, machinery and equipment?
Is the item goods for disabled people?
Is it for disabled people?
You can still get Statutory Maternity Leave and SMP if your baby: (1) is born early; (2) is stillborn after the start of your 24th week of pregnancy (3)

Conclusion
In this paper, we study conversational machine reading based on open-retrieval of supporting rule documents, and present a novel end-to-end framework OSCAR to enhance the question generation by referring to the rich contextualized dialogue states that involve the interactions between rule conditions, user scenario, initial question and dialogue history. Our OSCAR consists of three main modules including retriever, encoder, and decoder as a unified model. Experiments on OR-ShARC show the effectiveness by achieving a new state-ofthe-art result. Case studies show that OSCAR can generate high-quality questions compared with the previous widely-used pipeline systems.