Dialogue Graph Modeling for Conversational Machine Reading

Conversational Machine Reading (CMR) aims at answering questions in a complicated manner. Machine needs to answer questions through interactions with users based on given rule document, user scenario and dialogue history, and ask questions to clarify if necessary. In this paper, we propose a dialogue graph modeling framework to improve the understanding and reasoning ability of machine on CMR task. There are three types of graph in total. Specifically, Discourse Graph is designed to learn explicitly and extract the discourse relation among rule texts as well as the extra knowledge of scenario; Decoupling Graph is used for understanding local and contextualized connection within rule texts. And finally a global graph for fusing the information together and reply to the user with our final decision being either"Yes/No/Irrelevant"or to ask a follow-up question to clarify.


Introduction
Training machines to understand documents is the major goal of machine reading comprehension (MRC) (Hermann et al., 2015;Hill et al., 2015;Rajpurkar et al., 2016;Nguyen et al., 2016;Joshi et al., 2017;Rajpurkar et al., 2018;Choi et al., 2018;Zhang et al., 2018;Reddy et al., 2019;Zhang et al., , 2021. Especially, in the recent challenging conversational machine reading task (CMR), the machine is required to read and interpret the given rule document and the user scenario, ask clarification questions, and then make a final decision (Saeidi et al., 2018). As an example shown in Figure 1. The user posts the scenario and asks Figure 1: An example dialog from ShARC dataset (Saeidi et al., 2018). At each turn, machine can give a decision regarding to the initial question put up by the user. If the decision is "Inquire", machine will ask a clarification question to help with decision making. The corresponding rule text and the question are marked in the same color in the figure.
a question concerning whether the loan meet with the needs. Since the user cannot know the rule text, the information he/she provided may not be sufficient for machine to decide. Therefore, a series of follow-up questions are asked by the machine until it can finally make a conclusion.
The major challenges for the conversational machine reading include the rule document interpretation, and reasoning with the background knowledge, e.g., the provided rule document, user scenario and the input question. Existing works (Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020;Gao et al., 2020a,b) have made progress in improving the reasoning ability by modeling the interactions between the rule document and scenarios. They commonly divide the problem into two sub-tasks, i.e., rule-entailment decision and question generation. For the first subtask, machine needs to make a decision among "Yes", "No", "Inquire" and "Irrelevant" given rule text, initial question, user scenario and dialog history for each turn. "Yes/No" gives a definite answer to the initial question. "Irrelevant" means that the question cannot be answered with such knowledge base text. Lastly, if the information provided so far is insufficient for machine to decide, an "Inquire" decision will be made and we may step into the second sub-task. Once the second sub-task is activated, it means that machine lacks some information and it will ask a corresponding question to fill the gap of information.
In particular, Zhong and Zettlemoyer (2019) proposed Entailment-driven Extract and Edit network to jointly extracts a set of decision rules from the procedural text while reasoning about the entailment state for rules. Gao et al. (2020a) put up with EMT, which designs a recurrent entity network to track rule fulfillment for further use. Despite the effectiveness of these approaches, they typically only use the sequential rule texts as a whole while neglecting the salient discourse structure of the rule units, given that the rule documents are commonly formed with a series of possible items or conditions that the conversation should satisfy before making decisions confidently. The current stateof-the-art model DISCERN (Gao et al., 2020b) proposed to segment rule text into elementary discourse units (EDUs) for better understanding of the logical structure of rule texts and explicitly build the dependency between entailment states and decisions, which verified the great potential of modeling the discourse information. However, this work only takes advantage of the segmentation, instead of truly capturing the discourse structures.
In summary, the aforementioned methods have the following drawbacks. First, very little attention is paid in existing models to the discourse structure that reflects the inner dependency between the rule units. Second, existing methods do not dig deep enough into mining the relationships between the rule document and user scenario.
In this work, we proposed a dialogue graph modeling (DGM) framework to tackle the above problems. We first convert the entire rule document with discourse annotations into a discourse graph, which represents both of the rule EDUs and discourse relationships as vertices. The user scenario representation is also injected as a special global vertice, to bridge the interactions and capture the inherent dependency between the rule document and the user scenario information. In addition, a rule graph is designed for representing the inherency of rule text both in a local and contextualized way. Finally, a global graph is used to connect all the information together, forming the representation of given texts. The representations of the discourse and rule graphs is then fused for given the model decision.
Experimental results show that our proposed model outperforms the baseline models on the evaluation metrics and achieves the new state-of-theart results on ShARC benchmark (Saeidi et al., 2018). Specifically, we are the first to explicitly model the relationships among rules and knowledge base with Graph Convolutional Networks (GCNs) (Schlichtkrull et al., 2018). In addition, our model has a strong interpretability by modeling the process in an intuitive way.

Related Work
Compared with traditional triplet-based MRC tasks that aim to answer questions by reading given document (Hermann et al., 2015;Hill et al., 2015;Rajpurkar et al., 2016;Nguyen et al., 2016;Joshi et al., 2017;Rajpurkar et al., 2018), our concerned CRM task (Saeidi et al., 2018) is more challenging that involves rule documents, scenarios, asking clarification question, and making a final decision. The major differences lie in two sides: 1) the machines are required to formulate follow-up questions for clarification before confident enough to make the decision, 2) and the machines have to make a question-related conclusion by interpreting a set of complex decision rules, instead of simply extracting the answer from the text.
Existing works (Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020;Gao et al., 2020a,b) have made progress in improving the reasoning ability by modeling the interactions between the rule document and scenarios. As a widely-used manner, the existing models commonly extracted the rule documents into individual rule items, and track the rule fulfillment for the dialogue states. As indicated in Gao et al. (2020b), improving the rule document representation remains a key factor to the overall model performance, because the rule docu-ment are formed with a series of implicit, separable, and possibly interrelated rule items or conditions that the conversation should satisfy before making decisions confidently. However, Gao et al. (2020b) only considered segmenting the discourse, and neglected the discourse structure inner relationships between the EDUs. Compared to existing methods, our method differs in the following aspects: 1. We employ GCNs to explicitly model the discourse structure as well as the discourse relations of the rule document 2. We bridge the gap between the user scenario with the rule documents by injecting it as a special graph node in the discourse graph.
3. We decouple the complex rule document by a decoupling graph, to capture the local representation of each rule item, and the global interaction with the rest of rule items.

Model
As illustrated in Figure 2, our model mainly consists of three parts to generate the final answer.
1. Firstly, rule document is segmented into rule EDUs, which is then put into an open tagging tool for discourse relation annotation.
2. Taking rule document (with discourse annotation) and user scenario as input, we construct a Discourse Graph to learn the discourse information of rules as well as the extra information inherent in user scenario. Then, we formulate Decoupling Graph to learn the local and contextualized information within the raw rule document. Finally, a Global Graph takes the representation of rule EDU (combined representation generated by Discourse Graph and Decoupling Graph) ,initial question, user scenario and dialog history as inputs, and maps it into an entailment state of each rule EDU. Armed with these rule fulfillment situation, we can make a decision among "Yes", "No", "Inquire" and "Irrelevant".
3. Once the decision is made to be "Inquire", out model generates a follow-up question to clarify the under-specified rule span in rule document.

Preprocessing
EDU Segmentation In this step, we need to separate the rule text into several units each containing exactly one condition. This is relatively easy when there exists bullets in the rule text. However, in most cases we have one single rule text containing several conditions in total without any bullet notations. Here we follow DISCERN (Gao et al., 2020b) adopting the discourse segmenter  to break the rule text into EDUs.
Discourse Relation Unlike EDU segmentation which only concerns about constituency-based logical structures, discourse relation allows relations between the non-adjacent segmented EDUs. There are in total 16 discourse relations, namely, comment, clarification-question, elaboration, acknowledgement, continuation, explanation, conditional, question-answer, alternation, question-elaboration, result, background, narration, correction, parallel and contrast. We follow (Shi and Huang, 2019) to tell the relations of EDUs. This discourse parsing model decides the dependency links between EDUs and the corresponding relation types sequentially with structured representation of each EDU. It achieves the state-of-the-art F 1 score on STAC corpus.
Encoding We select the pre-trained language model (PrLM) model ELECTRA (Clark et al., 2020) for encoding. As shown in the figure, the input of our model includes rule text which has already be parsed into EDUs with explicit discourse relation tagging, user initial question, user scenario and the dialog history. Instead of inserting a [CLS] token in before each sentence to get a sentence-level representation, we use [RULE] which is proved to enhance performance (Lee et al., 2020). Note that we also insert [SEP] between every two adjacent utterances.

Decision Making based on Graph Modeling
Discourse Graph We first construct the discourse graph as a Levi graph (Levi, 1942) which turns the labeled edges into additional vertices. Suppose G = (V, E, R) is the graph constructed in the following way: if utterance U 1 is the continuation of another utterance U 2 , we add a directed edge e = (U 1 , U 2 ) with relation R assigned to Continuation. The corresponding Levi graph can be expressed as G = (V L , E L , R L ) where V L = V ∪ E. E L is the set of edges with format (U 1 , Continuation) and (Continuation, U 2 ). As for R L , previous works such as (Marcheggiani and Titov, 2017;Beck et al., 2018) designed three types of edges R L = def ault, reverse, self to enhance information flow. Here with our settings, we extend it into six Figure 2: The overall structure for our proposed model. With segmented EDUs and tagged relations, the inputs including user initial question, user scenario and dialog history are sent for embedding and graph modeling to make the final decision. If the decision is "Inquire", the question generation stage will be activated and use the under-specified span of rule text to generate a follow-up question.
types: default-in, default-out, reverse-in, reverseout, self, global, corresponding to the direction of the edges in and out of the relation vertices. An example of constructing Levi graph is shown in Figure 3. In order to construct the discourse information throughout the knowledge base, we add a global vertex representing user scenario and connect it with all the other vertices.
We use a relational graph convolutional network (Schlichtkrull et al., 2018) to implement discourse graph as the traditional GCN is not able to handle multi-relation graphs. For utterance and scenario vertices, we employ the encoding results of [RULE] and [CLS] in Section 3.1. For relation vertices, we look up in the embedding table to get the initial representation. Given the initial representation h 0 i of every node v i , the feed-forward or the message-passing process can be written as: where N r (v i ) denotes the neighbors of node v i under relation r and c i,r is simply the number of those nodes. w (l) r is the learnable parameters of layer l.
Because the total 16 relations cannot be treated equally, e.g. relation Contrast is much more important than the relation Continuation, we introduced the gating mechanism (Marcheggiani and Titov, 2017). The basic idea is to calculate a value between 0 and 1 for information passing control.
Finally, the forward process of gated GCN can be represented as: Rule Graph. Rule graph aims at digging information of rule text in both an internal and external way. Each token in rule EDU is represented as a vertex in the graph. We use adjacent matrices to express rule graph. Two types of matrices M l and M c are introduced standing for local and contextualized information respectively: where I i is the index of EDU token i is in. Thus the information containing in rule text are decoupled in two separate aspects. Using multi-head selfattention to encode the graph and denote the length of the whole rule text as s, embedding dimension as d, we will arrive at the following representation.
where G i ∈ R s×d and E is the embedding result from PrLM as mentioned in Section.3.1. MHSA denotes the multi-head self attention (Vaswani et al., 2017). After enough interactions inside rule EDUs, we then fuse the information (Liu et al., 2020) of these two decoupling graphs above in a gated manner by considering both the original and graph encoding representation of rule text.
where FC is the fully-connected layer and C ∈ R s×d . We take the calculated result of the original [RULE] to stand for the updated rule EDUs from C, denoted as c i .
Global Graph. Global graph is used to attend to all available information so far to learn in a systematic way. A transformer encoder (Vaswani et al., 2017) is adopted here allowing all the rule EDUs and user-provided information to attend to each other. Denote [r 1 , r 2 , ...; u q ; u s ; h 1 , h 2 , ...] denote all the representations, r i is the combined sentencelevel representation of discourse graph and rule graph, u q , u s and h i stand for the representation of user question, user scenario and dialog history respectively. After transformer encoding, the result can be displayed as [r 1 ,r 2 , ...;ũ q ,ũ s ;h 1 ,h2...].
Decision Making. Similar to existing works (Zhong and Zettlemoyer, 2019;Gao et al., 2020a,b), we apply an entailment-driven approach for decision making. We use a linear transformation to track the fulfillment state of each rule EDU, i.e. the model should give a prediction among Entailment, Contradiction and Unmentioned. Meanings and label methods will be further elaborated. With this supervised learning method, the decision is made:

Model
Decision Making Question Generation Micro Acc.
Macro Acc. BLEU1 BLEU4 NMT (Saeidi et al., 2018) 44.8 42.8 34.0 7.8 CM (Saeidi et al., 2018) 61.9 68.9 54.4 34.4 BERTQA (Zhong and Zettlemoyer, 2019) 68  where f i is the score predicted for the three labels. This prediction is trained via a cross entropy loss for multi-classification problems: where r is the ground-truth state of fulfillment. After obtaining the state of every rule, we are able to give a final decision towards whether it is Yes, No, Inquire or Irrelevant by self-attention.
where z has the score for all the four possible states. The corresponding training loss in cross entropy form is as:

Question Generation via Span Extraction
If the decision is made to be "Inquire", machine need to ask a follow-up question to further clarify. Question generation in this part is mainly based on the uncovered information in rule text, and then  those information will be rephrase into a question. Following Gao et al. (2020b), we predict the position of a under-specified span within a rule text in a supervised way. With BERTQA approach (Devlin et al., 2019), learn a start vector w s ∈ R d and end vector w e ∈ R d to indicate the start and end positions of the desired span: where t k,i denote the i-th token in the k-th rule sentence. The ground-truth span labels are generated by calculating the edit-distance. Intuitively, span with the minimum edit-distance is selected to be the under-specified span. Finally, we concatenate the rule text and the predicted span as a input sequence to findtune UniLM (Dong et al., 2019) and generate the follow-up question. Evaluation. For the decision-making sub-task, ShARC evaluates the micro-and macro-accuracy. If both the prediction and ground truth of decision is "Inquire", BLEU score (Papineni et al., 2002) will be evaluated on the follow-up question generation sub-task.
Implementation Details. For rule EDU relation prediction, we keep all the default parameters 1 , with F 1 score achieving 55%. In the decisionmaking stage, we finetune a ELECTRA-based model. The dimension of hidden states is 1024 for both the encoder and decoder. The training process uses Adam (Kingma and Ba, 2014) for 5 epochs with learning rate set to 5e-5. We also use the gradient clipping with a maximum gradient norm of 2, and a total batch size 16. In the question generation stage, for the sake of consistency, we also use a ELECTRA-based model for span extraction. For UniLM, we finetune it with a batch size of 16, a learning rate of 2e-5 and beam size is set to 10 for inference. Table 1 show the results for the End-to-End task on the ShARC dataset. It is worth noting that DGM outperforms the existing models in most of the metrics, In addition, we analyzed the class-wise classification accuracy of our model in Table 2. Result demonstrates that our model has a far better prediction ability to judge whether the user's requirements are fulfilled. Table 3 shows an ablation study of our Dialog Graph Model on the development set of ShARC. Both the Macro Acc and Micro Acc. decrease as in w/o Discourse Graph and w/o Decoupling graph. Especially, these two metrics drops by a great margin as we remove Discourse Graph, which shows the effectiveness of graph modeling in CMR.