Towards End-to-End Open Conversational Machine Reading

In open-retrieval conversational machine reading (OR-CMR) task, machines are required to do multi-turn question answering given dialogue history and a textual knowledge base. Existing works generally utilize two independent modules to approach this problem’s two successive sub-tasks: first with a hard-label decision making and second with a question generation aided by various entailment reasoning methods. Such usual cascaded modeling is vulnerable to error propagation and prevents the two sub-tasks from being consistently optimized. In this work, we instead model OR-CMR as a unified text-to-text task in a fully end-to-end style. Experiments on the ShARC and OR-ShARC dataset show the effectiveness of our proposed end-to-end framework on both sub-tasks by a large margin, achieving new state-of-the-art results. Further ablation studies support that our framework can generalize to different backbone models.


Introduction
In a multi-turn dialogue comprehension scenario, machines are expected to answer high-level questions through interactions with human beings until enough information is gathered to derive a satisfying answer (Zhu et al., 2018;Zhang et al., 2018;Zaib et al., 2020;Huang et al., 2020;Fan et al., 2020;Gu et al., 2021).As a specific and challenging dialogue comprehension task, conversational machine reading (CMR) (Saeidi et al., 2018) requires machines to understand the given user's initial setting and dialogue history before the machine itself is able to give a final answer or inquire for more clarifications according to rule texts (see Figure 1).
In terms of acquisition of rule texts which are the main reference for tackling the CMR, there is closed-book setting where the rule texts are all given and there is correspondingly open-retrieval setting where the rule texts need to be retrieved from a knowledge base (Gao et al., 2021) (see Figure 1).In terms of problem objectives, current approaches in general divide the targets into two categories, one as decision making and one as question generation (Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020;Gao et al., 2020b,c) .For decision making sub-task, the machine is required to give decisions to directly answer the user question which concludes the dialogue or generate clarifying questions which continues the dialogue.For question generation sub-task, the machine is required to generate the clarifying questions that are essential to the later final decision making.Following this line of approaching the CMR task, a variety of works have been proposed mainly based on modeling the matching of elementary conditions (Henaff et al., 2017;Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020;Gao et al., 2020b,c;Ouyang et al., 2021;Zhang et al., 2021) in either a sequential encoding or graph-based manner.
However, by tackling the CMR task with two divided sub-tasks, the corresponding division of the optimization on decision making sub-task and the optimization on the question generation sub-task may result in problems including error propagation, thus hindering further performance advance.Ouyang et al. (2021) has shown that transferring some knowledge between the training of two subtasks is beneficial for better performance.However, reducing the gap between two sub-tasks to achieve an end-to-end optimization CMR task still needs further and more comprehensive attempts.
In this work, we propose a completely Unified end-to-end framework for Conversational Machine Reading tasks (UNICMR1 ) to tackle the division of optimization challenge by formulating the CMR/OR-CMR task into a single text-to-text task.Our contributions are summarized as follows: (i) We completely unify two sub-tasks of OR-CMR into a single task in terms of optimization, achieving a fully end-to-end optimization paradigm.
(ii) Experimental results on the OR-ShARC dataset and ShARC dev set show the effectiveness of our method, especially on the question generation sub-task with a relatively small amount of parameters.Furthermore, our method achieves the new state-of-the-art results on all sub-tasks.
(iii) By further ablation studies, we have shown that our proposed framework largely advances the decision making performance, and reduces error propagation thus boosting the question generation performance.We have also shown that our proposed framework can generalize to different backbone models.Qualitative analysis including case study has further verified the effectiveness of our framework.

Conversational Machine Reading
The mainstream of research on the conversationbased reading comprehension task focuses on either the decision making (Choi et al., 2018;Reddy et al., 2019;Sun et al., 2019;Tao et al., 2019;Cui et al., 2020;Yang et al., 2020) or the followup utterance generation (Wu et al., 2019;Bi et al., 2019;Ren et al., 2019;Gao et al., 2020a).However, the decision making centered approaches leave out cultivating the machine's capability to reduce the information gap by clarifying interactions.While the question generation centered approaches neglect exploring the machine's capability to concentrate on target-oriented information and make vital decisions.In contrast, our work focuses on a more challenging conversation-based reading comprehension task called conversational machine reading (CMR) task (Saeidi et al., 2018;Gao et al., 2021), which requires machines to make decisions and generate clarifying questions in a dialogue given rule texts and user scenarios.

Open-Retrieval CMR
Most of the current studies on CMR concentrate on the closed-book setting of CMR where the essential reference for the final decision, a piece of rule text corresponding to each dialogue, is given (Zhong and Zettlemoyer, 2019;Verma et al., 2020;Gao et al., 2020b,c).One typical example benchmark is called ShARC (Saeidi et al., 2018).However, in a more realistic and also more challenging setting, the machine is required to retrieve rule texts based on different scenarios.Similar to the open domain question answering setting where the supporting texts are retrieved from external documents to answer factoid questions (Moldovan et al., 2000;Voorhees and Tice, 2000), open-retrieval conversational machine reading (OR-CMR) task is established by requiring the machine to retrieve useful information from a given knowledge base composed of rule texts.In contrast to most of the previous works on CMR, we focus on OR-CMR in pursuit of a more realistic and more challenging setting.

Joint Optimization of CMR
Existing studies generally approach conversational machine reading task by separating it into two sub-tasks (Zhong and Zettlemoyer, 2019;Verma et al., 2020;Gao et al., 2020b,c), decision making and question generation.Therefore, existing approaches generally focus on different methods to extract the fulfillment of rule-related conditions and conduct explicit entailment reasoning on tracking the conditions in the dialogues.This includes applying attention mechanisms on the sequentially encoded user setups and the dialogue (Zhong and Zettlemoyer, 2019;Lawrence et al., 2019;Verma et al., 2020;Gao et al., 2020b,c) and extract discourse structures for better fulfillment matching (Ouyang et al., 2021).However, one of the major challenges emerges as the division of the optimization of decision making sub-task and the optimization of the question generation sub-task.Zhang et al. (2021) have taken the initial attempt to mitigate the division between two sub-tasks by considering the encoded hidden states from decision making in question generation module.However, it still lacks synergy of optimization and relies on separate feature extractions including the entailment reasoning.In contrast, our work approaches the conversational machine reading task by unifying the two subtasks into one, enabling an end-to-end joint model optimization on both the decision making target and question generation target.

Problem Formulation
As shown by Figure 1, in traditional CMR task, the machine will be given: user scenario S, user initial question Q, a gold rule text R, and dialogue history D := {(q 1 , a 1 ), (q 2 , a 2 ), . . ., (q n , a n )} which consists of n follow-up question-answer pairs.The machine is required to do the two subtasks: • Decision Making.The machine makes a decision to either answer the user initial question with Yes or No, or give Inquire2 which activates the second sub-task to generate the inquiry question for more clarification.
• Question Generation.The machine generates an inquiry question aimed at essential clarifications to answer the user's initial question.
Beyond CMR, open-retrieval conversational machine reading (OR-CMR) (Gao et al., 2021) further mimics the more challenging second scenario, which is the focus of this work.As shown by Figure 1, the difference between the CMR and OR-CMR lies in the rule text part R. In CMR, the machine is provided with a gold rule text in a closed-book style.While in OR-CMR, the machine needs to retrieve rule texts from a knowledge base in an open-retrieval style alternatively.The machine is given a knowledge base B containing rule texts.Therefore, under the OR-CMR setting, the machine needs to first retrieve m rule texts R 1 , R 2 , . . ., R m to complete the input for the same downstream decision making and question generation sub-tasks.

Framework
Our model is composed of two main modules: a retriever and a text-to-text encoder-decoder model.
The retriever is applied to retrieve rule texts R 1 , R 2 , . . ., R m from a given knowledge base B. The text-to-text encoder-decoder model will take in the preprocessed textual input and generate the textual answer directly as a whole.Subsequent extraction methods will be applied for decision making and question generation sub-tasks to obtain the predictions for each sub-task respectively.

Retriever
To obtain the rule texts, the user scenario S and user initial question Q are concatenated as the input query to the retriever.Our retriever employs the MUDERN TF-IDF-based method (Gao et al., 2021), which takes account of bigram features and scores the similarity between rule texts and queries in the form of bag-of-words vectors weighted by the TF-IDF model.Top-scored m rule texts R 1 , R 2 , . . ., R m will then be chosen for the following text-totext encoder-decoder model.

Text-to-Text Encoder-Decoder
One of the major challenges of the CMR or OR-CMR task is the division of sub-task optimizations.Motivated by T5 (Raffel et al., 2020) which formulate several traditional NLP tasks into a unified text-to-text generation task, we unify the two sub-tasks by formulating the input and output to our encoder-decoder model as follows.

Input Formulation
Discourse Segmentation.We employ the discourse segmentation approach (Shi and Huang, 2019) to parse the retrieved rule texts into explicit conditions for the model.After discourse segmentation, each retrieved R i is parsed into N i elementary discourse units (EDUs) EDU i,1 , EDU i,2 , . . ., EDU i,N i .Formulation of the final input I is shown by the setting part in Figure 2.

Output Formulation
The output of the text-to-text encoder-decoder will be a sequence of textual tokens O :={o 1 , o 2 , . . ., o k } where the length k is determined by the model itself but within the maximum generation length hyperparameter.To extract the prediction of the decision making sub-task and the question generation sub-task respectively, we assume the first output token o 1 is model's prediction, and the following tokens {o 2 , . . ., o k } are the generated follow-up question, which is only meaningful when o 1 represents the Inquire decision.

Training Objective
In training stage, the labels Y := {y 1 , y 2 , . . ., y k } are formulated as: {Yes Token, [EOS]}, { No Token, [EOS]}, and {Inquire Token, Follow-up Question Tokens, [EOS]}. 3The training objective is defined as: where I is the input to our encoder-decoder model and θ is all the parameters of our model.This feature of the datasets aims to mimic more realistic scenario where user may asks questions on information that the machine has encountered or has never encountered (Gao et al., 2021).

Experiments
Evaluation Metrics.For decision making subtask, the evaluation is Micro-and Macro-Accuracy of the decisions.For question generation sub-task, we adopt the F1 BLEU (Gao et al., 2021) which calculates the F1 score with precision of BLEU (Papineni et al., 2002) when the predicted decision is Inquire and recall of BLEU when the ground truth decision is Inquire.

Quantitative Results
The effectiveness of our proposed method is verified on both the OR-ShARC and the original ShARC datasets.In addition, we compare the number of parameters with related studies.Tables 1-3 present our main experimental results.We will discuss our findings in the following part.

Decision Making and Question
Generation performance on OR-ShARC.
Referring to our results reported in Table 1, our large unified model has achieved new SOTA question generation performance in both dev and test sets by a large margin.In terms of decision making results, our large model lags behind in the dev set but prevails in the test set performance by maintaining a stable and consistent performance when transferring from dev set to test set.

Performance on ShARC.
As a reference, the performance of the UNICMR large together with the current SOTA model OSCAR on the dev set of ShARC is reported on Table 2.Note that, in contrast with OR-ShARC (Gao et al., 2021), ShARC benchmark (Saeidi et al., 2018) is in the closed-book setting with the evaluation metric of the question generation sub-task as BLEU.Based on the results in Table 2, it can be seen that UNICMR large maintains a new SOTA performance on dev set by a large margin for both the decision making and the question generation sub-tasks.This shows our unified method is effective for the model's performance

Comparison of Model Parameter
Numbers.
We have approximated total parameters of current high performance models.The information is shown in Table 3.By comparison of the parameter numbers used in current high performance models in Table 3, our UNICMR large (based on T5-large) uses around 770M parameters which generally prevails the current SOTA model OSCAR using around 1100M parameters.Our UNICMR base (based on T5-base) uses 220M parameters but prevails models like DISCERN which uses around 330M parameters.UNICMR base also achieves a close performance to OSCAR in terms of question generation.The above observations verify that our method of unifying optimizing the two sub-tasks is effective, which enables each sub-task to benefit from the optimization of the other task.

Number of Retrieved Rule Texts
The model performance under different choices of the number of retrieved rule texts is shown in Table 7 in Appendix B whose visualization is shown by Figure 3.We see that generally, when the number of rule texts increases, there will be more information which improves our model while also introducing more noise which harms our model.

Maximum Generation Length
The model performance under the different choices of the maximum generation length is shown in Table 8 6 in Appendix B whose visualization is shown by Figure 4.
In terms of decision making and question generation, redundant max generation length will not affect the performance of the model but insufficient max generation length will limit the model performance.This means the model well learns the difference between different forms of answers and is able to generate answers of suitable length accordingly.This verifies the feasibility of our end-to-end framework design.

Yes
No Inquire Figure 5: Classwise accuracy on dev set of each epoh.

Generation Quality Gain Across Training
The classwise accuracy evaluated in the training of the decision making sub-task is shown by Figure 5.By the initial gap between the accuracy for 6 In Table 7, the hyperparameter m (number of retrieved rule texts) is varied to compare our model performance on the OR-ShARC test set, test set seen and test set unseen divisions respectively.In Table 8, the hyperparameter maximum generation length of the backbone encoder-decoder model is varied to compare our model performance on the same datasets.The corresponding performance of the above two experiments on dev set is shown by Table 9 and Table 10 in Appendix B for reference.Note in these experiments, all the hyperparameters remain the same unless explicitly stated.
"Inquire" and the accuracy for other decisions, our model tends to predict the decision as Inquire and generate question when not well fine-tuned.This is due to a gap between the length for the answer Yes/No and the length for the answer "Inquire+Generated Question".And also the innate property of pre-trained T5 generation model before well fine-tuned at the beginning which is hence biased towards the longer answer.As the training continues, the accuracy for Yes and No gradually catches up with Inquire even though is slightly lower.This observation shows the existence of the bias of our backbone model and also the effectiveness of our training which large reduces such bias.This also suggests future improvements on more targeted training to eliminate the bias and lessening the discontinuity between the length of output for Yes/No and the length of output for "Inquire+Generated Question".

Contribution of the Retriever Module
To quantify the contribution of the retriever module, we conducted an additional experiment where OR-ShARC is turned into a closed-book setting (see Closed-Book in Table 4).Also, we replaced the TF-IDF retriever with the DPR++ retriever introduced in UNICMR large for reference (see w/ DPR++ in Table 4).Performance of UNICMR large without retriever is also shown (see w/o Retriever in Table 4).The results verify that using the retrieval is beneficial, which reduces the gap between the challenging open-retrieval task and the closed-book task with gold rule texts.

Discussions of Performance Improvement
To further investigate the source of performance improvement of our method, more comprehensive experimental results are shown here following the deduced conclusions.
First (i) In terms of F1 BLEU , UNICMR has dominantly higher performance than other separately trained models.
(ii) In terms of BLEU7 , UNICMR is not the best, which shows its source of F1 BLEU dominance includes reduction of error propagation.
(iii) For T5-large backbone, UNICMR is higher in BLEU than QG-only partial-evaluation, which means UNICMR's integration of decision making labels in training is effective.

Generalizability on Different Backbone Models
Replacing the T5-large backbone with BART-base, and repeating the same experiments (see the same settings but with BART-base as backbone models in Table 6 and Table 5), leads to same general conclusions.This shows the effectiveness of UNICMR's unified format can well generalize to different end-to-end architectures.

Error Analysis and Case Study
To reveal more insights into UNICMR, we randomly collect test samples and conduct error analysis (see Figure 6) and case study (see Figure 7 in Appendix A).The ground truth answers are indicated in red, the TF-IDF scores are indicated in green, and the predictions of UNICMR large are indicated in blue.The retrieved rule texts are in descending order in terms of TF-IDF scores.
Error Analysis.

Conclusion
In this paper, we study open-retrieval setting of the conversation machine reading task and promote a novel framework to first unify the optimizations of the two sub-tasks to achieve optimization synergy.With a retriever module and a parameter-efficient text-to-text encoder-decoder module, we have achieved new SOTA results in both the CMR and the OR-CMR benchmarks.Further experiments shows that our unified training form with an endto-end optimization method largely contributes to the advanced performance in decision making and reduces the error propagation to boost question generation performance.It's also shown that our framework well generalize to other backbone models.Further qualitative analysis also verifies our framework's effectiveness.

Limitations
Under the challenging open-retrieval setting, a retrieval is required to find the related rules texts.However, the performance of our model may be hindered by the noise introduced by the irrelevant rule texts from the retrieval.To conquer this deficiency, it is beneficial to develop additional filtering methods to alleviate the influence of irrelevant rule texts.

A Case Study
To reveal more insights into our framework, we randomly collect test samples and conduct the case study (see Figure 7).The ground truth answers are indicated in red, the TF-IDF scores are indicated in green, and the predictions of UNICMR large are indicated in blue.The retrieved rule texts are in descending order in terms of TF-IDF scores.For the analysis on cases, please refer to Section 6.7.

B Hyperparameter-Related Experiments
In this section, additional experiments related to the hyperparameter m (number of retrieved rule texts) and the hyperparameter maximum generation length are conducted with their results shown in Table 7-10.In Table 7, the hyperparameter m (number of retrieved texts) is varied to compare our model performance on the OR-ShARC test set, test set seen and test set unseen divisions respectively.In Table 8, the hyperparameter maximum generation length of the backbone encoder-decoder model is varied to compare our model performance on the same datasets.The corresponding performance of the above two experiments on dev set is shown by Table 9 and Table 10 respectively.Note in these experiments, all the hyperparameters remain the same unless explicitly stated.

Figure 1 :
Figure 1: CMR and OR-CMR Task Overview

Figure 3 :Figure 4 :
Figure 3: Evaluation performance of our model under different number of retrieved rule texts on test set.

Dialogue History Retrieved Rule Texts User Question User Scenario Query Discourse Segmented Knowledge Base Text-to-Text Encoder-Decoder Model Answer: Yes Answer: No Answer: Inquire+Generated Question Setting: Existing Models: Decision Making Module Question Generation Module Entailment Reasoning Module Output: Activation Signal Entailment States Our Model: Our Output Format: Answer: No Answer: Yes Answer: Inquire Generated Inquiry Question
Follow-up q₁ [FUA] Follow-up a₁ … [SN] [EDU] EDU₁,₁ [EDU] EDU₁,₂ … The dev and test set each satisfies that around 50% of examples ask questions based on the rule texts used in training (seen) and the remaining asks questions based on the unseen rule texts in training.
(Saeidi et al., 2018)Datasets.Our training and evaluation is based on the OR-ShARC dataset(Gao et al., 2021).Original dataset ShARC(Saeidi et al., 2018)contains 948 dialogues trees which is then flattened into 32,436 examples with entries composed of rule documents, user setups, dialogue history, evidence, and decision.Derived from ShARC, OR-ShARC modifies the initial question to be self-contained and to be independent of gold rule texts.Then the gold rule texts are removed to form the knowledge base B of 651 rules.The train and dev set of ShARC are further split into train, dev, and test set, with sizes 17,936, 1,105, and 2,373, respectively.

Table 1 :
Results on the validation and test set of OR-ShARC.Numerical values in the parentheses show how much our proposed model outperforms the current SOTA model.The first block presents the results of public models with the DPR++ retrieval method, and the second block reports the results of TF-IDF retrieval-based public models and our SOTA model.Our average results with a standard deviation on 3 random seeds are reported.The numbers in brackets (↑) indicate the improved accuracy over the previous state-of-the-art model.

Table 2 :
Results on the validation set of ShARC (with large models).Note that the test set of ShARC is not public hence only the evaluation on dev set is conducted.
4 https://huggingface.co/t5-base, and https:// huggingface.co/t5-large,respectively.5 Average training run time for UNICMRlarge is approximately 12 hours with one GPU.Average inference run time for UNICMRlarge is approximately 10 minutes on dev set and 21 minutes on test set with one GPU.

Table 3 :
The comparison of approximate number of parameters of some current models.

Table 4 :
Results of our UNICMR large and UNICMR large with different retriever module setting on the dev and test sets of OR-ShARC benchmark.For Closed-Book setting, the OR-ShARC is turned into a closed-book setting by given the rule texts.For w/ DPR++ setting, the TF-IDF retriever is replaced with DPR++ retriever.For w/o Retriever setting, the OR-ShARC is approached without rule texts.

Table 5 :
, UNICMR's unified training format advances the performance of training T5 separately Question generation performance of UNICMR compared with models trained only on question generation sub-task on OR-ShARC.For QG-only whole-evaluation setting, we use all samples by assigning empty generated question to samples with Yes/No decisions.For QG-only partial-evaluation setting, we use samples only with inquiry questions.The results are generally divided into two parts, one using T5-large as backbone model and one using BART-base as backbone model.

Table 5 )
, and T5-large-based UNICMR (UNICMR in Table5).The results indicate that: Robustness to Noisy Retrieved Rules which means the model can filter noisy retrieved rule texts to extract unsatisfied conditions.From the results in Figure7, it can be seen that UNICMR large generate more suitable inquires in terms of Exactness and achieves excellent performance in terms of Robustness to Noisy Retrieved Rules.This suggests that our fully end-toend framework enables the accurate focus on target conditions and the implicit feature engineering of UNICMR is powerful to filter noisy retrievals regardless of the retriever quality.