Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering

One of the main challenges in conversational question answering (CQA) is to resolve the conversational dependency, such as anaphora and ellipsis. However, existing approaches do not explicitly train QA models on how to resolve the dependency, and thus these models are limited in understanding human dialogues. In this paper, we propose a novel framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context. ExCorD first generates self-contained questions that can be understood without the conversation history, then trains a QA model with the pairs of original and self-contained questions using a consistency-based regularizer. In our experiments, we demonstrate that ExCorD significantly improves the QA models’ performance by up to 1.2 F1 on QuAC, and 5.2 F1 on CANARD, while addressing the limitations of the existing approaches.


Introduction
Conversational question answering (CQA) involves modeling the information-seeking process of humans in a dialogue. Unlike single-turn question answering (QA) tasks (Rajpurkar et al., 2016;Kwiatkowski et al., 2019), CQA is a multi-turn QA task, where questions in a dialogue are contextdependent; 2 hence they need to be understood with the conversation history Reddy et al., 2019). As illustrated in Figure 1  . Owing to linguistic phenomena in human conversations, such as anaphora and ellipsis, the current question q 3 should be understood based on the conversation history: q 1 , a 1 , q 2 , and a 2 . Question q 3 can be reformulated as a self-contained questionq 3 via a question rewriting (QR) process. the current question "Was he close with anyone else?," a model should resolve the conversational dependency, such as anaphora and ellipsis, based on the conversation history.
A line of research in CQA proposes the end-toend approach, where a single QA model jointly encodes the evidence document, the current question, and the whole conversation history Yeh and Chen, 2019;Qu et al., 2019a). In this approach, models are required to automatically learn to resolve conversational dependencies. However, existing models have limitations to do so without explicit guidance on how to resolve these dependencies. In the example presented in Figure  1, models are trained without explicit signals that "he" refers to "Leonardo da Vinci," and "anyone else" can be more elaborated with "other than his pupils, Salai and Melzi." Another line of research proposes a pipeline approach that decomposes the CQA task into question rewriting (QR) and QA, to reduce the complexity of the task (Vakulenko et al., 2020). Based on the conversation history, QR models first generate selfcontained questions by rewriting the original questions, such that the self-contained questions can be understood without the conversation history. For instance, the current question q 3 is reformulated as the self-contained questionq 3 by a QR model in Figure 1. After rewriting the question, QA models are asked to answer the self-contained questions rather than the original questions. In this approach, QA models are trained to answer relatively simple questions whose dependencies have been resolved by QR models. Thus, this limits reasoning abilities of QA models for the CQA task, and causes QA models to rely on QR models.
In this paper, we emphasize that QA models can be enhanced by using both types of questions with explicit guidance on how to resolve the conversational dependency. Accordingly, we propose EXCORD (Explicit guidance on how to Resolve Conversational Dependency), a novel training framework for the CQA task. In this framework, we first generate self-contained questions using QR models. We then pair the self-contained questions with the original questions, and jointly encode them to train QA models with consistency regularization (Laine and Aila, 2016;Xie et al., 2019). Specifically, when original questions are given, we encourage QA models to yield similar answers to those when self-contained questions are given. This training strategy helps QA models to better understand the conversational context, while circumventing the limitations of previous approaches.
To demonstrate the effectiveness of EXCORD, we conduct extensive experiments on the three CQA benchmarks. In the experiments, our framework significantly outperforms the existing approaches by up to 1.2 F1 on QuAC  and by 5.2 F1 on CANARD (Elgohary et al., 2019). In addition, we find that our framework is also effective on a dataset CoQA (Reddy et al., 2019) that does not have the self-contained questions generated by human annotators. This indicates that the proposed framework can be adopted on various CQA datasets in future work. We summarize the contributions of this work as follows: • We identify the limitations of previous approaches and propose a unified framework to address these. Our novel framework improves QA models by incorporating QR models, while reducing the reliance on them.
• Our framework encourages QA models to learn how to resolve the conversational dependency via consistency regularization. To the best of our knowledge, our work is the first to apply the consistency training framework to the CQA task.
• We demonstrate the effectiveness of our framework on three CQA benchmarks. Our framework is model-agnostic and systematically improves the performance of QA models.

Task Formulation
In CQA, a single instance is a dialogue, which consists of an evidence document d, a list of questions q = [q 1 , ..., q T ], and a list of answers for the questions a = [a 1 , ..., a T ], where T represents the number of turns in the dialogue. For the t-th turn, the question q t and the conversation history H t = [(q 1 , a 1 ), ..., (q t−1 , a t−1 )] are given, and a model should extract the answer from the evidence document as: where P(·) represents a likelihood function over all the spans in the evidence document, andâ t is the predicted answer. Unlike single-turn QA, since the current question q t is dependent on the conversation history H t , it is important to effectively encode the conversation history and resolve the conversational dependency in CQA.

End-to-end Approach
A naive approach in solving CQA is to train a model in an end-to-end manner ( Figure 2a). Since standard QA models generally are ineffective in the CQA task, most studies attempt to develop a QA model structure or mechanism for encoding the conversation history effectively Yeh and Chen, 2019;Qu et al., 2019a,b). Although

QA Loss
Evidence Document  Figure 2: Overview of the end-to-end approach, the pipeline approach, and ours. In the end-to-end approach, QA models are asked to answer the original questions based on the conversation history. In the pipeline approach, the self-contained questions are generated by a QR model, and then QA models answer them. Standard QA models are commonly used in this approach; however conversational QA models that encode the history can be adopted (the dotted line in Figure (b)). In ours, the original and self-contained question are jointly encoded to train QA models with the consistency loss. these efforts improved performance on the CQA benchmarks, existing models remain limited in understanding conversational context. In this paper, we emphasize that QA models can be further improved with explicit guidance using self-contained questions effectively.

Pipeline Approach
Recent studies decompose the task into two subtasks to reduce the complexity of the CQA task. The first sub-task, question rewriting, involves generating self-contained questions by reformulating the original questions. Neural-net-based QR models are commonly used to obtain selfcontained questions (Lin et al., 2020;Vakulenko et al., 2020). The QR models are trained on the CANARD dataset (Elgohary et al., 2019), which consists of 40K pairs of original QuAC questions and their self-contained versions that are generated by human annotators.
After generating the self-contained questions, the next sub-task, question answering, is carried out. Since it is assumed that the dependencies in the questions have already been resolved by QR models, existing works usually use standard QA models (not specialized to CQA); however conversational QA models can also be used (the dotted line in Figure 2b). We formulate the process of predicting the answer in the pipeline approach as: where P rewr (·) and P read (·) are the likelihood functions of QR and QA models, respectively.q t is a self-contained question rewritten by the QR model. The main limitation of the pipeline approach is that QA models are never trained on the original questions, which limits their abilities to understand the conversational context. Moreover, this approach makes QA models dependent on QR models; hence QA models suffer from the error propagation from QR models. 3 On the other hand, our framework enhances QA models' reasoning abilities for CQA by jointly utilizing original and self-contained questions. In addition, QA models in our framework do not rely on QR models at inference time and thus do not suffer from error propagation.

EXCORD: Explicit Guidance on Resolving Conversational Dependency
We introduce a unified framework that jointly encodes the original and self-contained questions as illustrated in Figure 2c. Our framework consists of two stages: (1) generating self-contained questions using a QR model ( §3.1) and (2) training a QA model with the original and self-contained questions via consistency regularization ( §3.2).

Question Rewriting
Similar to the pipeline approach, we utilize a QR model to obtain self-contained questions. We use the obtained questions for explicit guidance in the next stage. As shown in Equation 2, the QR task is to generate a self-contained question given an original question and a conversation history. Following Lin et al. (2020), we adopt a T5-based sequence generator (Raffel et al., 2020) as our QR model, which achieves comparable performance with that of humans in QR. 4 For training and evaluating the QR model, we use the CANARD dataset following previous works on QR (Lin et al., 2020;Vakulenko et al., 2020). During inference, we utilize the top-k random sampling decoding based on beam search with the adjustment of the softmax temperature (Fan et al., 2018;Xie et al., 2019).

Consistency Regularization
Our goal is to enhance the QA model's ability to understand conversational context. Accordingly, we use consistency regularization (Laine and Aila, 2016;Xie et al., 2019), which enforces a model to make consistent predictions in response to transformations to the inputs. We encourage the model's predicted answers from the original questions to be similar to those from the self-contained questions ( §3.1). Our consistency loss is defined as: (3) where KL(·) represents the Kullback-Leibler divergence function between two probability distributions. θ is the model's parameters, andθ depicts a fixed copy of θ.
With the consistency loss, QA models are regularized to make consistent predictions, regardless of whether the given question is self-contained or not. In order to output an answer distribution that is closer to P read θ (a t |d,q t ,H t ), QA models should treat original questions as if they were rewritten into self-contained questions by referring to the conversation history. Through this process, our consistency regularization method serves as explicit guidance that encourages QA models to resolve the conversational dependency. In our framework, P read θ (a t |·) is the answer span distribution over all evidence document tokens. In contrast to Asai and Hajishirzi (2020), by using all probability values in the answer distributions, the signals of selfcontained questions can be effectively propagated to the QA model. In addition to using all probability values, we also sharpened the target distribution P read θ (a t |d,q t ,H t ) by adjusting the temperature (Xie et al., 2019) to strengthen the QA model's training signal.
Finally, we calculate the final loss as: where λ 1 and λ 2 are hyperparameters. L orig and L self are calculated by the negative log-likelihood between the predicted answers and gold standards given the original and self-contained questions, respectively.
Comparison with previous works Consistency training has mainly been studied as a method for regularizing model predictions to be invariant to small noises that are injected into the input samples (Sajjadi et al., 2016;Laine and Aila, 2016;Miyato et al., 2016;Xie et al., 2019). The intuition behind consistency training is to push noisy inputs closer towards their original versions. Therefore, only the original parameters (i.e., θ) are updated, while the copied model parameters (i.e.,θ) are fixed. In contrast to the original concept of consistency training, our goal is to go in the opposite direction and update the original parameters. Thus, we fix the parametersθ with self-contained questions, and soley update θ for each training step as shown in Equation 3.

Experiments
In this section, we describe our experimental setup and compare our framework to baseline approaches (i.e., the end-to-end and pipeline approaches).

Datasets
QuAC QuAC  comprises 100k QA pairs in information-seeking dialogues, where a student asks questions based on a topic with background information provided, and a teacher provides the answers in the form of text spans in Wikipedia documents. Since the test set is only available in the QuAC challenge, we evaluate models on the development set. 5 For validation, we use a subset of the original training set of QuAC, which consists of questions that correspond to the self-contained questions in CANARD's development set. The remaining data is used for training.
CANARD CANARD (Elgohary et al., 2019) consists of 31K, 3K, and 5K QA pairs for training, development, and test sets, respectively. The questions in CANARD are generated by rewriting a subset of the original questions in QuAC. We use the training and development sets for training and validating QR models, and the test set for evaluating QA models.
CoQA CoQA (Reddy et al., 2019) consists of 127K QA pairs and evidence documents in seven domains. In terms of the question distribution, CoQA significantly differs from QuAC (see §5.3). We use CoQA to test the transferability of EX-CORD, where a QR model trained on CANARD generates the self-contained questions in a zeroshot manner. Subsequently, we train a QA model by using the original and synthetic questions. Similar to QuAC, the test set of CoQA is soley available in the CoQA challenge. 6 Therefore, we randomly sample 5% of the QA dialogues in the training set and adopt them as our development set.

Metrics
Following , we use the F1, HEQ-Q, and HEQ-D for QuAC and CANARD. HEQ-Q measures whether a model finds more accurate answers than humans (or the same answers) in a given question. HEQ-D measures the same thing, but in a given dialog instead of a question. For CoQA, we report the F1 scores for each domain (children's story, literature from Project Gutenberg, middle and high school English exams, news articles from CNN, Wikipedia) and the overall F1 score, as suggested by Reddy et al. (2019).

QA models
Note that the baseline approaches and our framework do not limit the structure of QA models. For a fair comparison of the baseline approaches and EXCORD, we test the same QA models in all approaches. The selected QA models are commonly used and have been proven to be effective in CQA. 5 https://quac.ai/ 6 https://stanfordnlp.github.io/coqa/ BERT BERT  is a contextualized word representation model that is pretrained on large corpora. BERT also works well on CQA datasets, although it is not designed for CQA. It receives the evidence document, current question, and conversation history of the previous turn as input.
BERT+HAE BERT+HAE is a BERT-based QA model with a CQA-specific module. Following Qu et al. (2019a), we add the history answer embedding (HAE) to BERT's word embeddings. HAE encodes the information of the answer spans from the previous questions.
RoBERTa RoBERTa  improves BERT by using pretraining techniques to obtain the robustly optimized weights on larger corpora. In our experiments, we found that RoBERTa performs well in CQA, achieving comparable performance with the previous SOTA model, HAM (Qu et al., 2019b), on QuAC. Thus, we adopt RoBERTa as our main baseline model owing to its simplicity and effectiveness. It receives the same input as BERT, otherwise specified.

Implementation Details
The CANARD training set provides 31,527 selfcontained questions from the original QuAC questions. Therefore, we can obtain 31,527 pairs of original and self-contained questions without question rewriting. For the rest of the original questions, we automatically generate self-contained questions by using our QR model. Finally, we obtain 83,568 question pairs and use them in our consistency training. We denote the original questions, selfcontained questions generated by humans, and selfcontained questions generated by a QR model as Q,Q human , andQ syn , respectively. Additional implementation details are described in Appendix B

Results
Table 1 presents the performance comparison of the baseline approaches to our framework on QuAC and CANARD. Compared to the end-to-end approach, EXCORD consistently improves the performance of QA models on both datasets. Also, these improvements are significant: EXCORD improves the performance of the RoBERTa by absolutely 1.2 and 2.3 F1 scores and BERT by 1.2 and 5.2 F1 scores on QuAC and CANARD, respectively. From these results, we conclude that the consistency training with original and self-contained questions enhances ability of QA models to understand the conversational context. On QuAC, the pipeline approach underperforms the end-to-end approach in all baseline models. This indicates that training a QA model soley with self-contained questions is ineffective when human rewrites are not given at the inference phase. On the other hand, EXCORD improves QA models by using both types of questions. As presented in Table 1, our framework significantly outperforms the baseline approaches on QuAC.
On CANARD, the pipeline approach is significantly more effective than the end-to-end approach. Since QA models are trained with self-contained questions in the pipeline approach, they perform well on CANARD questions. Nevertheless, EX-CORD still outperforms the pipeline approach in most cases. Compared to the pipeline approach, our framework improves the performance of RoBERTa by absolutely 1.2 F1 score.

Analysis and Discussion
We elaborate on analyses regarding component ablation and transferability. We also describe a case study carried out to highlight such differences between our and baseline approaches.

Ablation Study
In this section, we comprehensively explore the factors contributing to this improvement in detail: (1) using self-contained questions that are rewritten by humans (Q human ) as additional data, (2) using self-contained questions that are synthetically generated by the QR model (Q syn ), and (3) training a QA model with our consistency framework. In Table 2, we present the performance gaps when each component is removed from our framework. We use RoBERTa on QuAC in this experiment.  Table 2: Effect of self-contained questions and our consistency framework. We use RoBERTa in this experiment.
We first explore the effects ofQ human and Q syn . As shown in Table 2, excludingQ human degrades the performance of RoBERTa in our framework. Although automatically generated,Q syn contributes to the performance improvement. Therefore, both types of self-contained questions are useful in our framework.
To investigate the effect of our framework, we simply augmentQ human andQ syn to Q orig , which is called Question Augment (question data augmentation). We find that Question Augment slightly improves the performance of RoBERTa on CA-NARD, whereas it degrades the performance on QuAC. This shows that simply augmenting the questions is ineffective and does not guarantee improvement. On the other hand, our consistency training approach significantly improves performance, showing that EXCORD is a more optimal way to utilizing self-contained questions.

Case Study
We analyze several cases that the baseline approaches answered incorrectly, but our framework answered correctly. We also explore how our framework improves the reasoning ability of QA models, compared to the baseline approaches. These cases Error case # 1 Title : Montgomery Clift Section Title : Film career Document d : · · · His second movie was The Search . Clift was unhappy with the quality of the script, and edited it himself. The movie was awarded a screenwriting Academy Award for the credited writers. · · · q 1 : When did Clift start his film career? a 1 : His first movie role was opposite John Wayne in Red River , which was shot in 1946 and released in 1948.
Current Question q 2 : Was the film a success? Human Rewrite r 2 : Was Montgomery Clift's film Red River a success? Golden Answer : CANNOTANSWER Prediction of End-to-End : The movie was awarded a screenwriting Academy Award for the credited writers.  Table 3: Error analysis for predictions of RoBERTa that are trained with the baseline approaches and EXCORD. In the first case, the QA model trained with the end-to-end approach fails to resolve the conversational dependency. The QR model in the second case misunderstands the "my," and generates an unnatural question, triggering an incorrect prediction. are obtained from the development set of QuAC.
The first case in Table 3 shows the predictions of the two RoBERTa models trained in the end-to-end approach and our framework, respectively. Note that "the film" in the current question does not refer to "The Search" (red box) in the document d, but "Red River" (blue box) in a 1 . When trained in the end-to-end approach, the model failed to comprehend the conversational context and misunderstood what "the film" refers to, resulting in an incorrect prediction. On the other hand, when trained in EXCORD, the model predicted the correct answer because it enhances the ability to resolve conversational dependency.
In the second case, we compare the pipeline approach to EXCORD. In this case, the QR model misunderstood "my" in the current question as a pronoun and replaced it with the band's name, "Train's." Consequently, the QA model received the erroneous self-contained question, resulting in an incorrect prediction. On the other hand, the QA model trained in our framework predicted the correct answer based on the original question q 6 .

Transferability
We train a QR model to rewrite QuAC questions into CANARD questions. Then, self-contained questions can be generated for the samples that do not have human rewrites. This results in the improvement of QA models' performance on QuAC and CANARD ( §4.5). However, it is questionable whether the QR model can successfully rewrite questions when the original questions significantly differ from those in QuAC. To answer this, we test our framework on another CQA dataset, CoQA. We first analyze how the question distributions of QuAC and CoQA differ. We found that question types in QuAC and CoQA are significantly different, such that QR models could suffer from the gap of question distributions between two datasets. (See details in Appendix A).
To test the transferability of EXCORD, we compare the end-to-end approach to our framework on the CoQA dataset. Using a QR model trained on  CANARD, we generate the self-contained questions for CoQA and train QA models with our framework. As presented in Table 4, our framework performs well on CoQA. The improvement in BERT is 0.5 based on the overall F1, and the performance of RoBERTa is also improved by an overall F1 of 0.6. Improvements are also consistent in most of the documents' domains. Therefore, we conclude that our framework can be simply extended to other datasets and improve QA performance even when question distributions are significantly different. We plan to improve the transferability of our framework by fine-tuning QR models on target datasets in future work.

Related Work
Conversational Question Answering Recently, several works introduced CQA datasets such as QUAC  and COQA (Reddy et al., 2019). We classified proposed methods to solve the datasets into two approaches: (1) end-to-end and (2) pipeline. Most works based on the endto-end approach focused on developing a model structure (Zhu et al., 2018;Ohsugi et al., 2019;Qu et al., 2019a,b) or training strategy such as multitask with rationale tagging (Ju et al., 2019) that are specialized in the CQA task or datasets. Several works demonstrated the effectiveness of the flow mechanism in CQA Yeh and Chen, 2019).
With the advent of a dataset consisting of selfcontained questions rewritten by human annotators (Elgohary et al., 2019), the pipeline approach has drawn attention as a promising method for CQA in recent days (Vakulenko et al., 2020). The approach is particularly useful for the open-domain CQA or passage-retrieval (PR) tasks (Dalton et al., 2019;Ren et al., 2020;Anantha et al., 2020;Qu et al., 2020) since self-contained questions can be fed into existing non-conversational search engines such as BM25. Note that our framework can be used jointly with the pipeline approach in the opendomain setting because our framework can improve QA models' ability to find the answers from the retrieved documents. We will test our framework in the open-domain setting in future work.
Question Rewriting QR has been studied for augmenting training data (Buck et al., 2018;Sun et al., 2018;Zhu et al., 2019; or clarifying ambiguous questions (Min et al., 2020). In CQA, QR can be viewed as a task of simplifying difficult questions that include anaphora and ellipsis in a conversation. Elgohary et al. (2019) first proposed the question rewriting task as a sub-task of CQA and the CANARD dataset for the task, which consists of pairs of original and self-contained questions that are generated by human annotators. Vakulenko et al. (2020) used a coreference-based model (Lee et al., 2018) and GPT-2 (Radford et al., 2019) as QR models and tested the models in the QR and PR tasks. Lin et al.
(2020) conducted the QR task using T5 (Raffel et al., 2020) and achieved on performance comparable to humans on CANARD. Following Lin et al.
(2020), we use T5 in our experiments to generate high-quality questions for enhancing QA models.
Consistency Training Consistency regularization (Laine and Aila, 2016;Sajjadi et al., 2016) has been mainly explored in the context of semisupervised learning (SSL) (Chapelle et al., 2009;Oliver et al., 2018), which has been adopted in the textual domain as well (Miyato et al., 2016;Clark et al., 2018;Xie et al., 2020). However, the consistency training framework is also applicable when only the labeled samples are available (Miyato et al., 2018;Jiang et al., 2019;Asai and Hajishirzi, 2020). The consistency regularization requires adding noise to the sample, which can be either discrete (Xie et al., 2020;Asai and Hajishirzi, 2020) or continuous (Miyato et al., 2016;Jiang et al., 2019). Existing works regularize the predictions of the perturbed samples to be equivalent to be that of the originals'. On the other hand, our method encourages the models' predictions for the original asnwers to be similar to those from the rewritten questions, i.e., synthetic ones.

Conclusion
We propose a consistency training framework for conversational question answering, which enhances QA models' abilities to understand conversational context. Our framework leverages both the original and self-contained questions for explicit guidance on how to resolve conversational dependency. In our experiments, we demonstrate that our framework significantly improves the QA model's performance on QuAC and CANARD, compared to the existing approaches. In addition, we verified that our framework can be extended to CoQA. In future work, the transferability of our framework can be further improved by fine-tuning the QR model on target datasets. Furthermore, future work would include applying our framework to the open-domain setting.

QuAC CoQA
Title : Scott Walker (politician) q 1 : Is the US dollar on a decimal system? Section Title : Education a 1 : U.S. dollar is based upon a decimal system of values. I q 1 : What kind of education did Scott Walker have? q 2 : What country's dollar is not? a 1 : CANNOTANSWER a 2 : Unlike the Spanish milled dollar the U.S. dollar is q 2 : Are there any other interesting aspects about this article? based upon a decimal system of values. a 2 : signed a law to fund evaluation of the reading skills q 3 : What is a mill? of kindergartners as part of an initiative to ensure that students a 3 : n addition to the dollar the coinage act officially are reading at or above grade level established monetary units of mill or one-thousandth of a dollar Current Question q 3 : What other programs did he sign? Current Question q 4 : And a cent? Self-contained Questionq 3 :What other programs did Scott Walker Self-contained Questionq 4 : What is a cent? sign other than a law to fund evaluation - Table 5: Comparison of questions in QuAC and CoQA. In the left side, we can observe several question types that are frequently used in QuAC: unanswerable question (q 1 ) and "Anything else?" question (q 2 ). The current question q 3 refers to the previous answer (green box) and the background information (blue box). On the other hand, in the right side, the current question q 4 omits the question word that are used in the previous question (yellow box).

A Comparison of Questions in QuAC and CoQA
Before testing the transferability of EXCORD ( §5.3), we compare the question distribution of QuAC to that of CoQA. The types of questions are significantly different due to the difference in task setups. When questions were generated in QuAC, evidence documents were soley provided to answerers, but not to questioners. This setup prevented questioners from referring to the evidence documents, which encouraged the questioners to ask natural and information-seeking questions. By contrast, when creating CoQA, questioners and answerers shared the same evidence documents. Examples of QuAC and CoQA are presented in Table 5 and the categorization of question types in Table 6. The results are as follows: (1) QuAC has more non-factoid questions. Approximately half of QuAC questions are non-factoid, whereas more than 60% of questions in CoQA can be answered with either entities or noun phrases. (2) "Anything else?" questions are more frequently observed in QuAC. When questioners cannot find what to ask, they use "Anything else?" questions to seek new topics and continue the conversation. In CoQA, questioners rarely used the "Anything else?" question (2.8%) since they did not need to seek new topics. This type of question is observed in Table 5 (q 2 in the left side). (3) CoQA has few unanswerable questions. Since questioners and answerers share the evidence documents when creating CoQA, only 1.3% of unanswerable questions are asked. However, approximately 20% of questions in QuAC are unanswerable.

B Hyperparameters
Our implementation is based on PyTorch. 7 We implemented BERT using the Transformers library. 8 We implemented the T5-based QR model using the Transformers library and adopted the same QR model in the pipeline approach and EXCORD. We use a single 24GB GPU (RTX TITAN) for the experiments.
We measured the F1 scores on the development set for each 4k training step, and adopted the bestperforming models. We trained QA models based on the AdamW optimizer with a learning rate of 3e-5. We use the maximum input sequence length as 512 and the maximum answer length as 30. We set the maximum query length to 128 for all approaches since self-contained questions are usually longer than original questions. We use a batch size 12 for BERT and RoBERTa in all baseline approaches. For EXCORD, we set the coefficient λ 1 for QA loss for rewritten questions to 0.5. Also we search the coefficient λ 2 for consistency loss within the range of [0.7, 0.5] and the softmax temperature within the range of [1.0, 0.9] (Xie et al., 2019).