Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Conversational question answering aims to provide natural-language answers to users in information-seeking conversations. Existing conversational QA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further investigate how to improve automatic evaluations, and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies and discuss future directions towards building better conversational question answering systems.


Introduction
Conversational question answering (CQA) aims to build machines to answer questions in conversations and has the promise to revolutionize the way humans interact with machines for information seeking. With recent development of large-scale English CQA datasets (Choi et al., 2018;Saeidi et al., 2018;Reddy et al., 2019;Campos et al., 2020), rapid progress has been made in better modeling of conversational QA systems.
Current CQA datasets are collected by crowdsourcing human-human conversations, where the questioner asks questions about a specific topic, and the answerer provides answers based on an evidence passage and the conversational history. When evaluating CQA systems, a set of held-out conversations are used for asking models questions in turn. Since the evaluation builds on pre-collected conversations, the gold history of the conversation is always provided, regardless of models' actual predictions ( Figure 1). Although current systems achieve near-human F1 scores on this static evaluation, it is questionable whether this can faithfully reflect models' true performance in real-world applications. To what extent do human-machine conversations deviate from human-human conversations? What will happen if models have no access to ground-truth answers in a conversation?
To answer these questions and better understand the performance of CQA systems, we carry out the first large-scale human evaluation with four state-of-the-art models on the QuAC dataset (Choi et al., 2018) by having human evaluators converse with the models and judge the correctness of their answers. We collected 1,446 human-machine conversations in total, with 15,059 question-answer pairs. Through careful analysis, we notice a significant distribution shift from human-human conversations and identify a clear inconsistency of model performance between current evaluation protocol and human judgements.
This finding motivates us to improve the automatic evaluation such that it is better aligned with human evaluation. Mandya et al. (2020); Siblini et al. (2021) identify a similar issue in gold-history evaluation and propose to use models' own predictions for automatic evaluation. However, predictedhistory evaluation poses another challenge-since all the questions have been collected beforehand, using predicted history will invalidate some of the questions because of changes in the conversational history (see Figure 1 for an example).
Following this intuition, we propose a question arXiv:2112.08812v1 [cs.CL] 16 Dec 2021 Topic: Spandau Ballet (English pop band) What was the band's first success album at the international level?
They achieved platinum status.
What year did this happen?
What was the band's first success album at the international level?
They achieved platinum status.
"Only When You Leave".
What songs were in it?
What was the band's first success album at the international level?
Human evaluation Automatic evaluation w/ What songs were in it?
Automatic evaluation w/ predicted history gold history rewriting mechanism, which automatically detects and rewrites invalid questions with predicted history ( Figure 4). We use a coreference resolution model (Lee et al., 2018) to detect inconsistency of conference in question text conditioned on predicted history and gold history, and then rewrite those questions by substituting with correct mentions, so that the questions are resolvable in the predicted context. Compared to predicted-history evaluation, we find that incorporating this rewriting mechanism aligns better with human evaluation.
Finally, we also investigate the impact of different modeling strategies based on human evaluation. We find that both accurately detecting unanswerable questions and explicitly modeling question dependencies in conversations are crucial for model performance. Equipped with all the insights, we discuss directions for CQA modeling. We release our human evaluation dataset and hope that our findings can shed light on future development of better conversational QA systems.

Evaluation of Conversational QA
In each episode of CQA evaluation, there is an evidence passage P , a (human) questioner H that has no access to P , and a model M that has access to P . The questioner asks questions about P and the model answers them based on P and the conversational history thus far (see an example in Figure 1). Formally, for the i-th turn, the human asks a question based on the previous conversation, and then the model answers it based on both the history and the passage, where Q i and A i represent the question and the answer at the i-th turn. If the question is unanswerable from P , we simply denote A i as CANNOT ANSWER. The model M is then evaluated by the correctness of answers.
Evaluating CQA systems requires human in the loop and is hence expensive. Instead, current CQA benchmarks use automatic evaluation with gold history (Auto-Gold). For example, the QuAC dataset (Choi et al., 2018) collects a set of humanhuman conversations for automatic evaluation. For each passage, one annotator asks questions without seeing the passage, while the other annotator provides the answers. Denote the collected questions and answers as Q * i and A * i . In gold-history evaluation, the model is inquired with pre-collected questions Q * i and the gold answers as history: and we evaluate the model by comparing A i to A * i (measured by word-level F1). This process does not require human effort but cannot truly reflect the distribution of human-machine conversations, because the human questioner may ask different questions based on different model predictions.
In this work, we choose the QuAC dataset (Choi et al., 2018) as our primary evaluation because it is closer to real-world information-seeking conversations, where the questioner cannot see the evidence passage. It prevents the questioner asking questions that simply overlaps with the passage and encourages unanswerable questions. QuAC also adopts extractive question answering that restricts the answer as a span of text, which is generally considered easier to evaluate.

Models
For human evaluation and analysis, we choose the following four CQA models with different model architectures and training strategies: BERT. A simple BERT baseline, which concatenates the passage, the previous two turns of question-answer pairs, and the question as the input and predicts the answer as in Devlin et al. (2019). 2 GraphFlow. Chen et al. (2020) propose a recurrent graph neural network on top of BERT embeddings to model the dependencies between the question, the history and the passage.
HAM. Qu et al. (2019) propose a history attention mechanism (HAM) to softly select the most relevant previous turns.
ExCorD. Kim et al. (2021) train a question rewriting model on CANARD (Elgohary et al., 2019) to generate context-independent questions, and then use both the original and the generated questions to train the QA model. This model achieves the current state-of-the-art on QuAC (67.7% F1).
For all the models except BERT, we use the original implementations for a direct comparison.

Human Evaluation
In this section, we carry out a large-scale human evaluation with the four models discussed above.

Conversation Collection
We collect human-machine conversations using 100 passages from the QuAC development set on Amazon Mechanical Turk. 3 We also design a set of qualification questions to make sure that the annotators fully understand our annotation guideline. For each model and each passage, we collect three conversations from three different annotators.
We collect each conversation in two steps: (1) The annotator has no access to the passage and asks questions. The model extracts the answer span from the passage or returns CANNOT ANSWER in a human-machine conversation interface. 4 We provide the title, the section title, the background of the passage, and the first question from QuAC as a prompt to annotators. Annotators are required to ask at least 8 and at most 12 questions. We encourage context-dependent questions, but also allow open questions like "What else is interesting" if asking a follow-up question is difficult.
(2) After the conversation ends, the annotator is shown the passage and asked to check whether the model predictions are correct or not. We noticed that the annotators are biased when evaluating the correctness of answers. For questions to which the model answered CANNOT ANSWER, annotators tend to mark the answer as incorrect without checking if the question is answerable. Additionally, for answers with the correct types (for example, a date as an answer to "When was it?"), annotators tend to mark it as correct without verifying from the passage. Therefore, we asked another group of annotators to verify question answerability and correctness.

Answer Validation
For each collected conversation, we ask two additional annotators to validate the annotations. First, each annotator reads the passage before seeing the conversation. Then, the annotator sees the question (and question only) and selects whether the question is (a) ungrammatical, (b) unanswerable, or (c) answerable. If the annotator chooses "answerable", the interface then reveals the answer and asks about its correctness. If the answer is "incorrect", the annotator selects the correct answer span from the passage. We discard all questions that both annotators find "ungrammatical" and the correctness is taken as the majority of the 3 annotations.
In total, we collected 1,446 human-machine conversations and 15,059 question-answer pairs. We release this collection as an important source that complements existing CQA datasets. Numbers of conversations and question-answer pairs collected for each model is shown in Table 1. The data distribution of this collection is very different from the original QuAC dataset (human-human conversations): we see more open questions and unanswerable questions, due to less fluent conversation flow caused by model mistakes, and that models cannot provide feedback to questioners like human answerers do (more analysis in §6.2).

Annotator Agreement
Deciding the correctness of answers is challenging even for humans in some cases, especially when questions are short and ambiguous. We measure annotators' agreement and calculate the Fleiss' Kappa (Fleiss, 1971) on the agreement between annotators in the validation phase. We achieve κ = 0.598 (moderate agreement) of overall annotation agreement. Focusing on answerability annotation, we have κ = 0.679 (substantial agreement).

Disagreements between Human and Gold-history Evaluation
We now compare the results from our human evaluation and automatic evaluation with gold history. Note that the two sets of numbers are not directly comparable: (1) the human evaluation reports accuracy, while the automatic evaluation reports F1 scores; (2) the absolute numbers of human evaluation are much higher than those of automatic evaluations. In automatic evaluations, the gold answers cannot capture all possible correct answers to open-ended questions or questions with multiple answers. However, the annotators can evaluate the correctness of answers easily in human evaluations. Nevertheless, we can compare relative rankings between different models. Figure 2 shows different trends between human evaluation and gold-history evaluation (Auto-Gold). Current standard evaluation cannot reflect model performance in human-machine conversations: (1) Human evaluation and Auto-Gold rank BERT and GraphFlow differently; (2) The gap between HAM and ExCorD is significant (>2% F1) in the automatic evaluation but the two models perform similarly in human evaluation.

Strategies for Automatic Evaluation
The inconsistency between human evaluation and gold-history evaluation suggests that we need better ways to evaluate and develop our CQA models. When deployed in realistic scenarios, the models would never have access to the ground truth (gold answers) in previous turns and are only exposed to the conversational history and the passage. Intuitively, we can simply replace gold answers by the predicted answers of models and we name this as predicted-history evaluation (Auto-Pred). Formally, the model makes predictions based on the questions and its own answers: This evaluation has been suggested by several recent works (Mandya et al., 2020;Siblini et al., 2021), which reported a significant performance drop using predicted history. We observe the same performance degradation, shown in Table 2. However, another issue naturally arises with predicted history: Q * i s were written by the dataset annotators based on (Q * 1 , A * 1 , ..., Q * i−1 , A * i−1 ), which may become unnatural or invalid when the history is changed to (Q * 1 , A 1 , ..., Q * i−1 , A i−1 ).

Predicted History Invalidates Questions
We examined 100 QuAC conversations with the best-performing model (ExCorD) and identified three categories of invalid questions caused by predicted history. We find that 23% of the questions become invalid after using the predicted history. We summarize the types of invalid questions as follows (see detailed examples in Figure 3): • Unresolved coreference (44.0%). The question becomes invalid for containing either a pronoun or a definite noun phrase that refers to an entity unresolvable without the gold history.  • Incoherence (39.1%). The question is incoherent with the conversation flow (e.g., mentioning an entity non-existent in predicted history). While humans may still answer the question using the passage, this leads to an unnatural conversation and a train-test discrepancy for models. • Correct answer changed (16.9%). The answer to this question with the predicted history changes from when it is based on the gold history.
We further analyze the reasons for the biggest "unresolved coreference" category and find that the model either gives an incorrect answer to the previous question ("incorrect prediction", 39.8%), or the model predicts a different (yet correct) answer to an open question ("open question", 37.0%), or the model returns CANNOT ANSWER incorrectly ("no prediction", 9.5%), or the gold answer is longer than prediction and the next question depends on the extra part ("extra gold information", 13.6%). Invalid questions result in compounding errors, which may further affect how the model interprets the following questions.

Evaluation with Question Rewriting
Among all the invalid question categories, "unresolved coreference" questions are the most critical ones. They lead to incorrect interpretations of questions and hence wrong answers. We propose to improve our evaluation by incorporating a state-ofthe-art coreference resolution system (Lee et al., What was the band's first success album at the international level?
Became the band's last American hit.

What songs were in it
Coreference resolution Gold answer: "Parade" from 1984.
They achieved platinum status .
Coreference results using predicted and gold history do not match.

What songs were in "Parade"
Rewritten by gold history coreference results.
First single "Only When You Leave" .
Gold answer: "Only When You Leave" .
How did it do on the charts?

Coreference resolution
Coreference results match. No rewriting needed. Figure 4: An example of question rewriting. We rewrite the second question with referent in the gold history, because predicted and gold history have different coreference results. We do not rewrite the third question as coreference results are the same. 2018) to automatically detect invalid questions categorized as "unresolved coreference". 5 We make the assumption that if the coreference model resolves mentions in Q * i differently between using gold history (Q * 1 , A * 1 , ..., A * i−1 , Q * i ) and predicted history (Q * 1 , A 1 , ..., A i−1 , Q * i ), then Q * i is identified as having an unresolved coreference issue.
Detecting invalid questions. The inputs to the coreference model for Q * i are the following: where BG is the background, S * i and S i denote the inputs for gold and predicted history. We are only interested in the entities mentioned in the current question Q * i and we filter out named entities (e.g., the National Football League) because they can be understood without coreference resolution. After the coreference model returns entity cluster information given S * i and S i , we extract a list of entities E * = {e * 1 , ..., e * |E * | } and E = {e 1 , ..., e |E| }. We say Q * i is valid only if E * = E, that is, assuming e * j and e j have a shared mention in Q * i . We determine whether e * j = e j by checking if F1(s * j , s j ) > 0, where s * j and s j are the first mention of e * j and e j respectively, and F1 is the wordlevel F1 score, i.e., e * j = e j as long as their first mentions have word overlap.

Question rewriting through entity substitution.
Our first strategy is to substitute the entity names in Q * i with entities in E * , if Q * i is invalid. The rewritten question, instead of the original one, will be used in the conversation history and fed into the model. We denote this evaluation method as rewritten-question evaluation (Auto-Rewrite), and Figure 4 illustrates a concrete example of Auto-Rewrite. Our algorithm rewrites ∼12% of the questions for all of the models evaluated.
To analyze how well Auto-Rewrite does in detecting and rewriting questions, we manually check 100 conversations of ExCorD from the QuAC development set. We find that Auto-Rewrite can detect invalid questions with a precision of 72% and a recall of 72%. We notice that the coreference model sometimes detects the pronoun of the main subject in the passage as unresolvable, while it almost shows up in every question. This issue causes a low precision but is not a critical problem in our case-whether rewriting the pronoun of the main subject does not affect models' prediction much. An example of correctly detected and rewritten question by Auto-Rewrite is shown in Figure 5.
Among all correctly detected invalid questions, we further check the quality of rewriting, and Auto-Rewrite gives a correct context-independent ques-tions for 68% of the questions. The most common error is that the rewritten question is ungrammatical: For example, using the gold history of "... Dee Dee claimed that Spector once pulled a gun on him", the original question "Did they arrest him for doing this?" was rewritten to "Did they arrest Phillip Harvey Spector for doing pulled?" While this causes an ungrammatical question (and a distribution shift during testing), it is still better than putting an invalid question in the flow.
Question replacement using CANARD. In addition to automatically rewriting questions, we also attempted replacing the invalid questions with a human-written context-independent question provided in the CANARD dataset (Elgohary et al., 2019). We denote this strategy as replacedquestion evaluation (Auto-Replace). Since collecting context-independent questions is expensive, Auto-Replace is limited to evaluating models trained with QuAC, thus we do not treat this as a general method for CQA evaluation.

Automatic vs Human Evaluation
In this section, we compare human evaluation results with all the automatic evaluations we have introduced: gold-history evaluation (Auto-Gold), predicted-history evaluation (Auto-Pred), and our proposed Auto-Rewrite and Auto-Replace. We first explain how we compare different evaluation results and then discuss the findings.

Agreement Metrics
Model performance and rankings. We first consider using model performance reported by different evaluation methods. Considering numbers of automatic and human evaluations are not directly comparable, we also calculate models' rankings and compare whether the rankings are consistent   between automatic and human evaluations. Model performance is reported in Table 2. In human evaluation, GraphFlow < BERT < HAM ≈ ExCorD; in Auto-Gold, BERT < GraphFlow < HAM < Ex-CorD; in other automatic evaluations, GraphFlow < BERT < HAM < ExCorD.
Statistics of unanswerable questions. Percentage of unanswerable questions is an important aspect in conversations. Automatic evaluations using static datasets have a fixed number of unanswerable questions, while in human evaluation, the percentage of unanswerable questions asked by human annotators varies with different models. The statistics of unanswerable questions is shown in Table 3.
Pairwise agreement. For a more fine-grained evaluation, we perform a passage-level comparison for every pair of models. More specifically, for every single passage we use one automatic metric to decide whether model A outperforms model B (or vice versa) and examine the percentage of passages that the automatic metric agrees with human evaluation. For example, if the pairwise agreement of BERT/ExCorD between human evaluation and Auto-Gold is 52%, it means that Auto-Gold and human evaluation agree on 52% passages in terms of which model is better. Higher agreement means the automatic evaluation is closer to human evaluation. Figure 6 shows the results of pairwise agreement.

Key Findings
Automatic evaluations have a significant distribution shift from human evaluation. We draw this conclusion from the three following points.
• Human evaluation shows a much higher model performance than all automatic evaluations, as shown in Table 2. Two reasons caused this huge discrepancy: (a) Many CQA questions have multiple possible answers, and it's hard for the static dataset in automatic evaluations to capture all the answers. It is not an issue in human evaluation for all answers are judged by human evaluators. (b) There are more unanswerable questions and open questions in human evaluation (reason discussed in the next paragraph), which are relatively easy.
• Human evaluation has a much higher unanswerable question rate, as shown in Table 3. The reason is that in human-human data collection, the answers are usually correct and the questioners can ask followup questions upon the highquality conversation; in human-machine interactions, since the models can make mistakes, the conversation flow is less fluent and it is harder to have followup questions. Thus, questioners chatting with models tend to ask more open or unanswerable questions. This also suggests that current CQA models are far from perfection.
• All automatic evaluation methods have a pairwise agreement lower than 70% with human evaluation, as demonstrated in Figure 2.
Auto-Rewrite is closer to human evaluation. First, we can clearly see that among all automatic evaluations, Auto-Gold deviates the most from the human evaluation. From Table 2, only Auto-Gold shows different rankings from human evaluation, while Auto-Pred, Auto-Rewrite, and Auto-Replace show consistent rankings to human judgments.
In Figure 2, we see that Auto-Gold has the lowest agreement with human evaluation; among others, Auto-Rewrite better agrees with human evaluation for most model pairs. Surprisingly, Auto-Rewrite is even better than Auto-Replace -which uses humanwritten context independent questions -in most  cases. After checking the Auto-Replace conversations, we found that human-written context independent questions are usually much longer than original QuAC questions and unnatural in the conversational context, which leads to out-of-domain challenges for CQA models. It shows that our rewriting strategy can better reflect real-world performance of conversational QA systems.

Towards Better Conversational QA
With insights drawn from human evaluation and comparison with automatic evaluations, we discuss the impact of different modeling strategies, as well as future directions towards better CQA systems.
Modeling question dependencies on conversational context. When we focus on answerable questions (Table 2), we notice that GraphFlow, HAM and ExCorD perform much better than BERT. We compare the modeling differences of the four systems in Figure 7, and identify that all the three better systems explicitly model the question dependencies on the conversation history and the passage: both GraphFlow and HAM highlight repeated mentions in questions and conversation history by special embeddings (turn marker and PosHAE) and use attention mechanism to select the most relevant part from the context; ExCorD adopts a question rewriting module that generates context-independent questions given the history and passage. All those designs help models better understand the question in a conversational context. Figure 8 gives an example where GraphFlow, HAM and ExCorD resolved the question from long conversation history while BERT failed.
Unanswerable question detection. Table 4 demonstrates models' performance in detecting unanswerable questions. We notice that Graph-Flow predicts much fewer unanswerable questions than the other three models, and has a high pre-cision and a low recall in unanswerable detection. This is because GraphFlow uses a separate network for predicting unanswerable questions, which is harder to calibrate, while the other models jointly predict unanswerable questions and answer spans. This behavior has two effects: (a) GraphFlow's overall performance is dragged down by its poor unanswerable detection result (Table 2). (b) In human evaluation, annotators ask fewer unanswerable questions with GraphFlow (Table 3) -when the model outputs more, regardless of correctness, the human questioner has a higher chance to ask passage-related followup questions. Both suggest that how well the model detects unanswerable questions significantly affects its performance and the flow in human-machine conversations.
Optimizing towards the new testing protocols. Most existing works on CQA modeling focus on optimizing towards Auto-Gold evaluation. Since Auto-Gold has a large gap from the real world evaluation, more efforts are needed in optimizing towards the human evaluation, or Auto-Rewrite, which better reflects human evaluation. One potential direction is to improve models' robustness given noisy conversation history, which simulates the inaccurate history in real conversations that consists of models' own predictions. In fact, prior works (Mandya et al., 2020;Siblini et al., 2021) that used predicted history in training showed that it benefits the models in predicted-history evaluation.

Related Work
Conversational question answering. In recent years, several conversational question answering datasets have emerged, such as QuAC (Choi et al., 2018), CoQA (Reddy et al., 2019), and DoQA (Campos et al., 2020). Different from singleturn QA datasets (Rajpurkar et al., 2016), CQA requires the model to understand the question in   the context of conversational history. There have been many methods proposed to improve CQA performance (Ohsugi et al., 2019;Chen et al., 2020;Qu et al., 2019;Kim et al., 2021) and significant improvement has been made on CQA benchmarks.
Besides text-based CQA tasks, there also exist CQA benchmarks that require other forms of modeling ability, such as combining textual evidence with background knowledge (Saeidi et al., 2018), utilizing structured knowledge base (Saha et al., 2018;Guo et al., 2018), as well as CQA in other modalities (Das et al., 2017).
Evaluation with predicted history. Only recently has it been noticed that the current method of evaluating CQA models is flawed. Mandya et al. (2020); Siblini et al. (2021) point out that using gold answers in history is not consistent with the realworld scenario and propose to use predicted history for evaluation. Different from prior work, in this paper, we conduct a large scale human evaluation to support our claims, identify the issues with predicted history, and propose rewriting questions to further mitigate the gap to human evaluation.
Re-evaluation of evaluation strategies. In recent years, the NLP community has also cautiously re-evaluated and identified flaws in many other popular automated evaluation strategies (Liu et al., 2016;Reiter, 2018), and have proposed new evaluation protocols to mitigate the problems and align more with how humans would evaluate language systes in a real-world setting (Ghazarian et al., 2019;Gehrmann et al., 2021).

Conclusion
In this work, we carry out the first large-scale human evaluation on CQA systems. We show that current standard automatic evaluation with gold history cannot reflect models' performance in human evaluation, and that human-machine conversations have a large distribution shift from static CQA datasets of human-human conversations. To tackle these problems, we propose to use predicted history with rewriting invalid questions for evaluation, which reduces the gap between automatic evaluations and the real-world human evaluation. We also use the human evaluation results to analyze current CQA systems and identify promising directions for future development.