Coreference Reasoning in Machine Reading Comprehension

Coreference resolution is essential for natural language understanding and has been long studied in NLP. In recent years, as the format of Question Answering (QA) became a standard for machine reading comprehension (MRC), there have been data collection efforts, e.g., Dasigi et al. (2019), that attempt to evaluate the ability of MRC models to reason about coreference. However, as we show, coreference reasoning in MRC is a greater challenge than earlier thought; MRC datasets do not reflect the natural distribution and, consequently, the challenges of coreference reasoning. Specifically, success on these datasets does not reflect a model’s proficiency in coreference reasoning. We propose a methodology for creating MRC datasets that better reflect the challenges of coreference reasoning and use it to create a sample evaluation set. The results on our dataset show that state-of-the-art models still struggle with these phenomena. Furthermore, we develop an effective way to use naturally occurring coreference phenomena from existing coreference resolution datasets when training MRC models. This allows us to show an improvement in the coreference reasoning abilities of state-of-the-art models.


Introduction
Machine reading comprehension is the ability to read and understand the given passages and answer questions about them. Coreference resolution is the task of finding different expressions that refer to the same real-world entity. The tasks of coreference resolution and machine reading comprehension have moved closer to each other. Converting coreference-related datasets into an MRC format 1 The code and the resulting dataset are available at https://github.com/UKPLab/ coref-reasoning-in-qa.
improves the performance on some coreferencerelated datasets (Wu et al., 2020b;Aralikatte et al., 2019). There are also various datasets for the task of reading comprehension on which the model requires to perform coreference reasoning to answer some of the questions, e.g., DROP (Dua et al., 2019), DuoRC (Saha et al., 2018), MultiRC (Khashabi et al., 2018), etc.
Quoref (Dasigi et al., 2019) is a dataset that is particularly designed for evaluating coreference understanding of MRC models. Figure 1 shows a QA sample from Quoref in which the model needs to resolve the coreference relation between "his" and "John Motteux" to answer the question. Recent large pre-trained language models reached high performance on Quoref. However, our results and analyses suggest that this dataset contains artifacts and does not reflect the natural distribution and, therefore, the challenges of coreference reasoning. As a result, high performances on Quoref do not necessarily reflect the coreference reasoning capabilities of the examined models and answering questions that require coreference reasoning might be a greater challenge than current scores suggest.
In this paper, we propose two solutions to address this issue. First, we propose a methodology for creating MRC datasets that better reflect the coreference reasoning challenge. We release a sample challenging evaluation set containing 200 examples by asking an annotator to create new question-answer pairs using our methodology and based on existing passages in Quoref. We show that this dataset contains fewer annotation artifacts, and its distribution of biases is closer to a coreference resolution dataset. The performance of state-of-the-art models on Quoref considerably drops on our evaluation set suggesting that (1) coreference reasoning is still an open problem for MRC models, and (2) our methodology opens a promising direction to create future challenging MRC datasets.
Second, we propose to directly use coreference resolution datasets for training MRC models to improve their coreference reasoning. We automatically create a question whose answer is a coreferring expression m 1 using the BART model (Lewis et al., 2020). We then consider this question, m 1 's antecedent, and the corresponding document as a new (question, answer, context) tuple. This data helps the model learning to resolve the coreference relation between m 1 and its antecedent to answer the question. We show that incorporating these additional data improves the performance of the state-of-the-art models on our new evaluation set.
Our main contributions are as follows: • We show that Quoref does not reflect the natural challenges of coreference reasoning and propose a methodology for creating MRC datasets that better reflect this challenge. • We release a sample challenging dataset that is manually created by an annotator using our methodology. The results of state-of-the-art MRC models on our evaluation set show that, despite the high performance of MRC models on Quoref, answering questions based on coreference reasoning is still an open challenge. • We propose an approach to use existing coreference resolution datasets for training MRC models. We show that, while coreference resolution and MRC datasets are independent and belong to different domains, our approach improves the coreference reasoning of state-ofthe-art MRC models.
2 Related Work

Artifacts in NLP datasets
One of the known drawbacks of many NLP datasets is that they contain artifacts. 2 Models tend to ex-ploit these easy-to-learn patterns in the early stages of training (Arpit et al., 2017;Utama et al., 2020b), and therefore, they may not focus on learning harder patterns of the data that are useful for solving the underlying task. As a result, overfitting to dataset-specific artifacts limits the robustness and generalization of NLP models. There are two general approaches to tackle such artifacts: (1) adversarial filtering of biased examples, i.e., examples that contain artifacts, and (2) debiasing methods. In the first approach, potentially biased examples are discarded from the dataset, either after the dataset creation (Zellers et al., 2018;Yang et al., 2018a;Le Bras et al., 2020;Bartolo et al., 2020), or while creating the dataset (Dua et al., 2019;Nie et al., 2020).
In the second approach, they first recognize examples that contain artifacts, and use this knowledge in the training objective to either skip or downweight biased examples (He et al., 2019;Clark et al., 2019a), or to regularize the confidence of the model on those examples (Utama et al., 2020a). The use of this information in the training objective improves the robustness of the model on adversarial datasets (He et al., 2019;Clark et al., 2019a;Utama et al., 2020a), i.e., datasets that contain counterexamples in which relying on the bias results in an incorrect prediction. In addition, it can also improve in-domain performances as well as generalization across various datasets that represent the same task (Wu et al., 2020a;Utama et al., 2020b).
While there is an emerging trend of including adversarial models in data collection, their effectiveness is not yet compared with using debiasing methods, e.g., whether they are still beneficial when we use debiasing methods or vice versa.

Joint QA and Coreference Reasoning
There are a few studies on the joint understanding of coreference relations and reading comprehension. Wu et al. (2020b) propose to formulate coreference resolution as a span-prediction task by generating a query for each mention using the surrounding context, thus converting coreference resolution to a reading comprehension problem. They leverage the plethora of existing MRC datasets for data augmentation and improve the generalization of the coreference model. In parallel to Wu et al. (2020b), Aralikatte et al. (2019) also cast ellipsis and coreference resolution as reading comprehension tasks. They leverage the existing neural archi-tectures designed for MRC for ellipsis resolution and outperform the previous best results. In a similar direction, Hou (2020) propose to cast bridging anaphora resolution as question answering and present a question answering framework for this task. However, none of the above works investigate the impact of using coreference data on QA. Dua et al. (2020) use Amazon Mechanical Turkers to annotate the corresponding coreference chains of the answers in the passages of Quoref for 2,000 QA pairs. They then use this additional coreference annotation for training a model on Quoref. They show that including these additional coreference annotations improves the overall performance on Quoref. The proposed method by Dua et al. (2020) requires annotating additional coreference relations on every new coreference-aware QA dataset. Contrary to this, our approach uses existing coreference resolution datasets, and therefore, applies to any new QA dataset without introducing any additional cost.

How Well Quoref Presents Coreference
Reasoning?
For creating the Quoref dataset, annotators first identify coreferring expressions and then ask questions that connect the two coreferring expressions. Dasigi et al. (2019) use a BERT-base model (Devlin et al., 2019) that is fine-tuned on the SQuAD dataset (Rajpurkar et al., 2016) as an adversarial model to exclude QA samples that the adversarial model can already answer. The goal of using this adversarial model is to avoid including questionanswer pairs that can be solved using surface cues. They claim that most examples in Quoref cannot be answered without coreference reasoning. If we fine-tune a RoBERTa-large model on Quoref, it achieves 78 F1 score while the estimated human performance is around 93 F1 score (Dasigi et al., 2019). This high performance, given that RoBERTa can only predict continuous span answers while Quoref also contains discontinuous answers, indicates that either (1) Quoref presents coreference-aware QA very well so that the model can properly learn coreference reasoning from the training data, (2) pretrained transformer-based models have already learned coreference reasoning during their pre-training, e.g., as suggested by Tenney et al. (2019) and Clark et al. (2019b), or (3) coreference reasoning is not necessarily required for solving most examples.
In this section, we investigate whether Quoref contains the known artifacts of QA datasets, and therefore, models can solve some of the QA pairs without performing coreference reasoning. Figure  2 shows such an example where simple lexical cues are enough to answer the question despite the fact that coreference expressions "Frankie" and "his" were included in the corresponding context. We investigate five artifacts (biases) as follows: • Random named entity: the majority of answers in Quoref are person names. To evaluate this artifact, we randomly select a PERSON named entity from the context as the answer. 3 • Wh-word (Weissenborn et al., 2017): to recognize the QA pairs that can be answered by only using the interrogative adverbs from the question, we train a model on a variation of the training dataset in which questions only contain interrogative adverbs. • Empty question (Sugawara et al., 2020): to recognize QA pairs that are answerable without considering the question, 4 we train a QA model only on the contexts and without questions. • Semantic overlap (Jia and Liang, 2017): for this artifact, we report the ratio of the QA pairs whose answers lie in the sentence of the context that has the highest semantic similarity to the question. We use sentence-BERT (Reimers and Gurevych, 2019) to find the most similar sentence. • Short distance reasoning: for this bias, we train a model only using the sentence of the context that is the most similar to the question, instead of the whole context. We exclude the question-answer pairs in which the most similar sentence does not contain the answer. This model will not learn to perform coreference reasoning when the related coreferring pairs are not in the same sentence.
For wh-word, empty question, and short distance reasoning, we use the TASE model (Segal et al., 2020) to learn the bias. Biased examples are then those that can be correctly solved by these models. We only change the training data for biased example detection, if necessary, and the development set is unchanged. The Quoref column in Table 1   We also investigate whether these biases have similar ratios in a coreference resolution dataset. We use the CoNLL-2012 coreference resolution dataset (Pradhan et al., 2012a) and convert it to a reading comprehension format, i.e., CoNLL bart in Section 5. 5 This data contains question-answer pairs in which the question is created based on a coreferring expression in CoNLL-2012, and the answer is its closest antecedent. We split this data into training and test sets and train bias models on the training split. The CoNLL bart column in Table 1 shows the bias proportions on this data.
As we see, the short distance reasoning is the most prominent bias in the Quoref dataset. However, the ratio of such biased examples is only around 10% in CoNLL-2012. Therefore, apart from the examples that can be solved without coreference reasoning, 6 the difficulty of the required coreference reasoning in the remaining examples is also not comparable with naturally occurring coreference relations in a coreference resolution dataset.
As a result, high performance on Quoref does not necessarily indicate that the model is adept at performing coreference reasoning.
There is a growing trend in using adversarial models for data creation to make the dataset more challenging or discard examples that can be solved using surface cues (Bartolo et al., 2020;Nie et al., 2020;Yang et al., 2018a;Zellers et al., 2018;Yang et al., 2018b;Dua et al., 2019;Dasigi et al., 2019).
Quoref is also created using an adversarial data collection method to discard examples that can be solved using simple lexical cues. The assumption is that it is hard to avoid simple lexical cues by which the model can answer questions without coreference reasoning. Therefore, an adversarial model (A) is used to discard examples that contain such lexical cues. While this adversarial filtering removes examples that are easy to solve by A, it does not ensure that the remaining examples do not contain shortcuts that are not explored by A.
First, the adversarial model in Quoref is trained on another dataset, i.e., SQuAD. Thus, the failure of A on Quoref examples may be due to (1) Quoref having different lexical cues than those in SQuAD, or (2) domain shift. Second, and more importantly, as argued by Dunietz et al. (2020), making the task challenging by focusing on examples that are more difficult for existing models is not a solution for more useful reading comprehension. 7 We instead propose a methodology for creating question-answer pairs as follows: • Annotators should create a question that connects the referring expression m 1 to its antecedent m 2 so that (1) m 2 is more informative than m 1 , 8 and (2) m 1 and m 2 reside in a different sentence. • Candidate passages for creating QA pairs are selected according to their number of named entities and pronouns. The number of distinct named entities is an indicator of the number of entities in the text. Therefore, there would be more candidate entities for resolving referring expressions. The number of pronouns indicates that we have enough candidate m 1 s that have more informative antecedents.
We provide this guideline to a student from the 7 As put by them: "the dominant MRC research paradigm is like trying to become a professional sprinter by glancing around the gym and adopting any exercises that look hard". 8 Proper names are more informative than common nouns, and they are more informative than pronouns (Lee et al., 2013).   Table 2 presents examples from our dataset. Table 3 shows the results of the examined biases on our dataset. By comparing Table 3 and Table 1, we observe that the examined biases are less strong in our dataset, and their distribution is closer to those in CoNLL-2012. As we will see in Table 5, the performance of state-of-the-art models on Quoref drops more than 10 points, i.e., 13-18 points, on our challenge dataset. 9 Bias Ours random named entity 3.03 wh-word 13.64 empty question 11.62 semantic overlap 24.50 short-distance reasoning 35.35

Improving Coreference Reasoning
While we do not have access to many coreference annotations for the task of coreference-aware MRC, there are various datasets for the task of coreference resolution. Coreference resolution datasets contain the annotation of expressions that refer to the same entity. In this paper, we hypothesize that we can directly use coreference resolution corpora to improve the coreference reasoning of MRC models. We propose an effective approach to convert coreference annotations into QA pairs so that models learn to perform coreference resolution by answering those questions. In our experiments, we use the 9 We examine 50 randomly selected examples from our challenge set, and they were all answerable by a human. CoNLL-2012dataset (Pradhan et al., 2012b) that is the largest annotated dataset with coreference information.

Coreference-to-QA Conversion
The existing approach to convert coreference annotations into (question, context, answer) tuples, which is used to improve coreference resolution performance (Wu et al., 2020b;Aralikatte et al., 2019), is to use the sentence of the anaphor as a declarative query, and its closest antecedent as the answer. The format of these queries is not compatible with questions in MRC datasets, and therefore, the impact of this data on MRC models may be limited. In this work, we instead generate questions from those declarative queries using an automatic question generation model. We use the BART model (Lewis et al., 2020) that is one of the state-of-the-art text generation models. Below we explain the details of each of these two approaches for creating QA data from CoNLL-2012. Table 4 shows examples from both approaches.  2019) choose a sentence that contains an anaphor as a declarative query, the closest nonpronominal antecedent of that anaphor as the answer, and the corresponding document of the expressions as the context. 10 We remove the tuples in which the anaphor and its antecedent are identical. The reason is that (1) Quoref already contains many examples in which the coreference relation is between two mentions with the same string, and (2) even after removing such examples, CoNLL dec contains around four times more QA pairs than the Quoref training data.
CoNLL bart : we use a fine-tuned BART model (Lewis et al., 2020) released by Durmus et al.

Passage in CoNLL
Mention who was at huntingdon because she needed care?

My mother
The angel also held a large chain in his hand [...] The angel tied the dragon with the chain for 1000 years.
[a large chain, the chain] The angel tied the dragon with <ref> the chain </ref> for 1000 years.
what did the angel tie the dragon with for 1000 years? a large chain (2020) for question generation and apply it on the declarative queries in CoNLL dec . The BART model specifies potential answers by masking noun phrases or named entities in the query and then generates questions for each masked text span. We only keep questions whose answer, i.e., the masked expression, is a coreferring expression and replace that answer with its closest non-pronominal antecedent. We only keep questions in which the masked expression and its antecedent are not identical. Such QA pairs enforce the model to resolve the coreference relation between the two coreferring expressions to answer generated questions.

Experimental Setups
We use two recent models from the Quoref leaderboard: RoBERTa  and TASE (Segal et al., 2020), from which TASE has the state-ofthe-art results. We use RoBERTa-large from Hug-gingFace (Wolf et al., 2020). TASE casts MRC as a sequence tagging problem to handle questions with multi-span answers. It assigns a tag to every token of the context indicating whether the token is a part of the answer. We use the TASE IO +SSE setup that is a combination of their multi-span architecture and single-span extraction with IO tagging.We use the same configuration and hyper-parameters for TASE IO +SSE as described in Segal et al. (2020). We train all models for two epochs in all experiments. 11 We use the F1 score that calculates the number of shared words between predictions and gold answers for evaluation.
Training Strategies. To include the additional training data that we create from CoNLL-2012 using coreference-to-CoNLL conversion methods, we use two different strategies: • Joint: we concatenate the training examples from Quoref and CoNLL-to-QA converted 11 The only difference of TASE in our experiments and the reported results in Segal et al. (2020) is the number of training epochs. For a fair comparison, we train all models for the same number of iterations. datasets. Therefore, the model is jointly trained on the examples from both datasets.
• Transfer: Since the CoNLL-to-QA data is automatically created and is noisy, we also examine a sequential fine-tuning setting in which we first train the model on the CoNLL-to-QA converted data, and then fine-tune it on Quoref.

Data
We evaluate all the models on four different QA datasets.
• Quoref : the official development and test sets of Quoref, i.e., Quoref dev and Quoref test , respectively.
• Our challenge set: our new evaluation set described in Section 4.
• Contrast set: the evaluation set by Gardner et al. (2020) that is created based on the official Quoref test set. For creating this evaluation set, the authors manually performed small but meaningful perturbations to the test examples in a way that it changes the gold label. This dataset is constructed to evaluate whether models decision boundaries align to true decision boundaries when they are measured around the same point.
• MultiRC: Multi-Sentence Reading Comprehension set (Khashabi et al., 2018) is created in a way that answering questions requires a more complex understanding from multiple sentences. Therefore, coreference reasoning can be one of the sources for improving the performance on this dataset. Note that Mul-tiRC is from a different domain than the rest of evaluation sets. 12     Table 6 as we use it for investigating whether the resulting performance changes are due to using more training data or using coreference-aware additional data.
The language of all the datasets is English. Table 5 presents the results of evaluating the impact of using coreference annotations to improve coreference reasoning in MRC. We report the re-sults for both of the examined state-of-the-art models, i.e., TASE and RoBERTa-large, using both training settings: (1) training the model jointly on Quoref and CoNLL-to-QA converted data (Joint), and (2) pre-training the model on CoNLL-to-QA data first and fine-tuning it on Quoref (Transfer).

Results
Baseline represents the results of the examined models that are only trained on Quoref. CoNLL bart represents the results of the models that are only trained on the CoNLL bart data. Transfer-SQuAD reports the results of the sequential training when the model is first trained on the SQuAD training dataset (Rajpurkar et al., 2016) and is then finetuned on Quoref. Based on the results of Table 5, we make the following observations. First, the most successful setting for improving coreference reasoning, i.e., improving the performance on our challenge evaluation set, is Transfer-CoNLL bart . Pre-training the TASE model on CoNLL bart improves its performance on all of the examined evaluation sets. However, it only improves the performance of RoBERTa on our challenge set.
Second, SQuAD contains well-formed QA pairs while CoNLL bart and CoNLL dec contain noisy QA. Also, SQuAD and Quoref are both created based on Wikipedia articles, and therefore, have similar domains. However, the genres of the documents in CoNLL-2012 include newswire, broadcast news, broadcast conversations, telephone conversations,  weblogs, magazines, and Bible, which are very different from those in Quoref. As a result, pretraining on SQuAD has a positive impact on the majority of datasets. However, this impact is less pronounced on our challenge dataset, as it requires coreference reasoning while this skill is not present in SQuAD examples. Finally, while using the sentence of coreferring mentions as a declarative query (CONLL dec ) is the common method for converting coreference resolution datasets into QA format in previous studies, our results show using CoNLL bart has a more positive impact compared to using CoNLL dec .

Analysis
To analyze what kind of examples benefit more from incorporating the coreference data, we split Quoref dev and our dataset into different subsets based on the semantic overlap and short distance reasoning biases, which are the most common types of biases in both datasets.
The semantic overlap column in Table 7 represents the results on the subset of the data in which answers reside in the most similar sentence of the context, and the ¬semantic overlap column contains the rest of the examples in each of the examined datasets. The short reasoning column presents the results on the subset of the data containing examples that can be solved by the short distance reasoning bias model, and ¬ short reasoning presents the results on the rest of the examples. Table 7 shows the performance differences of the TASE and RoBERTa models on these four subsets for each of the two datasets.
Surprisingly, the performance of the baseline models is lower on the semantic overlap subset compared to ¬semantic overlap on Quoref dev . This can indicate that examples in the ¬semantic overlap subset of Quoref dev contain other types of biases that make QA less challenging on this subset.
The addition of the coreference resolution annotations in all four training settings reduces the performance gap of the TASE model on the semantic overlap and ¬semantic overlap subsets for both datasets. Incorporating coreference data for RoBERTa, on the other hand, has a positive impact using the CoNLL bart data and on the harder subsets of our challenge evaluation set, i.e., ¬semantic overlap and ¬short reasoning.
Finally, there is still a large performance gap between short reasoning and ¬ short reasoning subsets. In our coreference-to-QA conversion methods, we consider the closest antecedent of each anaphor as the answer. A promising direction for future work is to also create QA pairs based on longer distance coreferring expressions, e.g., to create two QA pairs based on each anaphor, one in which the answer is the closest antecedent, and the other with the first mention of the entity in the text as the answer.

Conclusions
We show that the high performance of recent models on the Quoref dataset does not necessarily indicate that they are adept at performing coreference reasoning, and that QA based on coreference reasoning is a greater challenge than current scores suggest. We then propose a methodology for creating a dataset that better presents the coreference reasoning challenge for MRC. We provide our methodology to an annotator and create a sample dataset. Our analysis shows that our dataset contains fewer biases compared to Quoref, and the performance of state-of-the-art Quoref models drops considerably on this evaluation set.
To improve the coreference reasoning of QA models, we propose to use coreference resolution datasets to train MRC models. We propose a method to convert coreference annotations into an MRC format. We examine the impact of incorporating this coreference data on improving the coreference reasoning of QA models using two topperforming QA systems from the Quoref leaderboard. We show that using coreference datasets improves the performance of both examined models on our evaluation set, indicating their improved coreference reasoning. The results on our evaluation set suggest that there is still room for improvement, and reading comprehension with coreference understanding remains a challenge for existing QA models, especially if the coreference relation is between two distant expressions. A Additional Statistics about Biased Examples Table 8 shows the proportion of biased examples in the CoNLL dec set. We can see that the results are similar to that of the CoNLL bart set.
To compare the ratio of biased examples between Quoref dev and our challenge set when considering the same number of examples in both datasets, we randomly sample 10 different subsets from Quoref dev and our challenge set with 100 samples in each subset and compute the rations in each subset. Figure 3 shows the results. As we see, in this setting the ratio of all bias types in our evaluation set is still lower than those in Quoref dev . Table 9 shows additional experiments for pretraining the examined models on coreference data. We examine an additional setting for pre-training on both CoNLL dec and CoNLL bart by first training the models on CoNLL dec , then on CoNLL bart , and finally on Quoref (Transfer-CoNLL bart+dec ). By comparing the results of Transfer-CoNLL bart+dec with Transfer-CoNLL bart from Table 5, we observe that pre-training the models on both CoNLL dec and CoNLL bart does not result in any additional advantage compared to only using CoNLL bart .     Table 9: Additional experiments on using the CoNLL dec and CoNLL bart data for pre-training RoBERTa-large and TASE models. Transfer-CoNLL bart+dec refers to the setting in which the model is first trained on CoNLL dec , then on CoNLL bart , and finally on Quoref.