Tracing Origins: Coreference-aware Machine Reading Comprehension

Machine reading comprehension is a heavily-studied research and test field for evaluating new pre-trained language models (PrLMs) and fine-tuning strategies, and recent studies have enriched the pre-trained language models with syntactic, semantic and other linguistic information to improve the performance of the models. In this paper, we imitate the human reading process in connecting the anaphoric expressions and explicitly leverage the coreference information of the entities to enhance the word embeddings from the pre-trained language model, in order to highlight the coreference mentions of the entities that must be identified for coreference-intensive question answering in QUOREF, a relatively new dataset that is specifically designed to evaluate the coreference-related performance of a model. We use two strategies to fine-tune a pre-trained language model, namely, placing an additional encoder layer after a pre-trained language model to focus on the coreference mentions or constructing a relational graph convolutional network to model the coreference relations. We demonstrate that the explicit incorporation of coreference information in the fine-tuning stage performs better than the incorporation of the coreference information in pre-training a language model.


Introduction
Machine reading comprehension (MRC), a task that automatically identifies one or multiple words from a given passage as the context to answer a specific question for that passage, is widely used in information retrieving, search engines, and dialog systems. Several datasets on MRC that limit the answer to one single word or multiple words from the passage are introduced, including TREC Context: Frankie Bono, a mentally disturbed hitman from Cleveland, comes back to his hometown in New York City during Christmas week to kill a middle-management mobster, Troiano. ...First he follows his target to select the best possible location, but opts to wait until Troiano isn't being accompanied by his bodyguards. ... Losing his nerve, Frankie calls up his employers to tell them he wants to quit the job. Unsympathetic, the supervisor tells him he has until New Year's Eve to perform the hit. Question: What is the first name of the person who has until New Year's Eve to perform a hit? Answer: he ->Frankie Question: What is the first name of the person who follows their target to select the best possible location? Answer: he ->Frankie Table 1: An example from QUOREF: coreference resolution is required to extract the correct answer. We highlight the supporting text in teal color and the related deictic expressions in bold. (Harman, 1993), SQuAD (Rajpurkar et al., 2018), NewsQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017), and QuAC (Choi et al., 2018), and intensive efforts were made to build new models that surpass the human performance on these datasets, including the pre-trained language models (Devlin et al., 2019;Yang et al., 2019a) or the ensemble models that outperform the human, in particular on SQuAD (Lan et al., 2020;Yamada et al., 2020;. More challenging datasets are also introduced, which require several reasoning steps to answer (Yang et al., 2018;Qi et al., 2021), the understanding of a much larger context (Kočiský et al., 2018) or the understanding of the adversarial content and numeric reasoning (Dua et al., 2019).
Human texts, especially long texts, are abound in deictic and anaphoric expressions that refer to the entities in the same text. These deictic and anaphoric expressions, in particular, constrain the generalization of the models trained without explicit awareness of the coreference. The QUOREF dataset (Dasigi et al., 2019) is specifically designed to validate the performance of the models in coreferential reasoning, in that "78% of the manually analyzed questions cannot be answered without coreference" (Dasigi et al., 2019). The example in Table 1 shows that the answers to the two questions cannot be directly retrieved from the sentences because the word in the corresponding sentence of the context is an anaphoric pronoun he, and to obtain the correct answers, tracing of its antecedent Frankie is required. The reasoning in coreference resolution is required to successfully complete the task in machine reading comprehension in the SQuADstyle QUOREF dataset.
Pre-trained language models, including BERT (Devlin et al., 2019), RoBERTa  and XLNet (Yang et al., 2019b), that are trained through self-supervised language modeling objectives like masked language modeling, perform rather poorly in the QUOREF dataset. We argue that the reason for the poor performance is that those pre-trained language models do learn the background knowledge for coreference resolution but may not learn adequately the coreference information required for the coreference-intensive reading comprehension tasks. In the human reading process, as shown in the empirical study of first-year English as a second language students during the reading of expository texts, "anaphoric resolution requires a reader to perform a textconnecting task across textual units by successfully linking an appropriate antecedent (among several prior antecedents) with a specific anaphoric referent" and "students who were not performing well academically were not skilled at resolving anaphors" (Pretorius, 2005) and the direct instruction on anaphoric resolution elevated the readers' comprehension of the text (Baumann, 1986). In addition, the studies on anaphor resolution in both adults using eye movement studies (Duffy and Rayner, 1990;van Gompel et al., 2004) and children (Joseph et al., 2015) evidenced a twostage model of anaphor resolution proposed by Garrod and Terras(Garrod and Terras, 2000). The first stage is "an initial lexically driven, contextfree stage known as bonding, whereby a link between the anaphor and a potential antecedent is made, followed by a later process known as resolution, which resolves the link with respect to the overall discourse context" (Joseph et al., 2015). The pre-trained language models only capture the semantic representations of the words and sentences, without explicitly performing such text-connecting actions in the specific coreferenceintensive reading comprehension task, thus they do not learn adequate knowledge to solve the complex coreference reasoning problems.
Explicitly injecting external knowledge such as linguistics and knowledge graph entities, has been shown effective to broaden the scope of the pre-trained language models' capacity and performance, and they are often known as Xaware pre-trained language models Kumar et al., 2021). It is plausible that we may imitate the anaphoric resolution process in human's two-stage reading comprehension of coreference intensive materials and explicitly make the text-connecting task in our fine-tuning stage as the second stage in the machine reading comprehension.
As an important tool that captures the anaphoric relationship between words or phrases, coreference resolution that clusters the mentions of the same entity within a given text is an active field in natural language processing (Chen et al., 2011;Sangeetha, 2012;Huang et al., 2019;Joshi et al., 2020;, with neural networks taking the lead in the coreference resolution challenges. The incorporation of the coreference resolution results in the pre-training to obtain the coreferenceinformed pre-trained language models, such as CorefBERT and CorefRoBERTa , has shown positive improvements on the QUOREF dataset, a dataset that is specially designed for measuring the models' coreference capability, but the performance is still considerably below the human performance. In this paper, we make a different attempt to leverage the coreference resolution knowledge and complete the anaphoric resolution process in reading comprehension. We propose a fine-tuned corefaware model that directly instructs the model to learn the coreference information 1 . Our model can be roughly divided into three major components: 1) pre-trained language model component. We use the contextualized representations from the pretrained language models as the token embeddings for the downstream reading comprehension tasks. 2) coreference resolution component. NeuralCoref, an extension to the spaCy, is applied here to extract the mention clusters from the context. 3) coreference enrichment component. We apply three methods in incorporating the coreference knowledge: additive attention enhancement, multiplication attention enhancement, and relationenhanced graph-attention network + fusing layer.
In this paper, we show that by simulating the human behavior in explicitly connecting the anaphoric expressions to the antecedent entities and infusing the coreference knowledge our model can surpass that of the pre-trained coreference language models on the QUOREF dataset.
2 Background and Related Work

Models and Training Strategies
Recent studies on machine reading comprehension mainly rely on the neural network approaches. Before the prevalence of the pre-trained language models, the main focus was to guide and fuse the attentions between questions and paragraphs in their models, in order to gain better global and attended representation (Huang et al., 2018;Hu et al., 2018;Wang et al., 2018).
However, the raw pre-trained language models, being deprived of the in-domain knowledge, the structures and the reasoning capabilities required for the datasets, often perform unsatisfactorily in the hard datasets, being significantly below the human performance. Efforts had been made to boost the model performance by enriching the pre-trained language models with specific syntactic information  or semantic information. Another trend was to fine-tune the pre-trained language model and added additional layers to incorporate task-specific information for better representation, in particular, the coreference information (Ouyang et al., 2021;Liu et al., 2021). For some questions that have multi-span answers, in other words, a single answer contains two or more discontinuous entities in the context, the BIO (B denotes the start token of the span; I denotes the subsequent tokens and O denotes tokens outside of the span) tagging mechanism is used to identify these answers and improve the model performance (Segal et al., 2020).
Recent studies also explored the possibilities of prompt-based learning in machine reading comprehension, including a new pre-training scheme that changed the question answering into a few-shot span selection model ) and a new model that fine-tuned the prompts with knowledge . The performance of the models using prompt-based learning is significantly higher than the baseline models, but is still below that of the fine-tuned models .

Graph Neural Network in Machine Reading Comprehension
Graph neural network (GNN) captures the relations among the entities in the text by modeling the entities as nodes in the graph and learning the weights via the message passing between the nodes of the graph (Kipf and Welling, 2017; Velickovic et al., 2018). As the dependencies in the natural language text, the relations among entities and knowledge-base triples can be relatively easily modeled in a graph structure, graph neural networks are used for numeric reasoning (Ran et al., 2019), for multi-document question answering by connecting mentions of candidate answers (De Cao et al., 2019), and for multi-hop reasoning by adding the edges with co-occurrence relations , or with contextual sentences as embeddings (Tu et al., 2020), or with a hierarchical paragraphsentence-entity graph (Fang et al., 2020), but none of them had attempted to connect the anaphoric expressions and their antecedents as a coreference resolution strategy in a graph neural network for machine reading comprehension.

Coreference-aware Machine Reading Comprehension
Our model, inspired by the anaphoric connecting behavior in the human reading comprehension process, consists of four parts, namely, a pretrained language model, a coreference resolution component, a graph encoder and a fusing layer. Context in the machine reading comprehension task is first processed by a coreference resolution model to identify the underlying coreference clusters, Figure 1: Coref-aware fine-tuning for machine reading comprehension. The text is tokenized and fed into a pretrained language model to obtain the embeddings, and into a coreference resolution model to obtain coreference information. Both the embeddings and the coreference information are used in the fine-tuning stage to 1) enhance cross attentions with additive operations; 2) enhance cross attentions with multiplication operations, or; 3) construct a coreference graph neural network with the coreference relations as edges.
which are formed by dividing the entities and anaphoric expressions in the context into disjoint groups on the principle that the mentions of the same entity should be in the same group. Then we use the coreference clusters to construct a coreference matrix that labels each individual cluster and identifies each element in the same cluster with the same cluster number. Meanwhile, the context is tokenized by the tokenizer defined in the pre-trained language model and the embeddings for each token are retrieved from that model. We propose three methods for connecting the anaphoric expressions and their antecedent entity: 1) adding the coreference matrix with each attention head in the additional coreference encoder layer; 2) multiplying the coreference matrix with each attention head in the additional coreference encoder layer; 3) constructing a graph neural network based on the coreference matrix with the edges corresponding to the coreference relations and then fusing the graph representation in the graph neural network with the embeddings of the context, as shown in Figure 1. The final representations from either one of the three methods are fed into the classifier to calculate the start/end span of the question.

Coreference Resolution
Coreference resolution is the process that identifies all the expressions of the same entity in the text, clusters them together as coreference clusters, and locates their spans. For example, after coreference resolution for the text Losing his nerve, Frankie calls up his employers to tell them he wants to quit the job., we obtained two mention clusters  Figure 2. As pre-trained language models use subwords in their tokenization and the coreference resolution uses word in the tokenization, a mapping is required to establish the relations. For the input sequence X = {x 1 , ...x n } of length n, the words W = {w 1 , ..., w m } obtained from the coreference tokenization are mapped to the corresponding subwords (tokens) T = {t 1 , ..., t k } from the tokenizer in the pre-trained language model, with one word contains one or more than one subwords. Then we constructed a coreference array with the 2 Image generated from https://huggingface.co/coref/ Figure 2: Coreference resolution: the red curves connecting the mentions of the same entity and marking the coreference relations. 2 following rule: where i is the position of the token in the token array, S m is an array of all words in the coreference mention clusters, n is the sequence number of the mention cluster and n ≥ 1. Tokens in the same mention cluster have the same sequence number n in the coreference array.

Graph Neural Network
We use the standard relational graph convolutional network (RGCN) (Sejr Schlichtkrull et al., 2018) to obtain the graph representation of the context enriched with coreference information. We use the coreference matrix and the word embeddings to construct a directed and labeled graph G = (V, E, R), with nodes (subwords) v i ∈ V, edges(relations) (v i , r, v j )) ∈ E, where r ∈ R is one of the two relation types (1 indicates coreference relation and self-loop; 2 indicates global relation), as shown in Figure 3 . The constructed graph is then fed into the RGCN, with the differentiable message passing and the basis decomposition to reduce model parameter size and prevent overfitting:

Coreference-enhanced Attention
In addition to the Graph Neural Network method, we also explore the possibility of using the selfattention mechanism (Vaswani et al., 2017) to explicitly add an encoder layer and incorporate the coreference information into the attention heads of that layer, so as to guide the model to identify the mentions in the cluster as the same entity.
We use two methods to fuse the coreference information and the original embeddings from the pre-trained language model: additive attention fusing and dot product attention fusing (multiplication). Given the coreference array A = {m 1 , 0, m 1 , m 2 , 0, m 2 , m 3 , 0, m 3 , m 1 ...}, where m n denotes the nth mention cluster, and 0 denotes no mentions, the enriched attention for additive attention fusing is formulated as: where M A is a coreference matrix constructed from the coreference array A with the element value in the matrix calculated by adding (for additive model) or multiplying (for multiplication model) the coreference hyper-parameter coref weight with the original attention weight if the element belongs to the coreference array, Q, K, V are the query, key and value respectively, d k is the dimension of the keys, and W i is trainable parameter. For dot product (multiplication) fusing, it is formulated as:

Integration
A machine reading comprehension task expects the model to output the start and end positions of the answer. For the RCGN method, we fuse the hidden state of nodes v i in the last layer of RCGN and the embeddings from the pre-trained language model with a fully-connected (FC) layer , and then calculate the start/end positions of the answer.
where E prLM denotes the embeddings from the pre-trained language model, E gnn denotes the embeddings from the graph encoder, P s denotes the predicted start positions, W s denotes the weight matrix and S denotes the text feature. For the two methods that add one additional encoder layer for additive or multiplication attention enrichment, we directly used the output of that encoder layer for the follow-up processing.
Following the practice of CorefRoBERTa  in handling multiple answers for the same question, we use the cross entropy to calculate the losses for each answer if the question has multiple answers: where n denotes the answer count as a hyper parameter for handling multiple answers, E n denotes the results after the linear transformation of the embeddings for the answer count and then we obtains the predicted start positions and end positions from that embeddings, L(E n , n) denotes the cross-entropy loss between the transformed embeddings and the answer count, L s denotes the total loss of the start positions, L e denotes the total loss of the end positions and L total denotes the combined total loss.

Model Settings
We developed three models based on the sequenceto-sequence Transformer architecture. The pretrained RoBERTa-large was used as the base model and then we used the following three methods to fine-tune it: 1) Coref GNN : feeding the coreference information into a graph neural network and then fuse the representations; 2) Coref AddAtt : adding the coreference weights with the self-attention weights; 3) Coref MultiAtt : calculating the dot product of the coreference weights with the selfattention weights. We used the results from CorefRoBERTa (Ye et al., 2020) as our baselines.

Setup
Our coreference resolution was implemented in spaCy (Honnibal and Montani, 2017) and Neural-Coref. NeuralCoref is an extension for spaCy that is trained on the OntoNotes 5.0 dataset based on the training process proposed by Clark and Manning (Clark and Manning, 2016), which identifies the coreference clusters in the text as mentions. In particular, spaCy 2.1.0 and NeuralCoref 4.0 are used, because the latest spaCy version 3.0+ has compatibility issues with NeuralCoref and extra efforts are required to solve the issues.
The neural network implementation was implemented in PyTorch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020). We used the embeddings of the pre-trained language model RoBERTa LARGE , with the relational graph convolutional network implemented in Deep Graph Library (DGL) . We used Adam (Kingma and Ba, 2015) as our optimizer, and the learning-rate was {1e-5, 2e-5, 3e-5}. We trained each model for {4, 6} epochs and selected the best checkpoints on the development dataset with Exact match and F1 scores. All experiments were run on

Tasks and Datasets
Our evaluation was performed on the QUOREF dataset (Dasigi et al., 2019). The dataset contains a train set with 3,771 paragraphs and 19,399 questions, a validation set with 454 paragraphs and 2,418 questions, and a test set with 477 paragraphs and 2,537 questions.

Results
We quantitatively evaluated the three methods and reported the standard metrics: exact match score (EM) and word-level F1-score (F1) (Rajpurkar et al., 2016). As shown in Table 2, compared with the baseline model CorefRoBERTa, the performance of our models improves significantly. In particular, Coref AddAtt performs best with 5.08%, 4.42% improvements over the baseline model in Exact Match and F1 score respectively on the QUOREF dev set, and 3.05% (F1) and 3.31% (Exact Match) improvements on the QUOREF test set. Coref GNN and Coref MultiAtt also outperform the baseline model by 2.34% (F1) and 2.80% (Exact Match), and 2.46% (F1) and 2.72% (Exact Match) respectively on the test set. Compared with the RoBERTa LARGE that does not use any explicit coreference information in the training or the CorefRoBERTa LARGE that uses the coreference information in the training, the improvements of our model are higher, which proves the effectiveness of the explicit coreference instructions in our strategies.

Model Efficiency
As shown in Table 2, compared with RoBERTa LARGE , our methods added only one component that explicitly incorporates the coreference information, and the three methods we used all exhibit considerable improvements over the baselines. Compared with RoBERTa LARGE which has 354M parameters, Coref AddAtt and the Coref MultiAtt add an encoder layer, which adds over 12M parameters. For the Coref GNN method, we added one hidden layer in GNN and two linear layers to transform the feature dimensions, with around 68.7K parameters in total. Our predictions are that intuitively with more focuses on the coreference clues, the models perform better on the task that requires intensive coreference resolution, as we have explicitly increased the attention weights to connect the words in the same coreference mention clusters. However, the overall performance of the models is also limited by the performance of the coreference component we use, namely, NeuralCoref.

Case Studies
To understand the model's performance beyond the automated metrics, we analyze our predicted answers qualitatively. Table 3 compares the representative answers predicted by our models and CorefRoBERTa LARGE . These examples require that the models should precisely locate the entity from several distracting entities for the anaphoric expression that directly answers the questions. Our model demonstrates that, after resolving the anaphoric expression with the antecedents in the context and enhancing with the coreference information by connecting the anaphoric expression with its antecedents, such as the connection from her to Henrietta in the first example and the connection from she to Rihanna in the second example, our model accurately locates the entity name among several names in the context, which the CorefRoBERTa LARGE fails to uncover.
We further explored the effects of the anaphoric connections on the attention weights by comparing the attention weights of the sample in the first row in Table 3 between our Coref AddAtt and CorefRoBERTa LARGE model, as shown in Figure  4. It is clear that the anaphoric expressions are not connected in the CorefRoBERTa LARGE model,

Coref-resolved Context (Abbreviated) Question Answers
Henrietta take an immediate liking to her, and she asks if Luce can sit by her during the wedding. Rachel arrives with her father and the ceremony begins. As Rachel is walking down the aisle, her eyes wander and she makes eye contact with Luce.
Rachel makes eye contact with a woman sitting next to whom?
Henrietta (Golden) Rachel (CorefR) Henrietta (C AddAtt ) After the song was completed, they wanted to play it to Rihanna, but Blanco was skeptical about the reaction towards the song because of its slow sound. After StarGate played it to her, they called Blanco from London and told him that she liked the song: "She's flippin' out.
Who liked a song?

Rihanna
(Golden) Blanco (CorefR) Rihanna (C AddAtt ) Table 3: Comparison of the predictions for two questions in QUOREF dev set. The blue and bold words indicate the mentions in the same coreference cluster obtained from coreference resolution. In the Answers column, Golden indicates the golden answer; CorefR indicates the prediction made by CorefRoBERTa LARGE model; C AddAtt indicates the prediction made by Coref AddAtt model. For the Coref AddAtt , the varying colors on the left heat-map indicate the connection strength among the anaphoric expressions and evidence the effects of explicit coreference addition that smooth and strength the attentions for anaphoric expressions, which contributes to the higher performance of our models.

Error Analysis
Despite the improvements made by our model, it still fails to predict the correct answers for some questions. We analyzed and summarized several error cases as follows. Table 4 shows three representative types of errors. The first type of errors is caused by the limitations of the coreference resolution component, NeuralCoref, as its performance had not reached 80% in F1 for MUC, B 3 or CEAF φ4 (Clark and Manning, 2016), which is evidenced by the failure in resolving the antecedent of the anaphoric expression its as the academy in the first sample, and the failure in clustering the anaphoric expressions her with the entity Beyoncé in the second sample, despite the success in resolving the second Gilman to its antecedent Rockwell "Rocky" Gilman. The second type of errors is more complicated, which involves multi-step reasoning that cannot be handled by simply adding the coreference information. To correctly answer the second question, the model should perform two successive tasks successfully: 1) it should understand that Mathew Knowles is the father Coref-resolved Context (Abbreviated)

Question Answers
West Point cadet Rockwell "Rocky" Gilman is called before a hearing brought after an influential cadet, Raymond Denmore, Jr., is forced to leave the academy...Denmore's attorney, Lew Proctor, attacking the academy and its Honor Code system, declares that Gilman is unfit and possibly criminally liable.
Who's honor code system does Proctor attack?
the academy (Golden) West Point (C AddAtt ) Following a career hiatus that reignited her creativity, Beyoncé was inspired to create a record with a basis in traditional rhythm and blues that stood apart from contemporary popular music...Severing professional ties with father and manager Mathew Knowles, Beyoncé eschewed the music of her previous releases What is the last name of the person who went on a career hiatus?
When the prosecutor suggests that the crime would have still happened if the owner were a woman, Christine, Andrea, Annie, Janine and the other women who witnessed the crime all laugh and exit the courtroom.
What are the names of the women Janine has to determine are sane or crazy?
Christine, Andrea, Annie (Golden) Christine, Andrea (C AddAtt ) of Beyoncé; 2) it should understand the world knowledge that the last name of Beyoncé is the same as her father's, which should be Knowles. This type of errors shows that our model performs poorly on the questions that require multi-step reasoning. The third type of errors is caused by the questions that have multiple items in an answer. A hyperparameter that limits the total number of items in an answer is used in our models and this parameter is set to 2 in the training, thus when the number of total items in the answer exceeds 2, our models fail to predict the exact items, and the third item Annie is ignored.

Conclusion
In this paper, we present intuitive methods to solve coreference-intensive machine reading comprehension tasks by following the reading process of human in which people connect the anaphoric expressions with explicit instructions. We demonstrate that all our three fine-tuning methods, including Coref GNN , Coref AddAtt and Coref MultiAtt , are superior to the pre-trained language models that incorporate the coreference information in the pretraining stage, such as CorefRoBERTa LARGE . As the fine-tuning methods rely on the coreference resolution models supplied by other researchers, their performance is also constrained by the accuracy of those coreference resolution models. In addition, the questions that require multistep reasoning, span multiple entities or contain multiple answer items also pose the challenges to our models. In the future, with more in-depth study on human reasoning in reading comprehension and more progress in graph neural networks, the GNNbased coreference graph can be enriched with more edge types and diverse structures to leverage more linguistic knowledge and gain better performance.