Focus on what matters: Applying Discourse Coherence Theory to Cross Document Coreference

Performing event and entity coreference resolution across documents vastly increases the number of candidate mentions, making it intractable to do the full n^2 pairwise comparisons. Existing approaches simplify by considering coreference only within document clusters, but this fails to handle inter-cluster coreference, common in many applications. As a result cross-document coreference algorithms are rarely applied to downstream tasks. We draw on an insight from discourse coherence theory: potential coreferences are constrained by the reader’s discourse focus. We model the entities/events in a reader’s focus as a neighborhood within a learned latent embedding space which minimizes the distance between mentions and the centroids of their gold coreference clusters. We then use these neighborhoods to sample only hard negatives to train a fine-grained classifier on mention pairs and their local discourse features. Our approach achieves state-of-the-art results for both events and entities on the ECB+, Gun Violence, Football Coreference, and Cross-Domain Cross-Document Coreference corpora. Furthermore, training on multiple corpora improves average performance across all datasets by 17.2 F1 points, leading to a robust coreference resolution model that is now feasible to apply to downstream tasks.


Introduction
Cross-document coreference resolution of entities and events (CDCR) is an increasingly important problem, as downstream tasks that benefit from coreference annotations -such as question answering, information extraction, and summarization -begin interpreting multiple documents simultaneously. Yet the number of candidate mentions across documents makes evaluating the full n 2 pairwise comparisons intractable (Cremisini and Finlayson, 2020). For single-document coreference, the search space is pruned with simple recency-based heuristics, but there is no natural corollary to recency with multiple documents.
Most CDCR systems thus instead cluster the documents and perform the full n 2 comparisons only within each cluster, disregarding inter-cluster coreference (Lee et al., 2012;Yang et al., 2015;Choubey and Huang, 2017;Barhom et al., 2019;Cattan et al., 2020;Yu et al., 2020;Caciularu et al., 2021). This was effective for the ECB+ dataset, on which most CDCR methods have been evaluated, because ECB+ has lexically distinct topics with almost no inter-cluster coreference.
Such document clustering, however, keeps CDCR systems from being generally applicable. Bugert et al. (2020b) shows that inter-cluster coreference makes up the majority of coreference in many applications. Cremisini and Finlayson (2020) note that document clustering methods are also unlikely to generalize well to real data where documents lack the significant lexical differences of ECB+ topics. These issues present a major barrier for the general applicability of CDCR.
Human readers, by contrast, are able to perform coreference resolution with minimal pairwise comparisons. How do they do it? Discourse coherence theory (Grosz, 1977(Grosz, , 1978Grosz and Sidner, 1986) proposes a simple mechanism: a reader focuses on only a small set of entities/events from their full knowledge. This set, the attentional state, is constructed as entities/events are brought into focus either explicitly by reference or implicitly by their similarity to what has been referenced. Since attentional state is inherently dynamic -entities/events come into and out of focus as discourse progresses -a document level approach is a poor model of this mechanism.
We propose modeling focus at the mention level using the two stage approach illustrated in Figure  Figure 1: A high level overview of our system: For a particular mention, candidate coreferring mentions are retrieved from a neighborhood surrounding the mention. These candidate pairs are fed to a pairwise classifier specialized for hard negatives fetched from this space. This allows our method to create a high fidelity coreference graph with minimal pairwise comparison and no a priori assumptions about coreference. We use a bi-encoder for candidate retrieval and a cross-encoder for pairwise classification (Humeau et al., 2020). 1. We model attentional state as the set of K nearest neighbors within a latent embedding space for mentions. This space is learned with a distance based classification loss to construct embeddings that minimize the distance between mentions and the centroid of all mentions which share their reference class.
These attentional state neighborhoods aggressively constrain the search space for our second stage pairwise classifier. This classifier utilizes cross-attention between mention pairs and their local discourse features to capture the features important within an attentional state which are comparison specific (Grosz, 1978). By sampling from attentional state neighborhoods at training time, we train on only hard negatives such as shown in Table 1. We analyze the contribution of the local discourse features to our approach, providing an explanation for the empirical effectiveness of our classifier and that of earlier work like Caciularu et al. (2021).
Following the recommendations of Bugert et al. (2020a), we evaluate our method on multiple event and entity CDCR corpora, as well as on crosscorpus transfer for event CDCR. Our method achieves state-of-the-art results on the ECB+ corpus for both events (+0.2 F1) and entities (+0.7 F1), the Gun Violence Corpus (+11.3 F1), the Football Coreference Corpus (+13.3 F1), and the Cross-Domain Cross-Document Coreference Cor-pus (+34.5 F1). We further improve average results by training across all event CDCR corpora, leading to a 17.2 F1 improvement for average performance across all tasks. Our robust model makes it feasible to apply CDCR to a wide variety of downstream tasks, without requiring expensive new coreference annotations to enable fine-tuning on each new corpus. (This has been a huge effort for the few tasks that have attempted it like multi-hop QA (Dhingra et al., 2018; and multi-document summarization (Falke et al., 2017).)

Related Work
Cross-Document Coreference Many CDCR algorithms use hand engineered event features to perform classification. Such systems have a low pairwise classification cost and therefore ignore the quadratic scaling and perform no pruning (Bejan and Harabagiu, 2010;Yang et al., 2015;Vossen and Cybulska, 2016;Bugert et al., 2020a). Other such systems choose to include document clustering to increase precision, which can be done with very little tradeoff for the ECB+ corpus (Lee et al., 2012;Cremisini and Finlayson, 2020). Kenyon-Dean et al. (2018) explore an approach that avoids pairwise classification entirely, instead relying purely on representation learning and clustering within an embedding space. They propose a novel distance based regularization term for their classifier that encourages representations that can    be used for clustering. This approach is more scalable than pairwise classification approaches, but its performance lags behind the state-of-the-art as it cannot use pairwise information.
Most recent systems use neural models for pairwise classification (Barhom et al., 2019;Cattan et al., 2020;Meged et al., 2020;Zeng et al., 2020;Yu et al., 2020;Caciularu et al., 2021). These algorithms each use document clustering, a pairwise neural classifier to construct distance matrices within each topic, and agglomerative clustering to compute the final clusters. Innovation has focused on the pairwise classification stage, with variants of document clustering as the only pruning option. Caciularu et al. (2021) sets the previous state of the art for both events and entities in ECB+ using a cross-document language model with a large context window to cross-encode and classify a pair of mentions with the full context of their documents.
Other Tasks Lee et al. (2018) introduces the concept of a "coarse-to-fine" approach in single document entity coreference resolution. The architecture utilises a bi-linear scoring function to generate a set of likely antecedents, which is then passed through a more expensive classifier which performs higher order inference on antecedent chains. Our work extends to multiple documents the idea of using a high recall but low precision pruning function combined with expensive pairwise classification to balance recall, precision, and runtime efficiency. Wu et al. (2020) use a similar architecture to ours to create a highly scalable system for zero-shot entity linking. Their method treats entity linking as a ranking problem, using a bi-encoder to retrieve possible entity mentions and then re-ranking the candidate mentions using a cross-encoder. Their results confirm that such architectures can deliver state of the art performance while achieving tremendous scale. However, in coreference resolution, mentions can have one, many, or no coreferring mentions which makes treating it as a ranking problem non-trivial and necessitates the novel training and inference processes we propose.

Model
Our system is trained in multiple stages and evaluated as a single pipeline. First, we train the encoder for the pruning model to define our latent embedding space. Then, we use this model to sample training data for a pairwise classifier which performs binary classification for coreference. Our complete pipeline retrieves candidate pairs from the attentional state, classifies them using the pairwise classifier, and performs a variant of the agglomerative clustering algorithm proposed by Barhom et al. (2019) to form the final clusters, as laid out in Figure 2.

Candidate Retrieval
Encoding Setup We feed the sentences from a window surrounding the mention sentence to a fine-tuned BERT architecture initialized from Algorithm 1: Inference Algorithm M e : mentions; s(·, ·): bi-encoder scorer; p(·, ·): cross-encoder scorer; pairs ← nearestNeighborPairs(M e , s(·, ·)); likelyPairs ← scoreAndSort(pairs, p(·, ·)); continue; end end return C; RoBERTA-large pre-trained weights (Devlin et al., 2019;Liu et al., 2019). A mention is represented as the concatenation of the token-level representations at the boundaries of the mention, following the span boundary representations used by Lee et al. (2017).
Optimization Similar to Kenyon-Dean et al. (2018), the network is trained to perform a multiclass classification problem where the classes are labels assigned to the gold coreference clusters, which are the connected components of the coreference graph. Rather than adding distance based regularization, we instead optimize the distance metric directly by using the inner product as our scoring function.
Before each epoch, we construct the representation of each mention y m i with the encoder from the previous epoch. Each gold coreference cluster y c i is represented as the centroid of its component mentions c i : The score s o of a mention m i for a cluster c i is simply the inner product between this cluster representation and the mention representation: Using this scoring function, the model is trained to predict the correct cluster for a mention with respect to sampled negative clusters. We combine random in-batch negative clusters with hard negatives from the top 10 predicted gold clusters for each training sample in the batch, following Gillick et al. (2019). For each mention m i with true cluster c and negative clusters B, the loss is computed using Categorical Cross Entropy loss on the softmax of our score vector, which we express as: (3) This loss function can be interpreted intuitively as rewarding embeddings which form separable dense mention clusters according to their gold coreference labels. The left term in our loss function acts as an attractive component towards the centroid of the gold cluster, while the right term acts as a repulsive component away from the centroids of incorrect clusters. The repulsive component is especially important for singleton clusters, whose centroids are by definition identical to their mention representations.
Inference Unlike previous work using the biencoder architecture, our inference task is distinct from our training task. Since our training task requires oracle knowledge of the gold coreference labels, it cannot be performed at inference time. However, since the embedding model is optimized to place all mentions near their centroids, it implicitly places all mentions of the same class close to one another even when that class is unknown. Therefore, the set of K nearest mentions within this space is made up of coreferences and references to highly related entities/events such as shown in Table 1, which models an attentional state made up of entities/events explicitly and implicitly in focus (Grosz and Sidner, 1986).
Compared to document clustering, this approach can prune aggressively without disregarding any links. The encoding step scales linearly and old embeddings do not need to be recomputed if new documents are added. Importantly, no pairs are disregarded a priori when we compute the nearest neighbor graph and this efficient computation can scale to millions of points using GPU-enabled nearest neighbor libraries like FAISS (Johnson et al., 2017), which we use for our implementation.

Pairwise Classifier
Classification Setup For pairwise classification, we use a transformer with cross-attention between pairs. This follows prior work demonstrating that such encoders pick up distinctions between classes which previously required custom logic (Yu et al., 2020). Our use of cross-attention is also motivated by discourse coherence theory. Grosz (1978) highlights that, within an attentional state, the importance to coreference of a mention's features depends heavily on the features of the mention it is being compared to.
The cross encoder is a fine-tuned BERT architecture starting with RoBERTA-large pre-trained weights. For a mention pair (e i , e j ), we build a pairwise representation by feeding the following sequence to our encoder, where S i is the sentence in which the mention occurs and w is the maximum number of sentences away from the mention sentence we include as context: Each mention is represented as v e i which is the concatenation of the representations of its boundary tokens, with the pair of mentions represented as the concatenation of each mention representation and the element-wise multiplication of the two mentions: This vector is fed into a multi-layer perceptron and we take the softmax function to get the probability that e i and e j are coreferring.
Training Pair Generation We use K nearest neighbors in the bi-encoder embedding space to generate training data for the pairwise classifier. This provides the training data a similar distribution of positives and negatives as the classifier will likely see at inference time, but also serves to sample only positive and hard negative pairs. These negatives are those that the bi-encoder was unable to separate clearly in isolation, which makes them prime candidates for more expensive cross-comparison. At training time, the selection of hyperparameter K is used to balance the volume of training data with the difficulty of negative pairs.
Optimization Once the training data has been generated, we simply train the classifier in a binary setup to classify a pair as either coreferring or noncoreferring. As with prior work, we optimize our pairwise classifier using binary cross-entropy loss.

Clustering
At inference time, we use a modified form of the agglomerative clustering algorithm designed by Barhom et al. (2019) to compute clusters, as described in Figure 2. We do not perform mention detection, so our method relies on gold mentions or a separate mention detection step. First, it generate pairs of mentions using K nearest neighbor retrieval within our embedding space. Each of these pairs is run through the trained cross-encode and all pairs with a probability of less than 0.5 are removed. Pairs are then sorted by their classification probability and clusters are merged greedily.
Following Barhom et al. (2019), we compute the score between two clusters as the average score between all mention pairs in each cluster. However, since we only compare two clusters that share a local edge, we do this without computing the full pairwise distance matrix.

Experiments
We perform an empirical study across 3 event and 2 entity English cross-document coreference corpora.

Datasets
Here we briefly cover the properties of each corpus we evaluate on. For a more thorough breakdown of corpus properties for event CDCR, see Bugert et al. (2020a).

Event Coreference Bank Plus (ECB+)
Historically, the ECB+ corpus has been the primary dataset used for evaluating CDCR. This corpus is based on the original Event Coreference Bank corpus from (Bejan and Harabagiu, 2010), with entity annotations added in Lee et al. (2012) to allow joint modeling and additional documents added by Cybulska and Vossen (2014  Cross-Domain Cross-Document Coreference Corpus (CD2CR) Ravenscroft et al. (2021) presents a dataset which evaluates the ability for CDCR models to work across domains which vary significantly in style and vocabulary. It contains 918 documents documents, made up of a 459 pairs of a scientific paper and a newspaper article covering the paper. These articles cover a variety of topics, but since documents come in automatically discovered pairs existing evaluations use the gold document pairs. It contains 13,169 links between 3102 entity mentions.

Evaluation and Results
All models are implemented in PyTorch (Paszke et al., 2019) and optimized with Adam (Kingma and Ba, 2015). Training the whole pipeline takes one day on a single Tesla V100 GPU. For ECB+, we use the data split used by Cybulska and Vossen (2015). For both FCC and GVC, we use the data splits used by Bugert et al. (2020a). For CD2CR, we use the splits used by Ravenscroft et al. (2021 (Barhom et al., 2019) and there are no intercluster links (Bugert et al., 2020a). Given that document clustering has almost no downside for ECB+ and Caciularu et al. (2021) uses a cross-encoder architecture with a much wider context window for classification, we largely credit the increased performance on ECB+ dataset to the benefits of hard sampling using our attentional state neighborhoods.
GVC & FCC We evaluate the broader applicability of our model for event CDCR by applying it to the FCC and GVC datasets. Each aim to address elements of real world event CDCR overlooked by ECB+. These datasets only annotate events, preventing joint modeling of events and entities. This negatively impacts Barhom et al. (2019) which was designed as a joint method, but requires no changes to our architecture.
Our approach improves over the state of the art by 11.3 F1 points for the GVC dataset and by 13.1 F1 points for the FCC dataset. It is worth noting that the previous state-of-the-art was split between these datasets, with document clustering benefiting GVC and harming FCC performance. Our approach improves on the results for both datasets without modification, unifying the state-of-the-art under one approach.
CD2CR CD2CR presents a unique challenge with coreference links which span two domains with very different linguistic properties: academic text and science journalism. While one might expect that this linguistic diversity could cause our pruning method to struggle to retrieve pairs across domains, our method proves robust to this challenge with a 34.5 F1 point improvement over the state-of-the-art. This is especially significant as CD2CR previously used a highly corpus-tailored document linking algorithm that relied on data such as DOI matching and author name and affliation matching since document clustering algorithms used for ECB+ are a bad fit for CD2CR due to the within-topic lexical diversity. This highlights how flexible our method is compared to document clustering.

Event Cross-Dataset Evaluation
We evaluate the robustness of our learned models by training and evaluating across the multiple event datasets. Bugert et al. (2020a) propose cross-corpus training as a treatment to produce more generally effective models, since downstream corpora are unlikely to match any specific CDCR corpus. We follow their cross-corpus evaluation and present the results for this cross-evaluation in Table 3.
For models trained on the train split from a single corpus, we see significant performance loss when evaluated on test splits from other corpora as is expected. However, we see vastly improved generalizability with our approach when trained on a single corpus compared to the baseline set by Bugert et al. (2020a).
To evaluate the ability of our model to learn from multiple corpora at once, we train our pipeline  Interestingly, our performance improves on FCC and GVC when training our model with two out of three datasets for both GVC and FCC. We achieve our best results on FCC when GVC training data is added and our best results on GVC when ECB+ data is added. This signals that there is potential for further improvement of the model trained on all datasets by exploring what causes the performance decrease with the introduction of the third dataset in these two cases.
Most importantly, our model trained across all datasets shoes improved generalizability across each dataset, sacrificing 2.9, 5.0, and 4.9 F1 points compared to our state-of-the-art corpus tailored models for ECB+, GVC, and FCC respectively. This is a 4.27 point F1 decrease on average compared to 16.7 F1 points for the baseline, suggesting that our model more effectively adapts to the varying feature importance across corpora shown by Bugert et al. (2020a). For use in downstream systems, this model variant makes it feasible variety of downstream corpora without fine-tuning, which is especially important since the majority of downstream tasks lack coreference annotations for fine-tuning.

Analysis
We analyze the components of our model in isolation to explain the sources of our significant performance gains and bottlenecks which still exist.

Candidate Retrieval Isolation
We evaluate our pruning method with alternate classifiers in Table 4. For these experiments, we fetch 5 nearest neighbor pairs for each mention.
We define the upper bound performance of our pruning method by performing an oracle study where the pruned pairs are passed pairwise classifier that has access to gold labels. Despite using  only 5 nearest neighbors the system achieves a recall of 96.3, resulting in an upper-bound F1 of 98.1. Future works can use our pruning method with improved pairwise classification methods without concern since the pruning method delivers near perfect results with an oracle pairwise classifier.
We isolate the benefits of our pairwise classification approach by using our pruning model with the pairwise classifiers of Barhom et al. (2019) and the trigger-only variant of Yu et al. (2020). The resulting performance is worse than that of our work, indicating that the pairwise classification model we utilize also plays an important role in our results. Our approach varies from Yu et al. (2020) by using a hard negative training approach and local discourse features, leading us to believe these are the primary beneficial factors.

Discourse Context Ablation Study
Both our work and the prior state-of-the-art (Caciularu et al., 2021) utilize discourse features during pairwise comparison, which significantly improves performance compared to just a single sentence of context. However, it is not well understood what features of local discourse are valuable to CDCR. We analyze the contributions of local discourse information through two ablation studies.
We first evaluate the sensitivity of our model to hyperparameter w, the number of sentences surrounding each mention included as context, by keeping a fixed bi-encoder and training 4 separate cross-encoders from w = 0 up until w = 3. Due to our model's 512 token limit, we do not evaluate over w = 3. The results of this ablation, shown in Table 6, demonstrate that each increase in window size increases performance, with diminishing returns.
To understand which local discourse features contribute to this improvement, we study three spe-cial types of token from the surrounding discourse: times, locations, and coreferences. Time and location within a sentence has been used in past work using semantic role labeling (Barhom et al., 2019;Bugert et al., 2020a) and coreferring tokens are intuitively informative as they provide additional information about the same event/entity. By including local discourse, 21%, 11%, 29% of events and 18%, 9%, 34% entities gain access to new time, location, and coreference information respectively. For example, consider the following text: A strong earthquake struck Indonesia's Aceh province on Tuesday. Many houses were damaged and dozens of villagers were injured.
While the event "damaged" is ambiguous with only the context of a single sentence, it becomes much more specific when contextualized with the previous sentence which contains both a time and a location for the event. We evaluate our system with tokens of these types masked from the local discourse with results reported in Table 5.
For events, both masking time and location (-1.0 F1) and masking coreference (-0.6 F1) in the local discourse significantly harms performance . However, only within-document coreference seems to majorly impact entity resolution (-2.9 F1). Both events and entities are more impacted by masking all entities (-1.3 F1 for events, -3.6 for entities) than they are by masking all events (-1.1 F1 for events, +0.3 F1), which matches the expectation that the greater degree of polysemy for event tokens makes them less discriminative.

Conclusion and Future Work
In this work, we presented a two-step method for resolving cross-document event and entity coreference inspired by discourse coherence theory. We  Table 6: Ablation on cross-encoder context window w evaluated on ECB+ using B 3 achieved state-of-the-art results on 3 event and 2 entity CDCR datasets, unifying the previously fractured CDCR space with a single model. We further improve applicability by training across corpora, presenting a model which can be used for downstream tasks that lack coreference annotations for fine-tuning. We demonstrated that our pruning method offers high upper bound performance and that both stages of our model contribute to our state-of-the-art results. Finally, we explained contributions of local discourse features when crossencoding for coreference resolution.
We identify 3 areas of future work: • Using knowledge distillation to further improve scalability. Wu et al. (2020) demonstrate that much of the quality gain from crossencoding can be transferred to a bi-encoder through knowledge distillation, which could have the potential to remove pairwise classification altogether.
• Pairing alternate models for pairwise classification with the bi-encoder candidate pair generator. Our candidate pair generator is unlikely to become a recall bottleneck, so future efforts in CDCR should focus primarily on improving the accuracy of pairwise classification.
• Integrating CDCR into a wider range of tasks. Our work is robust to a wide variety of data, but it is still unknown which cross-document tasks benefit the most from coreference information.

A Full Metrics Report
In Table 7, we present a table of the commonly used metrics for evaluating CDCR systems for each of our corpus-tailored systems for the sake of future comparisons.