Cross-document Coreference Resolution over Predicted Mentions

Coreference resolution has been mostly investigated within a single document scope, showing impressive progress in recent years based on end-to-end models. However, the more challenging task of cross-document (CD) coreference resolution remained relatively under-explored, with the few recent models applied only to gold mentions. Here, we introduce the first end-to-end model for CD coreference resolution from raw text, which extends the prominent model for within-document coreference to the CD setting. Our model achieves competitive results for event and entity coreference resolution on gold mentions. More importantly, we set first baseline results, on the standard ECB+ dataset, for CD coreference resolution over predicted mentions. Further, our model is simpler and more efficient than recent CD coreference resolution systems, while not using any external resources.


Introduction
Cross-document (CD) coreference resolution consists of identifying textual mentions across multiple documents that refer to the same concept. For example, consider the following sentences from the ECB+ dataset (Cybulska and Vossen, 2014), where colors represent coreference clusters (for brevity, we omit some clusters): 1. Thieves pulled off a two million euro jewellery heist in central Paris on Monday after smashing their car through the store's front window.
2. Four men drove a 4x4 through the front window of the store on Rue de Castiglione, before making off with the jewellery and watches.
Despite its importance for downstream tasks, CD coreference resolution has been lagging behind the 1 https://github.com/ariecattan/coref impressive strides made in the scope of a single document (Lee et al., 2017;Joshi et al., , 2020Wu et al., 2020). Further, state-of-the-art models exhibit several shortcomings, such as operating on gold mentions or relying on external resources such as SRL or a paraphrase dataset (Shwartz et al., 2017), preventing them from being applied on realistic settings.
To address these limitations, we develop the first end-to-end CD coreference model building upon a prominent within-document (WD) coreference model (Lee et al., 2017) which we extend with recent advances in transformer-based encoders. We address the inherently non-linear nature of the CD setting by combining the WD coreference model with agglomerative clustering that was shown useful in CD models. Our model achieves competitive results on ECB+ over gold mentions and sets baseline results over predicted mentions. Our model is also simpler and substantially more efficient than existing CD coreference systems. Taken together, our work seeks to bridge the gap between WD and CD coreference, driving further research of the latter in realistic settings.

Background
Cross-document coreference Previous works on CD coreference resolution learn a pairwise scorer between mentions and use a clustering approach to form the coreference clusters (Cybulska and Vossen, 2015;Yang et al., 2015;Choubey and Huang, 2017;Kenyon-Dean et al., 2018;Bugert et al., 2020). Barhom et al. (2019) proposed to jointly learn entity and event coreference resolution, leveraging predicate-argument structures. Their model forms the coreference clusters incrementally, while alternating between event and entity coreference. Based on this work, Meged et al. (2020) improved results on event coreference by leverag- Figure 1: A high-level diagram of our model for cross-document coreference resolution. (1) Extract and score all possible spans, (2) keep top spans according to s m (i), (3) score all pairs s(i, j), and (4) cluster spans using agglomerative clustering.
ing a paraphrase resource (Chirps; Shwartz et al., 2017) as distant supervision. Parallel to our work, recent approaches propose to fine-tune BERT on the pairwise coreference scorer (Zeng et al., 2020), where the state-of-the-art on ECB+ is achieved using a cross-document language model (CDLM) on pairs of full documents (Caciularu et al., 2021). Instead of applying BERT for all mentions pairs which is quadratically costly, our work separately encodes each (predicted) mention.
All above models suffer from several drawbacks. First, they use only gold mentions and treat entities and events separately. 2 Second, pairwise scores are recomputed after each merging step, which is resource and time consuming. Finally, they rely on additional resources, such as semantic role labeling, a within-document coreference resolver, and a paraphrase resource, which limits the applicability of these models in new domains and languages. In contrast, we use no such external resources.
Within-document coreference The e2e-coref WD coreference model (Lee et al., 2017) learns for each span i a distribution over its antecedents.
Considering all possible spans as potential mentions, the scoring function s(i, j) between span i and j, where j appears before i, has three components: the two mention scores s m (·) of spans i and j, and a pairwise antecedent score s a (i, j) for span j being an antecedent of span i.
Each span is represented with the concatenation of four vectors: the output representations of the span boundary (first and last) tokens (x FIRST(i) , x LAST(i) ), an attention-weighted sum of token representationsx i , and a feature vector φ(i). These span representations (g i ) are first fed into a mention scorer s m (·) to filter the λT (where T is the number of tokens) spans with the highest scores. Then, the model learns for each of these spans to optimize the marginal log-likelihood of its correct antecedents. The full description of the model is described below:

Model
The overall structure of our model is shown in Figure 1. The major obstacle in applying the e2e-coref model directly in the CD setting is its reliance on textual ordering -it forms coreference chains by linking each mention to an antecedent span appearing before it in the document. This linear clustering method cannot be used in the multipledocument setting since there is no inherent ordering between the documents. Additionally, ECB+ (the main benchmark for CD coreference resolution) is relatively small compared to OntoNotes (Pradhan et al., 2012), making it hard to jointly optimize mention detection and coreference decision. These challenges have implications in all stages of model development, as elaborated below.
Pre-training To address the small scale of the dataset, we pre-train the mention scorer s m (·) on the gold mention spans, as ECB+ includes singleton annotation. This enables generating good candidate spans from the first epoch, and as we show in Section 4.3, it substantially improves performance.
Training Instead of comparing a mention only to its previous spans in the text, our pairwise scorer s a (i, j) compares a mention to all other spans across all the documents. The positive instances for training consist of all the pairs of highest scoring mention spans that belong to the same coreference cluster, while the negative examples are sampled (20x the number of positive pairs) from all other pairs. The overall score is then optimized using the binary cross-entropy loss as follows: where N corresponds to the set of mention-pairs (x, z), and y ∈ {0, 1} to a pair label. Full implementation details are described in Appendix A.1.
Notice that the mention scorer s m (·) is further trained in order to generate better candidates at each training step. When training and evaluating the model in experiments over gold mentions, we ignore the span mention scores, s m (·), and the gold mention representations are directly fed into the pairwise scorer s a (i, j).
Inference At inference time, we score all spans; prune spans with lowest scores; score the pairs; and finally form the coreference clusters using an agglomerative clustering (using average-linking method) over these pairwise scores, following common practices in CD coreference resolution (Yang et al., 2015;Choubey and Huang, 2017;Kenyon-Dean et al., 2018;Barhom et al., 2019). Since the affinity scores s(i, j) are also computed for mention pairs in different documents, the agglomerative clustering can effectively find cross-document coreference clusters.

Experimental setup
Following most recent work, we conduct our experiments ECB+ (Cybulska and Vossen, 2014), which is the largest dataset that includes both WD and CD coreference annotation (see Appendix A.2). We use the document clustering of Barhom et al. (2019) for pre-processing and apply our coreference model separately on each predicted document cluster. Following Barhom et al. (2019), we present the model's performance on both event and entity coreference resolution. In addition, inspired by Lee et al. (2012), we train our model to perform event and entity coreference jointly, which we term "ALL". This represents a useful scenario when we are interested in finding all the coreference links in a set of documents, without having to distinguish event and entity mentions. Addressing CD coreference with ALL is challenging because (1) the search space is larger than when treating separately event and entity coreference and (2) models need to make subtle distinctions between event and entity mentions that are lexically similar but do not corefer. For example, the entity voters do not corefer with the event voted.
We apply RoBERTa LARGE (Liu et al., 2019) to encode the documents. Long documents are split into non-overlapping segments of up to 512 wordpiece tokens and are encoded independently . Due to memory constraints, we freeze output representations from RoBERTa instead of fine-tuning all parameters. For all experiments, we use a single GeForce GTX 1080 Ti 12GB GPU. The training takes 2.5 hours for the most expensive setting (ALL on predicted mentions), while inference over the test set takes 11 minutes.

Results
Table 1 presents the combined within-and crossdocument results of our model, in comparison to previous work on ECB+. We report the results using the standard evaluation metrics MUC, B 3 , CEAF, and the average F1 of these metrics, called CoNLL F1 (main evaluation).
When evaluated on gold mentions, our model achieves competitive results for event (81 F1) and entity (73.1) coreference. In addition, we set baseline results where the model does not distinguish between event and entity mentions at inference time (denoted as the ALL setting). The overall performance on ECB+ obtained using two separate models for event and entity is negligibly higher (+0.6 F1) than our single ALL model.
Our model is the first to enable end-to-end CD coreference on raw text (predicted mentions). As expected, the performance is lower than that using gold mentions (e.g 26.6 F1 drop in event coreference), indicating the large room for improvement over predicted mentions. It should be noted that beyond mention detection errors, two additional factors contribute to the performance drop when moving to predicted mentions. First, while WD coreference systems typically disregard singletons (mentions appearing only once) when evaluating on raw text, CD coreference models do consider singletons when evaluating on gold mentions on ECB+. We observe that this difference affects the evaluation, explaining about 10% absolute points out of the aforementioned drop of 26.6. The effect of singletons on coreference evaluation is further    explored in (Cattan et al., 2021). Second, entities are annotated in ECB+ only if they participate in event, making participant detection an additional challenge. This explains the more important performance drop in entity and ALL. Table 2 presents the CoNLL F1 results of withinand cross-document coreference resolution for both gold and predicted mentions on ECB+. For all settings, results are higher in within-document coreference resolution, showing the need in addressing typical challenges of CD coreference resolution. Table 3 shows the results of our model without document clustering. Here, the performance drop and error reduction are substantially larger for event coreference (-6.2/12%) than entity coreference (-1.3/2%) and ALL (-2/3.5%). This difference is probably due to the structure of ECB+ which poses a lexical ambiguity challenge for events, while the document clustering step reconstructs almost perfectly the original subtopics, as shown in (Barhom et al., 2019).
Further, the higher results on event coreference do not mean that the task is inherently easier than entity coreference. In fact, when ignoring singletons in the evaluation, as done on OntoNotes, the performance of event coreference is lower than entity coreference (62.1 versus 65.3 CoNLL F1) (Cattan et al., 2021). This happens because event singletons are more common compared to entity singletons (30% vs. 17%), as shown in Appendix A.2. Finally, our model is more efficient in both training and inference since the documents are encoded using just one pass of RoBERTa, and the pairwise scores are computed only once using a simple MLP. For comparison, previous models compute pairwise scores at each iteration (Barhom et al., 2019;Meged et al., 2020), or apply a BERT-model to every mention pairs with their sentence (Zeng et al., 2020) or full document (Caciularu et al., 2021)

Ablations
To show the importance of each component of our model, we ablate several parts and compute F1 scores on the development set of the ECB+ event dataset. The results are presented in Table 4 using predicted mentions without document clustering. Skipping the pre-training of the mention scorer results in a 3.2 F1 points drop in performance. Indeed, the relatively small training data in the ECB+ dataset (see Appendix A.2) might be not sufficient when using only end-to-end optimization, and pretraining of the mention scorer helps generate good candidate spans from the first epoch.
To analyze the effect of the dynamic pruning, we froze the mention scorer during the pairwise training, and kept the same candidate spans along the training. The significant performance drop (4 F1) reveals that the mention scorer inherently incorporates coreference signal.
Finally, using all negative pairs for training leads to a performance drop of 1.4 points and significantly increases the training time.

Qualitative Analysis
We sampled topics from the development set and manually analyzed the errors of the ALL configuration. The most common errors were due to an over-reliance on lexical similarity. For example, the event "Maurice Cheeks was fired" was wrongly predicted to be coreferent with a similar, but different event, "the Sixers fired Jim O'Brien", probably because of related context, as both coached the Philadelphia 76ers. On the other hand, the model sometimes struggles to merge mentions that are lexically different but semantically similar (e.g "Jim O'Brien was shown the door", "Philadelphia fire coach Jim O'Brien"). The model also seems to struggle with temporal reasoning, in part due to missing information. For example, news articles from different days have different relative reference to time, while the publication date of the articles is not always available. As a result, the model missed linking "Today" in one document to "Saturday" in another document.

Conclusion and Discussion
We developed the first end-to-end baseline for CD coreference resolution over predicted mentions. Our simple and efficient model achieve competitive results over gold mentions on both event and entity coreference, while setting baseline results for future models over predicted mentions.
Nonetheless, we note a few limitations of our model that could be addressed in future work. First, following most recent work on cross-document coreference resolution ( §2), our model requires O(n 2 ) pairwise comparisons to form the coreference cluster. While our model is substantially more efficient than previous work ( §4.2), applying it on a large-scale dataset would involve a scalability challenge. Future work may address the scalability issue by using recent approaches for hierarchical clustering on massive datasets (Monath et al., 2019(Monath et al., , 2021. Another appealing approach consists of splitting the corpus into subsets of documents, constructing initial coreference clusters (in parallel) on the subsets, then merging meta-clusters from the different sets. We note though that it is currently impossible to test such solutions for more extensive scalability, pointing to a need in collecting largerscale datasets for cross-document coreference. Second, to improve overall performance over predicted mentions, future work may incorporate, explicitly or implicitly, semantic role labeling signals in order to identify event participants for entity prediction, as well as for better event structure matching. Further, dedicated components may be developed for mention detection and coreference linking, which may be jointly optimized.

A.1 Implementation Details
Our model includes 14M parameters and is implemented in PyTorch (Paszke et al., 2019), using HuggingFace's library (Wolf et al., 2020) and the Adam optimizer (Kingma and Ba, 2014). The layers of the models are initialized with Xavier Glorot method (Glorot and Bengio, 2010). We manually tuned the standard hyperparameters, presented in Table 5 on the event coreference task and keep them unchanged for entity and ALL settings. Table 6 shows specific parameters, such as the maximum span width, the pruning coefficient λ and the stop criterion τ for the agglomerative clustering, that we tuned separately for each setting to maximize the CoNLL F1 score on its corresponding development set.   Table 6: Specific hyperparameters for each mention type; λ is the pruning coefficient and τ is the threshold for the agglomerative clustering.

A.2 Dataset
ECB+ 4 is an extended version of the EventCoref-Bank (ECB) (Bejan and Harabagiu, 2010) and EECB (Lee et al., 2012), whose statistics are shown in Table 7. The dataset is composed of 43 topics, where each topic corresponds to a famous news event (e.g Someone checked into rehab). In order to introduce some complexity and to limit the use of lexical features, each topic is constituted by a collection of texts describing two different event 4 http://www.newsreader-project.eu/ results/data/the-ecb-corpus/ instances of the same event type, called subtopic. For example, the first topic corresponding to the event "Someone checked into rehab" is composed of event mention of the event "Tara Reid checked into rehab" and "Lindsay Lohan checked into rehab" which are obviously annotated into different coreference cluster. Documents in ECB+ are in English. Since ECB+ is an event-centric dataset, entities are annotated only if they participate in events. In this dataset, event and entity coreference clusters are denoted separately.  Table 7: ECB+ statistics. The slash numbers for # Mentions, # Singletons and # Clusters represent event/entity statistics. As recommended by the authors in the release note, we follow the split of Cybulska and Vossen (2015) that use a curated subset of the dataset.