Scaling Within Document Coreference to Long Texts

State of the art end-to-end coreference resolution models use expensive span representations and antecedent prediction mechanisms. These approaches are expensive both in terms of their memory requirements as well as compute time, and are particularly ill-suited for long documents. In this paper, we propose an approximation to end-to-end models which scales gracefully to documents of any length. Replacing span representations with token representations, we reduce the time/memory complexity via token windows and nearest neighbor sparsiﬁcation methods for more efﬁcient antecedent prediction. We show our ap-proach’s resulting reduction of training and inference time compared to state-of-the-art methods with only a minimal loss in accuracy.


Introduction
Recent advances in coreference resolution (Lee et al., 2018;Joshi et al., 2019Joshi et al., , 2020Wu et al., 2020) have been largely based on the end-to-end model proposed by Lee et al. (2017). However, these models are costly both in terms of training and inference time, as well as memory requirements, especially for long documents. The large computational cost makes the models infeasible to run for a typical user on large document collections in domains such as blogs, stories, books, etc. Moreover, a reduction in energy use of these models can be of benefit to cloud service providers' costs and there can also be environmental benefits (Strubell et al., 2019;Schwartz et al., 2020).
There are two main computational bottlenecks in using end-to-end coreference models on long documents: (i) span and span-pair representations for all spans in the document are simultaneously considered, and (ii) the coreference decision for a mention requires considering all candidate antecedent spans.
In this paper, we propose an approximation to the end-to-end coreference model (Lee et al., 2017) that scales to long documents by addressing both these bottlenecks. Our proposed approach operates at the token level instead of the span level, removing the quadratic dependence on the number of mention spans in a document and addressing bottleneck (i). We propose token level scoring functions for the bilinear inference model originally proposed by Lee et al. (2018). To address bottleneck (ii), we use token windows, token-level k-nearest neighbor relationships along with low-rank matrix approximations of the token similarity matrix thereby improving time/memory efficiency. We also propose an approach to drop token representations from memory to reduce memory requirements while still maintaining the accuracy.
We evaluate our approach on three coreference datasets: CoNLL-2012(Pradhan et al., 2012, Litbank (Bamman et al., 2019), and MedMentions (Mohan and Li, 2019) and observe competitive accuracy to state of the art coreference models based on end-to-end training while achieving both faster training and inference running times. Our approach is also more memory efficient and up to 10x faster than the recently proposed memory-based incremental coreference resolution model on Litbank (Toshniwal et al., 2020b). Finally, we demonstrate the scalability of our approach by running it on a novel of two million tokens in 14 minutes while requiring just 12GB of GPU RAM, while previous work can only scale to documents of just around eleven thousand tokens even with up to 48GB of GPU RAM.
Concurrent to our work, Kirstain et al. (2021) also proposes a bilinear token level scoring function for coreference. The focus of our work however is on long documents and we further introduce a token k-nn graph approximation, a low-rank matrix factorization and an approach to drop non essential candidate antecedents to improve mem/time scalablity.
2 Background: End-to-end Within-Document Coreference End-to-end within-document coreference resolution models jointly discover a set of mentions, M in a document D and determine which of the mentions are coreferent. We use D to refer to the ordered set of tokens in the document D = {x 1 , x 2 , . . . , x T }. Each mention is a token span s = x i , . . . , x j 1 . We use x i to refer to the contextualized embedding of the token i (see Section 4.1 for more details on the encoder). The model comprises of two parts which are jointly trained: (a) a mention-proposer, and (b) an antecedent-predictor. The mention-proposer model evaluates all spans S in the dataset and proposes a small set of potential mentions M ⊂ S. The antecedent prediction model evaluates the mentions suggested by the mention proposer and produces coreference clusters (chains) C ⊂ P(M), where P(·) is the powerset.
Recent work (Lee et al., 2018;Joshi et al., 2020;Xu and Choi, 2020, inter alia) has built upon the first neural, end-to-end coreference model (Lee et al., 2017). Each of these models introduce two scoring functions s m (s) and s a (m 1 , m 2 ). s m (s) represents the scores that a span s is a mention, and s a (m 1 , m 2 ) is the score for mention m 1 being an antecedent of mention m 2 . These scoring functions are used to define the joint mention proposal and antecedent prediction model for coreference.
Mention proposer: The previous works use a neural network for s m : S → R. The architecture takes in a mention span and outputs a score. For each mention span s, the model computes a vector representation g s ∈ R d . The scoring functions take these vector representations as input: and g s is computed as: where x START(s) , x END(s) are the boundary representations of span s,x s is a self-attention representation of span s, and φ(s) encodes the width (number of tokens) of span s. For efficiency, the model selects top 0.4T scoring mention spans where T is the number of tokens of the document. We refer to this set of selected mention spans as M. We use an ordering of the mentions m ∈ M based on their start/end offsets.
Antecedent Prediction: Previous work has explored several models for antecedent prediction. The most computationally efficient being a bilinear scoring model (Lee et al., 2018): Higher-order inference models, which use deep models to capture coreference relationships between mentions, have also been considered (Lee et al., 2018). For example, We refer the reader to Xu and Choi (2020) for a detailed analysis of higher order inference models.
The prediction of the antecedent of each mention, which we refer to as inference, is done by backwards chaining. Clusters of mentions are determined by finding for each mention, the highest scoring antecedent among the mentions appearing earlier in the document and adding the mention to the antecedent's cluster. This can be described as finding the connected components of a graph G. Coarse-to-fine inference (Lee et al., 2018) and the standard bi-linear model can be differentiated by different ways of constructing the adjacency matrix of the graph G with nodes being the mentions M. We refer to this adjacency matrix as A, and use A i,j = 1 to indicate the existence of an edge between mention m i and m j . The adjacency matrix of the bilinear model can be written as: The adjacency matrix of the higher-order model can be written as: where k for argtopk is a hyperparameter.
End-to-end training The mention proposal and antecedent prediction models are trained by relaxing the adjacency matrix A, replacing the argmax operation with a softmax (i.e., setting a weighted edge between i and j with weight s(m i , m j )). The training objective is to maximize the log-likelihood of a ground truth adjacency matrix A , where A i,j = 1 if m i and m j are coreferent and i < j under the relaxed adjacency matrix. The argtopk operation is not relaxed. A nil antecedent is introduced, which provides similarity (s a ) of 0 to any mention span is incorporated in the training objective. The number of candidate antecedents is also restricted by a hyperparameter (Lee et al., 2018).

Efficient Approximations for
End-to-End Coreference We describe our proposed approach for efficiently approximating the span-based end-to-end coreference model with a token-level model. Our model jointly predicts which tokens are in the same mention spans (i.e., mention proposal) and what tokens are coreferent with one another (i.e., antecedent prediction). By operating at the token level, we remove the dependence on considering quadratically many phrases. We show the structure of our approximation allows for a sparsification technique that reduces the number of antecedent predictions that need to be considered using k-nearest neighbor relationships between tokens and by splitting documents into windows with certain computations made independently for each window. We describe how low-rank matrix approximations can be used to improve inference efficiency.

Mention Proposer
Observe that computing the set M requires us to evaluate s m (·) for all candidate spans S in the document (which grows roughly quadratically with the number of tokens). Recall that s m (·) is a function of the start and end tokens of each span, producing a score that is high if the pair of tokens likely form a span. This approach can be thought of as having each token t in the document predict whether or not another token u is the last token in a span beginning with t.
We first model for each token t whether it is a start (st) or end (en) token of some phrase using a linear model: These terms weighs each token by how likely it is to be part of some mention span.
Following Kirstain et al. (2021), we find that there can an empirical benefit(described in Section 6) to additionally modelling the relationship between u and t, i.e., whether it is reasonable for the span beginning with t to end in u. To do this we use a asymmetric (bilinear) scoring function. Further, we restrict the spans to be contiguous and follow the rule-based span criteria of previous work (Lee et al., 2017).
For each token, we predict candidate end-tokens for a mention span starting at the given token. We assign a score per span by sum of Eq. 9 & 10 and follow previous work to select the top 0.4T scoring spans (mentions). We can replace the mention scoring mechanism used in previous works s m (·) with an approximation based on the token level score: Rather than having to instantiate a d-dimensional span representation for all |S| spans, our approach simply uses the output token representations from the encoder. This requires O(T ) space compared to O(|S|) space. Note that computing s m (m i ) for all mentions requires at most two matrix multiplications, each with just T rows. Observe that this leads to a drastic reduction in time and space complexity. As noted by previous work (Toshniwal et al., 2020b), the mention proposal step requires the most memory usage because of the quadratic dependency. We validate the reduction in time and memory requirements of our token level mention detection in Section 4.5 & 4.6.
Pretraining For Mention Detection Previous work (Wu et al., 2020;Toshniwal et al., 2020b) has shown that pre-training models for mention detection is beneficial, especially in cases where predicting singleton mentions is required (e.g., Lit-Bank (Bamman et al., 2019)). Given a set of ground truth mentions M and the set of mentions from a given document M, we use a mention detection loss which minimizes: We use it as a multi-task objective in training the models and as well as a pre-training objective. We detect singleton mentions by using a threshold on the mention score value s m (m i ) which is tuned on the development set according to the downstream performance.

Antecedent Scoring
Next, we would like to model coreference relationships between tokens to approximate the span-level scoring function (s a , Eq. 3, 5). We predict for each token, the other tokens with which it is coreferent. These predictions are then aggregated to make span level predictions.
First, we consider approximating bilinear scoring function at the token level (s bi a ). We use a bilinear model applied to the encoded token representations. We parameterize four asymmetric similarity functions. Note that the backwards-chaining property of inference motivates our use of the asymmetric function. We model the similarity between tokens that are the start or end tokens of phrases separately: We use these similarities to approximate the bilinear antecedent scoring function (Eq. 3) as: Observe how s bi a (·, ·) reduces the memory requirements compared to s bi a (·, ·). We do not need to instantiate the span representations, only use the encoded token representations. Computing s bi (m i , m j ) requires at most O(T 2 ) instead of O(|S| 2 ) work. We can compute each of the S · i,j (Eq. 16) in space O(T 2 ) and as matrix multiplication between matrices of O(T ) rows.
For training and inference in our model, we define the adjacency matrix A withÂ: Inference can then be done as exactly as before, using connected-components based inference.

Token Windows & Sparsifying Antecedent Scoring with k-NN Graphs
We can use the backwards-chaining structure of the inference procedure and divide a document into smaller token windows (non-overlapping), reducing the number of tokens that need to be encoded in any one component. We can propose mentions independently in each window. We then perform antecedent scoring using the K-NN sparsification described below for each window. By batching the long document into these windows, we never need to store more than the final encoded token representations for the tokens appearing in some entity cluster.
The approximation method presented thus far reduces the complexity of end-to-end coreference approaches from depending on the number of spans to the number of tokens. However, for long documents scaling quadratically in the number of tokens is still prohibitively expensive, both in terms of time complexity and also in terms of memory. Observe that computing and storing each of the s bi (·, ·) may become prohibitively expensive for all pairs of tokens in the document. We would like to reduce the time and space complexity of this approach.
We propose to approximate the top scoring pairs of mention spans according to s bi (·, ·) (i.e., further approximating s bi (·, ·)). We do this by only allowing two mentions m i and m j to be coreferent if the start/end tokens of m j are in the k-nearest neighbors of the start/end tokens of m i . More precisely, we will maintain the k nearest neighbors of each token for each of the four similarity functions S ss , S se , S es , S ee (Eq. 16). To align with inference procedure, we select these k nearest neighbors for each token only from the preceding tokens in the document. we define S ss knni,j to be non-zero only if j is in the top-k values of S ss We define S se knni,j , S es knni,j , S ee knni,j analogously.We then build an further approximation of s bi using these S · knn values: Observe that storing S ss knni,j can use sparse matrices and therefore provide better scalability to long documents for which storing O(4T k) is advantageous over O(T 2 ).
End-to-end Training We use the same end-to-end training procedure that was used by previous work (Section 2) using our approximated mention proposal and antecedent scoring procedures. We note that the use of token windows and KNN sparsification of the antecedent scoring term do not change training at all, this is only applied at inference time.

Low-dimensional Approximations
Much of the computation time of the k-NN graph approximation model comes from the computation of the top-k nearest tokens. The computation bottleneck mostly depends on the high dimensionality of the encoded token representations, which are from transformer-based language models (Joshi et al., 2020).
To produce lower dimensional embeddings of each token, which preserve similarities in the original space, we use low-rank matrix approximation methods, specifically the Nyström method (Williams and Seeger, 2001;Musco and Musco, 2017, inter alia). We hope to approximate the matrices S ss i,j , S se i,j , S es i,j , S ee i,j . While these are asymmetric, we can consider an equivalent symmetrized version where each token appears two times (on left/right of bilinear term) to apply Nyström. The lower dimensional embedding produced by Nyström is done The Nyström method provides a low-rank approximation of a symmetric pairwise similarity matrix S ∈ RR N ×N , by selecting landmark points uniformly at random among the N rows of S. We use L i to be column vector one-hot representation of the i th landmark. We assume L ∈ R N × to be a matrix of such one-hot representations. The approximation of S is given by: S = LS(L T SL) −1 L T S. The term LS is a dimensional embedding of the rows, which is defined by the similarity of each row with each of the landmarks ( is the reduced dimension). Similarly, (L T SL) −1 L T S can be thought of as providing a dimensional embedding of each column of S, which is based on the similarity and the (inverse) of the landmark similarities.

Limiting Num. of Candidate Antecedents
In the aforementioned approach, the number of candidate antecedents scales with the document length. We would like to determine a mechanism for using a fixed number of candidate antecedents if desired. Previous work other work uses entitylevel representations to achieve this (Toshniwal et al., 2020b;Xia et al., 2020).
In our work, we operate at the mention level, removing mentions as candidate antecedents. We define a hyperparameter, ρ, which is maximum number of antecedents that would be kept after processing each window of the document. Our approach removes mentions as candidate antecedents which (1) belong to large coreference clusters (2) are not frequently selected as antecedents. We achieve this by dropping mentions in the order of |C m |i A i,m , where |C m | is the size of the cluster of the mention m and i A i,m is the degree of the mention in the antecedent graph.

Experiments
In this section, we compare our proposed approach for scalable coreference on long documents to various state-of-the-art methods in terms of accuracy as well as efficiency of training and inference. We perform a detailed scalability analysis, which characterizes the time/memory used by each method as a function of the length of documents. We also report timing results on novels of ∼ 2 million tokens.

Datasets
We evaluate each method on the following datasets: CoNLL-2012 Shared Task: The CoNLL-2012 shared task (Pradhan et al., 2012) uses the v5.0 of the OntoNotes corpus for the task of coreference resolution in English, Chinese, and Arabic languages. We use only the English version for our experiments. The training set contains 2802 training, 343 development, and 348 test documents. The training documents contain on average of 454 tokens and a maximum of 4009 tokens. Litbank: We also use the Litbank dataset (Bamman et al., 2019) which consists of 210,532 tokens evenly drawn from 100 different English language literary texts. The average document length in Litbank is much longer (around 2,000 tokens). Following (Bamman et al., 2019;Toshniwal et al., 2020b), we use a 10fold cross-validation setup with 80% of the data as training data and rest 10% each as validation and test data. The final evaluation is reported as the average of all 10 test runs. Note that the family of endto-end approaches that we are approximating with our method do not predict singletons as is typically done for Litbank. Mention pretraining is performed as described in Section 3.1 MedMentions We also repurpose MedMentions (Mohan and Li, 2019) an existing entity linking dataset in the biomedical domain for coreference resolution. We treat the entity labels as the ground truth cluster assignments of each mention for coreference training/analysis. We use the ST21PV subset that is recommended by (Mohan and Li, 2019). Artamène ou le Grand Cyrus (Artamène, or Cyrus the Great). To further asses the scalability of our approach, we run our method on an English translation of the 17th century French novel that is one of the longest books available in English the public domain (Scudery, 1601). The work contains 1.99 million tokens and over two million sub-tokens. We use this data to illustrate the scalability of our approach to really long documents.

Methods
We compare our end-to-end coreference approximation with and without the token windows, KNN sparsification approach (i.e. Ours and Ours (Sp.), Section 3.3). We denote the number of neighbors used in the sparsification approach as k and the size of the window used as w. We compare these to the methods that they are approximating: the bilinear scoring function-based method (E2E (bi)) (Lee et al., 2017) as described in Eq. 3 and the coarse-tofine higher-order inference based approach (E2E (hoi)) (Lee et al., 2018). All models use spanbert-large (Joshi et al., 2020) to encode tokens. The encoder parameters are trained along with the coreference specific model parameters (see Section 4.4 for details). E2E (bi), E2E (hoi) use additional features such as speaker and genre, we do not use this metadata in our proposed approximation approach.

Coreference Performance
In Table 1, we report the coreference performance (along with the running time and memory usage) for each method on the three datasets. We observe that our approximate approach is achieves comparable performance to the E2E approaches on CoNLL-2012 and MedMentions, performing slightly worse on Litbank. We hypothesize that the token level representations can be effective at these tasks due to the expressiveness of the contextualized embeddings. We observe that the performance of our model is relatively unchanged with and without the sparsification approach applied.
Recently, Toshniwal et al. (2020b); Xia et al. (2020) have proposed memory-based models optimising memory usage. Toshniwal et al. (2020b) trains for improve mention detection by another pretraining process. These papers achieve state-ofthe-art results on Litbank and are focused on reducing the running time and memory usage of coreference models by storing entity representations instead of mention representations in a bounded memory architecture. We compare inference running time and coreference performance of our method with them in Table 1. We find that our models run 10x faster and are slightly more memory efficient than UMem (Toshniwal et al., 2020b) while matching their performance on litbank.

Experimental Details
We use the hyperparameter settings from (Xu and Choi, 2020) in all applicable cases. We use 512 as the segment length. On CoNLL and Medmentions we train all models for 24 epochs with maximum training sentences set to 3. On Litbank, we train for 120 epochs and pick parameters from (Toshniwal et al., 2020b). We use 0.4 as the ratio to pick the top spans(mentions) among all candidate spans.

Inference Time and Memory Usage
In Figure 1, we compare the time and memory used by the end-to-end coreference models and our  proposed family of approximations. We select a book at random from the Litbank corpus (Little Women) and report the time and memory used by each method to perform coreference as a function of the number tokens analyzed. We plot a curve for each, reporting the statistics until the method runs out of GPU memory (48GB). We cut off the x-axis of the graph where our proposed approach without the backwards chaining runs out of memory. Our token level models only scales upto 24K tokens. We note that Ours (Sp.) is able to run on the entire book requiring only marginally higher memory for higher document lengths. This is in contrast with previous E2E methods which run out of memory for documents longer than 1e4 tokens.

Training Time and Memory Usage
We report in Table 2 the training time and memory requirements for each of the methods. For each dataset, we train all the methods in focus for the same number of epochs/updates. We train for 24 epochs on CoNLL, 120 epochs on Litbank and 24 epochs on MedMentions. We observe that our approach greatly reduces GPU memory requirements and are also slightly faster. This gap is wider for  Table 2: Training time(hours) and memory(GB) usage. Our approach requires less time and memory than the competing end-to-end approaches. datasets containing longer documents as shown by the numbers on LitBank. Note that the sparsification approximation is simply an inference time approximation and uses the same trained model as our approach with the K-NN approximation.

Scaling to Long Documents
We run Ours (Sp.) on the full text of Artamène or Cyrus the Great, which has 1.99M tokens (> 2M subtokens). To our knowledge, this is the largest single document a neural within document coreference system has been applied to. In Figure 2, we show that our approach runs in about 14 minutes. Further, we demonstrate how the hyperprameters of the sparsification can be adjusted depending on the system requirements. We show that the window size parameter can be set to be the minimal amount (w=512) to require just 13 GB of GPU RAM. Table 3 suggests using small window sizes are also advantageous in terms of accuracy.

Model Analysis K-NN Sparsification Performance Analysis
In Table 3, we show the CoNLL F1 as a function of the number of neighbors k and window size w in Ours (Sp.). We observe that we can achieve high quality results even with a small number of neighbors, providing an empirical justification for our approximation. In our case, using just 10 nearest neighbors (k = 10) puts Ours (Sp.) within 99% of the performance of the version of our approach without sparsification. Litbank however required to use a higher value of K due to the presence of long distance coreference links in literary texts.    in the memory. We note that a reduction in memory can be achieved by dropping antecedents and using sparse matrices. However, this is not as efficient as using dense matrices on the gpu.

Performance Analysis
Comparisons with Baselines To give a sense of how the proposed approximations work, we performed a simple analysis among E2E (hoi), Ours and Ours (Sp.) on the last fold of Litbank dataset. In the first experiment, we keep only the pronomial mentions in the predicted clusters and evaluate coref scores. In the second experiment we keep all mentions containing atleast one noun. Table 4 shows the final numbers. The gap in performance between E2E (hoi) and Ours seems to be equal in both categories. Ours (Sp.) and Ours have similar performance gap in both categories as well. Thus our models seem to be approximating fairly across different categories. Table 5 shows the analysis of distance between antecedents predicted by each model on the last litbank fold. Our models seem to have a higher average distance between antecedents. This shows that the models proposed are capable to identifying long distance links. Note that distance between antecedent does not determine accuracy. A mention linked to any mention in its golden coreference cluster will have the same effect.

Effects of model components
We further analyse the effect of other heuristics that went into the model. We use a Subtoken Strategy (SS) where we restrict the candidate mentions to align with subtoken starts, ends. As shown in Table  6, (SS) seems to have improved the results on all the datasets. Also, for litbank, mention pretraining & mention training (MT) seem to have helped significantly. Mention training forces the score of golden mentions to be higher thereby making it easy to use a threshold for singletons at inference. Bilinear mention (BM) term described in Eq.10 seemed to have helped in litbank and medmentions.

Related Work
With the growing computational cost of deep learning, NLP researchers have started to focus on more efficient models (Strubell et al., 2019;Schwartz et al., 2020). As coreference is a document-level phenomena, it is particularly challenging to scale, especially for long documents. While most of the work in coreference has focused on genres of text with short documents such as news articles and blogs (Pradhan et al., 2012), there has been renewed focus in long text documents such as novels (Bamman et al., 2019;Toshniwal et al., 2020b). Coreference in long text is particularly interesting due to the introduction of long-range anaphora.
Span based end-to-end coreference systems (Lee et al., 2017(Lee et al., , 2018Joshi et al., 2020;Wu et al., 2020) have been the state-of-the-art in short-document coreference resolution (Lu and Ng, 2020). These systems avoid training a separate mention detector. However, end-to-end coreference models are challenging to scale to long text documents due to their large memory footprint as well as slow training and inference. Thus, research on long-document coreference so far has focused on incremental (memory-based) coreference resolution (Xia et al., 2020;Toshniwal et al., 2020a,b). Memory-based approaches model coreference as online clustering by picking the most similar entity to every new mention where the cluster representations (i.e., entity representations) are also updated. However, the underlying recurrent nature of these models and the frequent read-write memory operations make these models slow. In this work, we focus on the end-to-end coreference system and show gains both in speed as well as memory.
We note that this paper's novel token-level model ideas presented in this paper were concurrently introduced by Kirstain et al. (2021). We also introduce token windows, k-nearest neighbor based sparsification techniques. Furthermore, we provide empirical results on documents with about two million tokens, which we believe to be one of the longest documents to which neural coreference models have been applied. We also note that Wu et al. (2020) hold state-of-the-art results on CoNLL (83F1). Wu et al. (2020) uses a question-answering cross-encoder style model to perform coreference. However, the method is very computationally expensive and so it is difficult to scale to the long documents which is the focus of this paper.

Conclusion
In this paper, we introduce a new scalable approach for performing coreference that scales to long documents. Our approach replaces costly span-based operations with token-level decisions for proposing mentions and determining antecedents. Our approach uses token similarity in the form of knearest neighbor graphs along with processing documents in span windows to reduce the time and memory complexity. We evaluate our proposed approach empirically and demonstrate that it achieves competitive coreference F1 scores while improving time and memory usage requirements. We demonstrate the scalability of our method by applying it to novels with about two million tokens. We further propose and demonstrate the use of low rank approximations and dropping of non essential tokens to improve memory/time efficiency.

Broader Impact and Discussion of Ethics
While our model is not tuned for any specific realworld application, our method could be used in sensitive contexts such as legal or health-care settings, and it is essential that any work using our method undertake extensive quality-assurance and robustness testing before using it in their setting. The datasets used in our work do not contain any sensitive information to the best of our knowledge.
Replicability: As part of our contributions, we will release the code used for training and evaluation in this work, as well as all the trained models at https://github.com/raghavlite/Scalable-Coreference.