Coreference Resolution without Span Representations

The introduction of pretrained language models has reduced many complex task-specific NLP models to simple lightweight layers. An exception to this trend is coreference resolution, where a sophisticated task-specific model is appended to a pretrained transformer encoder. While highly effective, the model has a very large memory footprint – primarily due to dynamically-constructed span and span-pair representations – which hinders the processing of complete documents and the ability to train on multiple instances in a single batch. We introduce a lightweight end-to-end coreference model that removes the dependency on span representations, handcrafted features, and heuristics. Our model performs competitively with the current standard model, while being simpler and more efficient.


Introduction
Until recently, the standard methodology in NLP was to design task-specific models, such as BiDAF for question answering (Seo et al., 2017) and ESIM for natural language inference (Chen et al., 2017). With the introduction of pretraining, many of these models were replaced with simple output layers, effectively fine-tuning the transformer layers below to perform the traditional model's function (Radford et al., 2018). A notable exception to this trend is coreference resolution, where a multi-layer taskspecific model (Lee et al., 2017(Lee et al., , 2018 is appended to a pretrained model (Joshi et al., 2019(Joshi et al., , 2020. This model uses intricate span and span-pair representations, a representation refinement mechanism, handcrafted features, pruning heuristics, and more. While the model is highly effective, it comes at a great cost in memory consumption, limiting the amount of examples that can be loaded on a large GPU to a single document, which often needs to * Equal contribution. be truncated or processed in sliding windows. Can this coreference model be simplified? We present start-to-end (s2e) coreference resolution: a simple coreference model that does not construct span representations. Instead, our model propagates information to the span boundaries (i.e., its start and end tokens) and computes mention and antecedent scores through a series of bilinear functions over their contextualized representations. Our model has a significantly lighter memory footprint, allowing us to process multiple documents in a single batch, with no truncation or sliding windows. We do not use any handcrafted features, priors, or pruning heuristics.
Experiments show that our minimalist approach performs on par with the standard model, despite removing a significant amount of complexity, parameters, and heuristics. Without any hyperparameter tuning, our model achieves 80.3 F1 on the English OntoNotes dataset (Pradhan et al., 2012), with the best comparable baseline reaching 80.2 F1 (Joshi et al., 2020), while consuming less than a third of the memory. These results suggest that transformers can learn even difficult structured prediction tasks such as coreference resolution without investing in complex task-specific architectures. 1 2 Background: Coreference Resolution Coreference resolution is the task of clustering multiple mentions of the same entity within a given text. It is typically modeled by identifying entity mentions (contiguous spans of text), and predicting an antecedent mention a for each span q (query) that refers to a previously-mentioned entity, or a null-span ε otherwise. Lee et al. (2017Lee et al. ( , 2018 introduce coarse-to-fine (c2f), an end-to-end model for coreference resolu-tion that predicts, for each span q, an antecedent probability distribution over the candidate spans c: Here, f (c, q) is a function that scores how likely c is to be an antecedent of q. This function is comprised of mention scores f m (c), f m (q) (i.e. is the given span a mention?) and a separate antecedent score f a (c, q): Our model (Section 3) follows the scoring function above, but differs in how the different elements f m (·) and f a (·) are computed. We now describe how f m and f a are implemented in the c2f model.

Scoring Mentions
In the c2f model, the mention score f m (q) is derived from a vector representation v q of the span q (analogously, f m (c) is computed from v c ). Let x i be the contextualized representation of the i-th token produced by the underlying encoder. Every span representation is a concatenation of four elements: the representations of the span's start and end tokens x qs , x qe , a weighted average of the span's tokensx q computed via selfattentive pooling, and a feature vector φ(q) that represents the span's length: The mention score f m (q) is then computed from the span representation v q : where W m and v m are learned parameters. Then, span representations are enhanced with more global information through a refinement process that interpolates each span representation with a weighted average of its candidate antecedents. More recently, Xu and Choi (2020) demonstrated that this span refinement technique, as well as other modifications to it (e.g. entity equalization (Kantor and Globerson, 2019)) do not improve performance.

Scoring Antecedents
The antecedent score f a (c, q) is derived from a vector representation of the span pair v (c,q) . This, in turn, is a function of the individual span representations v c and v q , as well as a vector of handcrafted features φ(c, q) such as the distance between the spans c and q, the document's genre, and whether c and q were said/written by the same speaker: The antecedent score f a (c, q) is parameterized with W a and v a as follows: Pruning Holding the vector representation of every possible span in memory has a space complexity of O(n 2 d) (where n is the number of input tokens, and d is the model's hidden dimension). This problem becomes even more acute when considering the space of span pairs (O(n 4 d)). Since this is not feasible, candidate mentions and antecedents are pruned through a variety of model-based and heuristic methods. Specifically, mention spans are limited to a certain maximum length . The remaining mentions are then ranked according to their scores f m (·), and only the top λn are retained, while avoiding overlapping spans. Antecedents (span pairs) are further pruned using a lightweight antecedent scoring function (which is added to the overall antecedent score), retaining only a constant number of antecedent candidates c for each target mention q.
Training For each remaining span q, the training objective optimizes the marginal log-likelihood of all of its unpruned gold antecedents c, as there may be multiple mentions referring to the same entity: Processing Long Documents Due to the c2f model's high memory consumption and the limited sequence length of most pretrained transformers, documents are often split into segments of a few hundred tokens each (Joshi et al., 2019). Recent work on efficient transformers (Beltagy et al., 2020) has been able to shift towards processing complete documents, albeit with a smaller model (base) and only one training example per batch.

Model
We present start-to-end (s2e) coreference resolution, a simpler and more efficient model with respect to c2f (Section 2). Our model utilizes the endpoints of a span (rather than all span tokens) to compute the mention and antecedent scores f m (·) and f a (·, ·) without constructing span or span-pair representations; instead, we rely on a combination of lightweight bilinear functions between pairs of endpoint token representations. Furthermore, our model does not use any handcrafted features, does not prune antecedents, and prunes mention candidates solely based on their mention score f m (q).
Our computation begins by extracting a start and end token representation from the contextualized representation x of each token in the sequence: We then compute each mention score as a biaffine product over the start and end tokens' representations, similar to Dozat and Manning (2017): The first two factors measure how likely the span's start/end token q s /q e is a beginning/ending of an entity mention. The third measures whether those tokens are the boundary points of the same entity mention. The vectors v s , v e and the matrix B m are the trainable parameters of our mention scoring function f m . We efficiently compute mention scores for all possible spans while masking spans that exceed a certain length . 2 We then retain only the top-scoring λn mention candidates to avoid O(n 4 ) complexity when computing antecedents. Similarly, we extract start and end token representations for the antecedent scoring function f a : Then, we sum over four bilinear functions: f a (c, q) = a s cs · B ss a · a s qs + a s cs · B se a · a e qe + a e ce · B es a · a s qs + a e ce · B ee a · a e qe Each component measures the compatibility of the spans c and q by an interaction between different boundary tokens of each span. The first component compares the start representations of c and q, while the fourth component compares the end representations. The second and third facilitate a cross-comparison of the start token of span c with the end token of span q, and vice versa. Figure 1 (bottom) illustrates these interactions. This calculation is equivalent to computing a bilinear transformation between the concatenation of each span's boundary tokens' representations: However, computing the factors directly bypasses the need to create n 2 explicit span representations. Thus, we avoid a theoretical space complexity of O(n 2 d), while keeping it equivalent to that of a transformer layer, namely O(n 2 + nd).

Experiments
Dataset We train and evaluate on two datasets: the document-level English OntoNotes 5.0 dataset (Pradhan et al., 2012), and the GAP coreference dataset (Webster et al., 2018). The OntoNotes dataset contains speaker metadata, which the baselines use through a hand-crafted feature that indicates whether two spans were uttered by the same speaker. Instead, we insert the speaker's name to the text every time the speaker changes, making the metadata available to any model.

Pretrained Model
We use Longformer-Large (Beltagy et al., 2020) as our underlying pretrained model, since it is able to process long documents without resorting to sliding windows or truncation.
Baseline We consider Joshi et al.'s (2019) expansion to the c2f model as our baseline. Specifically, we use the implementation of Xu and Choi (2020) with minor adaptations for supporting Longformer. We do not use higher-order inference, as Xu and Choi (2020) Table 1: Performance on the test set of the English OntoNotes 5.0 dataset. c2f refers to the course-to-fine approach of Lee et al. (2017Lee et al. ( , 2018, as ported to pretrained transformers by Joshi et al. (2019).  model over three pretrained models: Longformer-Base, Longformer-Large, and SpanBERT-Large (Beltagy et al., 2020;Joshi et al., 2020).
Hyperparameters All models use the same hyperparameters as the baseline. The only hyperparameters we change are the maximum sequence length and batch size, which we enlarge to fit as many tokens as possible into a 32GB GPU. 3 For our model, we use dynamic batching with 5,000 max tokens, which allows us to fit an average of 5-6 documents in every training batch. The baseline, however, has a much higher memory footprint, and is barely able to fit a single example with Longformer-Base (max 4,096 tokens). When combining the baseline with SpanBERT-Large or Longformer-Large, the baseline must resort to sliding windows to process the full document (512 and 2,048 tokens, respectively).
Performance Table 1 and Table 2   ence resolution architecture, while there are potential gains to be had from optimizing with larger batches.
Efficiency We also compare our model's memory usage using the OntoNotes development set. Table 3 shows that our implementation is at least three times more memory efficient than the baseline. This improvement results from a combination of three factors: (1) the fact that our model is lighter on memory and does not need to construct span or span-pair representations, (2) our simplified framework, which does not use sliding windows, and (3) our implementation, which was written "from scratch", and might thus be more (or less) efficient than the original.

Related Work
Recent work on memory-efficient coreference resolution sacrifices speed and parallelism for guarantees on memory consumption. Xia et al. (2020) and Toshniwal et al. (2020) present variants of the c2f model (Lee et al., 2017(Lee et al., , 2018) that use an iterative process to maintain a fixed number of span representations at all times. Specifically, spans are processed sequentially, either joining existing clusters or forming new ones, and an eviction mechanism ensures the use of a constant number of clusters. While these approach constrains the space complexity, their sequential nature slows down the computation, and slightly deteriorates the performance. Our approach is able to alleviate the large memory footprint of c2f while maintaining fast parallel processing and high performance. CorefQA (Wu et al., 2020) propose an alternative solution by casting the task of coreference resolution as one of extractive question answering. It first detects potential mentions, and then creates dedicated queries for each one, creating a pseudo-question-answering instance for each candidate mention. This method significantly improves performance, but at the cost of processing hundreds of individual context-question-answer instances for a single document, substantially increasing execution time. Our work provides a simple alternative, which can scale well in terms of both speed and memory.

Conclusion
We introduce a new model for coreference resolution, suggesting a lightweight alternative to the sophisticated model that has dominated the task over the past few years. Our model is competitive with the baseline, while being simpler and more efficient. This finding once again demonstrates the spectacular ability of deep pretrained transformers to model complex natural language phenomena.