Word-Level Coreference Resolution

Recent coreference resolution models rely heavily on span representations to find coreference links between word spans. As the number of spans is O(n^2) in the length of text and the number of potential links is O(n^4), various pruning techniques are necessary to make this approach computationally feasible. We propose instead to consider coreference links between individual words rather than word spans and then reconstruct the word spans. This reduces the complexity of the coreference model to O(n^2) and allows it to consider all potential mentions without pruning any of them out. We also demonstrate that, with these changes, SpanBERT for coreference resolution will be significantly outperformed by RoBERTa. While being highly efficient, our model performs competitively with recent coreference resolution systems on the OntoNotes benchmark.


Introduction
Coreference resolution is used in various natural language processing pipelines, such as machine translation (Ohtani et al., 2019;Miculicich Werlen and Popescu-Belis, 2017), question answering (Dhingra et al., 2018) and text simplification (Wilkens et al., 2020). As a building block of larger systems itself, it is crucial that coreference resolution models aim to not only achieve higher benchmark scores but be time and memory efficient as well.
Most recent coreference resolution models for English use the mention ranking approach: the highest ranking antecedent of each span is predicted to be coreferent to that span. As this approach presents a serious computational challenge of O(n 4 ) complexity in the document length, Lee et al. (2017) keep only M top-scoring spans, while Lee et al. (2018) extend this idea by keeping only top K antecedents for each span based on an easily computable score. Their C2F-COREF model, which showed 72.6 F1 on the OntoNotes 5.0 shared task (Pradhan et al., 2012), was later improved by Joshi et al. (2020), who introduced SpanBERT to obtain better span representations (79.6 F1), and by Xu and Choi (2020), who proposed a novel way of higher-order coreference resolution (80.2 F1).
The current state-of-the-art system built by Wu et al. (2020) uses a different approach, formulating the task as a machine reading comprehension problem. While they were able to achieve an impressive 83.1 F1 on the OntoNotes 5.0 shared task, their model is particularly computationally expensive as it requires a full transformer pass to score each span's antecedents.
To make coreference resolution models more compact and allow for their easier incorporation into larger pipelines, we propose to separate the task of coreference resolution from span extraction and to solve it on the word-level, lowering the complexity of the model to O(n 2 ). The span extraction is to be performed separately only for those words that are found to be coreferent to some other words. An additional benefit of this approach is that there is no need for a complex span representation, which is usually concatenation-based. Overall this allows us to build a lightweight model that performs competitively with other mention ranking approaches and achieves 81.0 F1 on the OntoNotes benchmark. 1 2 Related work 2.1 End-to-end coreference resolution The model proposed by Lee et al. (2018) aims to learn a probability distribution P over all possible antecedent spans Y for each span i: Here s(i, j) is the pairwise coreference score of spans i and j, while Y (i) contains spans to the left of i and a special dummy antecedent with a fixed score s(i, ) = 0 for all i. The pairwise coreference score s(i, j) of spans i and j is a sum of the following scores: 1. s m (i), whether i is a mention; 2. s m (j), whether j is a mention; 3. s c (i, j), coarse coreference score of i and j; 4. s a (i, j), fine coreference score of i and j.
They are calculated as follows: where g i is the vector representation of span i and φ is a vector of pairwise features, such as the distance between spans, whether they are from the same speaker, etc. The span representation g i is initialized as a concatenation of contextual embeddings of the start and end tokens, the weighted sum of all the tokens in the span and a feature vector (learnable width and genre embeddings): The weights to calculatex i are obtained using an attention mechanism (Bahdanau et al., 2014). The model also updated the span representations with the weighted sums of their antecedent representations to allow for cluster information to be available for a second iteration of calculating S a . However, Xu and Choi (2020) have demonstrated that its influence on the performance is negative to marginal.

Coreference resolution without span representations
Recently, Kirstain et al. (2021) have proposed a modification of the mention ranking model which does not use span representations. Instead, they compute the start and end representations of each subtoken in the sequence: The resulting vectors are used to compute mention and antecedent scores for each span without the need to construct explicit span representations. This approach performs competitively with other mention ranking approaches, while being more memory efficient. However, the theoretical size of the antecedent score matrix is still O(n 4 ), so the authors need to prune the resulting mentions. In our paper, we present an approach that does not exceed the quadratic complexity without the need for mention pruning, while retaining the comparable performance.

Token representation
After obtaining contextual embeddings of all the subtokens of a text, we compute token representations T as weighted sums of their respective subtokens. The weights are obtained by applying the softmax function to the raw scores of subtokens of each token. The scores are calculated by multiplying the matrix of subtoken embeddings X by a matrix of learnable weights W a : No mention score is computed for the resulting token representations and none of them is subsequently pruned out.

Coarse-to-fine antecedent pruning
Following Lee et al. (2018) we first use a bilinear scoring function to compute k most likely antecedents for each token. This is useful to further reduce the computational complexity of the model.
Then we build a pair matrix of n × k pairs, where each pair is represented as a concatenation of the two token embeddings, their element-wise product and feature embeddings (distance, genre and same/different speaker embeddings). The fine antecedent score is obtained with a feed-forward neural network: The resulting coreference score is defined as the sum of the two scores: The candidate antecedent with the highest positive score is assumed to be the predicted antecedent of each token. If there are no candidates with positive coreference scores, the token is concluded to have no antecedents. Our model does not use higher-order inference, as Xu and Choi (2020) have shown that it has a marginal impact on the performance. It is also computationally expensive since it repeats the most memory intensive part of the computation during each iteration.

Span extraction
The tokens that are found to be coreferent to some other tokens are further passed to the span extraction module. For each token, the module reconstructs the span by predicting the most probable start and end tokens in the same sentence. To reconstruct a span headed by a token, all tokens in the same sentence are concatenated with this head token and then passed through a feed-forward neural network followed by a convolution block with two output channels (for start and end scores) and a kernel size of three. The intuition captured by this approach is that the best span boundary is located between a token that is likely to be in the span and a token that is unlikely to be in the span. During inference, tokens to the right of the head token are not considered as potential start tokens and tokens to the left are not considered as potential end tokens.

Training
To train our model, we first need to transform the OntoNotes 5.0 training dataset to link individual words rather than word spans. To do that, we use the syntax information available in the dataset to reduce each span to its syntactic head. We define a span's head as the only word in the span that depends on a word outside the span or is the head of the sentence. If the number of such words is not one, we choose the rightmost word as the span's head. This allows us to build two training datasets: 1) word coreference dataset to train the coreference module and 2) word-to-span dataset to train the span extraction module.
Following Lee et al. (2017), the coreference module uses negative log marginal likelihood as its base loss function, since the exact antecedents of gold spans are unknown and only the final gold clustering is available: We propose to use binary cross-entropy as an additional regularization factor: This encourages the model to output higher coreference scores for all coreferent mentions, as the pairs are classified independently from each other. We use the value of α = 0.5 to prioritize NLML loss over BCE loss. The span extraction module is trained using cross-entropy loss over the start and end scores of all words in the same sentence.
We jointly train the coreference and span extraction modules by summing their losses.

Experiments
We use the OntoNotes 5.0 test dataset to evaluate our model and compare its performance with the results reported by authors of other mention ranking models. The development portion is used to reason about the importance of the decisions in our model's architecture.

Experimental setup
We implement our model with PyTorch framework (Paszke et al., 2019). We use the Hugging Face Transformers (Wolf et al., 2020) implementations of BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), SpanBERT (Joshi et al., 2020) and Longformer (Beltagy et al., 2020). All the models were used in their large variants. We also replicate the span-level model by Joshi et al. (2020) to use it as a baseline for comparison.
We do not perform hyperparameter tuning and mostly use the same hyperparameters as the ones used in the independent model by Joshi et al. (2020), except that we also linearly decay the learning rates.
Our models were trained for 20 epochs on a 48GB nVidia Quadro RTX 8000. The training took 5 hours (except for the Longformer-based model, which needed 17 hours). To evaluate the final model one will need 4 GB of GPU memory.

Selecting the best model
The results in Table 2  n/a n/a 79.74 +RoBERTa n/a n/a 78.65 Table 2: Model comparisons on the OntoNotes 5.0 development dataset (best out of 20 epochs). WL F1 means word-level CoNLL-2012 F1 score, i.e. the coreference metric on the word-level dataset; SA is the span extraction accuracy or the percentage of correctly predicted spans; SL F1 is the span-level CoNLL-2012 F1 score, the basic coreference metric.
Longformer yields even higher scores for wordlevel models, though RoBERTa does not improve the performance of the span-level model. The addition of binary cross-entropy as an extra regularization factor has a slight positive effect on the performance. As it introduces zero overhead on inference time, we keep it in our final model.
All the underlying transformer models (with BERT being the exception) have reached approximately the same quality of span extraction. While it allows the approach to compete with span-level coreference models, there is still room for improvement, which can be seen from the gap between "pure" word-level coreference scores and spanlevel scores.

Efficiency
As one can see from the Table 3, the word-level approach needs to consider 14x fewer mentions and 222x fewer mention pairs than the span-level approach. This allows us to keep all the potential WL 163,104 62,803,841 475,208 SL 2,263,299 13,970,813,822 n/a  mentions while span-level approaches have to rely on pruning techniques, like scoring the mentions and selecting the top-scoring portion of them for further consideration. Such pruning is a compromise between efficiency and accuracy, so removing the necessity for it positively influences both. Moreover, our representation of a mention does not require concatenating start, end and content vectors, which additionally reduces the memory requirements of the model. Table 4 contains actual efficiency measurements of our baseline (JOSHI-REPLICA) before and after disabling higher-order inference and switching to word-level coreference resolution. While the baseline relies on aggressive mention pruning and keeps only top λn mentions, where λ = 0.4 and n is the number of words in the input text, it still requires more time and memory than its word-level counterpart, which considers all possible mentions.

Mentions Mention pairs SBC
As a reference, we also provide time and memory measurements for other mention ranking models. However, they should not be directly compared, as there are other factors influencing the running time, such as the choice of a framework, the degree of mention pruning, and code quality.

Test results
Table 1 demonstrates that our model performs competitively with other mention ranking approaches. It can be seen that it has a higher recall than spanlevel models, because it does not need any mention pruning techniques to be computationally feasible and considers all the potential antecedents for each token in the text.

Conclusion
We introduce a word-level coreference model that, while being more efficient, performs competitively with recent systems. This allows for easier incorporation of coreference models into larger natural language processing pipelines. In addition, the separation of coreference resolution and span extraction tasks makes the model's results more interpretable. If necessary, the span prediction module can be replaced by a syntax-aware system to account for non-continuous spans. Such task separation also makes the coreference module independent from the way coreferent spans are marked up in the training dataset. This will potentially simplify using various data sources to build robust coreference resolution systems.