Moving on from OntoNotes: Coreference Resolution Model Transfer

Academic neural models for coreference resolution (coref) are typically trained on a single dataset, OntoNotes, and model improvements are benchmarked on that same dataset. However, real-world applications of coref depend on the annotation guidelines and the domain of the target dataset, which often differ from those of OntoNotes. We aim to quantify transferability of coref models based on the number of annotated documents available in the target dataset. We examine eleven target datasets and find that continued training is consistently effective and especially beneficial when there are few target documents. We establish new benchmarks across several datasets, including state-of-the-art results on PreCo.


Introduction
Starting initially with neurally-learned features (Clark and Manning, 2016a,b), end-to-end neural models for coreference resolution (Lee et al., 2017(Lee et al., , 2018 have been developed and imbued with the benefits from contextualized language modeling (Joshi et al., 2019(Joshi et al., , 2020 and additional pretraining . At the same time, the number of parameters used in these models have likewise increased, which raises questions of overfitting our research to a specific benchmark. Several studies show that fully-trained models on preexisting large datasets do not transfer well to new domains (Aktaş et al., 2020;Bamman et al., 2020;Timmapathini et al., 2021), and that rule-based baselines can still be superior (Poot and van Cranenburgh, 2020). Furthermore, while there has been prior work on analyzing fully-trained models for mention pairs, like gender bias (Rudinger et al., 2018;Webster et al., 2018;Zhao et al., 2019), there has not been a comprehensive comparison analyzing model transfer across datasets for document-level coreference resolution.
We bridge the current gap in understanding between the strength of pretrained models and the value of annotated target data, in light of the strong few-shot capabilities demonstrated by pretrained language models (Brown et al., 2020;Schick and Schütze, 2020). While transfer in other NLP tasks have been studied quite in-depth, transfer in coreference resolution has scarcely been studied despite recent models that contain hundreds of millions of parameters. We investigate model transfer across datasets with continued training, in which a fullytrained model on a source dataset is further trained on a small number of target dataset examples (Sennrich et al., 2016;Khayrallah et al., 2018). 1 We contribute the first study of neural coreference resolution transfer across five datasets spanning different domains, annotation guidelines, and languages. We find evidence that OntoNotes, a widely-used dataset for benchmarking coreference resolution, is no better at model transfer than the freely-available PreCo. Furthermore, we establish modern benchmarks on several understudied datasets, including state-of-the-art results on PreCo and LitBank.

Coreference Resolution
Entity coreference resolution is the task of finding clusters of mentions within a document that all refer to the same entity. It still remains a difficult challenge in NLP due to several factors like ambiguity (Poesio and Artstein, 2008) and dependence on real-world knowledge (Levesque et al., 2012). As coreference resolution is a documentlevel phenomenon, annotation is challenging due to its contextual nature.
Despite these challenges, there are several large annotated datasets for coreference resolution. The predominant one used for model benchmarking is OntoNotes 5.0 (Weischedel et al., 2013). In general, however, the annotation guidelines for coreference resolution datasets differ from each other based on the goals or needs of the particular dataset, resulting in differences in what is considered a mention, how to handle singleton clusters, 2 and what types of links should be annotated. Yet, OntoNotes has emerged as the most widely-used benchmark for the full task, and widely used publicly available models are trained with this dataset (Manning et al., 2014;Gardner et al., 2018).
However, OntoNotes-based models may not always be appropriate. OntoNotes is a collection of several thousand documents across just seven genres from the 2000s (or earlier), and many datasets fall outside of the scope of those genres or time period. In addition, singleton mentions are not annotated even though most datasets do annotate them. In modeling OntoNotes, genre-specific and speakerrelated features are needed to improve on the stateof-the-art, both of which are idiosyncrasies of the OntoNotes dataset. It is unclear how well these models transfer to a new, target dataset, especially if it is possible to retrain entirely on target data.
Prior work on domain adaptation for coreference resolution has focused on a single dataset and often with non-neural models. Yang et al. (2012) use an adaptive ensemble which adjusts members per document. Meanwhile, Zhao and Ng (2014) use active learning approach and find that it is possible to adapt a feature-based coreference resolution model to be on par with one trained from scratch while using far less data. Aktaş et al. (2020) adapt the neural model of Lee et al. (2018) to Twitter by carefully selecting genres of OntoNotes to train from, which requires retraining a model on source data selected based on the target domain.
While these studies shed insight on single datasets, we aim to set broader expectations and guidelines on effectively using new data for model adaptation, both in terms of quantity and allocation between training and model selection.

Methods
We investigate our research question across five datasets with seven different initialization methods and vary the training set size for each model. 2 An entity cluster with only one mention.

Continued Training
We adopt the formulation of continued training from Luong and Manning (2015) where a model is first trained on a source dataset until convergence. This fully-trained model is then used to initialize a second model which is trained on a target dataset.
This general framework has been used successfully for other tasks where annotation guidelines or domains shift significantly between datasets, like in syntactic parsing (Joshi et al., 2018), semantic parsing (Fan et al., 2017;Lialin et al., 2021) and neural machine translation (Luong and Manning, 2015;Khayrallah et al., 2018). In addition, continued training can be staggered at different granularities (Gururangan et al., 2020) or at different mixture rates of in-domain and out-of-domain data (Xu et al., 2021).

Incremental Coreference Model
End-to-end models for coreference resolution broadly have four parts: a text encoder, a scorer for mention detection, a scorer for mention pair linking, and an algorithm for decoding clusters. The incremental coreference (ICOREF) model (Xia et al., 2020) is a constant-memory adaptation of the end-to-end neural coreference resolution model (Lee et al., 2017) with improvements from subsequent work that incorporates stronger encoders (Joshi et al., 2019(Joshi et al., , 2020. By creating explicit clusters and performing mention-cluster linking instead of mention-pair linking, ICOREF removes the need for a decoding step. This model is conceptually similar to Toshniwal et al. (2020), which likewise aims to better accommodate longer documents with limited memory. The choice of this model is due to its competitive performance against the line of endto-end neural coreference resolution models (Joshi et al., 2019) and memory efficiency, which allowed for experiments on longer documents. We use this model architecture for all of our experiments.
However, ICOREF, like the models before it, was designed around OntoNotes. As a result, we need to ignore the genre-specific embeddings and accommodate singletons by implementing an auxiliary objective. These details, along with more specific details about the model's algorithm and training objective, are described in Appendix A.

Data
We explore several source and target datasets, as shown in Table 1 are small, evaluation is performed via k-fold crossvalidation (following the original authors).
OntoNotes 5.0 (Weischedel et al., 2013) is a dataset spanning several genres including telephone conversations, newswire, newsgroups, broadcast news, broadcast conversations, weblogs, and religious text. The dataset contains annotations of syntactic parse trees, named entities, semantic roles, and coreference. Notably, however, it does not annotate for singleton mentions, while it does link events. Finally, the release includes data in English ( en ), Chinese ( zh ), and Arabic, which we refer to using superscripts.
PreCo (Chen et al., 2018) is a dataset consisting of reading comprehension passages used in test questions. The authors argue that because its vocabulary is smaller than that of OntoNotes, it is more controllable for studying train-test overlap. While they detail many ways in which their annotation scheme differs from OntoNotes, we note that they annotate singleton mentions and do not annotate events. Furthermore, this corpus is sufficiently large that it is possible to train a general-purpose coreference resolution model. Finally, because the official test set has not been released, we refer to the official "dev" set as our test set, and use a separate 500 training examples as our "dev" set.
LitBank (Bamman et al., 2020) is an annotated dataset of the first 2,000 words, on average, of 100 books. While they annotate singleton mentions, they also limit their mentions only to those which can be assigned an ACE category. Furthermore, due to the small corpus size and high variability in writing style, they evaluate using k-fold crossvalidation.
QBCoref (Guha et al., 2015) is a set of 400 quiz bowl 3 literature questions that are annotated for coreference resolution. This dataset also annotates singletons and it only considers a small set of mention types. The documents are short and dense with (nested) entity mentions, as well as terminology specific to literature questions.
ARRAU (Uryupina et al., 2020) is the second release 4 of ARRAU, a corpus first created by Poesio and Artstein (2008) which spans several genres. The fine-grained annotations mark the explicit type of coreference, and the dataset also includes phenomena like singleton mentions and nonreferential mentions. For this paper, we study only the coarsest-grained coreference resolution sets of the RST subcorpus, which is a subset of the Penn Treebank (PTB) newswire documents, and therefore uses the same splits as PTB (Poesio et al., 2018). Since OntoNotes also includes sections of PTB, this dataset overlaps with OntoNotes. However, we can still use ARRAU to study annotation transfer.

Source models
ICOREF has three trained components: an encoder, a mention scorer, and a mention linker. We try initializing the encoder only and the full model.
Pretrained encoders For these models, we initialize only the encoder with a pretrained one and randomly initialize the rest of the model. Joshi et al. (2020) trained the SPANBERT encoder on a collection of English data with a span boundary objective aimed at improving span representations. In addition, they finetune SPANBERT by training a coreference resolution system on OntoNotes (Joshi et al., 2019), which they release separately. We name this finetuned encoder SPANBERT-ON. Conneau et al. (2020) trained XLM-R, a cross-lingual encoder, on webcrawled text in 100 languages. It is effective at cross-lingual transfer, including coreference linking (Xia et al., 2021). We use the "large" size of each model, except for one experiment with the "base" size of SPANBERT-ON.
Trained models Alternatively, we can initialize the full model. TRANSFER (ON) is the downloadable model from Xia et al. (2020). In addition, we train models on PreCo with SpanBERT-large (TRANSFER (PC)) and on OntoNotes en with XLM-R (TRANSFER (EN)). 5 We also train a variant of each model with gold mention boundaries, which skips the mention scorer. (2020), with base and Large designating its size. For these first four models, we only initialize the encoder. For the TRANSFER models, we perform continued training and initialize the full model with one that has already been trained on a source dataset: OntoNotes (on), PreCo (pc), or OntoNotes en (en).

Experiments and Results
For a single source model and target dataset, we train several models using a different number of input training examples. The exact details for training set sizes and preprocessing are in Appendix B while training details and hardware are in Appendix C. We evaluate coreference using the average F 1 between MUC, B 3 and CEAF φ 4 , following prior work (Pradhan et al., 2012).

How effective is continued training for domain adaptation?
Continued training Figure 1 shows that it is always beneficial to perform continued training on a source model, even if there is a large amount of target data. However, the differences are most pronounced in low-resource settings (with 10 fullyannotated documents) where it is still possible to adapt a strong model to perform non-randomly.  (2020) find that pretrained language models can effectively learn a broad suite of sentence-level understanding, translation, question-answering tasks with just few examples. We corroborate their findings for a document-level information extraction task, since our models, based on strong pretrained encoders, perform well with just 5 or 10 training documents.
OntoNotes vs. PreCo We find that OntoNotes (TRANSFER (ON)), despite being the benchmark dataset, is on par (or worse) as a pretraining dataset compared to PreCo (TRANSFER (PC)). One possibility is that because PreCo annotates for singletons, it is closer to the target datasets that also annotate singletons. This is evident when we compare the mention detection accuracy of the two models in low-data settings (e.g. LitBank or QB-Coref at 5 examples). However, we subsequently explore the case when all models are given gold mention boundaries in training and test, which ef-     Figure 2: The expected test F 1 (and standard deviation) on the PreCo dataset for a given number of training documents and 20 sampled subsets of dev documents for two models described in Section 3.4. The number of runs matching the best full dev checkpoint is in the lower-right. This shows that for a given model, the dev set size has relatively little impact fectively evaluates just the linker. Instead, we find that PreCo outperforms OntoNotes even more on QBCoref, LitBank, and even ARRAU RST . This suggests that PreCo might be a better pretraining dataset than OntoNotes.
Model size and pretraining The publicly available models use the "base" and "large" encoders. While there are even larger encoders, coreference models using them are rare. For future model development, one may decide between using a publicly available small model or retraining a large one from scratch. To simulate this, we compared a small encoder finetuned on OntoNotes, SPANBERT-ON (B), with SPANBERT (L), which has not been trained on the task. This is also a realistic setting if there are hardware or compute limitations.
In all datasets, we see that there is benefit to having some pretraining. When there is not much training data, the smaller (finetuned) encoder outperforms the larger encoder without finetuning. However, with enough data, the large model appears to surpass the smaller model. Nonetheless, there exist scenarios where continued training of a smaller model is desirable.
Cross-lingual transfer The gap in performance at low-data conditions (and the high initial starting point) shows that transfer via continued training is effective. In this case, it also provides more  evidence for XLM-R's cross-lingual transfer ability.
In the end-to-end setting, the continued training method consistently outperforms the base encoder. This is less clear when given gold mentions. We noticed that only when training with XLM-R (L), some models did not converge. The reason for the high variability with OntoNotes zh warrants further study.
New benchmarks For each dataset, Table 2 shows the test scores of our best model compared to prior work. For PreCo, we directly evaluate on the fully-trained model without continued training, as the full dataset is sufficiently large. Since some of these datasets are understudied, we present these primarily as stronger baselines for future work.

How should we allocate annotated documents?
In Figure 1, the experiments for each dataset used the same dev set for model selection. Meanwhile, we observe that adding even a few more training examples can lead to improved performance. Yet, for some datasets, like PreCo, the size of the dev set used for model selection greatly outnumbers the number of training documents. Here, we allocate fewer documents for model selection.
We compare 20 models for PreCo trained with a different number of examples using SPANBERT-ON (L) and TRANSFER (ON). We train each model for 60 epochs and make predictions on all 500 dev examples. Next, for each dev set size, we sample a subset of the full predictions and determine, post-hoc, the checkpoint at which the model would have stopped had we used that sampled subset. We sample 20 such subsets and compute the expected scores and standard deviation for each model, along with how frequently the subset agreed with the full dev set. Figure 2 summarizes the results, showing remarkable stability in expectation even with tiny dev sets, often less than a couple points behind using the full dev set. Given a fixed budget of documents or annotations, these results suggest that it is beneficial to allocate as many documents as possible towards training, leaving behind a small set for model selection.

How much do the source models forget?
To measure the degree of catastrophic forgetting, we revisit the source datasets of each TRANSFER model and measure its performance. 6 In Figure 3, we see that the catastrophic forgetting is especially pronounced with just few examples.
We hypothesize that this is due to easy-tolearn changes between the annotation guidelines that are incompatible between the two datasets, like the annotation of certain entity types. Two pairs, (OntoNotes en →OntoNotes zh ) and (PreCo→ARRAU RST ) are less affected by continued training. For OntoNotes, the same guidelines are used for all languages. Meanwhile, PreCo and ARRAU RST are more similar in annotation guidelines than any other pair since they both include singletons. On the other hand, (OntoNotes→ARRAU RST ) shows a substantial drop in performance despite the two datasets containing overlapping documents.

Conclusion
In this work, we comprehensively examined the transferability of neural coreference resolution models. We explored several different model initialization methods across five datasets, each with a different number of training examples, to demonstrate the effectiveness of continued training. Additionally, this method results in improved performance over prior work on these datasets. Furthermore, we found that other datasets, like PreCo, can be leveraged to pretrain coreference resolution models, suggesting a viable alternative to OntoNotes for model development and benchmarking. We also found that given a fixed set of annotated examples, very few should be allocated towards model selection. This study and its set of benchmarks should serve as a reference for coreference resolution model adaptation, especially for scenarios where annotation can be expensive or data may be scarce. If max c j ∈C (s c (x i , c j )) ≤ 0, a new cluster, c new = {x i } with embedding x i , is created and added to C. Otherwise, x i is merged into the topscoring c j , with the new embedding, where α is a learned function of x i and c j .
The training objective aims to minimize − log x i ∈X P (c * x i |x i ), where c * x i is the correct cluster determined by the cluster containing the most recent antecedent of x i . If no such antecedent exists, then the correct cluster is the dummy cluster, , and s c (x i , ) = 0. Letting C = C ∪ { }, The probability can then be computed as .
In this work, we instead optimize for all antecedents of x, Ant(x), instead of the most recent one: − log Finally, s a usually incorporates a genre embedding determined by the genre of the document. We retain that small set of parameters but assume all documents have the same genre. The only model for which this is not the case is the directly downloaded model, as it was trained for best performance on OntoNotes.

A.2 Singleton Mentions
For most datasets and many downstream tasks, we want to include the singleton entity mentions in the output predictions. For OntoNotes, all singleton mentions are removed in postprocessing. We could add an auxiliary objective that maximizes s m (x i ) if x i is an entity mention (Zhang et al., 2018) and only prune out singleton mentions s m (x i ) < 0 in postprocessing. Instead, we present a model reformulation that is similar to the choices made by Toshniwal et al. (2020).
Instead of taking the top kn spans at span pruning, we prune to the top kn spans from the set {x i ∈ X : s m (x i ) > 0} (which could have fewer than kn elements). This is both more efficient and easier to optimize for. Now, the training objective is to minimize s m (x i ) if x i is not an entity mention, and maximize s m (x i )+s a (x i , c j ) if it is. This latter term is identical to s c (x i ) from the previous model.
We can interpret this change as now modeling the joint distribution of whether x i is an entity mention (a binary random variable M ) and which entity cluster (E) is would best match to (s a ). We can decompose the joint probability, This can further split into the components, (2) The M = 1 objective is the same as training without singleton mentions (as in OntoNotes), while the M = 0 term accounts for singletons. Note that if we know M = 0, then we always make the correct "cluster" decision by ignoring it for the remainder of the algorithm, which allows for this simplification. This is different from simply adding an objective maximizing P (M ), since that would incorrectly handle cases when M = 0. In practice, however, we found that this makes no difference in performance on the task, though pruning spans earlier resulted in a substantially faster model.

B Dataset Preprocessing
We use the scripts from Joshi et al. (2019) to convert all documents into sentence-separated and subtokenized segments of sizes at most 512. For all English datasets, we use the SpanBERT tokenizer, while we use the XLM-R tokenizer for the cross-lingual experiments.
For QBCoref, we split the dataset into five splits after shuffling the initial dataset. For LitBank, we use the published splits (Bamman et al., 2020). In ARRAU RST , several mentions are split. Correctly modeling split spans is an active area of ongoing work (Yu et al., 2020(Yu et al., , 2021. Since we use ARRAU RST primarily for intrinsic comparisons, we defer to the minimum span if a mention is split. This means we replaced a subset of markables,   [10,20,40,80,160,335]  listed in Table 3. In addition, a small number of markables do not have an annotated coreference cluster, while a couple split markables failed to reduce because there is no minimum span annotated. These two phenomena did not affect the test set. Nonetheless, the model's inability to address split markables affects comparability against prior work. Table 4 shows the number of training examples we use for each dataset. Since we only shuffle once initially, larger training sets are always a superset of a smaller one.

C Training Details
We follow the same hyperparameters used by Xia et al. (2020). We use k = 0.4 to select the top 0.4n spans, use learning rates of 2e-4 for training the non-encoder parameters (with Adam) and 1e-5 for the encoder (with AdamW). For all models, we finetune the full encoder. We use gradient clipping of 10, train for up to 100 epochs with a patience of 10 for early stopping, as determined by dev F1. For OntoNotes en , we consider spans up to width 30, while we use 15 for PreCo and ARRAU, 20 for LitBank and QBCoref, and 50 for OntoNotes zh . These choices are made based on prior work or the statistics of the training set; increasing the value would affect runtime (with marginal gains in performance).
Each model was trained on a single 24GB Nvidia Quadro RTX 6000s for between 20 minutes to 16 hours, depending on the number of training examples. Due to the cost of training over 300 models, each model was trained only once, with the exception of a couple OntoNotes zh models with gold-mentions which converged poorly (by predicting all mention as singletons) and were retrained. Since the goal of this work is to show general trends, with the exception of nonconverging models, similar conclusions can still be drawn even if some points are far from their expected F 1 .