On Generalization in Coreference Resolution

While coreference resolution is defined independently of dataset domain, most models for performing coreference resolution do not transfer well to unseen domains. We consolidate a set of 8 coreference resolution datasets targeting different domains to evaluate the off-the-shelf performance of models. We then mix three datasets for training; even though their domain, annotation guidelines, and metadata differ, we propose a method for jointly training a single model on this heterogeneous data mixture by using data augmentation to account for annotation differences and sampling to balance the data quantities. We find that in a zero-shot setting, models trained on a single dataset transfer poorly while joint training yields improved overall performance, leading to better generalization in coreference resolution models. This work contributes a new benchmark for robust coreference resolution and multiple new state-of-the-art results.


Introduction
Coreference resolution is a core component of the NLP pipeline, as determining which mentions in text refer to the same entity is used for a wide variety of downstream tasks like knowledge extraction (Li et al., 2020), question answering (Dhingra et al., 2018), and dialog systems (Gao et al., 2019).As these tasks span many domains, we need coreference models to generalize well.
Meanwhile, models for coreference resolution have improved due to neural architectures with millions of parameters and the emergence of pretrained encoders.However, model generalization across domains has always been a challenge (Yang et al., 2012;Zhao and Ng, 2014;Poot and van Cranenburgh, 2020;Aktaş et al., 2020).Since these models are usually engineered for a single dataset, they capture idiosyncrasies inherent in that dataset.As an example, OntoNotes (Weischedel et al., 2013), a widely-used general-purpose dataset, provides metadata, like the document genre and speaker information.However, this assumption cannot be made more broadly, especially if the input is raw text (Wiseman et al., 2016).
Furthermore, while there are datasets aimed at capturing a broad set of genres (Weischedel et al., 2013;Poesio et al., 2018;Zhu et al., 2021), they are not mutually compatible due to differences in annotation guidelines.For example, some datasets do not annotate singleton clusters (clusters with a single mention).Ideally, we would like a coreference model to be robust to all the standard datasets.In this work, we consolidate 8 datasets spanning multiple domains, document lengths, and annotation guidelines.We use them to evaluate the off-theshelf performance of models trained on a single dataset.While they perform well within-domain (e.g., a new state-of-the-art of 79.3 F1 on LitBank), they still perform poorly out-of-domain.
To address poor out-of-domain performance, we propose joint training for coreference resolution, which is challenging due to the incompatible training procedues for different datasets.Among other things, we need to address (unannotated) singleton clusters, as OntoNotes does not include singleton annotations.We propose a data augmentation process to add predicted singletons, or pseudosingletons, into the training data to match the other datasets which have gold singleton annotations.
Concretely, we contribute a benchmark for coreference to highlight the disparity in model performance and track generalization.We find joint training highly effective and show that including more datasets is almost "free", as performance on any single dataset is only minimally affected by joint training.We find that our data augmentation method of adding pseudo-singletons is also effective.With all of these extensions, we increase the macro average

Training Datasets
OntoNotes 5.0 (ON) (Weischedel et al., 2013) is a collection of news-like, web, and religious texts spanning seven distinct genres.Some genres are transcripts (phone conversations and news).As the primary training and evaluation set for developing coreference resolution models, many features specific to this corpus are tightly integrated into publicly released models.For example, the metadata includes information on the document genre and the speaker of every token (for spoken transcripts).Notably, it does not contain singleton annotations.
LitBank (LB) (Bamman et al., 2020) is a set of public domain works of literature drawn from Project Gutenberg.On average, coreference in the first 2,000 tokens of each work is fully annotated for six entity types. 2 We only use the first crossvalidation fold of LitBank, which we call LB 0 .
PreCo (PC) (Chen et al., 2018) contains documents from reading comprehension examinations, each fully annotated for coreference resolution.Notably, the corpus is the largest such dataset released.
2 They are people, facilities, locations, geopolitical entities, organizations, and vehicles.

Evaluation Datasets
Character Identification (CI) (Zhou and Choi, 2018) has multiparty conversations derived from TV show transcripts.Each scene in an episode is considered a separate document.This charactercentric dataset only annotates mentions of people.
WikiCoref (WC) (Ghaddar and Langlais, 2016) contains documents from English Wikipedia.This corpus contains sampled and annotated documents of different lengths, from 209 to 9,869 tokens.
Quiz Bowl Coreference (QBC) (Guha et al., 2015) contains questions from Quiz Bowl, a trivia competition.These paragraph-long questions are dense with entities.Only certain entity types (titles, authors, characters, and answers) are annotated.

Analysis Datasets
Gendered Ambiguous Pronouns (GAP) (Webster et al., 2018) is a corpus of ambiguous pronounname pairs derived from Wikipedia.While only pronoun-name pairs are annotated, they are provided alongside their full-document context.This corpus has been previously used to study gender bias in coreference resolution systems.

Winograd
Schema Challenge (WSC) (Levesque et al., 2012) is a challenge dataset for measuring common sense in AI systems. 3Unlike the other datasets, each document contains one or two sentences with a multiple-choice question.We manually align the multiple choices to the text and remove 2 of the 273 examples due to plurals.(Pradhan et al., 2012), except for GAP (F 1 ) and WSC (accuracy).Some models use speaker ( S ) features, genre ( G ) features, or pseudo-singletons (PS).

Baselines
We first evaluate a recent system (Xu and Choi, 2020) which extends a mention-ranking model (Lee et al., 2018) by making modifications in the decoding step.We find disappointing out-of-domain performance and difficulties with longer documents present in LB 0 and WC (Appendix B.1).To overcome this issue, we study the longdoc model by Toshniwal et al. (2020), which is an entity-ranking model designed for long documents that reported strong results on both OntoNotes and LitBank.
The original longdoc model uses a pretrained SpanBERT (Joshi et al., 2020) encoder which we replace with Longformer-large (Beltagy et al., 2020) as it can incorporate longer context.We retrain the longdoc model and finetune the Longformer encoder for each dataset, which proves to be competitive for coreference. 4For OntoNotes we train with and without the metadata of: (a) genre embedding, and (b) speaker identity which is introduced as part of the text as in Wu et al. (2020).

Joint Training
With copious amounts of text in OntoNotes, PreCo, and LitBank, we can train a joint model on the combined dataset.However, this is impractical as the annotation guidelines between the datasets are misaligned (OntoNotes does not annotate singletons and uses metadata) and because there are substantially more documents in PreCo.
Augmenting Singletons Since OntoNotes does not annotate for singletons, our training objective for OntoNotes is different from that of PreCo and LitBank.To address this, we introduce pseudosingletons that are silver mentions derived from first training a mention detector on OntoNotes and selecting the top-scoring mentions outside the gold mentions. 5We experiment with adding 30K, 60K, and 90K pseudo-singletons (in total, there are 156K gold mentions).We find adding 60K to be the best fit for OntoNotes-only training, and 30K is the best for joint training (Appendix B.3). Metadata Embeddings For the joint model to be applicable to unknown domains, we avoid using any domain or dataset-identity embeddings, including the OntoNotes genre embedding.We do make use of speaker identity in the joint model because: (a) this is possible to obtain in conversational and dialog data, and (b) it does not affect other datasets that are known to be single-speaker at test time.

Results
Table 2 shows the results for all our models on all 8 datasets.We report each dataset's associated metric (e.g., CoNLL F 1 ) and a macro average across all eight datasets to compare generalizability.
Among the longdoc baseline models trained on one of OntoNotes, PreCo, or LitBank, we observe a sharp drop in out-of-domain evaluations.The Lit-Bank model is generally substantially worse than the models trained on OntoNotes and PreCo, likely (3) QBC This poet of "(I) felt a Funeral in (my) Brain" and "I'm Nobody, Who are you?"wrote about a speaker who hears a Blue, uncertain, stumbling buzz before expiring in "(I) heard a fly buzz when (I) died".For 10 points, name this female American poet of Because (I) could not stop for Death.
(4) CI Chandler Bing: Okay, I don't sound like that.(That) is so not true.(That) is so not ... (That) is so not ... That ... Oh , shut up !Table 3: Joint + PS 30K error analysis for zero-shot evaluation sets.Each row highlights one cluster where spans in parenthesis are predicted by the model while the blue-colored spans represent ground truth annotations.Thus, in (2) the model misses out on the ground truth cluster entirely while in ( 3) and ( 4) the model predicts an additional cluster.
due to both a smaller training set and a larger domain shift.Interestingly, the LitBank model performs the best among all models on QBC, which can be attributed to both LB and QBC being restricted to a similar set of markable entity types.Meanwhile, all OntoNotes-only models perform well on WC and GAP, possibly due to the more diverse set of genres within ON and because WC also does not contain singletons.
For models trained on OntoNotes, we find that the addition of speaker tokens leads to an almost 9 point increase on CI, which is a conversational dataset, but has little impact for non-conversational evaluations.Surprisingly, the addition of genre embeddings has almost no impact on the overall evaluation. 6Finally, the addition of pseudo-singletons leads to consistent significant gains across almost all the evaluations, including OntoNotes.
The joint models, which are trained on a combination of OntoNotes, LitBank, and PreCo, suffer only a small drop in performance on OntoNotes and PreCo, and achieve the best performance for LitBank.Like the results observed when training with only OntoNotes, we see a significant performance gain with pseudo-singletons in joint training as well, which justifies our intuition that they can bridge the annotation gap.The "Joint + PS 30K" model also achieves the state of the art for WC.

Analysis
Impact of Singletons Singletons are known to artificially boost the coreference metrics (Kübler and Zhekova, 2011), and their utility for downstream applications is arguable.To determine the impact of singletons on final scores, we present separate results for singleton and non-singleton clusters in QBC in Table 4.For non-singleton clusters we use the standard CoNLL F 1 but for singleton clusters the CoNLL score is undefined, and hence, we use the vanilla F 1 -score.
The poor performance of ON models for singletons is expected, as singletons are not seen during training.Adding pseudo-singletons improves the performance of both the ON and the Joint model for singletons.Interestingly, adding pseudo-singletons also leads to a small improvement for non-singleton clusters.
The PC model has the best performance for nonsingleton clusters while the LB 0 model, which performs the best in the overall evaluation, has the worst performance for non-singleton clusters.This means that the gains for the LB 0 model can be all but attributed to the superior mention detection performance which can be explained by the fact that both LB and QBC are restricted to a similar set of markable entity types.

Related work
Joint training is commonly used in NLP for training robust models, usually aided by learning dataset, language, or domain embeddings (e.g., (Stymne et al., 2018) for parsing; (Kobus et al., 2017;Tan et al., 2019) for machine translation).This is essentially what models for OntoNotes already do with genre embeddings (Lee et al., 2017).Unlike prior work, our test domains are unseen, so we cannot learn test-domain embeddings.For coreference resolution, Aralikatte et al. ( 2019) augment annotations using relation extraction systems to better incorporate world knowledge, a step towards generalization.Subramanian and Roth (2019) use adversarial training to target names, with improvements on GAP.Moosavi and Strube (2018) incorporate linguistic features to improve generalization to WC.Recently, Zhu et al. (2021) proposed the OntoGUM dataset which consists of multiple genres.However, compared to the datasets used in our work, OntoGUM is much smaller, and is also restricted to a single annotation scheme.To the best of our knowledge, our work is the first to evaluate generalization at scale.Missing singletons in OntoNotes has been previously addressed through new data annotations, leading to the creation of the ARRAU (Poesio et al., 2018) and PreCo (Chen et al., 2018) corpora.While we include PreCo in this work, ARRAU contains additional challenges, like split-antecedents, that further increase the heterogeneity, and its domain overlaps with OntoNotes.Pipeline models for coreference resolution that first detect mentions naturally leave behind unclustered mentions as singletons, although understanding singletons can also improve performance (Recasens et al., 2013).
Recent end-to-end neural models have been evaluated on OntoNotes, and therefore conflate "not a mention" with "is a singleton" (Lee et al., 2017(Lee et al., , 2018;;Kantor and Globerson, 2019;Wu et al., 2020).For datasets with singletons, this has been addressed explicitly through a cluster-based model (Toshniwal et al., 2020;Yu et al., 2020).For those without, they can be implicitly accounted for with auxiliary objectives (Zhang et al., 2018;Swayamdipta et al., 2018).We go one step further by augmenting with pseudo-singletons, so that the training objective is identical regardless of whether the training set contains annotated singletons.

Conclusion
Our eight-dataset benchmark highlights disparities in coreference resolution model performance and tracks cross-domain generalization.Our work begins to address cross-domain gaps, first by handling differences in singleton annotation via data augmentation with pseudo-singletons, and second by training a single model jointly on multiple datasets.This approach produces promising improvements in generalization, as well as new state-of-the-art results on multiple datasets.We hope that future work will continue to use this benchmark to measure progress towards truly general-purpose coreference resolution.

A Model and Training Details
A.1 Model Our model follows the typical coreference pipeline of encoding the document, followed by mention proposal, and finally mention clustering.The model is architecturally the same as Toshniwal et al. (2020), and so we re-present their formulation throughout this section.However, we use the Longformer encoder as it accommodates longer documents.Otherwise, the model is identical to Toshniwal et al. (2020) in terms of model size and weight dimensions.We next explain the mention proposal and mention clustering stages briefly.
Mention Proposal Given a document D, we score all mentions of length ≤ 20 subword tokens and choose the K = 0.4 × |D| top spans among them.This is an initial pruning step that speeds up the model and reduces memory usage.Let X(K) = {(x i ) K i=1 } represent the top-K candidate mention spans and s m (x i ) be a learned scoring function for span x i , which represents how likely a span is an entity mention.s m is trained to assign positive score to gold mentions (any mention in a gold cluster), and negative score otherwise.
The training objective only uses spans in X(K), i.e. loss is computed after pruning.During inference, we can therefore further prune down to {x i : x i ∈ X(K), s m (x i ) ≥ 0}, which we then pass into the clustering step.During training, we use teacher forcing and only pass gold mentions among the top-K mentions to the clustering step.

Mention Clustering
The entity-based model tracks M entities (initially M = 0).Let E = (e m ) M m=1 represent the M entities.For ease of notation, we will overload the terms x i and e j to also correspond to their respective representations.
The model decides whether the span x i refers to any of the entities in E as follows: s c (x i , e j ) = f c ([x i ; e j ; x i e j ; g(x i , e j )]) where represents the element-wise product, and f c (•) corresponds to a learned feedforward neural network.The term g(x i , e j ) corresponds to a concatenation of feature embeddings that includes embeddings for (a) number of mentions in e j , and (b) number of tokens between x i and last mention of e j .If s top c > 0 then x i is considered to refer to e top , and e top is updated accordingly. 7Otherwise, we initiate a new cluster: E = E ∪ x i .During training, we use teacher-forcing i.e. the clustering decisions are based on ground truth.

A.2 Training
We train all the models for 100K gradient steps with a batch size of 1 document.Only the LB-only models are trained for 8K gradient steps which corresponds to 100 epochs for LB.The models are evaluated a total of 20 times (every 5K training steps) for all models except the LB-only models which are evaluated every 400 steps.We use early stopping and a patience of 5 i.e. training stops if the validation performance doesn't improve for 5 consecutive evaluations.
We use the full context size of 4096 tokens for Longformer-large.All training documents used in this work except 1 ON document fit in a single context window.For optimizer, we use AdamW with a weight decay of 0.01 and initial learning rate of 1e-5 for the Longformer encoder, and Adam with an initial rate of 3e-4 for the rest of the model parameters.The learning rate is linearly decayed throughout the training.

B.1 Xu and Choi (2020) Baselines
We run the off-the-shelf model on the test sets of ON, LB 0 , PC, and QBC.LB 0 requires a 24GB GPU, while WC runs out of memory even on that hardware.The model shows strong in-domain performance with 80.2 on ON.However, out-ofdomain performance is weak: 57.2 on LB 0 , 49.3 on PC, and 37.6 on QBC.These are roughly on par with the ON longdoc models.

B.2 LitBank Cross-Validation Results
Table 7 presents the results for all the crossvalidation splits of LitBank.The overall performance of 79.3 CoNLL F1 is state of the art for Lit-Bank, outperforming the previous state of the art of 76.5 by Toshniwal et al. (2020).Note that in this work, the joint model outperformed (78.2 vs. 77.2) this baseline model on split 0 (LB 0 ).However, training 10 joint models contradicts the purpose of this work, which is to create a single, generaliz-

B.3 Singleton Results for OntoNotes
For ON-only models, we tune over the number of pseudo-singletons sampled among {30K, 60K, 90K}.Table 8 shows that 60K pseudo-singletons is the best choice based on validation set results on ON.

B.4 Downsampling and Singleton Results for Joint
In preliminary experiments, we sample 500 docs from ON and PC. more examples (e.g.5K PC) begins to hurt performance on LB, likely due to data imbalance.
For the 1K downsampling setting, we tune over the number of pseudo-singletons sampled among {30K, 60K, 90K}.We find 30K to be the best choice based on validation set results.

B.5 Results with Gold Mentions
In Table 6, we report the results with gold mentions for the training and evaluation sets.The analysis sets are skipped as they are partially annotated.We find that joint training is also helpful in this setting, as results mirror findings with predicted mentions.In particular, this shows that it is not just a failure to predict mentions that is preventing ON from performing well on LB, PC, and QBC.

C Compute Resources
Given that we are finetuning the Longformer model and using a maximum context size of 4096 tokens, the memory requirements of the model are quite large even though the cluster-ranking paradigm is considered memory efficient (Xia et al., 2020).We were able to train the PreCo-only model on a 12 GB GPU in 20 hrs (even the longest PreCo documents are shorter than 2048 tokens with the Longformer tokenization).All other models were trained over GPUs with memory 24GB or higher (Titan RTX and A6000).On an A6000, the LB-only models can be trained within 4 hrs, the ON-only models within 16 hrs, and the joint models within 20 hrs.
PreCo has 36K training documents, compared to 2.8K and 80 training documents for OntoNotes and LitBank respectively.A naive dataset-agnostic sampling strategy would mostly sample PreCo documents.To address this issue, we downsample OntoNotes and PreCo to 1K documents per epoch.Downsampling to 0.5K documents per epoch led to slightly worse performance (Appendix B.4).
a creature "burning bright, in the forests of the night," . ..(2)QBC This author's non fiction works . . .another work , a plague strikes secluded valley where teenage boys have been evacuated . . .name this author of Nip the Buds, Shoot the Kids . . .

Table 1 :
Statistics of datasets.Datasets with k indicate that prior work uses k-fold cross-validation; we record the splits used in this work.Datasets with p are partially annotated, so we do not include cluster details.F 1 across all datasets by 9.5 points and achieve a new state-of-the-art on LitBank and WikiCoref.
We organize our datasets into three types.Training datasets (Sec.2.1) are large in terms of number of tokens and clusters and more suitable for training.Evaluation datasets (Sec.2.2) are out-of-domain compared to our training sets and are entirely held out.Analysis datasets (Sec.2.3) contain annotations aimed at probing specific phenomena.Table 1 lists the full statistics.

Table 2 :
Performance of each model on 8 datasets measured by CoNLL F 1

Table 6 :
Result of all the models with gold mentions.Some models use speaker ( s ) features, genre ( g ) features, or pseudo-singletons (PS).The metric for the training and evaluation datasets is CoNLL F-score.We skip the analysis datasets because they lack the set of true gold mentions.

Table 7 :
LitBank cross-validation results.able model.Realistically, we recommend jointly training with the entirely of LitBank.

Table 8 :
Table 5 shows the results, confirming that 1K is slightly better than 500.Using Validation and Test results for ON-only model trained on different amount of pseduo-singletons (PS).