OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12 More Genres

SOTA coreference resolution produces increasingly impressive scores on the OntoNotes benchmark. However lack of comparable data following the same scheme for more genres makes it difficult to evaluate generalizability to open domain data. This paper provides a dataset and comprehensive evaluation showing that the latest neural LM based end-to-end systems degrade very substantially out of domain. We make an OntoNotes-like coreference dataset called OntoGUM publicly available, converted from GUM, an English corpus covering 12 genres, using deterministic rules, which we evaluate. Thanks to the rich syntactic and discourse annotations in GUM, we are able to create the largest human-annotated coreference corpus following the OntoNotes guidelines, and the first to be evaluated for consistency with the OntoNotes scheme. Out-of-domain evaluation across 12 genres shows nearly 15-20% degradation for both deterministic and deep learning systems, indicating a lack of generalizability or covert overfitting in existing coreference resolution models.


Introduction
Coreference resolution is the task of grouping referring expressions that point to the same entity, such as noun phrases and the pronouns that refer to them. The task entails detecting correct mention or 'markable' boundaries and creating a link with previous mentions, or antecedents. A coreference chain is a series of decisions which groups the markables into clusters. As a key component in Natural Language Understanding (NLU), the task can benefit a series of downstream applications such as Entity Linking, Dialogue Systems, Machine Translation, Summarization, and more (Poesio et al., 2016).
In recent years, deep learning models have achieved high scores in coreference resolution. The end-to-end approach (Lee et al., 2017(Lee et al., , 2018 jointly scoring mention detection and resolution currently not only beats earlier rule-based and statistical methods but also outperforms other deep learning approaches (Wiseman et al., 2016;Clark and Manning, 2016a,b). Additionally, language models trained on billions of words significantly improve performance by providing rich word and contextlevel information for classifiers (Lee et al., 2018;Joshi et al., 2019a,b).
However, scores on the identity coreference layer of benchmark OntoNotes dataset (Pradhan et al., 2013) do not reflect the generalizability of these systems. Moosavi and Strube (2017) pointed out that lexicalized coreference resolution models, including neural models using word embeddings, face a covert overfitting problem because of a large overlap between the vocabulary of coreferring mentions in the OntoNotes training and evaluation sets. This suggests that higher scores on OntoNotes-test may not indicate a better solution to the coreference resolution task.
To investigate the generalization problem of neural models, several projects have tested other datasets consistent with the OntoNotes scheme. Moosavi and Strube (2018) conducted out-ofdomain evaluation on WikiCoref (Ghaddar and Langlais, 2016), a small dataset employing the same coreference definitions. Results showed that neural models (with fixed embeddings) do not achieve comparable performance (16.8% degradation in score) as on OntoNotes. More recently, the e2e model using BERT (Joshi et al., 2019b) showed gains on the GAP corpus (Webster et al., 2018) using contextualized embeddings; however GAP only contains name-pronoun coreference, a very specific subset of coreference, and is limited in domain to the same single source -Wikipedia.
Though previous work has already identified the overfitting problem, it also has three main shortcomings. First, the scale of out-of-domain evalua-  showing that the overfitting problem is not overcome by contextualized language models.
• We give a genre-by-genre analysis for two popular systems, revealing relative strengths and weaknesses of current approaches and the range of easier/more difficult targets for coreference resolution.

Related Work
OntoNotes and similar corpora OntoNotes is a human-annotated corpus with documents annotated with multiple layers of linguistic information including syntax, propositions, named entities, word sense, and within document coreference (Weischedel et al., 2011;Pradhan et al., 2013). It covers three languages-English, Chinese and Arabic. The English subcorpus has 3,493 documents and ∼1.6 million words. WikiCoref, which is annotated for anaphoric relations, has 30 documents from English Wikipedia (Ghaddar and Langlais, 2016), containing 7,955 mentions in 1,785 chains, following OntoNotes guidelines.
GUM The Georgetown University Multilayer (GUM) corpus (Zeldes, 2017) is an open-source corpus of richly annotated texts from 12 types, including 168 documents and over 150K tokens.
Though it originally contains more coreference phenomena than OntoNotes using more exhaustive guidelines, it also contains rich syntactic, semantic and discourse annotations which allow us to create the OntoGUM dataset described below. We also note that due to its smaller size (currently about 10% the size of the OntoNotes coreference dataset), it is not possible to train SOTA neural approaches directly on this dataset while maintaining strong performance.
Other corpora As mentioned above, GAP is a gender-balanced labeled corpus of ambiguous pronoun-name pairs, used for out-of-domain evaluation but limited in coreferent types and genre. Several other comprehensive coreference datasets exist as well, such as ARRAU (Poesio et al., 2018) and PreCo (Chen et al., 2018), but these corpora cannot be used for out-of-domain evaluation because they do not follow the OntoNotes scheme. Their conversion has not been attempted to date.
Coreference resolution systems Prior to the introduction of deep learning systems, the coreference task was approached using deterministic linguistic rules (Lee et al., 2013;Recasens et al., 2013) and statistical approaches Klein, 2013, 2014). More recently, three neural models achieved SOTA performance on this task: 1) ranking the candidate mention pairs (Wiseman et al., 2015;Clark and Manning, 2016a), 2) modeling global features of entity clusters Manning, 2015, 2016b;Wiseman et al., 2016), and 3) end-to-end (e2e) approaches with joint loss for mention detection and coreferent pair scoring (Lee et al., 2017(Lee et al., , 2018Fei et al., 2019). The e2e method has become the dominant one, gaining the best scores on OntoNotes. To investigate differences between deterministic and deep learning models on unseen data, we evaluate the two approaches on OntoGUM.

Dataset Conversion
GUM's annotation scheme subsumes all markables and coreference chains annotated in OntoNotes, meaning we do not need human annotation to recognize additional mentions in the conversion process, though mention boundaries differ subtly (e.g. for appositions and verbal mentions). Since GUM has gold syntax trees, we were able to process the entire conversion automatically. Additionally, most coreference evaluations use gold speaker information in OntoNotes, which is available in GUM (for fiction, reddit and spoken data) and could be assembled automatically as well. The conversion is divided into two parts: removing coreference relations not included in the OntoNotes scheme, and removing or adjusting markables. For coreference relation deletion, we cut chains by removing expletive cataphora, and identifying the definiteness of nominal markables, since indefinites cannot be anaphors in OntoNotes. In addition to modifying existing mention clusters, we also remove particular coreference relations and mention spans, such as Noun-Noun compounding (only included in OntoNotes for proper-name modi-fiers), bridging anaphora, copula predicates, nested entities ('i-within-i'= single mentions containing coreferring pronouns), and singletons (all not included in OntoNotes). We note that singletons are removed as the final step, in order to catch singletons generated during the conversion process. We also contract verbal markable spans to their head verb, and merge appositive constructions into single mentions, following the OntoNotes guidelines. 2 To evaluate conversion accuracy, three annotators, including an original OntoNotes project member, conducted an agreement study on 3 documents, containing 2,500 tokens and 371 output mentions. Re-annotating from scratch based on OntoNotes guidelines, the conversion achieves a span detection score of ∼96 and CoNLL coreference score of ∼92, approximately the same as human agreement scores on OntoNotes. After adjudication, the conversion was found to make only 8/371 errors, in addition to 2 errors due to mistakes in the original GUM data, meaning that degradation due to conversion errors is marginal, and consistency should be close to the variability in OntoNotes itself.

Experiments
We evaluate two systems on the 12 OntoGUM genres, using the official CoNLL-2012 scorer (Pradhan et al., 2012(Pradhan et al., , 2014. The primary score is the average F1 of three metrics -MUC, B 3 , and CEAF φ4 . Deterministic coreference model We first run the deterministic system (dcoref, part of Stanford CoreNLP, Manning et al. 2014) on the OntoGUM benchmark, as it remains a popular option for offthe-shelf coreference resolution. As a rule-based system, it does not require training data, so we directly test it on OntoGUM's test set. However, POS tags, lemmas, and named-entity (NER) information are predicted by CoreNLP, which does have a domain bias favoring newswire. The system's multi-sieve structure and token-level features such as gender and number remain unchanged. We expect that the linguistic rules will function similarly across datasets and genres, notwithstanding biases of the tools providing input features to those rules.
SOTA neural model Combining the e2e approach with a contextualized LM and span masking is the current SOTA on OntoNotes. The system  utilizes the pretrained SpanBERT-large model, finetuned on the OntoNotes training set. Hyperparameters are identical to the evaluation of OntoNotes test to ensure comparable results between the benchmarks. We note that while we choose the SOTA system as a 'best case scenario', most off-the-shelf neural NLP toolkits (e.g. spaCy) actually use somewhat simpler e2e models than SpanBERT-large, due to memory/performance constraints.

Results
OntoGUM vs. OntoNotes The last rows in each half of Table 2 give overall results for the systems on each benchmark. e2e+SpanBERT encounters a substantial degradation of 15 points (19%) on OntoGUM, likely due to lower test set lexical and stylistic overlap, including novel mention pairs. We note that its average score of 64.6 is somewhat optimistic, especially given that the system receives access to gold speaker information wherever available (including in fc, cn and it, some of the better scoring genres), which is usually unrealistic. dcoref, assumed to be more stable across genres, also sees losses on OntoGUM of over 18 points (30%). We believe at least part of the degradation may be due to mention detection, which is trained on different domains for both systems (see the last three columns in the table). These results suggest that input data from CoreNLP degrades substantially on OntoGUM, or that some types of coreferent expressions in OntoGUM are linguistically distinct from those in OntoNotes, or both, making OntoGUM a challenging benchmark for systems developed using OntoNotes.
Comparing genres Both systems degrade more on specific genres. For example, while vl (with gold speaker information) fares well for both systems, neither does well on tx, and even the SOTA system falls well below (or around) 60s for the nw, wh and tx genres. This might be surprising for vl, which contains transcripts of spontaneous unedited speech from YouTube Creative Commons vlogs quite unlike OntoNotes data; conversely the result is less expected for carefully edited texts which are somewhat similar to data in OntoNotes: OntoNotes contains roughly 30% newswire text, and it is not immediately clear that GUM's nw section, which comes from recent Wikinews articles, differs much in genre. Examples (1)-(2) illustrate incorrectly predicted coreference chains from both sources and the type of language they contain.
( These examples show that errors occur readily even in quite characteristic news writing, while genre disparity by itself does not guarantee low performance, as in the case of the vlogs whose lanugage is markedly different. In sum, these observations suggest that accurate coreference for downstream applications cannot be expected in some common well edited genres, despite the prevalence of news data in OntoNotes (albeit specifically from the Wall Street Journal, around 1990). This motivates the use of OntoGUM as a test set for future benchmarking, in order to give the NLP community a realistic idea of the range of performance we may see on contemporary data 'in the wild'.
We also suspect that prevalence of pronouns and gold speaker information produce better scores in the results. Table 3 ranks genres by their e2e CoNLL score, and gives the proportions of pronouns, as well as score rankings for span detection. Because pronouns are usually easier to detect and pair than nouns (Durrett and Klein, 2013), more pronouns usually means higher scores. On genres with more than 50% pronouns and gold speakers (vl, it, cn, sp, fc) e2e gets much higher results, while genres with few pronouns (<30%) have lower scores (ac, vy, nw). This diversity over 12 genres supports the usefulness of OntoGUM, which can evaluate the genrealizability of coreference systems.

Conclusion
This paper presented OntoGUM, the largest open, gold coreference dataset following the OntoNotes scheme, adding several new genres (including more spoken data) to the OntoNotes family. The corpus is automatically converted from GUM by modifying the existing markable spans and coreference relations using multi-layer annotations, such as dependency trees. Results showed a lack of generalizability of existing systems, especially in genres low in pronouns and lacking speaker information.
We suspect that at least part of the success of SOTA approaches is due to correct mention detection and high matching scores in genres rich in pronouns, and more so with gold speaker information. Success for other types of mentions in OntoNotes data appears to be much more sensitive to lexical features, performing well on the benchmark test set with high lexical overlap to the training data, but degrading very substantially outside of it, even on newswire texts from our OntoGUM data. This supports use of this challenging dataset for future work, which we hope will benefit evaluations of systems targeting the OntoNotes standard.