Longtonotes: OntoNotes with Longer Coreference Chains

Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts.In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available.We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process.The resulting corpus, which we call LongtoNotes contains documents in multiple genres of the English language with varying lengths, the longest of which are up to 8x the length of documents in Ontonotes, and 2x those in Litbank.We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/hyperparameters and document length on performance and efficiency of the models, and demonstrate areas of improvement in long-document coreference modelling revealed by our new corpus.


Introduction
Coreference resolution is an important problem in discourse with applications in knowledge-base construction (Luan et al., 2018), question-answering (Reddy et al., 2019) and reading assistants (Azab et al., 2013;Head et al., 2021). In many such settings, the documents of interest, are significantly longer and/or on wider varieties of domains than the currently available corpora with coreference annotation (Pradhan et al., 2013;Bamman et al., 2019;Mohan and Li, 2019;Cohen et al., 2017).
The Ontonotes corpus (Pradhan et al., 2013) is perhaps the most widely used benchmark for coreference (Lee et al., 2013a;Durrett and Klein, 2013; * Now at Google. Long documents in genres such as broadcast conversations (bc) were split into smaller parts in Ontonotes. Our proposed dataset, LongtoNotes, restores documents to their original form, revealing dramatic increases in length in certain genres. Sachan et al., 2015;Wiseman et al., 2016;Lee et al., 2017;Joshi et al., 2020;Toshniwal et al., 2020b;Thirukovalluru et al., 2021;Kirstain et al., 2021). The construction process for Ontonotes, however, resulted in documents with an artificially reduced length. For ease of annotation, longer documents were split into smaller parts and each part was annotated separately and treated as an independent document (Pradhan et al., 2013). The result is a corpus in which certain genres, such as broadcast conversation (bc), have greatly reduced length compared to their original form ( Figure 1). As a result, the long, bursty spread of coreference chains in these documents is missing from the evaluation benchmark.
In this work, we present an extension to the Ontonotes corpus, called LongtoNotes. LongtoNotes combines coreference annotations in various parts of the same document, leading to a full document coreference annotation. A carefully trained annotation team merged coreference annotation following the annotation guidelines laid out in the original Ontonotes corpus ( §3). The resulting LongtoNotes dataset has an average document length that is over 40% longer than the standard OntoNotes benchmark. Furthermore, LongtoNotes sees a 25% increase in the average size of coreference chains. While other datasets such as Litbank (Bamman et al., 2019) and CRAFT (Cohen et al., 2017) focus on long documents in specialized domains, LongtoNotes comprises of documents in multiple genres (Table 2).
To illustrate the usefulness of LongtoNotes, we evaluate state-of-the-art coreference resolution models (Kirstain et al., 2021;Toshniwal et al., 2020b;Joshi et al., 2020) on the corpus and analyze the performance in terms of document length ( §4.2). We illustrate how model architecture decisions and hyperparameters that support long-range dependencies have the greatest impact on coreference performance and importantly, these differences are only illustrated using LongtoNotes and are not seen in Ontonotes ( §4.3). LongtoNotes also presents a challenge in scaling coreference models as prediction time and memory requirement increase substantially on the long documents ( §4.4).

Our Contribution: LongtoNotes
We present LongtoNotes, a corpus that extends the English coreference annotation in the OntoNotes Release 5.0 corpus 1 (Pradhan et al., 2013) to provide annotations for longer documents. In the original English OntoNotes corpus, the genres such as broadcast conversations (bc) and telephone conversation (tc) contain long documents that were divided into smaller parts to facilitate easier annotation. LongtoNotes is constructed by collecting annotations to combine within-part coreference chains into coreference chains over the entire long document. The annotation procedure, in which annotators merge coreference chains, is described and analyzed in Section 3.
The divided parts of a long document in Ontonotes are all assigned to the same partition (train/dev/test). This allows LongtoNotes to maintain the same train/dev/test partition, at the document level, as Ontonotes (Table 1). While the content of each partition remains the same, the number of documents changes because the divided parts are merged into a single annotated text in LongtoNotes. We refer to LongtoNotes s as the subset of LongtoNotes comprising only the merged documents (i.e. documents merged by the annotators).

Length of Documents in LongtoNotes
The average number of tokens per document (rounded to the nearest integer) in LongtoNotes is 674,~44% higher than in Ontonotes (466). Table 2 shows the changes in document length by genre. We observe that the genre with the longest documents is broadcast conversation with 4071 tokens per document, which is a dramatic increase from the length of the divided parts in Ontonotes which had 511 tokens per document in the same genre. The number of coreference chains and the number of mentions per chain grows as well. The long documents that were split into multiple parts during the original OntoNotes annotation are not evenly distributed among the genres of text present in the corpus. In particular, text categories broadcast news (bn) and newswire (nw) consist exclusively of short non-split documents, which were not affected by the LongtoNotes merging process. A list of which documents are merged in LongtoNotes is provided in Table 10 (Appendix).

Number of Coreference Chains
As a consequence of the increase in document length, LongtoNotes presents a higher number of coreference chains per document (16), compared to OntoNotes (12). Figure 2 shows the length and number of coreference chains for each document in the two corpora. As expected, the number of chains   Figure 3.

Number of Mentions per Chain
The number of mentions per coreference chain in LongtoNotes is over 30% larger than in OntoNotes. This is primarily because of longer documents and an increase in the number of coreference chains per document. Mentions per chain increase with the increase in document length. For the broadcast conversation (bc) genre, the increase in the mentions per chain is highest with 87%, while for the pivot (pt) (Old Testament and New Testament text) genre it is only 30% as it has shorter documents.

Distances to the Antecedents
For each coreference chain, we analyzed the distance between the mentions and their antecedents. The largest distance for a mention to its antecedent grew 3x for LongtoNotes when compared to OntoNotes from 4,885 to 11,473 tokens. Figure  4 shows a detailed breakdown of the mention to antecedent distance. There are no mentions that are more than 5K tokens distant from its antecedent in OntoNotes. There are 178 such mentions in LongtoNotes.

Comparison with other Datasets
The We compare LongtoNotes to these datasets in terms of number of documents, total number of tokens, and document length (Table 3). Litbank is a popular long document coreference dataset, presenting a high tokens/document ratio. However, the dataset consists of only 100 documents, rendering model development challenges. Moreover, it focuses only on the literary domain. Other datasets containing long documents (e.g., WikiCoref) are also very small in size. On the other hand, datasets consisting of a larger number of texts tend to contain shorter documents (e.g., PreCo). Thus, by building LongtoNotes , we address the scarcity of a multi-genre corpus with a collection of long documents containing longrange coreference dependencies.
In concurrent work, Gupta et al. (2023) present a generalised annotation platform for coreference with simplified guidelines to users. In the future, such a tool could be used to more easily annotate documents of increased length.

Annotation Procedure & Quality
In this section, we describe and assess the annotation procedure used to build LongtoNotes.

Annotation Task
The annotators merge the coreference annotation in a sequential fashion. That is, they combine annotations from the second split part of an Ontonotes document into the first part, then the third part into the combined first two parts, and so on. Precisely, to build LongtoNotes, annotators successively merge chains in the current part i + 1 of the document with one of the chains in the previous parts 1, . . . , i. We reformulate this annotation process as a question answering task where we ask annotators a series of questions (rather the same coreference determining question for different mentions) using our own annotation tool designed for this task (Figure 5). We display parts 1, . . . , i with color-coded mention spans. We then show a highlighted concept (a coreference chain in part i + 1) and ask the question: The highlighted concept below refers to which concept in the above paragraphs? The anno-tators select one of the colour-coded chains from parts 1, . . . , i from a list of answers or the annotators can specify that the highlighted concept in part i + 1 does not refer to any concept in parts 1, . . . , i, (i.e., a new chain emerging in part i + 1). The list of answers here are the merged chains formed in the previous iterations. The annotation tool proceeds with a question for each coreference chain ordered (sorted by the first token offset of the first mention in the chain). The annotation of all parts of a document comprises an annotation task. That is, a single annotator is tasked with answering the multiple-choice question for each coreference chain in each part of a document. At the end of each part, annotators are shown a summary page that allows them to review, modify, and confirm the decisions made in the considered part. A screenshot of the summary page is provided in the Figure 9 in the Appendix.
From Annotations to Coreference Labels The annotations collected in this way are then converted into coreference labels for the merged parts of a document. The answers to the questions tell us the antecedent link between two coreference chains. These links are used to relabel all mentions in the two chains with the same coreference label, resulting in the LongtoNotes dataset.
Singletons Existing OntoNotes coreference annotation does not include singletons. Considering all parts of a document together might allow mentions that were considered to be singletons in a specific part to be assigned to a coreference chain. To understand the frequency of singletons in a single part of a document that has coreferent mentions in other parts, we manually analysed 500 mentions spread across 10 parts over three randomly selected long documents. We found only 17 instances (~0.03%) where singletons can be merged with coreference chains in different parts of the same document. Given that such singletons would constitute only such a small percentage of mentions, we decided it was appropriate to obit them from the annotation process to reduce the complexity of annotation. To merge this small number of singleton mentions, our annotators would have had to label over 50% more mentions per document. We further discuss this in Appendix A.4.

Annotators and Training
We hired and trained a team of three annotators for the aforementioned task. The annotators were university-level English majors from India and were closely supervised by an expert with experience in similar annotation projects. The annotation team was paid a fair wage of approximately 15 USD per hour for the work. We had several hourlong training sessions outlining the annotation task, setup of the problem, and Ontonotes annotation guidelines. We reviewed example cases of difficult annotation decisions and collaboratively worked through example annotations. We then ran a pilot annotation study with a small number of documents (approx 5% of the total documents). For these documents, we also provided annotations to ensure the training of the annotators and eventual annotation quality. We calculated the inter-annotators' agreement between the annotators and us. After a few rounds of training, we were able to achieve an inter-annotator agreement score (strict match, defined in the next subsection) of over 95% and we decided to go ahead with the annotation task. This confirmed the annotators' understanding of the task.
After the satisfactory pilot annotation study, the tasks were assigned to the annotators in five batches of 60 documents each. For 10% of the tasks, we had all three annotators provide annotations. For the remaining 90%, a single annotator was used. For the documents with multiple annotators, we used majority voting to settle disagreements. If all annotators disagreed on a specific case, we selected Annotator 1's decision over the others (analysis in the Appendix B).

Measuring Quality of Annotation
We would like to ensure that LongtoNotes maintains the high-quality standards of OntoNotes. Thus, we compute various metrics of agreement between a pair of annotators. We consider (1) the question-answering agreement (i.e., how similar are the annotations made using the annotation tool), and (2) the coreference label agreement (i.e., at the level of the resulting coreference annotation).
Assume that annotator j receives a set of chains i , the annotator links it to a New chain or a chain from their (annotator specific) set of available chains. Let us call D (j) i the linking decision of the jth annotator, which consists of a pair (C is the selected antecedent chain. We consider the following question answering metrics: (i) Strict Decision Matching: When two annotators agreed on merging two chains and there is an exact match between the merged chains. Calculated as 1 (ii) Jaccard Decision Match: Jaccard decision calculates the Jaccard similarity between the merged chain: 1 (iii) New Chain Agreement: Number of times both annotators select a new chain divided by the number of times at least one selects new chain.
(iv) Not New Chain Agreement: Number of times two annotators agreed on not a New chain choice divided by the number of times at least one annotator labels not a New chain.
(v) Krippendorff's alpha: Krippendorff's alpha (Krippendorff, 2011) is the reliability coefficient measuring inter annotator agreement. We compute Krippendorff's alpha using a strict decision match as the coding for agreement. Table 4 presents the results for these metrics. We observed that on average annotators agreed with each other on over 90% of their decisions except when the No New chains were considered. Removing New chains reduces the total decisions to be made significantly, and hence a lower score on No New chains agreement. We found that Annotator 1 agreed most with the experts and hence Annotator 1's decisions were preferred over the others in case of disagreement between all three annotators.

Where are disagreements found in annotation?
We would like to understand what kinds of mentions lead to the disagreement between annotators. We measure the part of speech of all the disagreed chain assignments between the annotators. We found that the 8% of the mentions within the disagreed chain assignments were pronouns, 8% were verbs, and 9% were common nouns. The number of proper nouns disagreements was lower with just 5%. When considering different genres, it was observed that genres with longer documents like broadcast conversation (bc) had more mentions that were pronouns when compared with genres with shorter documents pivot (pt). As expected, the number of disagreements in general increased with the size of the documents. However, we found that the number of disagreements was small even for long document genres such as broadcast conversation (bc). See Appendix B.  Table 4: Annotation Quality Assessment. We report the average of each metric over all pairs of annotators.

Time Taken per Annotation
We also recorded the time taken for each annotation. Time taken per annotation increases with the increase in the document length (Appendix Fig.  10). This is expected as more chains create more options to be chosen from and longer document length demands more reading and attention. In total, our annotation process took 400 hours.

Pitfalls of Automatically Merging Chains
To show the importance of our human-based annotation process, we investigate whether the annotators' decisions could have been replicated using off-the-shelf automatic tools. We performed two experiments: (i) a simple greedy rule-based string matching system (described in the Appendix A.5) and (ii) Stanford rule-based coreference system to merge chains across various parts. We use the merged chains to calculate the CoNLL F 1 score with the annotations produced by our annotators. We found that our string-matching system achieved a CoNLL F 1 score of only 61%, while the Stanford coreference system reached a score of only 69%. The low scores compared to the annotators' agreement (which is over 90%) underline the complexity of the task and the need for a human-annotations.

Empirical Analysis with LongtoNotes
We hope to show that LongtoNotes can facilitate the empirical analysis of coreference models in ways that were not possible with the original OntoNotes. We are interested in the following empirical questions using the datasets-Ontonotes (Pradhan et al., 2013), and our proposed LongtoNotes and LongtoNotes s : • How does the length of documents play a role in the empirical performance of models?
• Does the empirical accuracy of models depend on different hyperparameters in LongtoNotes and Ontonotes?
• Does LongtoNotes reveal properties about the efficiency/scalability of models not present in Ontonotes?

Models
Much of the recent work on coreference can be organized into three categories: span based representations (Lee et al., 2017;Joshi et al., 2020), token-wise representations (Thirukovalluru et al., 2021;Kirstain et al., 2021) and memory networks / incremental models (Toshniwal et al., 2020b,a). We consider one approach from all three categories.
Span-based representation We used the Joshi et al. (2020) implementation of the higher-order coref resolution model (Lee et al., 2018) with Span-BERT. Here, the documents were divided into a non-overlapping segment length of 384 tokens. We used SpanBERT Base as our model due to memory constraints. The number of training sentences was set to 3. We set the maximum top antecedents, K = 50. We used Adam (Kingma and Ba, 2014) as our optimiser with a learning rate of 2e −4 . Memory networks We used SpanBERT Large with a sequence length of 512 tokens. Following Toshniwal et al. (2020b), an endpoint-based mention detector was trained first and then was used for coreference resolution. The number of training sentences was set to 5, 10, and 20.  We find that to achieve accuracy with hyperparameters such as learning rate/warmup size, we need to maintain a number of steps per epoch consistent with Ontonotes when training with LongtoNotes. A detailed analysis is presented in the Appendix Section C.

Token-wise representation
Length Analysis -Number of Tokens We break down the performance of the span-based model by the number of tokens in each document. We compare the performance of the model depending on the training set. Figure 2 shows that the majority of the documents in the OntoNotes dataset falls within a token length of 2000 per document. We create two splits of LongtoNotes s , one having a token length greater than 2000 tokens, the other having a number of tokens smaller than 2000. Table 5 shows that for smaller document length (less than 2000 tokens), the SpanBERT model trained on OntoNotes performed better but the trend reverses for longer documents (more than 2000 tokens), on which the model trained on LongtoNotes outperformed the model trained on OntoNotes by +1%. Table 7 displays the change in F 1 score with the increase in the number of clusters per document. The Span-BERT Base model trained on LongtoNotes outperforms the same model trained on OntoNotes (+0.6%) when the number of clusters is more than 40. Note that, 40 is selected based on the cluster distribution shown in Table 2

Hyperparameters & Document Length
Each model has a set of hyperparameters that would seemingly lead to variation in performance with respect to document length. We consider the performance of the models on LongtoNotes as a function of these hyperparameters.
Span-based model hyperparameters We consider two hyperparameters: the number of antecedents to use, K and the max number of sentences used in each training example. We found that upon varying K: 10, 25, and 50, there was only a small difference observed in the results for both the models trained on OntoNotes and LongtoNotes (increasing K led to only minor increases). The result is summarized in Table 8. We could not go beyond K = 50 due to our GPU memory limitations. However, going beyond 50 might further help for longer documents. Furthermore, we found that the number of sentences parameter used to create training batches does not play a significant role in performance either (Figure 8).
Token-wise model hyperparameters Reducing the sequence length when testing from 4096 to 384 leads to a drop in F1 as seen in Figure 6. We observed that longer sequence length (4096)   more for LongtoNotes s as there are longer sequences than for OntoNotes, which is evident in Figure 6. Furthermore, we analyzed performance on two genres: magazine (mz) having 6x longer sequences in LongtoNotes than OntoNotes vs pivot (pt) having just 1.4x longer documents. As observed in Figure 7 (and Appendix Table 15), when the document is long as in magazine (mz), there is a significant increase in performance with a longer sequence but the effect is negligible for pivot (pt) where the size of the document is almost the same. Memory model hyperparameters We consider two hyperparameters -the memory size which denotes the maximum active antecedents that can be considered and the max number of sentences used in training. We show that doubling the size of the memory leads to an increase of 0.8 points of CoNLL F 1 for LongtoNotes dataset. (Appendix Table 14). Figure 8 demonstrates that there is no significant improvement in the performance of the model with the increase in the number of training sentences.

Model Efficiency
We compare the prediction time for the span-based model on the longest length and average length documents in LongtoNotes and Ontonotes in Table 9. We observe that there is a significant jump in running time and memory required to scale the model to long documents on LongtoNotes; this jump is much smaller on Ontonotes. This suggests that our proposed dataset is better suited for assessing the scaling properties of coreference methods.

Conclusion
In this paper, we introduced LongtoNotes, a dataset that merges the coreference annotation of documents that in the original OntoNotes dataset were split into multiple independently-annotated   parts. LongtoNotes has longer documents and coreference chains than the original OntoNotes dataset. Using LongtoNotes, we demonstrate that scaling current approaches to long documents has significant challenges both in terms of achieving better performance as well as scalability. We demonstrate the merits of using LongtoNotes as an evaluation benchmark for coreference resolution and encourage future work to do so.

Limitations
Our dataset is comprised solely of English texts, and our analysis, therefore, applies uniquely to the English language. OntoNotes, however, consists of the Arabic and the Chinese annotations too and those languages were not considered in our study due to the limited expertise of the annotators. Since our models are not tuned for any specific real-world application, the methods should not be used directly in highly sensitive contexts such as legal or health-care settings, and any work building on our methods must undertake extensive qualityassurance and robustness testing before using them.

Ethical Considerations
The annotation was performed with a data annotation service which ensured that the annotators were paid a fair compensation of 15 USD per hour. The annotation process did not solicit any sensitive information from the annotators.

Appendix A Dataset and Annotation Details
A.1 Annotation tool Fig. 9 shows our tool's summary page.

A.2 Comparison with OntoNotes
A detailed genre-wise comparison of the documents from OntoNotes dataset which were merged in LongtoNotes is presented in Table 10. It can be seen that categories like bn and nw are completely missing in LongtoNotes , while pt is partially missing.

Documents in Corpus comparison Category
Onto

A.3 Dataset selection decision
Due to budget constraints and the expertise of our team and annotators in English only (and some training of annotators is required to ensure data quality), we only considered the English parts of the OntoNotes dataset in our work. We think that the dataset can be extended to Arabic and Chinese too, but we leave it for future work.

A.4 Annotating singletons
While manually annotating all singletons, we observed that almost all NPs can be thought of as mentions and all those NPs that are not part of any chain can be thought of as a singleton. Our analysis suggests that there are over 50% mentions that are not annotated by OntoNotes and can qualify for singletons. To annotate all the singletons, the annotator needs to go through all of them, discard the ones that do not abide by the OntoNotes rules and then make a decision whether to merge each singleton to some chain or other singleton. In our analysis, the number of such singletons is very low and all the efforts were not worth it for the small improvement over the current annotations. So we decide to ignore all the singletons in our study.
A.5 Greedy rule-based matching system We use a greedy string matching system where we take all the mentions in a chain of the current para i + 1 and analyse its part of speech provided in the OntoNotes dataset. We take the first Noun (NN or NP) present in each chain and look for the mentions overlap in all other previous paras 1, . . . , i chains. We merged two chains if there is a strict overlap with any of the mentions in a given chain. If there are no strict overlaps, we move to the next noun in the given chain and repeat the process. If we find no strict overlap with any mentions in any other para chains, we keep the chain independent (same as assigning None of the below in our annotation tool). We repeat the process with all chains in a given document and constantly update the chain after every para.

B Annotation Disagreement Analysis
B.1 Genre wise disagreement analysis Table 11 presents the genre-wise disagreement analysis for strict decision matching. Genres with longer documents like bc, mz have more disagreements compared to genres with smaller document lengths like tc, pt. The trend is very similar for new chain assignments where genres with larger documents have more disagreements over new chain assignments. The numbers are presented in Table 13. Figure 11 shows the cases (in black) when the annotators disagreed for each part of the speech categories (shown in big coloured bubbles). The size of the bubbles is representative of their occurrence in the dataset, suggesting there are more pronominal mentions in the dataset than nouns or proper nouns.

B.2.1 Genre wise disagreement analysis
In general, annotators disagree more on pronouns than proper nouns and the trend is consistent for various genres as shown in Table 12.

C.1 MUC, B 3 and CEAFE scores
Tables 16, 17 and 18 present the MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998) and CEAFE (Luo, 2005) scores for SpanBERT Base (Lee et al., 2017)     with shorter sequence length (384) for all models. The difference is higher when the documents are longer (as seen in mz genre) than when the documents are shorter (as seen in pt).    (Toshniwal et al., 2020b) and find that the larger memory version achieves better results on each dataset.