Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News

Cross-document event coreference resolution (CDCR) is the task of identifying which event mentions refer to the same events throughout a collection of documents. Annotating CDCR data is an arduous and expensive process, explaining why existing corpora are small and lack domain coverage. To overcome this bottleneck, we automatically extract event coreference data from hyperlinks in online news: When referring to a significant real-world event, writers often add a hyperlink to another article covering this event. We demonstrate that collecting hyperlinks which point to the same article(s) produces extensive and high-quality CDCR data and create a corpus of 2M documents and 2.7M silver-standard event mentions called HyperCoref. We evaluate a state-of-the-art system on three CDCR corpora and find that models trained on small subsets of HyperCoref are highly competitive, with performance similar to models trained on gold-standard data. With our work, we free CDCR research from depending on costly human-annotated training data and open up possibilities for research beyond English CDCR, as our data extraction approach can be easily adapted to other languages.


Introduction
Cross-document event coreference resolution (CDCR) is the task of identifying and clustering mentions of real-world events in a given collection of documents. For example, CDCR systems need to decide whether the sentences "On Monday, Lindsay Lohan checked into rehab in Malibu" and "Ms. Lohan entered a rehab facility" from two different documents refer to the same event, by taking temporal, spatial and other cues from the document contexts into account. CDCR is a vital preprocessing step for downstream multi-document tasks such as question answering or fact checking.  Annotation of CDCR is laborious and expensive, requiring expert annotation which can span weeks for several hundred documents (Cybulska and Vossen, 2014b;Vossen et al., 2018). Crowdsourcing annotation has been proposed, however this requires extensive training of annotators (Bornstein et al., 2020) or post-processing by experts (Bugert et al., 2020(Bugert et al., , 2021, precluding large-scale studies. Annotating CDCR data in a different language requires great effort since language-specific guidelines (Minard et al., 2016) and enough annotators with proficiency in the target language are required. As a consequence, CDCR corpora had to compromise on size, domain coverage, and the density of annotated mentions and coreference links, as well as language coverage. This data bottleneck is problematic for three reasons. Firstly, recent state-of-the-art CDCR systems are based on pretrained language models (Devlin et al., 2019;Liu et al., 2019) fine-tuned with supervised learning on human-annotated corpora (Zeng et al., 2020;Yu et al., 2020;Caciularu et al., 2021;Cattan et al., 2021). Achieving top results with such models still requires high-quality labeled data for fine-tuning (Gururangan et al., 2020), yet the size of current corpora is insufficient for training large systems. Secondly, because corpora need to make compromises on domain coverage, the domain coverage of all existing CDCR corpora is limited even when combined. This holds back research on open-domain CDCR systems which could increase the (currently limited) applicability of CDCR in downstream tasks (Bugert et al., 2021). Thirdly, because test splits consist only of a few hundred documents (far less than what downstream applications may require), scalability to large corpora is a problem which could not be tackled so far (Bugert et al., 2021).
To overcome this gap, we leverage hyperlinks in online news articles: When referencing a significant real-world event in the body of an article, writers often add a hyperlink to a different article covering this event. We conjecture that by collecting hyperlinks which point to the same article(s) and interpreting anchor texts as mention spans, highquality cross-document event coreference links can be retrieved quickly and in large quantity (see Figure 1). To this end, we devise a data extraction pipeline which mines such datasets automatically from Common Crawl 2 and apply it to create the HYPERCOREF corpus, consisting of 40 news outlets with over 2M mentions in total, far exceeding the size of existing CDCR corpora. HYPERCOREF achieves broader coverage in event types compared to manually annotated corpora. In experiments with a state-of-the-art CDCR model (Cattan et al., 2021), we evaluate the relation between the amount of gold training data and test performance across three CDCR corpora: ECB+, FCC-T, and GVC. We make the remarkable observation that models trained entirely on silver-standard data from HYPERCOREF perform on a similar level as models trained on gold-standard data (between 4 pp. CoNLL F1 worse and 4 pp. better, depending on the corpus at hand). Overall, our findings lift the dependency on gold data for training CDCR systems and pave the way for large, robust and potentially multilingual systems, as our data extraction approach can be easily adapted to any language found on the web. Our contributions are: C1. A novel approach for acquiring silverstandard cross-document event coreference links from hyperlinks, C2. HYPERCOREF, a large corpus created with this approach, and its analysis compared to 2 https://commoncrawl.org/ gold-standard CDCR corpora, C3. out-of-domain transfer experiments with a state-of-the-art CDCR system, certifying the quality of this data.

Fundamentals
We define cross-document event coreference resolution (CDCR) and its relation to hyperlinks in news.
Task Definition CDCR consists of two steps: (1) identifying mentions of real-world or hypothetical events in a collection of documents (event mention detection), and (2) recognizing which of these mentions refer to the same events, thereby producing a cross-document clustering of mentions (coreference resolution). Event mentions are commonly defined by four components: their action (checked into), participants (Lindsay Lohan), time (on Monday) and location (rehab in Malibu) (Cybulska and Vossen, 2014a). Actions are the centerpiece of event mentions, and their token span is the main representative of an event mention in text.
We also refer to this span as a mention span.
Hyperlinks in News To establish the context of a recent news development, news journalists make reference to other events which have caused, influenced or are otherwise related to the recent newsworthy event. In online news, such references are often marked with a hyperlink to another article which covers the referenced event in greater detail. These hyperlinks can (with some margin of error) be interpreted as cross-document coreference links: The hyperlink's anchor text (its clickable text region) corresponds to an event mention's action, and the target URL identifies the referenced event. A pair of hyperlinks which point to the same URL but are located in different articles then correspond to two event mentions connected by a CDCR link. This is exemplified in Figure 1. We propose to collect CDCR data by mining hyperlinks from online news. In the next section, we explain our data pipeline creating such data and the key issues one needs to overcome in the process.

Data Extraction
Following an explanation of our data extraction pipeline, we describe its application for creating the HYPERCOREF corpus which we then compare to expert-annotated CDCR corpora.

Pipeline
We apply our pipeline on one news outlet at a time for greater computational efficiency. The following steps are visualized in Figure 2. Given a news outlet: 1. We source documents from Common Crawl (CC), a public repository of crawled web pages. We retrieve a list of all pages available for this news outlet from the CC index. 2. We download these web pages, remove excess markup and detect publication dates with the newspaper3k framework 3 . 3. We drop pages without in-or outgoing hyperlinks. A large portion of pages tend to be duplicates. We deduplicate pages based on their textual content using locality sensitive hashing (Leskovec et al., 2020, ch. 3) and clustering of pages by their TF-IDF cosine similarity in a two-stage approach. 4. We extract hyperlinks, remove their URL query strings, and harmonize the boundaries of their anchor text (trimming of whitespace, exclusion of punctuation). 5. The removal of hyperlinks which refer to noneventive content such as overview pages, product reviews, or affiliate products plays a key role. We apply a series of filtering steps: (a) Links whose target domain (amazon.com, facebook.com, etc.) mismatches the article's domain are removed. (b) We make the assumption that the majority of links do refer to eventive content, and that their URLs share syntactic similarities. We build a prefix tree from all URLs and retain only those links whose URLs are part of the 90% most frequent prefixes. (c) Links to pages with high indegree or from pages with high outdegree are removed. (d) A set of handwritten rules targeting URLs and anchor texts is applied, which for example remove URLs containing /tag/, /category/, or anchor texts such as "click here". 6. Groups of hyperlinks sharing the same anchor text and target URL are removed entirely. This eliminates any remaining links to hub pages 3 https://pypi.org/project/newspaper3k/ or "read more"-type links appearing out of context on multiple pages. All pipeline steps except for the handwritten rules in step 5d) are language-independent, hence the pipeline can be easily applied to news in any language found on the web. We limit ourselves to English news in this work since the majority of gold-standard CDCR corpora are in English. In several test runs with Dutch, German, and French news we observed results of similar quality to those created from English news.

The HYPERCOREF Corpus
We apply our pipeline to 40 news outlets from English-speaking countries to produce HYPER-COREF, a corpus consisting of 2M documents, 0.8M hyperlink clusters ( = events) and 2.7M hyperlinks ( = event mentions). The most recent documents included stem from the October 2020 crawl of CC. We compare HYPERCOREF to three CDCR corpora: EventCorefBank+ (ECB+) (Cybulska and Vossen, 2014b), a corpus containing news articles from a broad selection of 45 topics, the Football Coreference Corpus (FCC-T) (Bugert et al., 2020(Bugert et al., , 2021 which annotated football match events in sports news and the Gun Violence Corpus (GVC) (Vossen et al., 2018) which annotates fine-grained gun violence events in news. A size comparison is shown in Table 1, demonstrating that HYPERCOREF is several orders of magnitude larger than expert-annotated corpora. Additional analysis (see Appendix A.1) reveals that the distribution of cluster sizes is comparable between HYPERCOREF and expert-annotated corpora, however anchor texts tend to be longer phrases or even entire sentences, opposed to the minimum span annotations of action triggers pursued for the creation ECB+. To keep further analysis and experiments manageable, the remainder of this work focuses on ABC News (abcnews.go.com) and BBC News (bbc.com), two large and reputable news outlets.

Event Types
We heuristically determine the event types contained in each corpus by performing WSD on the syntactic head of mentions against WordNet (Miller, 1995), choosing the most frequent sense and counting word sense occurrences. Table 2 shows the top word senses of three (sub-)corpora. Compared to ECB+, ABC contains a greater proportion of reporting events (cf. Pustejovsky et al. (2003)) and mentions using   light verbs, which are challenging for coreference resolution (Hovy et al., 2013;Choubey and Huang, 2017). BBC consists of events from the sports domain. We count the number of unique verbal word senses of mentions to estimate the event type coverage per corpus (see Table 1). HYPERCOREF exhibits considerably broader coverage than previous CDCR corpora.

Qualitative Analysis
We manually analyze a total of 300 hyperlinks from the ABC and BBC subcorpora to gain a better understanding of the retrieved data. 70% of these links are accompanied by a plausible event mention in the same sentence. The remaining 30% refer to topically similar but unrelated events (see A1 in Table 3) or refer to non-event content such as health guides. We analyze if and where the four event components (see Section 2) are found in sentences containing plausible event mentions. For 66% of these links, anchor texts contain the event action.
In the remaining 34% of cases, writers oftentimes marked event participants, times, or locations instead to emphasize these aspects (see A1, B2 in Table 3). Although such hyperlinks contradict the common definition of an event mention, we decided against filtering these out since doing so may also have removed event mentions recognizable exclusively by their participants, time or location. 4 For the subset of links where anchor texts contain the event action, 74% of these links exhibit verbal actions (the remainder being predominantly nominal actions). Overall, HYPERCOREF qualifies as CDCR data, though with inherently noisy clusters and imperfect mention spans. We evaluate the use of HY-PERCOREF for training a CDCR system in the next section.

Experiments
Annotating gold-standard CDCR corpora is a laborious and expensive process, raising the question to which extent such data can be replaced with cheaper to obtain silver-standard (i.e. automatically generated) data for training CDCR models. We investigate this question by evaluating the stateof-the-art CDCR system of Cattan et al. (2021) on three gold corpora (ECB+, FCC-T, GVC) and the ABC News and BBC News subcorpora of the silver HYPERCOREF corpus. We first describe the aforementioned CDCR system and explain how we prepare each corpus for CDCR experiments, then report our results obtained for the coreference resolution and mention detection tasks.  Evaluation Scenarios We evaluate scarcity of gold-standard data in three scenarios of increasing difficulty: in S gg , a model has full access to gold-annotated train and dev splits, with an optional equal amount of silver mentions from HY-PERCOREF used during training. In S sg , the dev split remains gold but training data is replaced with silver data. Finally, S ss tests out-of-domain transfer using entirely silver train and dev splits.

SOTA System
Data Preparation For ECB+, we use the official splits and filtered sentences specified in the corpus documentation. For FCC-T and GVC, we use the splits of Bugert et al. (2021). For HYPERCOREF, we first discard all documents which closely resemble documents from the dev or test splits in either of the three gold-standard corpora so as to guarantee unbiased evaluation later 5 Please refer to original publication for further details. We report additional training and setup details in Appendix A.3. on. 6 We only use clusters consisting of 2 to 10 mentions to strike a balance between cluster sizes and large enough variance in events. To conform a given hyperlink anchor text span to the minimum span annotation of gold-standard corpora (see Section 3.3), we dependency parse the surrounding sentences with CoreNLP (Manning et al., 2014) and choose the syntactic head of the anchor text (including any tokens connected with compound or flat relations) as the mention span. To keep the training times of the Cattan et al. (2021) system manageable, we limit HYPERCOREF training data to 25k event mentions in the S ss and S sg scenarios. For S ss , we use development splits consisting of 1.7k mentions (ABC) and 4.2k mentions (BBC) which corresponds to 5% of all available mentions for these corpora.

Coreference Resolution Experiments
We evaluate event coreference resolution performance in-domain on each CDCR corpus, as well as across corpora to measure out-of-domain robustness. Achieving comparable coreference resolution results between the ECB+, FCC-T and GVC corpora requires using gold event mention spans due to non-exhaustive event mention annotations in FCC-T and GVC (Bugert et al., 2021). We therefore do not use the mention detection mechanism of Cattan et al. (2021) in this set of experiments. We evaluate CDCR with the CoNLL F1 metric (Pradhan et al., 2012).
Baselines We report two common CDCR baselines. The lemma baseline clusters all event mentions with the same head lemma together. lemmaδ is a trainable variant of lemma which restricts merging to document pairs which exceed a TF-IDF cosine similarity of δ (Upadhyay et al., 2016). We train δ on the gold development split of each respective corpus.
Results Tables 4a and 4b show results of the state-of-the-art system and baseline results for each data scarcity scenario on each corpus. The complete results for all metrics, including the link-based entity-aware coreference metric (LEA) (Moosavi and Strube, 2016), are reported in Appendix A.4.
In S gg , the model trained on ECB+ generalizes best. This is due to the broad domain coverage of ECB+ which includes sports and gun violencethe two topics on which FCC-T and GVC specialize. The performance of S gg models trained on an equal amount of gold and silver data from either ABC or BBC is mixed: test performance on individual corpora is at times higher, but aggregated performance across corpora declines. Looking at the most difficult scenario S ss , the performance of the models trained and optimized entirely on silver HYPERCOREF data is highly competitive with the in-domain performance of S gg models. The strong results of the BBC model on FCC-T can be attributed to a large portion of football sports news in the BBC subcorpus (see Section 3.2), yet performance on GVC is similarly strong. Performance increases further in the S sg scenario, where gold dev sets are used for early stopping and for choosing the clustering threshold τ .
With respect to baselines, the HYPERCOREF models outperform the baselines in both the S ss and S sg scenarios. The lemma-δ baseline adapts to the distribution of lexically similar mentions in clusters of documents with similar content. This is a highly corpus-dependent property, explaining the baseline's strong in-domain performance (which surpasses the Cattan et al. (2021) system in the S sg scenario in two occasions) while it leads to significantly worse performance on unseen test corpora.
The most likely explanation as to why S gg models trained on mixed gold and silver data perform worse than S sg and even S ss models appears to be that HYPERCOREF data is most helpful when it is used in large quantities. Using small subsets limits the diversity of training events and bears a greater risk of overfitting to noise.
Error Analysis We sample 10 test clusters from each gold-standard corpus and manually compare the predictions of (1) both S ss models, (2) of the S sg models optimized on the respective corpus and (3) of the respective in-domain S gg model without augmentation. On ECB+, we make the common observation (Upadhyay et al., 2016;Barhom et al., 2019;Bugert et al., 2021) that the in-domain model primarily matches event actions with similar surface form. S ss and S gg models are more liberal with merging paraphrases (such as "revealed" or "unveiled") but overmerge unrelated mentions more frequently as a result. Compared to ECB+, FCC-T exhibits greater ambiguity of event actions ("win", "draw", "final" can refer to many different sports matches). The in-domain model rarely clusters such mentions, opposed to the S ss and S sg models which merged such mentions if nearby participant mentions were compatible. Our GVC analysis mirrors these findings, with the BBC S ss and S sg models performing noticeably better merges than other models, particularly for clusters with varied actions ("went off", "shooting", "discharged") where a mention's context matters. In summary, models trained on HYPERCOREF exhibit greater context sensitivity.

Mention Detection Experiments
We additionally investigate the usefulness of HY-PERCOREF for event mention detection. Of the three CDCR corpora studied, ECB+ is the only corpus allowing general evaluation of event mention detection (Bugert et al., 2021): in ECB+, mentions were annotated exhaustively per sentence, opposed to FCC-T and GVC where only mentions of specific event types were annotated. The silver event mentions found in HYPERCOREF are similarly incomplete, as it is unlikely that each event in a given sentence is marked with a separate hyperlink. We therefore expect low recall for models trained on HYPERCOREF when applied on ECB+.
To measure event detection performance, we train the full CDCR system of Cattan et al. (2021) (training mode "e2e") for the three previously mentioned data scarcity scenarios and test these models on ECB+. Included in the comparison is Reimers (2018)  (c) Event mention detection performance on ECB+. We report mention P/R/F1 on one meta-document of the entire test split.
Reimers (2018) is the mean of 25 independent trials. Table 4: Coreference resolution and mention detection performance for three scenarios: models learned predominantly on gold data (S gg ), predominantly on silver data (S sg ) or entirely on silver data (S sg ). Results on corpora unseen during model optimization are marked in gray. We report the mean of three independent trials.
tion detection system using a BiLSTM-CRF architecture (Huang et al., 2015). The results are shown in Table 4c. All models incorporating HYPERCOREF data exhibit lower precision and much lower recall than the S gg model trained entirely on gold data, partially confirming our expectations. At the same time, Reimers (2018) significantly outperforms Cattan et al. (2021), which most likely stems from its CRF component which determines optimal labels on entire token sequences instead of scoring individual spans as done by Cattan et al.. While the absolute numbers in the S ss scenario are decent given the circumstances, the differences in mention span definition between HYPERCOREF and ECB+ evidently are too significant to benefit from HYPERCOREF for event mention detection.

Discussion and Future Work
Expensive data annotation constitutes a bottleneck for research on scalable, open-domain CDCR systems. Addressing this gap, we found that hyperlinks extracted from online news have tremendous potential when used as proxy training data for CDCR: The fact that a model trained on hyperlinks from BBC sports news outperforms indomain models trained on FCC-T (sports) and GVC (gun violence), and that a model trained on ECB+ performs only 4% pp. CoNLL F1 better than a model trained on hyperlinks from ABC News (see Table 4a) demonstrates that hyperlink data can offer both sufficient depth and breadth in domains to enable development of domain-focused and open-domain CDCR models. The proposed data extraction pipeline only requires scraped online news, basic NLP tools 7 and several handwritten filtering rules. Obtaining CDCR training data with sufficient quality is therefore vastly cheaper with this approach than traditional means of annotation, since no annotation guidelines or trained annotators are required. The approach is languageindependent save for a set of filtering rules (see Section 3.1). Rules for a particular language can be created by a single individual proficient in that language (or alternatively, by a motivated researcher using machine translation). Hence, we significantly lower the entry hurdle for future CDCR research on languages other than English, for which little to no gold-standard training data exists.
Nevertheless, there are downsides to our proposed approach. By nature, it cannot be applied to text types without hyperlinks such as works of fiction (Sims et al., 2019), e-mail conversa-tions (Dakle et al., 2020) or dialogue (Eisenberg and Sheriff, 2020). Data scraped from the web is inherently noisy, and while filtering steps can mitigate this issue, noise from imperfect markup cleaning, hyperlinks not referring to eventive content, or different issues remains. Similarly, hyperlink anchor texts oftentimes resemble the mention spans of event mentions as defined in traditional annotation guidelines, but this is not guaranteed (see Section 3.3). Future work may investigate humanin-the-loop annotation (Wang et al., 2021) to resolve such cases on a subset of the data. Training event mention detection systems entirely on hyperlink data is possible (see Section 4.2), however with a deficit in precision and particularly recall since hyperlinks appear less frequently in news data than event mentions do in gold-standard corpora. Doing so can nonetheless be an effective fallback solution for languages for which no gold-standard mention detection corpus is available.

Future Work
The experiments performed in this work mainly serve the purpose of characterizing the quality of the HYPERCOREF corpus. Future work may explore ways on how to use this data to its full effect to maximize CDCR performance. In particular, the system of Cattan et al. (2021) does not perform document-level inference and does not make use of document publication dates for temporal inference. More advanced CDCR models, such as the recently introduced cross-document language model of Caciularu et al. (2021), may therefore display even greater benefit from HYPERCOREF. Of the scenarios we tested HYPERCOREF in, training a system on HYPERCOREF and optimizing its hyperparameters on gold-standard data produced the best results. Better results may be possible with transfer learning (Pruksachatkun et al., 2020;Vu et al., 2020). We observed favorable performance when training on only 2 out of 40 news outlets from HYPERCOREF (amounting to just 2% of all mentions available). Future work may exploit the entirety of HYPERCOREF for training large, truly open-domain CDCR models, potentially including additional languages beyond English by adapting our pipeline.

Related Work
Harvesting NLP datasets from the web has a long history. Sil et al. (2010) extract sequences of verb constructions from webpages to learn common preconditions of actions and events, and Cham-bers and Jurafsky (2008) extract narrative chains of events from the Gigaword corpus (Graff et al., 2005). However, these works extract event knowledge from raw newswire text, omitting signals from hyperlinks. The WikiLinks corpus (Singh et al., 2012) is an entity coreference corpus created by collecting hyperlinks to Wikipedia pages from a web crawl. While large in volume, the dataset does not target events and suffers from low mention ambiguity, since a mention-pair string identity baseline can reach 82% F1. As part of the Wikification task (Roth et al., 2014;Peng et al., 2016), links in the body of Wikipedia articles and their anchors were collected to produce multilingual entity coreference resolution data.
Closest to our approach is Wikipedia Event Coreference (WEC) (Eirew et al., 2021), a recent CDCR corpus created from Wikipedia articles on real-world events and cross-page links. While Wikipedia-based corpora are easier to create than newswire corpora (such as HYPERCOREF) due to standardized markup, their key downside is that encyclopedic text lacks the temporal and spatial anchoring present in newswire (as in "Today, the White house announced" or "Ed Sheeran is coming to town") which considerably lessens their usefulness for event-related tasks. Furthermore, compared to HYPERCOREF, event coverage in WEC is limited to events which the Wikipedia community deemed significant enough to warrant a dedicated article (for each separate language). This excludes high-frequency, high-ambiguity events (resignations, stocks surging, arrests, etc.) 8 which are most challenging to resolve for CDCR systems and therefore crucial to have as training data. These corpus differences have a direct impact on results, with our broad-coverage ABC S ss model outscoring the identical RoBERTa-based architecture trained by Eirew et al. (2021)  Recently, Choubey and Huang (2021) investigated automated retrieval of annotations for withindocument event coreference resolution. Their method is applicable on plaintext news articles, but requires a database of lexical paraphrases for mention identification and a discourse parsing system for filtering. It is unclear whether this method can be successfully applied for CDCR, since the employed discourse-based filtering rules may not be transferable to the cross-document case.
To the best of our knowledge, we are the first to propose a cheap and high-quality data extraction approach specifically for cross-document event extraction and coreference which does not depend on pre-existing resources, combining past work on event extraction from raw newswire text and mining of hyperlinks.

Conclusion
To overcome the prevalent data bottleneck of the CDCR task, we proposed a new method for cheaply and automatically collecting silver-standard data from hyperlinks in online news. We used this approach to create HYPERCOREF, a large dataset with over 2M mentions and show that a system trained on a subset of this dataset achieves equivalent performance as the same system trained on expert-annotated corpora. Our data collection approach opens up many avenues for future work, particularly for languages where gold-standard CDCR data is currently scarce or non-existent.

A.1 Additional Corpus Statistics
A complete size comparison between the three gold-standard corpora ECB+, FCC-T, and GVC and silver-standard corpora HYPERCOREF and WEC (Eirew et al., 2021) is shown in Table 10. For HYPERCOREF, the number of documents shown corresponds to the number of documents containing at least one event mention. The corpus contains an additional 850k documents which only appear as hyperlink targets (i.e., these documents are the seminal documents describing each event cluster) and which are kept for reference or for future experiments.
The distribution of cluster sizes in data retrieved with our pipeline resembles that gold-standard corpora (see Figure 3). Figure 4 demonstrates that hyperlink anchor texts consist of considerably longer token spans than event mentions annotated in goldstandard corpora. This is due to the fact that hyperlink anchor texts are often phrases or entire sentences, opposed to minimum span annotations (see Figure 6). In order to provide rough insights into the proportion of nominalized vs. verbal event mentions in each corpus, we determine the coarsegrained part-of-speech tag of mention heads for each corpus ( Figure 5) using CoreNLP. The majority of ECB+ consists of verbal mentions whereas FCC-T and GVC mostly contain nominalized mentions and a certain amount of adjectival mentions. HYPERCOREF subcorpora again exhibit properties similar to gold-standard corpora. The most frequent WordNet synsets of all event mentions in each gold-standard corpus and of all hyperlink anchor texts in HYPERCOREF subcorpora are reported in Table 11.   Table 5 reports our classification of 300 HYPER-COREF mentions as to whether their hyperlink anchor text or surrounding sentence refers to eventlike content. Invalid event references account for cases where a hyperlink qualifies as an event mention but the referenced article covers a different event, and for hyperlinks which do refer to the correct event but are not related to their surrounding document.

A.2 Qualitative Analysis Details
Given a hyperlink anchor text, we analyze where mentions of action, participants, time and location are located in the surrounding document (see Table 8). Actions are predominantly found inside the anchor text and the major participants mostly in the surrounding sentence.  events need to be determined from the document context more frequently.
As demonstrated in Table 7, the vast majority of the analyzed event mentions in HYPERCOREF references past events.
We examine the relation between the event(s) mentioned by the sentence surrounding a hyperlink and the main event the article referenced by this hyperlink is reporting about. Table 6 shows that the majority of links refer to the main event, while a smaller proportion references a subevent of the article's main event. Several such examples are A1, B1, B2 in Table 9. Note that B1 contains mentions of the main event (the interview) and a sub-event (Biden saying he takes responsibility), with the hyperlink being placed on the sub-event mention. We also observed the opposite case in which the event mentioned in the sentence containing a hyperlink encompasses the event in the referenced article. An example is A2 in Table 9. Here, a hyperlink refers to a mass shooting, with the referenced article reporting about a recent aspect of this crime (the indictment of the offender). Such cases tend to happen when writers merely provide context on a topic, rather than citing a specific incident.
Significant real-world events need little information to be recognizable -two prime examples are the 2001 terrorist attacks on New York's World Trade Center which are oftentimes referenced only by their date ("9/11") or the US Independence Day which is celebrated on the 4th of July every year and therefore uniquely recognizable by its date. We observed similar cases in HYPERCOREF during a pre-study: In Table 9, example F1 refers to a motor race which took place in 2019 in Canada involving driver Sebastian Vettel, yet there is no explicit lexical trigger for the event action since readers will infer the action from the document context. Instead, "Canada", being the location of the event, takes over the role of the lexical trigger. Event mentions of this kind have so far only been considered for historically significant events (such as 9/11 or World War II) (Cybulska and Vossen, 2014a), hence it would be vital to retain these in HYPERCOREF. A different case is example F2: in this sentence, "in Canada" refers to the country of a race event, but was chosen as the anchor text for reasons of emphasis. Yet another special case is shown in example B3 which cites a football trainer's previous employment period. Leading a team until a certain date implicitly references a contract expiry event, which here is marked with a hyperlink to an article discussing said event.
Identifying such edge cases in order to correct anchor texts which are misplaced (from a linguistic point of view) as in example F2, while keeping cases like examples F1 and B3 unchanged may improve the quality of the data considerably. However, we expect reliable identification of these cases to be very difficult. Given a risk of introducing biases in the data through machine learning based filtering techniques, as well as the computational cost of applying such a solution to large volumes of data, we decided against filtering techniques going beyond the rule-based approaches described in Section 3.1. The detection and correct resolution of the previously mentioned action-less event mentions in particular may however pose an interesting topic for future research.    Having faced out-of-memory errors when training on topics larger than roughly 50 documents, we used the KaHyPar framework (Schlag, 2020) to partition HYPERCOREF data into pseudo-topics of 50 documents each (where the number of hyperlinks lost in the partitioning process is minimized).

Coreference Resolution
The following information only concerns coreference resolution experiments.
There are diminishing returns in generating all possible coreferring mentions pairs at training time, particularly in corpora with large clusters such as FCC-T which lead to the generation of many similar pairs (Bugert et al., 2021). To address this, we sample at most 6 · √ n coreferring mention pairs for a given cluster of n mentions. Regarding the population of non-coreferring pairs, Cattan et al. train with up to 20 times as many non-coreferring pairs per topic as there are coreferring pairs. We reduced this ratio to 15 to speed up the training process. GVC has the unique property of consisting of a single topic and many small clusters, leading to a highly skewed ratio of non-coreferring pairs to coreferring pairs in the training split (GVC: 498:1, ECB+: 20:1, FCC-T: 19:1). After observing strong model bias towards predictions of the non-coreferring class on the GVC development set using a ratio of 15, we reduced the ratio of noncoreferring to coreferring training pairs to 5 for all S sg experiments (partially) trained on GVC.
Analogous to the experiments of Cattan et al., all coreference resolution experiment results are reported using "predicted topics" following Barhom et al. (2019). This entails preclustering the set of test documents using TF-IDF, generating separate event coreference clusters within each of these document clusters, merging the predictions for each document cluster into a single meta-document, followed by the computation of coreference resolution metrics from this meta-document. For FCC-T and GVC, we use document preclusterings created by Bugert et al. (2021) in the above manner.
We include singletons throughout our experiments. For the lemma and lemma-δ baselines, we use the implementation from Bugert et al. (2021).
Apart from the differences mentioned above, please note that our in-domain ECB+ results in the S gg scenario are different from the results reported by Cattan et al. (2021) since their model was trained on event and entity annotations. To ensure comparability with FCC-T and GVC (which do not offer entity coreference annotations), we only make use of event annotations for ECB+.

A.4 Full Experiment Results
We measure coreference resolution performance with the MUC (Vilain et al., 1995), CEAF e (Luo, 2005), B 3 (Bagga and Baldwin, 1998), CoNLL F1 (Pradhan et al., 2012) and LEA (Moosavi and Strube, 2016) metrics. We use the scorer implementation from https://github.com/ ns-moosavi/coval. Tables 12 to 16 report the full P/R/F1 scores of the coreference resolution experiments reported in Section 4.1 for each of these metrics respectively. A1 Sharapova remains an obstacle she has yet to overcome, however, with the Russian having won all three of their previous matches. Halep draws hope from the fact that she has improved each time, pushing the former world number one to three sets in Madrid recently .
In their first interview since announcing his candidacy, former Vice President Joe Biden and Dr. Jill Biden sat down with ABC's "Good Morning America"co-anchor Robin Roberts and addressed issues from Biden's past that have drawn criticism.

B1
In an interview with ABC's "Good Morning America"co-anchor Robin Roberts Monday, Biden said he takes responsibility for Hill's treatment in 1991 when she testified before the Senate Judiciary committee during Supreme Court Justice Clarence Thomas's confirmation hearing.
B2 Biden also tried to position himself as the antithesis of President Donald Trump.
Patrick Crusius, the alleged gunman in the El Paso shooting, has been indicted for capital murder by a grand jury in Texas.
A2 Sen. Ted Cruz, R-Texas, said there have been "too damn many"mass shootings in Texas, but claimed that gun control proposals from Democrats would not have stopped the recent mass shootings in his home state.
Lewis Hamilton secured a record-breaking seventh win at the 2019 Canadian Grand Prix, after a penalty for Sebastian Vettel, who finished first on the road, demoted the German to second in the standings.
F1 Vettel, on other hand, could not muster a smile. Since Canada , Sebastian seems to have been struggling more and more, and at Silverstone those woes deepened further.
n/a F2 A hint of tension between the Force India drivers had been seen at the previous race in Canada , when Perez had refused to let Ocon past to try and attack Ricciardo.
The Ghana Football Association (GFA) says it has parted company with national team coach Kwesi Appiah by mutual consent.