Event Linking: Grounding Event Mentions to Wikipedia

Comprehending an article requires understanding its constituent events. However, the context where an event is mentioned often lacks the details of this event. A question arises: how can the reader obtain more knowledge about this particular event in addition to what is provided by the local context in the article? This work defines Event Linking, a new natural language understanding task at the event level. Event linking tries to link an event mention appearing in an article to the most appropriate Wikipedia page. This page is expected to provide rich knowledge about what the event mention refers to. To standardize the research in this new direction, we contribute in four-fold. First, this is the first work in the community that formally defines the Event Linking task. Second, we collect a dataset for this new task. Specifically, we automatically gather the training set from Wikipedia, and then create two evaluation sets: one from the Wikipedia domain, reporting the in-domain performance, and a second from the real-world news domain, to evaluate out-of-domain performance. Third, we retrain and evaluate two state-of-the-art (SOTA) entity linking models, showing the challenges of event linking, and we propose an event-specific linking system, EVELINK, to set a competitive result for the new task. Fourth, we conduct a detailed and insightful analysis to help understand the task and the limitations of the current model. Overall, as our analysis shows, Event Linking is a challenging and essential task requiring more effort from the community.


Introduction
Grounding is a process of disambiguation and knowledge acquisition, and is an important task for natural language understanding. Entity linking, grounding entity mentions to a knowledge base 1 Data and code are available here: http://cogcomp. org/page/publication_view/996. Two homemade pressure cooker bombs detonated in Boston, killing 3 people and injuring hundreds of others. The police released the images of two suspects.

Boston
Boston, officially the City of Boston, is the capital and most populous city of the Commonwealth of Massachusetts in the United States ......

Boston Marathon Bombing
The Boston Marathon bombing was a domestic terrorist attack that took place during the annual Boston Marathon on April 15, 2013. ......

Wikipedia
Local context Figure 1: Examples of Event linking and Entity linking. The left side is the local context, and the right side contains Wikipedia pages. Entity linking model connects the entity "Boston" to the Wikipedia page "Boston", while event linking model links the event "detonated" to the Wikipedia page "Boston Marathon Bombing", which is more relevant to the local context.
(usually Wikipedia) (Bunescu and Pasca, 2006;Mihalcea and Csomai, 2007;Ratinov et al., 2011;Gupta et al., 2017;Wu et al., 2020), has been shown important in natural language understanding tasks, such as question answering, recommendation system, dialogue generation. Despite the significant progress brought by entity linking, we argue that grounding entities may not provide enough background knowledge that is often needed to support text understanding. Consider the example, Figure 1; an entity linking model will link the entity "Boston" to the Wikipedia page "Boston" which introduces the history and culture of the city Boston. The information we can get from the page "Boston" is irrelevant to the local context. To really help understand this sentence, we need to link the event centered by the verb "detonated" to the Wikipedia page "Boston Marathon Bombing". We call this process that grounds events Event Linking.
In this paper, we formulate this Event Linking task for the first time, analyze the difference and challenges of the new task, and carefully design a benchmark dataset for this task. We automatically collect training data from the hyperlinks in Wikipedia, and create two evaluation sets to evalu-ate both in-domain and out-of-domain performance. For in-domain evaluation, the test data is also from hyperlinks in Wikipedia. To avoid models from overfitting, the test data is balanced with hard cases and easy cases determined by whether the event is seen in the training and by the similarity between the surface forms of event mentions and Wikipedia titles. For out-of-domain evaluation, we annotate real-world news articles across 20 years collected from New York Times. Considering the sparsity of events existed in Wikipedia, we also add "Nil" annotation to the test data, indicating that those events do not exist in Wikipedia, therefore, the model needs to tag them as "Nil".
Technically, we come up with an event linking model EVELINK that uses the entities in the local context as arguments of the event structure to better present the event mention. EVELINK outperforms two SOTA entity linking models BLINK (Wu et al., 2020) and GENRE (Cao et al., 2021), and achieves strong performance on the event linking test set, especially on seen events and easy cases, and a detailed error analysis shows the difficulties of the new task and the limitation of the current model.
To conclude, our contributions are four-fold: (i) We formulate the task Event Linking. (ii) We collect training data for this task, and design both in-domain and out-of-domain test data, with a balanced ratio of hard cases and easy cases to ensure the dataset quality. (iii) Our proposed approach EVELINK shows promising performance in experiments, which sets a competitive result for future works. (iv) Our in-depth analysis provides a better understanding of this new problem, the challenges in different domains, and the new approach.

Grounding Events in Wikipedia
Given an article and an event mention m in it, event linking tries to find a title t, from all the English Wikipedia titles (around 5m titles), to provide the best explanation of m. Event mention is defined as verb or nominal that refers to an event. A correct title is defined as follows: as long as a Wikipedia page is about this event, or any subsection of the page introduces this event, we regard its title as the correct one. In this paper, all the models assume gold event mentions are given. For each event mention, a system is expected to label it with the correct Wikipedia page or a "Nil" tag if the event does not exist in Wikipedia. Accuracy is adopted as the official evaluation metric.
Part of the Manhattan Bridge will be closed so that its roadway can be rebuilt.

Manhattan Bridge
The Manhattan Bridge is a suspension bridge that crosses the East River in New York City, ......

Reconstruction
...... The Brooklyn-bound roadway on the upper level was closed from 1993 to 1996 so that side of the bridge could be repaired. ......

Wikipedia
Local context Figure 2: Example of event mentions that only exist in the subsection of a Wikipedia page. The event "rebuilt" does not have its own page, but is mentioned in the subsection of the page "Manhattan Bridge".

NFL Draft
The 2000 NFL Draft was the procedure by which National Football League teams selected amateur U.S. college football players. It is officially known as the NFL Annual Player Selection Meeting.

National Football League Draft
The National Football League Draft, also called the NFL Draft or (officially) the Player Selection Meeting, is an annual event which serves as the league's most common source of player recruitment. ......

Wikipedia
Local context Annotator Subevent Wiki Hyperlink Figure 3: Example of hierarchical events. Event "draft" of Tom Brady is mentioned in the page "Tom Brady", and is also a sub-event of "2000 NFL Draft", which is again a sub-event of "National Football League Draft". Event Linking vs. Entity Linking. Relatedness: (i) They both link an object (event/entity) from an article to Wikipedia; (ii) Some events, such as "World War II", are entities; in this case, two tasks are the same. Distinctions: (i) Entities are mostly consecutive text spans. Events, in contrast, are more structured objects, consisting of a trigger and a couple of arguments. An event trigger is mostly a general verb, which may not refer to a specific event by its own without knowing event arguments. More complex structures in events make event linking a more challenging task and require a deeper understanding of the local context; (ii) Unlike entities with a large coverage in Wikipedia, many events do not have a record in Wikipedia. Considering the sparsity, we require models to tag event mentions that do not exist in Wikipedia as "Nil".
Why Event Linking? Except for some events that are also entities, generally speaking, events are information units of larger granularity. As shown in Figure 1, a better comprehension of events, such as through linking to Wikipedia, is expected to facilitate the text understanding more.
Challenges specific to Event Linking. (i) The correct title for some event mentions may not be unique. The same event could be introduced in several pages. For example, "Invasion of Poland" and "Occupation of Poland (1939Poland ( -1945" both introduce the event that German Army invaded Poland in 1939. How to decide the ground truth set and how to evaluate in this situation are not trivial. (ii) Events may only exist in the subsection of the Wikipedia page. Only a limited number of famous events have their own pages, while many other relatively infamous events only exist in the subsection of some pages. Considering the example in Figure 2, the event "rebuilt" of the Manhattan Bridge does not have its own Wikipedia page, but it is mentioned in the subsection "Reconstruction" of the page "Manhattan Bridge". Linking these events requires a model to understand the whole page instead of just encoding the first paragraph.
(iii) Events have a hierarchical structure. Events at larger granularity consist of many sub-events, and these sub-events may have their own Wikipedia pages, or just be mentioned in the pages of the large events. Ideally, the model should always link the event mention to the most appropriate page. If the sub-event page exists, then link to the subevent page. Otherwise, link to the page of the large event. However, the term "appropriate" here could be unclear because of the event hierarchy. As Figure 3 shows, the Wikipedia page "Tom Brady" is most specific to the event "drafted". On the other hand, draft of "Tom Brady" is a sub-event of "2000 NFL Draft", which is further a sub-event of "National Football League Draft". Annotators prefer to link this event to "Tom Brady", while Wikipedia hyperlinks link the event to "National Football League Draft". The hierarchy of events makes the standard of the correct title inconsistent.
3 Related Work Entity Linking. As described in the previous section, entity linking has been extensively studied for many years (Bunescu and Pasca, 2006;Mihalcea and Csomai, 2007;Ratinov et al., 2011;Gupta et al., 2017;Wu et al., 2020;Cao et al., 2021). Though both of entity linking and event linking could be regarded as a task linking document contents to a knowledge base, we argue that entity linking is more about linking text span, while event linking is more about linking an event structure, centered by a predicate, which is more challenging because the predicate span is usually a general verb. In the experiment section, we show that just retraining the entity linking model on event linking data without considering the event structure does not perform well. Humeau et al. (2019) and Wu et al. (2020) use a bi-encoder/cross-encoder architecture to train the candidate generation/ranking model respectively for entity linking. Considering the structure of events that entities do not have, EVELINK extends their model by adding structure information to the event mention representation. "Event Linking". We note that the term "event linking" has been once used in the literature (Nothman et al., 2012;Krause et al., 2016). However, these works are essentially performing crossdocument event coreference: determine if a given event mention refers to another event mention (in the same or another document). We, on the other hand, link an event mention to a Wikipedia concept with a different purpose: acquiring external knowledge about the event which is often beyond what we can obtain from the local context. Our definition of event linking can not only improve the understanding of the article, but also pave the way for the intensively-studied event coreference and other event relation identification problems. Data. Eirew et al. (2021) collect training data from Wikipedia hyperlinks for event coreference, while we use similar methods to collect data for event linking. In this work, we use the FIGER type of the title to find event titles, while Eirew et al. (2021) use the Wikipedia infobox. There also exists some other event knowledge bases, such as EventKG (Gottschalk and Demidova, 2018). Because we use hyperlinks in Wikipedia as training data resource, and we do not limit the candidate space to be event titles only, in this work we only focus on linking event mentions to Wikipedia, and the candidate space is all the Wikipedia titles.

Data Construction
We collect training data and in-domain test data from Wikipedia automatically, and manually annotate a test set in the news domain for out-of-domain evaluation purpose. Table 1 lists some data examples, and Table 2 shows detailed statistics.

Wikipedia
We first collect all hyperlinks (hypertext, title) in Wikipedia text, which links a hypertext to a Wikipedia title. Then, we map the FreeBase type Sam Cassell A man who killed his former wife, a bartender and a cook in 1984 Godinez v. Moran was executed by injection early today. A 45-year-old fashion photographer was shot and killed in his West Nil Village apartment yesterday morning, the police said.  of Wikipedia titles to FIGER types (Ling and Weld, 2012), and all titles with a type "Event" are regarded as event titles. All the hypertexts linked to these event titles are regarded as event mentions. Because same event mentions in one Wikipedia page are hyperlinked only once, and editors tend to hyperlink more nominal mentions than verb mentions, verb mentions are highly limited in Wikipedia. To balance the size of verbs and nominals, we use SpaCy Part-of-Speech model 2 to keep all verb mentions, and sample the same size of nominals. To prevent models from overfitting, we design hard and easy cases for verbs and nominals: Verbs: We classify each verb mention mainly by whether the surface form (S) of the verb is seen in training data, and whether the gold event title (T) is seen in training data. If both S and T are seen in training data, we call it Seen Event. If T is seen in training data, but S is new, we call it Unseen Form. If T is never seen in training data, we call it Unseen Event. Under this setting, "Seen Event" is regarded as easy cases, and the other two are hard cases. Because of the limited size of verb mentions, all the event titles with fewer than or equal to 5 verb mentions are used as "Unseen Event".
Nominals: We classify each nominal mention mainly by its surface form similarity to the gold title. We calculate the Jaccard similarity between the nominal mention and the gold title by taking 3 grams of the surface form. If the similarity is lower than 0.1, we think it is a hard nominal; otherwise, it is an easy nominal. Then we sample same numbers of hard and easy cases.

New York Times
We sample 2,500 lead paragraphs from The New York Times Annotated Corpus (Sandhaus, 2008), which contains New York Times articles from 1987 to 2006. We first use an off-the-shelf verb and nominal SRL model 3 to extract event mention candidates, and then we use Amazon Mechanical Turk to annotate the corresponding Wikipedia title of the predicted mention candidates. To ensure the quality of the annotation, we design our annotation process in two rounds: First round. Annotators need to answer whether they think the predicted mention is an event or not. If they think it is an event, then they need to find the corresponding Wikipedia title, otherwise submit "Nil". Each mention is annotated by three annotators. If all of them submit "Nil", we include this event mention as a "Nil" example in the final test data. To prevent annotators from simply submitting "Nil", 10% of the event mentions are the relatively easy cases from the Wiki data and we know their answers. We randomly insert them into the input data for AMturk (i.e., annotators are unaware of that) to evaluate the accuracy of the annotator. Only the annotation from annotators with an accuracy higher than 90% will be accepted.
Second round. This round verifies the annotated results in the first round. Each mention with the annotated title is verified by another three annotators. They need to read the page, and figure out whether it introduces the mention. If the majority vote for "yes", we include it in the final test data. Because of majority voting, some annotations that not all the annotators agree would be included. The inter-agreement is 63.74 Fleiss' kappa.

Domain Analysis
Event linking in the news domain is more challenging than that in the Wikipedia domain because of the following reasons: (i) News articles describe an event at a different granularity as how Wikipedia does, usually with more details. For example, here is a piece of news about "Iraq_War": "A contractor working for the American firm Kellogg Brown & Root was wounded in a mortar attack in Baghdad." The event "wounded" here is a very small event in Iraq War, but it is what daily news would report. On the other hand, the event mention that links to "Iraq_War" in Wikipedia domain is: "When touring in Europe, the US went to war in Iraq." The different granularity in representing events makes the task slightly different in two domains. Event linking in Wikipedia domain is more like event coreference, while event linking in news domain is mixed with more subevent relation extraction.
(ii) As analyzed in Section 2, event linking is challenging because some event mentions may only exist in the subsection of the correct page, and the correct title is not consistent because of the event hierarchy. However, these problems mainly happen in the news domain. First of all, the Wikipedia hyperlinked mentions usually have their own pages instead of just existing in subsections. In news domain, we annotate events that only exist in subsections of a Wikipedia page. Second, in Wikipedia domain, the gold title of same event mentions is usually consistent. For example, all of the event mentions "drafted" of football players link to "National Football League Draft" instead of the page of the specific player. However, the annotation standard of NYT is not always consistent with Wikipedia hyperlinks. For example, annotators would link event mentions about sports player draft to the page of the specific player instead of the general concept page "National Football League Draft". These problems make data annotation and model evaluation in news domain very challenging.
Because of the reasons claimed above, we think that, for some cases in news domain, the correct answer is multiple titles instead of just one title. Ranking the annotated title to the second place may be because the top one is also correct. To relax the evaluation metric here, for news domain, we also report the number of Accuracy@5, which means that if the annotated title is ranked in the top 5 candidates, we think it is correct.

Model
In this section, we propose EVELINK as the first event linking model. We first introduce the representation of event mentions and event titles in Section 5.1, and then introduce the model architecture in Section 5.2.

Event Representation
A key difference between entity and event is that the context of an entity is more diverse than the context of an event. For example, when the entity "China" is mentioned in a sentence, it is unclear what entities or what events probably would also be mentioned together. However, if a verb like "invade" is used to represent the event "Battle of France" in a sentence, it is very likely that entities like "Germany", "Italy" and "France" will also be mentioned. This shows that an event is defined by its arguments, and these arguments, with a large chance, will also be mentioned in the local context because the verb itself cannot refer to any event. Given this observation, we think that the entities in the local context of the event mention should overlap with the entities in the correct Wikipedia page, and these entities can be used to help the model better represent events. To embed these entities information explicitly to the event representation, we use similar method as how Vyas and Ballesteros (2021) embed entity attributes information to the entity representation.
Event mentions: To represent event mentions in local context, we first use an off-the-shelf Named Entity Recognition model 4 trained on 18-type OntoNotes dataset (Weischedel et al., 2013) to extract the entities around the event. We simply define the context window by 500 characters around the event mention. After   predicting all the entities e i with their type t i , we represent the event mentions by: (2) where m, ctxt l , ctxt r , e i are tokens of event mention, the context on the left of the mention, the context on the right of the mention and predicted entities. [M s ] and [M e ] are special tokens to tag the start and end of the event mention. [t is ] and [t ie ] are special tokens to tag the start and end of the entity whose type is t i . r m is the final representation of event mentions.
Title: To represent Wikipedia titles, since important entities are already hyperlinked in the page contents, we take the first ten hyperlinked spans as entities, and represent the title by: where title, h i and description are tokens of the title, hyperlinked spans, and the content of the Wikipedia page. We simply take the first 2, 000 characters as the description.
[TITLE] is the special token to separate the title and the description. r t is the final representation of Wikipedia titles.

Model Architecture
Similar to Wu et al. (2020), we first use a biencoder architecture to efficiently generate candidates, and use a cross-encoder architecture, which requires more computations, to rank the candidates. Candidate Generation. We use a bi-encoder architecture to train the candidate generation model. We use two independent BERT transformers (Devlin et al., 2019) to encode the representation of event mentions r m and Wikipedia titles r t , and use the output of the two [CLS] tokens in r m and r t as the event mention vector v m and the title vector v t . Then, we maximize the dot product between the vectors of event mentions v m and the correct title v t in a batch with randomly selected negatives. At inference time, representations of all the titles are cached, and for each event mention, we calculate the dot products between its representation and the representation of all the titles, and titles with higher scores will become candidates.
Candidate Ranking. For each event mention, we take 30 candidates from the candidate generation model as the training data for the ranking model, and use a cross-encoder architecture to train the candidate ranking model. We concatenate the representation of event mentions r m and titles r t , use one BERT transformer to encode the concatenated representation, and use the output of the [CLS] token as the final vector v. Then we maximize the dot product between the vector v of   the correct title and an additional linear layer W .

Experiments
In this section, we evaluate the in-domain performance on Wiki test set and the out-of-domain performance on NYT test set, and conduct an error analysis. Implementation details in Appendix A.
Baselines. Since there is no existing event linking system, we have to compare with previous entity linking systems. In this paper, we mainly compare our system with two SOTA entity linking models BLINK (Wu et al., 2020) and GENRE (Cao et al., 2021). To make a fair comparison, BLINK and GENRE have the following two setups: BLINK/GENRE-Entity: Since a large portion of event mentions are nominals, which is also a kind of entity, it would be interesting to see how a SOTA entity linking system performs for event linking. Therefore, we test the BLINK/GENRE model pretrained specific to entity linking directly. Please note that the size of entity linking training data is 9 million, which is much larger than the size of event linking training data 66k.
BLINK/GENRE-Event: It adopts the same algorithm with the original BLINK/GENRE system, but is trained on our event linking training set.
For all the experiments, BLINK-Entity retrieves 10 candidates from candidate generation, and both BLINK-Event and EVELINK retrieves 100 candidates from candidate generation. These numbers are tuned on dev data. GENRE is a generation model, which does not use the same pipeline of candidate generation and ranking. We follow the original setting to use the beam search with 5 beams.
Besides SOTA entity linking systems, we also evaluate the performance of BM25, Glove vector cosine similarity between event mention and titles (Pennington et al., 2014) and prior distribution. Because event mentions are limited in Wikipedia, to fairly estimate the prior distribution of the event titles, we only evaluate event mentions that appear at least 10 times in Wikipedia.
In-domain experiment on Wikipedia. We evaluate EVELINK on the Wikipedia test set as the in-domain performance. We report the recall of candidate generation in Table 3, and the accuracy of candidate ranking in Table 4. As shown in Table 3 and Table 4, EVELINK outperforms baseline models by a large margin, 6.94 points in Recall and 6.38 points in Accuracy. EVELINK also achieves a high performance on seen verbs and easy nominals, around 90 accuracy, but a relatively low performance on other hard cases, which leaves a large space for future works to further improve.
Out-of-domain experiment on News. We evaluate EVELINK on the NYT test set as the outof-domain performance. In Table 5 Table 6, we evaluate the accuracy on the event mentions that exist in Wikipedia, which is the same setting as the experiments in the Wikipedia domain, and again the accuracy drops significantly from 82.87 to 15.73. Even if we accept 5 predictions instead of just one to solve the multiple correct titles problem, the Accu-racy@5 is 29.13, which is still low. Detailed error analysis is in Section 6. In Table 7, we evaluate the accuracy of all the event mentions, including Nil. Because we do not have Nil examples in training data and development data, we simply predict all the event mentions with probability lower than 50 to "Nil", and leave better solutions to future works. GENRE is not tested for Nil mentions because it is unclear how to get its prediction probability.
Analysis. We wonder following questions: Q 1 : Where the gain comes from, compared with the BLINK system? We do ablation study in Table 8. Explicitly adding entities to the event representation boosts the performance by 10.14 accuracy on Wiki test data and 2.73 accuracy on NYT data. Adding entity types further improves the performance by 1.48 accuracy on Wiki and 4.03 accuracy on NYT data. Q 2 : Error patterns of EVELINK We collect several error patterns that are common in both domains, and patterns that are mostly in news domains. Error patterns of both domains: (i): Repeating events. In the errors, we find many repeating events, like award ceremonies or sports games, that would happen every several years, and the model usually cannot find the correct year of the event if the year is not explicitly mentioned in the context. For example: In 1995, his debut season, Biddiscombe made two appearances, · · · The following year he earned a Rising Star nomination for his performance· · · In this example, the gold event is "1996 AFL Rising Star", and the prediction is "1998 AFL Rising Star", though there is a temporal hint (the following year of 1995 is 1996) to indicate that the correct answer should be the award in 1996. There are many similar errors when linking awards or games, which shows that a deeper temporal understanding is necessary for future works.
(ii): Unrelated context. EVELINK replies on the surrounding entities to link the event mentions, however, the context is not always related and surrounding entities cannot help linking. For example: Returning to his country at the end of the conflict and another begun, Barinaga rejected an offer from Athletic Bilbao, moving to Real Madrid instead.
In this example, the gold event is "World War II", but the prediction is "1939-40 La Liga". All the entities, like "Barinaga", "Athletic Bilbao" and "Real Madrid", are about football, which is unrelated to the war. To link to the correct page, the model needs to know the second conflict of Barinaga's country, which indicates that only using the local context maybe not enough. Error patterns specific to news domain: (i): Subsection events. Some events do not have their own pages, and are only introduced in the subsections of other pages. For example: The Philippine government lifted its five -year ban on the return of Imelda Marcos today and said the widow of the late President Ferdinand Marcos was free to come home from exile in the United States.
In this example, the return of Imelda Marcos is introduced in the subsection "Return from exile (1991-present)" of the page "Imelda Marcos". However, we only use the first 2,000 characters of the page contents to represent the title "Imelda Marcos", which has no information about the return from exile. A document-level representation may be a potential solution for future works.
(ii): Sub-events. Some events are sub-events of other larger events. For example: Stepping in at the 11th hour, Hillary Rodham Clinton will campaign in Florida on Saturday for her brother, Hugh Rodham, in his bid for a United States Senate seat.
This event is a sub-event of "1994 United States Senate election in Florida", which has different event arguments, so the names in the local context do not overlap with the names in the page.
In this work, we discuss many challenges of the task in different domains, but EVELINK cannot address all of them. We leave them to future works.

Conclusion
In this work, we formulate Event Linking, a challenging but essential task, with a carefully designed Wikipedia dataset and NYT test set, and propose an event linking model EVELINK for future works.

Limitations
In this section, we discuss limitations of our work.
• We only focus on event linking to English Wikipedia in this work. We leave multilingual event linking to future works.
• The performance of EVELINK on hard cases is still low, for example events that only exist in the subsection of Wikipedia page.
• In this work, we simply predict all the mentions with a prediction probability that is lower than 50 to "Nil". We leave better solutions to future works.