Discovering Black Lives Matter Events in the United States: Shared Task 3, CASE 2021

Evaluating the state-of-the-art event detection systems on determining spatio-temporal distribution of the events on the ground is performed unfrequently. But, the ability to both (1) extract events “in the wild” from text and (2) properly evaluate event detection systems has potential to support a wide variety of tasks such as monitoring the activity of socio-political movements, examining media coverage and public support of these movements, and informing policy decisions. Therefore, we study performance of the best event detection systems on detecting Black Lives Matter (BLM) events from tweets and news articles. The murder of George Floyd, an unarmed Black man, at the hands of police officers received global attention throughout the second half of 2020. Protests against police violence emerged worldwide and the BLM movement, which was once mostly regulated to the United States, was now seeing activity globally. This shared task asks participants to identify BLM related events from large unstructured data sources, using systems pretrained to extract socio-political events from text. We evaluate several metrics, accessing each system’s ability to identify protest events both temporally and spatially. Results show that identifying daily protest counts is an easier task than classifying spatial and temporal protest trends simultaneously, with maximum performance of 0.745 and 0.210 (Pearson r), respectively. Additionally, all baselines and participant systems suffered from low recall, with a maximum recall of 5.08.


Introduction
Typically, performance evaluations of automated event coding engines are carried out with respect to benchmarks made of annotated linguistic units (e.g. clause, sentence or document). While this is crucial in order to factorize the individual, linguistic subtasks composing the event extraction process, it does not estimate the overall usability of machinecoded event data sets for micro-level modelling of social processes, particularly in the domain of socio-political and armed conflict, where spatial analysis has become standard.
The complex dynamics of the Black Lives Matter movement and its varied media coverage by news outlets and social media make it a particularly relevant use case for assessing the capability of automated, Event Extraction systems to model socio-political processes. The Task 3: "Discovering Black Lives Matter Events" 1 organized in the context of the Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE) 2021 workshop aims at doing so by challenging Event Extraction (EE) engines to extract a collection of protest events from two heterogeneous text collections (i.e., news and social media) and then measuring a number of spatiotemporal correlation coefficients against a curated Gold Standard data set of protest incidents from the BLM movement.
During May and June of 2020, protests occurred across the globe in response to the murder of George Floyd, an unarmed Black man, by Derek Chauvin, a white police officer. In the U.S., the number of locations holding demonstrations related to this murder outnumbered any other demonstration in U.S. history (Putnam et al., 2020). These events were more often than not associated with the Black Lives Matter (BLM) movement, either (1) directly through organizing or (2) indirectly through the slogan "Black Lives Matter" or shared political agendas such as police abolition and protests against police violence towards Black communi-ties. Since its inception in 2013, the Black Lives Matter movement, a loose network of affiliated organizations, has organized demonstrations around a large number of police shootings and killings and sought to raise awareness of systematic violence against Black communities. While support for Black Lives Matter has varied over its lifetime (Horowitz, 2020), the work done over the past years laid the foundation for the global response seen in the wake of George Floyd's murder.
This task is the third in a series of tasks at CASE 2021 workshop (Hürriyetoglu et al., 2021b). The first task is concerned with protest news detection at multiple text resolutions (e.g., the document and sentence level) and in multiple languages: English, Hindi, Portuguese, and Spanish (Hürriyetoglu et al., 2021a). Teams which participated in Task 1 were invited to participate in this third task: "Discovering Black Lives Matter Events in the United States". This task is an evaluation only task, where all models are (1) trained on the data supplied in Task 1, (2) applied to the news and social media data (i.e, New York Times and Twitter data), and (3) evaluated on a manually curated, Gold Standard BLM protest event list. Each team's system is compared to simple baselines in order to properly evaluate their accuracy.

Related Work
Summary measures such as precision, recall, and F1 are limited in their capacity to inform about the quality of the predictions of an automated system (Derczynski, 2016;Yacouby and Axman, 2020). Moreover, evaluating capabilities of a system on detecting socio-political events from text requires additional metrics such as spatio-temporal correlation of the system output and the actual distribution of the events (Wang et al., 2016;Althaus et al., 2021).
Several studies focused on assessing the correlation of machine-coded event data sets with Gold Standards based on disaggregated event counts, for example Ward et al. (2013) and Schrodt and Analytics (2015). Hammond and Weidmann (2014) applied disaggregation of events incidents across PRIO-GRID geographical cells (Tollefsen et al., 2012) to assess the Global Database of Events, Language and Tone (GDELT) data approximation of the spatio-temporal pattern of conflicts. Zavarella et al. (2020) adapted this method to administrative units for measuring the impact of event de-duplication on increasing correlation with the Armed Conflict Location and Event Data (ACLED) data sets for a number of conflicts in Africa. In this report we report on an evaluation task, which we refer as Task 3, we provide a detailed analysis of the capabilities of the best performing systems on Task 1 (Hürriyetoglu et al., 2021a) in this respect. We believe this effort will shed light on system performances beyond precision, recall, and F1.

Data
The goal of this task is to evaluate the performance of automatic event detection systems on modeling the spatial and temporal pattern of a social protest movement. We evaluate the capability of participant systems to reproduce a manually curated BLM-related protest event data set, by detecting BLM event reports, enriched with location and date attributes, from a news corpus collection, a Twitter collection, and from the union of the two.

Training Data
As a usability analysis, no training data were provided for this Task. Namely, the event definition applied for coding the reference event data set is the same as the one adopted for Shared Task 1 (Hürriyetoglu et al., 2021a) and any data utilized for Task 1 and Task 2, such as the one from Hürriyetoglu et al. (2021), or any additional data could be used to build a system/model run on the input data.

Input Data
We provide two types of input data. The first is a generic, not topic filtered collection of all news items (Title and Lead Paragraph) from the New York Times for the target time range May 25th -June 30th. The second is a collection of Black Lives Matter related tweets (Giorgi et al., 2020).

New York Times
The New York Times (NYT) data sets consists of 5,347 articles published between May, 25 and June 30, 2020. The data associated with each article includes published date, print headline, lead paragraph, web URL, authors, and an abstract, among other meta-data. This is a general set of NYT articles (i.e., articles may or may not be related to BLM), unlike the Twitter data set which only contains tweets related to BLM or counter protests (e.g., All Lives Matter and Blue Lives Matter).

Twitter
We used an open source data set of tweets containing keywords related to Black Lives Matter and the counter protests: All Lives Matter and Blue Lives Matter. While this data set contains tweets dating back to the origins of the Black Lives Matter movement, the tweets used in this task are limited to the date range: May 25, 2020 (the date of George Floyd's murder) to June 30, 2020. These tweets were pulled in real time using the Twitter API's keyword matching with the following three keywords: BlackLivesMatter, Al-lLivesMatter, and BlueLivesMatter. This data set consists of 30,160,837 tweets. Participants were given full access to each tweet's meta-data (including the tweet's text), which could include URLs, location information, and dates.

Gold Standard Data
For the Gold Standard data (i.e., the BLM events list we wish to automatically detect) we considered two online sources of Black Lives Matter protest events: Creosote Maps 2 and Race and Policing 3 . Starting with these two data sets, we first checked if the source URL link was still active. If not, we referenced other data sets for the event in question: Wikipedia (a list of George Floyd protests in and outside of the U.S.) and the New York Times. If a valid article was not found matching this protest date and location, then we performed a Google search for the specific event. If still nothing was found, then the event was removed from the data set. If at any point, we discovered a valid URL for the event, we ran a validation check. This check asked: (1) is the source a tweet or Facebook post; (2) does the source describe an upcoming event; (3) is the source irrelevant to the protest at the location; (4) does the source have enough information; and (5) is the source not accessible because of a paywall. If the source passed this check, we then scraped the source for the publication date and days of the week in the article text. If the publication date and the day of the week do not match, we then inferred the date of the protest by the mention of the day of the week closest to the publication date. Finally, we manually checked the scraped or inferred dates and record this as the event date.
In the end, this produced 3,463 distinct U.S. events between May 25 and June 30, 2020 with date, city, and state information. Of these events, only 537 (approximately 15% of the events) occurred after the first week of June. To compensate for the lack of coverage across all of June, we used the open source data set from the The Crowd Counting Consortium (CCC) 4 . From our original data set of 3,463 events, 754 events also occurred in the CCC data, matching on (1) URL or (2) both date and city. We then combined the two data sets (i.e., the CCC events with our original list) and removed duplicates. This resulted in 7,976 protest events in our final Gold Standard data. The U.S. map in Figure 1 shows the spatial distribution of these events (yellow dots).

Evaluation
System performance is evaluated by computing correlation coefficients on event counts aggregated on cell-days, using uniform grid cells of approximately 55 kilometers sides from the PRIO-GRID data set (Tollefsen et al., 2012). We use these analytical measures as a proxy to the spatio-temporal pattern of the BLM protest movement.

Data Normalization
In order to be joined with PRIO-GRID shapefiles, string-like location information of system output data had to be normalized to coordinate pairs. To do this we used the OpenStreetMap Nominatim search API 5 . For structured location name representations (i.e., city, state, country) we used a parametric search, otherwise we used free-form query strings. We note that geographical coordinate conversion from Nominatim places the event at the geographical centroid of the polygon of the assigned administrative unit. In our evaluation, we discarded the system output event records with no source location information or whose string-like location attribute returned null results in Nominatim API.

Metrics
We use the cell-days counts for two different analysis: the correlation with the total daily "protest cell" counts (i.e., time trends alone) and the event counts for each cell-day (i.e., spatial and temporal trends together).
Temporal Trends The first analysis only considers the total number of "activated" cells (i.e., for which at least one Protest event was recorded), in the system output and Gold Standard data set. This time series analysis is sufficient to estimate how well the automatic systems capture the time trends of the protest movement. However, it does not compute accuracy of system data in estimating the spatial variation of the target process.
Spatial and Temporal Trends To this purpose, we also measure the correlation coefficients on the absolute event counts with respect to Gold Standard, over each single cell-day.
For both analyses, we use two types of correlation coefficients to assess variable's relationship: Pearson coefficient r and Spearman's rank correlation coefficient ρ. Moreover, we used Root Mean Squared Error (RMSE) to measure the absolute value of the error on estimating cell/event counts from the Gold Standard.

Baseline
As a baseline, we used the output from NEXUS, a state-of-the-art engine for events detection from news (Tanev et al., 2008) that has been used in the area of security and disaster management 6 . We denote this system as Baseline throughout. Nexus is based on a blend of rule-based cascaded grammars for detection event slots (i.e. perpetrator, various types of affected people, infrastructure and vehicle targets and weapons used), and a combination of keyword-based and statistical classifiers for detection of event classes. The dictionaries underlying the extraction grammars of the system have been learned using weakly supervised lexical learning on generic news corpora . No learning was performed on domain corpora in protest movements or related themes. Details on Nexus full taxonomy of event categories can be found in Atkinson et al. (2017). For this task, we filter the events belonging to the following type set: Disorder/Protest/Mutiny, Boycott/Strike, Public Demonstration, Riot/Turmoil, Sabotage/Impede, Mutiny. NEXUS performs event geocoding by (1) matching populated place names from the GeoNames gazetteer 7 in the news item; (2) resolving them into unique location entities via disambiguation heuristics (Pouliquen et al., 2006); and (3) selecting a single main event location based on the text proximity with the matched event components (see the slots above) in the news article. In order to mitigate the lack of geographical context in the tweet body, when processing the Twitter data, we ran Nexus on an enriched text, which included the String value of the full name field in the Place child object of the tweet, whenever that was available 8 . This resulted in a small fraction of 32,085 tweets with geographical information (out of the roughly 30 million tweets originally sampled). For the sake of comparison, we shared with participants this subset of tweets, together with the assigned location.

Nexus Deduplication
This system, developed by the Task organizers and denoted NexusDdpl, is an extension of the Baseline system, where an event deduplication has been integrated as a post-processing module. The algorithm uses two metrics based on geographical distance between two event points and semantic distance, respectively. The semantic distance is computed using the cosine between the projections of the sentence embeddings of the texts of the events records. The LASER embeddings (Schwenk and Douze, 2017) were used for that purpose. Twitter data has been cleaned of hashtags, URLs, and accounts names, as these have a negative impact on the semantic similarity measure. In order to be considered duplicate two events must have both distance measures under a fixed threshold, which were set to 2km for spatial distance, 0.20 for semantic distance on NYT data, 0.30 for semantic distance on Twitter data. The reason of these different threshold depending on the data sets is that Twitter data are noisier than NYT data, with higher variations in text size and style when describing a single event. As such looser threshold was required. When applying on the combination of both data sets, we use a compromise threshold of 0.35 was used.

Team Systems
Four teams participated in this event: DaDeFrNi, EventMiner, Handshakes, and NoConflict. We briefly describe the systems below and ask the reader to refer to their systems papers for additional details.
DaDeFrNi This team considered two slightly different procedures for this task. For the NYT data set, they first extracted geo-entities from each article using the Python library geography, which was used to classify each entity in one of the three categories "city", "country", and "region". For the cases where an article contained the name of a city but did not provide any region or country reference, DaDeFrNi retrieved the necessary information by checking the city name against a worldwide cities database. When the name of a city was associated with several locations, we filtered the city with the highest population, along with its corresponding "region" and "country". For the Twitter data set, given the large size of the data, the above procedure was computationally expensive. Thus, the Python library spaCy (Honnibal et al., 2020) for retrieving NER/GPE entities, given its much smaller computational cost. The complete system details can be found in Ignazio Re et al. (2021).
EventMiner Team EventMiner's approach for Task 3 is mainly based on transformer models (Hettiarachchi et al., 2021). This approach involved three steps: (1) event document identification, (2) location detail extraction, (3) and event filtering to identify the spatial and temporal pattern of the targeted social protest movement. Event documents are identified using the winning solution submitted to CASE 2021 Task 1-Subtask 1: event document classification (Hettiarachchi et al., 2021). Next, the location details in event described tweets are extracted. Since this team only focused on the Twitter corpus, they used tweet metadata to extract location details. However, since the majority of the tweets are not geotagged and to extract the location details mentioned in the text, they used a NER approach too. For NER, a transformer model is fine-tuned for token classification using the data set released with the WNUT 2017 Shared Task on Novel and Emerging Entity Recognition (Derczynski et al., 2017). The BERTweet model is used since it is pretrained on Tweets (Nguyen et al., 2020). To convert the location details into an unique format and fill the missing details (e.g. region, country), locations are geocoded using the GeoPy library 9 . For the final step, event tweets with location details are grouped based on their created dates and locations and removed the groups with fewer tweets assuming that important events generate a high number of tweets. Three systems were submitted. For the first system, denoted by †, only the new events are included (i.e., events with locations which are identified in the previous day are removed). The second system † †, includes all the extracted events (i.e., no filtering as in †). Finally, the third system † † † further filters the events from † to include U.S. events only. Please see Hettiarachchi et al. (2021) for more details Handshakes This model is a pretrained XLM-RoBERTa model, fine-tuned on the multi-language article data from Task 1 Subtask 1 and sentence data from Subtask 2, with a classification head that predicts if the input text is a protest or not. We make use of the provided location data in the data sets, where available. Please see Kalyan et al. (2021) for further details.
NoConflict Team NoConflict used their model of protest event sentence classification from the winning submission of the English version of Task 1 Subtask 2. Their model is based on a RoBERTa (Liu et al., 2019) backbone with a second pretraining (Gururangan et al., 2020) stage done on the POLUSA (Gebhard and Hamborg, 2020) data set before finetuned on Subtask 2 data. For the NYT data set, they first filtered the articles based on the section name. They then ran their model on the abstract of each article to identify ones containing protest events. For each remaining article, they run a transformer-based (Vaswani et al., 2017) named entity recognition from spaCy (Honnibal et al., 2020) to identify the location and date of the events. They covert the location to absolute location using the Geocoder library and convert the date of the event to the absolute date based on the article's publication date. If the relative location or date is unavailable, they default to those included in the metadata. The event sentence classification system details can be found in Hu and Stoehr (2021). Three systems were submitted for the NYT data, denoted , , and . Each system used a set of manually curated keywords applied to different parts of each data point. Theses rules are included in the Appendix. For the Twitter data set, Team NoConflict ran their model on the full text of each tweet to identify protest events. For each potential event tweet, they identify the location and time based on the metadata of the tweet itself and the main tweet if it is a retweet. Table 1 shows the Pearson r, Spearman correlation coefficient ρ, and Root Mean Squared Error (RMSE) for the total daily protest cell counts of the Baseline and participant systems, over the 35 days target time range. When a run for both source types exists for a system, we also evaluate the union of the two event sets (noted as "Merged" in Tables). Here, the correlations are between the total number of cells per day where the system found an event vs. the number of cells where event happened according to the Gold Standard (i.e., temporal patterns and not spatial patterns). These correlation measures are tolerant to errors in geocoding (as far as the events are located in U.S.) and evaluate the capability of the system to detect protest events in the news and social media, independent of their location. We see the following: (1) NoConflict surpasses the Baseline with the NYT, Twitter, and Merged data in both Pearson r and Spearman ρ, and (2) EventMiner and HandShakes surpasses Baseline with Twitter data in Pearson r (both systems have lower Spearman ρ than Baseline). Additionally, NoConflict surpasses the NexusDdpl system (using NYT, Twitter, and Merged data), and the HandShakes system surpasses the NexusDdpl system using Twitter data. Table 2 reports Pearson r, Spearman correlation coefficient ρ, and Root Mean Squared Error (RMSE) over cell-day event counts of the Baseline and participant systems with respect to Gold Standard, for the 35 days time range. Here the variables range over the whole set of PRIO-GRID cells included in the US territory and, thus, shows the correlation of event numbers across geo-cells, thus evaluating the system's geolocation capabilities. NoConflict (NYT ) had the highest Pearson r and lowest RMSE across all systems, as well as the highest Spearman ρ (with the Merged data). Using Twitter data alone, the Baseline and NexusDdpl systems outperformed all others in terms of Pearson r, however NexusDdpl had a higher Spearman ρ. However, when looking at both correlation metrics simultaneously, no system is above the NexusDdpl baseline.

Results
In Figure 2 we plot the time series of total daily protest cells for the best performing instance of each system on New York Times (left) and Twitter (right) data, respectively. We see the systems evaluated on the NYT data failing to pick up both variation in the temporal patterns (i.e., a large number of protests early in late May and early June, which gradually declines with weekly spikes) and the magnitude of the events (i.e, most systems pick up less than 100 events per day). Systems evaluated on Twitter data pick up more events in late May and early June, but still fail to pick up the magnitude of the events.
A more lenient representation of the agreement with Gold Standard is shown in Table 3. Here we report the confusion matrix between grid cells that Gold Standard and system runs code as experiencing at least a protest event. It can be observed that only few of the cells classified as Protest by Gold Standard are detected by the automatic systems, which on the other hand incorrectly classified as Protest several additional cells.

Conclusions
The goal of the "Discovering Black Lives Matter Events" Shared Task was to explore novel performance evaluations of pretrained event detection systems. These systems were applied to large noisy, multi-modal text data sets (i.e., news articles and social media data) related to a specific protest movement, namely, Black Lives Matter. Thus, the systems are being evaluated out-of-domain in terms of both data type (i.e., the systems are trained on news data and evaluated on both news and social media) and protest movement context (i.e., the training data are not necessarily related to BLM). Systems are evaluated in their ability to identify both events across time as well as events their distribution across space. This evaluation scenario proved difficult for all systems participating in the shared task. A major problem, as shown on Table 3, is the system's low recall. No system was able to outperform the NexusDdpl baseline both in precision and recall together. The only system which outperformed the baseline in either recall or F1 is the DaDeFrNi (Ignazio Re et al., 2021), with a recall of 5.08 and F1 of 8.86. On the other hand, two systems surpass the baseline in precision: Event-Miner (Hettiarachchi et al., 2021) and NoConflict (Hu and Stoehr, 2021), with precisions of 56.0 and 73.6, respectively. The low recall at this years shared task may well be due to the low coverage of protest events of the highly diffused BLM movement both in the NYT and Twitter corpus, so the upper bound of the recall may turn out not to be much higher than the system performance. One possible explanation for this is that a significant part of the BLM events in the Gold standard are located in small towns, for which NYT has a limited coverage and also they were not in the focus of social media, due to their small scale. NexusDdpl turned out to be quite high both in terms of event detection accuracy, as well as geo-coding correlation. While no single system outperformed all others in tracking both temporal and spatial trends, NoConflict had a clear advantage (i.e., the highest scoring system in 2 out of 3 metrics) in terms of tracking daily events.  Table 2), so the NexusDdpl system was omitted.  Table 3: Confusion matrix of grid cells experiencing at least one Protest event (true) versus inactive cells (false), for the Gold Standard, Baseline and participant systems. Unless denoted by a superscript, all systems use the "merged" version (i.e., both NYT and Twitter data sets) except for HandShakes system which uses only Twitter data.