Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence

Automated event extraction in social science applications often requires corpus-level evaluations: for example, aggregating text predictions across metadata and unbiased estimates of recall. We combine corpus-level evaluation requirements with a real-world, social science setting and introduce the IndiaPoliceEvents corpus--all 21,391 sentences from 1,257 English-language Times of India articles about events in the state of Gujarat during March 2002. Our trained annotators read and label every document for mentions of police activity events, allowing for unbiased recall evaluations. In contrast to other datasets with structured event representations, we gather annotations by posing natural questions, and evaluate off-the-shelf models for three different tasks: sentence classification, document ranking, and temporal aggregation of target events. We present baseline results from zero-shot BERT-based models fine-tuned on natural language inference and passage retrieval tasks. Our novel corpus-level evaluations and annotation approach can guide creation of similar social-science-oriented resources in the future.


Introduction
Understanding the actions taken by political actors is at the heart of political science research: How do actors respond to contested elections (Daxecker et al., 2019)? How many people attend protests (Chenoweth and Lewis, 2013)? Which religious groups are engaged in violence (Brathwaite and Park, 2018)? Why do some governments try to prevent anti-minority riots while others do not (Wilkinson, 2006)? In the absence of official records, social scientists often turn to news data to extract the actions of actors and surrounding events. These * Indicates joint first-authorship. B. To answer these questions, domain experts use natural language to define semantic event classes of interest. C. Our INDIAPOLICEEVENTS dataset: Humans annotate every sentence in the corpus in order to evaluate whether a system achieves full recall of relevant events. In production, computational models run B's queries to classify or rank sentences or documents, which are aggregated to answer A. news-based event datasets are often constructed by hand, requiring large investments of time and money and limiting the number of researchers who can undertake data collection efforts.
Automated extraction of political events and actors is already prominent in social science (Schrodt et al., 1994;King and Lowe, 2003;Hanna, 2014;Hammond and Weidmann, 2014;Boschee et al., 2015;Beieler et al., 2016;Osorio and Reyes, 2017) and is increasingly promising given recent gains in information extraction (IE), the automatic conversion of unstructured text to structured datasets (Grishman, 1997;McCallum, 2005;Grishman, 2019). While social scientists and IE researchers have over-lapping interests in evaluating event extraction systems, social scientists have particular needs that have so far been under-addressed by the computer science IE research community. Figure 1A shows a common goal of social scientists: answering aggregate substantive questions from corpora such as, "Over time, when did police fail to act?" which could be measured by, for example, the daily count of newspapers mentioning the event class over time. For these types of questions, social scientists predominantly want very high recall methods because often the events of interest are sparse or their substantive conclusions depend on identifying every event in a corpus. 1 In contrast to this corpus-level focus, much of current IE research has focused on distinct subtasks such as entity linking, relation extraction, and coreference resolution. 2 Furthermore, all widely used event datasets (e.g. ACE, FrameNet, ERE, or KBP; Aguilar et al. 2014) are typically curated at the ontology level-attempting to cover a selected set of event types-but have little consideration of the corpus level-annotated documents are not necessarily a substantively meaningful sample of the broader corpora from which they are drawn. We try to address these evaluation shortcomings in this paper.
In addition to corpus-level recall, social scientists are often interested in using off-the-shelf models that are easily extensible to their domain questions. Fortunately, recent NLP research has seen a paradigm shift from structured semantic and event representations (Abend and Rappoport, 2017;Aguilar et al., 2014) which are limited by their predefined schemas, to directly using natural language to encode semantic arguments (QA-SRL;He et al. 2015;Stanovsky et al. 2016;FitzGerald et al. 2018;Roit et al. 2020) and events (Levy et al., 2017;Liu et al., 2020;Du and Cardie, 2020). In this paper, we also use natural language questions to annotate and model the event classes in our dataset, not only facilitating ease of annotation, but also allowing for the evaluation of zero-shot natural language inference and information retrieval models for the 1 In some studies, researchers rely on an assumption that events are missing at random, but others depend on knowing whether an event occurred at least once. 2 The first five Message Understanding Conferences (MUC) required participants to submit complete systems to fill event templates; however, starting with MUC-6 and subsequent ACE and KBP tasks, information extraction was broken into distinct modules (Grishman and Sundheim, 1996;Grishman, 2019). tasks.
To address these social science desiderata, we present the INDIAPOLICEEVENTS corpus 3 which has the following useful properties: • Social science relevance. Our dataset consists of all 21,391 sentences from all 1,257 Times of India articles about events in the state of Gujarat during March, 2002-a period that is of deep interest to political scientists due to widespread Hindu-Muslim violence (Dhattiwala and Biggs, 2012;Berenschot, 2012;Basu, 2015). We focus on the actions of a single entity type, police, because of extensive substantive research on police actions during the Gujarat violence (Varadarajan, 2002). Our choice of location, actors, and event types are motivated by Wilkinson (2006)-political science work which created a hand-coded event dataset from newspapers about communal violence events in India from 1950-1995. • Corpus-level full-recall. Unlike most previous event evaluation datasets, our annotators read every document in our corpus (that match a loose spatiotemporal filter; §4.1). This requires substantially more annotation work compared to a more targeted filter to select documents to annotate (e.g. matching via keywords), but eliminates a potential source of evaluation bias compared to alternative document retrieval data collection approaches (Grossman et al., 2016), and allows for full-recall evaluation of end-to-end event extraction systems. • Document-level context. Our annotators read the context of an entire document to provide answers for each question on each sentence. We then aggregate these sentence-level answers to make document-level inferences. This allows us to accurately label sentences with anaphora or context-specific meaning. • Natural language event specification and zero-shot model evaluation. In constructing our dataset, we gather annotations via a natural question-answer format because it allows for easily specifying constraints on arguments (e.g. police being the agent). Additionally, it allows for specifying event predicates not covered within the ontologies of current structured semantic representations, or with additional hard-to-specify semantic phenomena-e.g. "Did police fail to act?" or when political actors do not take an action, which is very important to political scientists (e.g. Wilkinson (2006)). This format also allows us to evaluate zero-shot natural language inference and information retrieval models. • High-quality annotators who provide uncertainty explanations. We hire and train political science undergraduate students as annotators to ensure quality control, retraining annotators over a period of several months with training videos, two hour-long live meetings, and individual annotator feedback before producing our final dataset. Our annotators also provide free-text explanations for instances in which they are uncertain about the answer. These rationales are important given the recent attention to propagating annotator uncertainty in downstream NLP tasks (Dumitrache et al., 2018;Paun et al., 2018;Pavlick and Kwiatkowski, 2019;Keith et al., 2020) and social scientists' interest in quantifying uncertainty (King, 1989;Wallach, 2018).
In the remainder of this paper, we use our dataset for three levels of evaluation: sentence-level classification, ranking of documents to reduce manual reading time, and constructing temporal aggregates useful to social scientists ( §3). We describe in detail our annotation and dataset creation process ( §4), provide baseline models ( §5), and evaluate their performance on all three tasks ( §6).

Related Work
NLP and IR for police activity. Natural Language Processing (NLP) and Information Retrieval (IR) have been used for analysis of other police activity such as identifying victims of police fatalities from news articles (Keith et al., 2017;Nguyen and Nguyen, 2018;Sarwar and Allan, 2019); extracting eye-witness event types from Twitter including police activity and shootings (Doggett and Cantarero, 2016); detecting dialogue acts from police stops (Prabhakaran et al., 2018); and computational analysis of degree of respect in police officers' language (Voigt et al., 2017). Political event extraction. Automated event extraction in social science is generally performed using dictionary methods and a set of substantively motivated event types and actor categories (Schrodt et al., 1994;Gerner et al., 2002;Beieler et al., 2016;Boschee, 2016;Radford, 2016;Brathwaite and Park, 2018;Liang et al., 2018). Other work uses supervised learning to infer events such as conflict or cooperation (Beieler, 2016) and protests (Hanna, 2017). While some have attempted to induce event types without supervision (O'Connor et al., 2013;Huang et al., 2016), most social science applications of event extraction require substantial human input either through constructing keyword lists, or annotating texts to train classifiers.
Recall-focused IR. TREC's total-recall track (Grossman et al., 2016) is inspired by real-world recall-focused applications from law, medicine, and oversight (McDonald et al., 2018). However, the track's datasets are not typically focused on events and assume documents are collected through interacting with a system. Other work has focused on methods for truncating ranked lists that minimize the risk of viewing non-relevant documents (Arampatzis et al., 2009;Lien et al., 2019), but this line of work does not evaluate on semantic retrieval of event classes.

Three Levels of Tasks
In order to answer substantive social science questions, for example "Does variation in party control of state government affect whether police failed to intervene in ethnic conflict?" (Wilkinson, 2006), social scientists often need to gather counts of events (e.g. "police failed to intervene") from text when official government records are lacking. Ideally, a social scientist could use automatic information extraction methods (Cowie and Lehnert, 1996;McCallum, 2005;Grishman, 2019) to transform unstructured text into a structured database that would be useful in a quantitative analysis. Yet, even stateof-the-art information extraction systems often give less than perfect accuracy, so social scientists must still manually analyze large portions of their corpus in order to extract events of interest. This quantitative research process motivates the following three tasks which our dataset can be used to evaluate: Task 1: Sentence classification. Although social science corpora typically consist of documents, it would be useful for a system to classify sentences that contain events of interest. 4 Highlighting relevant sentences could, for semi-automated systems, reduce a social scientist's reading time, and, for fully-automated systems, provide sentence-level evidence of the automated method's validity, a cru-cial aspect of research in text-as-data (Grimmer and Stewart, 2013) and the broader social sciences (Drost et al., 2011). INDIAPOLICEEVENTS allows for evaluation of sentence-level precision, recall, and F1 ( §6.1).
Task 2: Document ranking. For semiautomated systems, social scientists must navigate the tradeoff between recall and manual reading time. Social scientists may rely on IR methods which present ranked lists of relevant documents (Baeza-Yates et al., 1999;Schütze et al., 2008). However, our informal interviews with social scientists suggest they want to know at what point they have read enough documents to achieve very high (95-100%) recall. In creating INDIAPO-LICEEVENTS, annotators read every single sentence in a corpus which allows for full evaluations of average precision and our newly proposed metric: the proportion of the corpus that would have to be read to achieve Recall=X (PropRead@RecallX) ( §6.2). 5 Task 3: Substantive temporal aggregates. For social scientists, NLP and IR methods are used in service of answering substantive questions from text. In addressing our running example "Did differences in party control of state government affect whether police failed to intervene in ethnic conflict?" a social scientist could measure how many news articles 6 discuss "police failing to intervene" each day for a given temporal span. In this setting, it would be helpful to know if changes in model performance at the sentence or document level resulted in significant differences at this aggregate level. We design INDIAPOLICEEVENTS with the capability of evaluating these meaningful corpus-level temporal aggregates, such as the mean absolute error and Spearman rank correlation coefficient between per-day event counts of computational models and ground truth annotations ( §6.3).

Corpus selection
We curate our corpus with a substantively motivated specification: it is restricted to a single authoritative news source, over a defined span of time, 5 We do not address the problem of estimating recall when gold-standard labels are only known for the subset of documents read so far, but INDIAPOLICEEVENTS could be used to evaluate that task in future work. 6 Count of news articles are often used in social science as a proxy for the true measure of the event, e.g. Nielsen (2013); Chadefaux (2014). with articles that mention one of two locations involved in or related to the 2002 Gujarat violence.
From the website of Times of India, an English language newspaper of record in India, we first download all news articles published in March 2002. 7 During this period, widespread communal violence occurred in India, following the death of 59 Hindu pilgrims in a train fire in the state of Gujarat. In the subsequent months, reprisal attacks were directed at mostly Muslim victims across the state (Human Rights Watch, 2002;Subramanian, 2007). In creating our annotations, we specifically focus on the actions of police during these events, since a large body of evidence points to the importance of police intervention and non-intervention in quelling or permitting ethnic violence (Human Rights Watch, 2002;Wilkinson, 2006;Subramanian, 2007). We focus on the first month of the violence in order to fit within our annotation budget. This month saw the greatest levels of violence, though violence continued for a period of months afterward.
Our final corpus consists of the subset of scraped documents published in March 2002 that include either the name of the state (Gujarat) or a city related to the beginning of violence (Ayodhya). 8 Selecting on geographical and temporal metadata is a high recall way to filter the corpus without biasing the dataset by filtering to topic or eventrelated keywords, thus giving a better view of the true recall of an event extraction method.

Annotations via natural language
To collect annotations, we give annotators an entire document for context, and then ask them natural language questions about semantic event classes anchored on the actions of police for each sentence in that document: • KILL: "Did police kill someone?" Lethal police violence is an important subject for social scientists (Subramanian, 2007). Example sentence: "In Vadodara, one person was killed in police firing on a mob in the Fatehganj area." • ARREST: "Did police arrest someone?" Knowing when and where police made arrests and who was arrested is an important part of understanding police response to communal violence. Example sentence: "Police officials said nearly 2,537 people have so far been rounded up in the state." • FAIL TO ACT: "Did police fail to intervene?" In the 2002 Gujarat violence, police were often accused of failing to prevent violence or allowing it to happen. Knowing when police were present but did not act is important for understanding the extent of this phenomenon and its potential causes (Wilkinson, 2006). Example sentence: "The news items [...] suggest inaction by the police force [...] to deal with this situation." • FORCE: "Did police use force or violence?" Political scientists are interested not only when police kill but the level of force they use. Example sentence: "Trouble broke out in Halad [...] where the police had to open fire at a violent mob." • ANY ACTION: "Did police do anything?" We collect annotations on all police activities, so that social scientists could, in the future, label more fine-grained event classes. Example sentence: "In the heart of the city's Golwad area, the army is maintaining a vigil over mounting tension following [...]" Figure 2 shows the interface annotators see. 9 See Appendix §A for exact annotation instructions and per-question agreement rates. Figure 2: We present annotators with a highlighted sentence (blue) and its document context. Their task is to click a check-mark for the event-focused questions for which there is a positive answer in the highlighted sentence.
Following the guidelines of Pustejovsky and Stubbs (2012), we first assign each document to two annotators and then follow with an adjudication round in which items with disagreement are given to an additional annotator to resolve and create the gold standard. For annotators, we select undergraduate students majoring in political science (as opposed to crowdworkers) in order to approximate the domain expertise of social scientists. 10 We initially recruited and selected 12 students. After a pilot study and two rounds of training, in which we provided individual feedback to annotators via email, we selected 8 final annotators based on their performance. Each student annotated around 330 documents (∼5,500 sentences) using the interface described in the Appendix, Figure A2. Table 1 shows the prevalence of the event classes after the adjudication round. Note that some of the classes are relatively rare: of all documents, only roughly 4% have KILL and 7% have FORCE. Our annotators had fairly high inner-annotator agreement for KILL and ARREST, with Krippendorff's alpha values of 0.75 and 0.71 respectively. Other questions, such as FAIL TO ACT and "Did police use other force?" had lower agreement (α < 0.4), indicating more difficulty and ambiguity. Full agreement rates are show in Appendix, Table A1.

Annotation uncertainty explanations
We also collect free-text annotation uncertainty explanations in order to analyze instances that an-notators found difficult or ambiguous. For each sentence presented to annotators, we ask "If you found this example difficult or ambiguous please explain why" and ask them to provide a short written response in a provided text-box. This follows recent work that that has emphasized the importance of annotator disagreement not necessarily always as error in annotation but instead as ambiguity that is inherent to natural language and a potential useful signal for downstream analyses (Dumitrache et al., 2018;Paun et al., 2018;Pavlick and Kwiatkowski, 2019;Keith et al., 2020).
Annotators remarked on several types of text they were uncertain about: agents of actions who were not explicitly mentioned but implicitly police, named entities whose status as police is ambiguous, confusion about what precisely constitutes an "arrest", and confusion arising from the lack of specific cultural knowledge (e.g., around the Indian crowd-control tactic of "lathi charging"). In the appendix, see Table A3 for examples and Table A2 for a categorization of free text responses.

Baseline Models
We test several baseline models, all requiring no annotation (and thus most realistic for the social science use case), and assess their performance on INDIAPOLICEEVENTS.
Keyword matching. Boolean keyword queries are a very common social science approach to document classification (e.g. Nielsen (2013) 2016)), since they are simple, transparent, and widely supported in user software. We use conjunctive normal form rules, where inferring an event class for a sentence requires matching any term from a police keyword list (including both common nouns and names of major police and security institutions), as well as an event keyword. To construct the keyword lists, a domain expert coauthor first manually generates a list of seed keywords for the semantic categories police, kill, arrest, intervention, and force. To address lexical coverage, we then expand the keywords through word2vec (Mikolov et al., 2013) nearest neighbors, filtered to semantically equivalent words by the domain expert. 11 This process is repeated using WordNet synonym sets for 11 We train word2vec on every article in the Times of India from 2002 (the same corpus as our dataset, 69,000 articles) plus another 100,000 articles from The Hindu, another Englishlanguage newspaper in India. We inspect each keyword's 20 nearest neighbors with highest cosine similarity. lookup (Miller, 1995), resulting in 217 keywords total; see appendix ( §C.2) for details.
RoBERTa+MNLI. Given two input sentences, a premise and hypothesis, the task of natural language inference (NLI) is to predict whether the premise entails or contradicts the hypothesis or does neither (neutral) (Bowman et al., 2015;Williams et al., 2018). Previous work has shown promise of NLI transfer learning for events: Sarwar and Allan (2020)  . The model takes a sentence and a declarative form of an event class question as input ( §C.1), and we use its predicted probability of entailment as the probability of the event class. For document ranking, we create a document score by taking the maximum predicted probability over sentences. Future experiments could vary the amount of text (sentence vs. passage vs. document) used as input to the model.

BM25+RM3.
Weighted term matching between a query and document is a strong competitor to neural ranking methods (Craswell et al., 2020; Lin, 2019), via, for example, BM25 scoring with RM3 query expansion (Lavrenko and Croft, 2001). With the Anserini BM25 implementation (Yang et al., 2018a), we set k 1 = 0.9 and b = 0.4, and conduct RM3 expansion of the query to terms found in the top k = 10 BM25-retrieved documents, following Lin (2019)'s hyperparameter settings. As the input query, this set of models uses the natural language questions described in §4.2. Appendix Table A5 contains full results.

Results
We report the performance of the baseline models ( §5) on our three tasks in Table 2 and in Figure 3.

Task 1: Sentence classification.
For sentence classification, 12 Table 2 shows that the keyword matching method slightly outperforms RoBERTa+MNLI on F1 for ANY ACTION and FORCE, which we suspect is due to the keyword method having better access to synonyms of "police" (e.g. "jawan", "RPF") particular to the Times of India via its word2vec expansion. However, RoBERTa+MNLI achieves a higher F1 score on KILL, ARREST, and FAIL TO ACT. We need further controlled experiments to understand how the concreteness of the event class, importance of identifying events' agents, and formulation of the query (e.g. "Did police use force or violence?" vs. "Were police violent?") affect the results of contextualized language models. Table 2 also shows poor performance of our keyword matching method on FAIL TO ACT (F1=0.05); however, a largescale contextual language model seems to be able to better distinguish the semantics of the event class (F1=0.48). The Task 1 plot in Figure 3 shows that across all labels, RoBERTa+MNLI has higher recall than the keyword method for every event class. If social scientists plan to use these sentence classification methods in a semi-automated fashion (as we suggest in §1), selecting models like RoBERTa+MNLI that achieve higher recall may be important. 12 We do not evaluate sentences with less than 5 tokens as many of these sentences are due to sentence segmentation errors. After this filtering, the number of remaining sentences we evaluate on is 18,645. 6.2 Task 2: Document ranking.
For Task 2, we report average precision and a new metric-the proportion of documents that would have to be read to achieve recall equal to X (PropRead@RecallX). We use X = 0.95 because social scientists typically use 95% cutoffs for significance and sampling error. 13 We leave to future work estimating recall on a corpus without ground truth. Table 2 shows that RoBERTa+MNLI outperforms both BM25 and ELECTRA+MS MARCO on both average precision and PropRead@Recall95 across all event classes. We hypothesize this is because natural language inference is a task that is much more aligned with the semantic-oriented precision at which we want to rank documents. In contrast, the MS MARCO dataset is constructed for a much higher level information need, and documents that are "relevant" could potentially not entail the semantic event class of interest. As Figure 3 shows, if a social scientist was presented with a ranked list of documents from RoBERTa+MNLI, they would only have to read 5% of the entire corpus to achieve 95% recall on KILL. RoBERTa+MNLI also does well on ARREST and FORCE with 0.17 and 0.20 PropRead@Recall95 respectively. There is consistently more difficulty across all models for ANY ACTION and FAIL TO ACT. We speculate this is because ANY ACTION is the class with the greatest prevalence, and thus is more difficult to achieve higher recall.  . While the the overall temporal trend is broadly consistent across the three methods, the decreased accuracy of the automated methods could lead to attenuation bias if they were used as input to statistical models. A qualitative examination of the extracted events also reveals the need for future work in temporal linking models: most of the events after March 25  are describing events from earlier in March that were being reported in the context of investigations into the violence. Table 2 shows that for all event classes except for FORCE RoBERTa+MNLI has a higher Spearman's ρ 14 between the predicted versus gold-standard document counts. The Task 1 vs. 3 plot in Figure 3 shows an approximately linear relationship between the F1 scores of sentencelevel models and Spearman's ρ, suggesting there is promise that NLP research focused on sentencelevel models could be of use to social scientists who care about corpus-level evaluation.

Qualitative error analysis.
We manually analyze the false positives and false negatives of our best-performing baseline model, RoBERTa+MNLI. Some false positives are due to lexical semantic misunderstandings: the model often mistakes "shot" for KILL, and assigns high probability to negative FORCE sentences such as, "The police escorting the vehicle fired into the air and dispersed the mob." The model also has difficulty identifying the police as agents: for example, it assigns high probability to the negative tence "Four persons have been killed and five are injured." and FORCE sentence "One person was injured and rushed to the SSG hospital"; if one reads the proceeding context of both of these sentences it is clear that police are the agents of the actions.

Discussion and Future Work
The dataset, tasks, and evaluations we present in this work are driven by the needs of social scientists: we assess the performance of zero-shot models on metrics important to applied researchers, including recall against a fully annotated corpus and performance at temporally aggregated levels. We find cause for optimism for social scientists using BERT-style pre-trained models on their tasks. These models could potentially be used in place of social scientists' existing keyword-based classifiers, although we caution accuracy is far from perfect and applied researchers will need to extensively validate model outputs. Even with imperfect classification accuracy, we believe these zero-shot models show promise for decreasing human annotation effort by reducing the proportion of the corpus read to achieve a specific recall level (the metric we call PropRead@RecallX). Future work can extend our dataset creation process to new semantic event classes, such as protests, communal violence itself, and other forms of participation in political and social activity. Additional annotated datasets could allow researchers to generalize the performance of zero-shot language models to new domains and event classes. Finally, tasks such as temporal and geographic linking, event deduplication and coreference, and identifying hypothetical events are unsolved but are major obstacles for applied social scientists working with automatically extracted events.

Acknowledgments
For helpful comments, we're grateful to Aidan Milliff, the UMass NLP group, and anonymous ACL reviewers. This work was funded by a Kaggle Open Data Science grant, and additionally: AH: a National Science Foundation Graduate Research Fellowship; KK: a Bloomberg Data Science PhD Fellowship; SS: the Center for Intelligent Information Retrieval (CIIR); BO: National Science Foundation IIS-1845576 and IIS-1814955. Impact To ensure the replicability of our work and to further research into event extraction systems for social science research, we are making the text of the news articles available to researchers alongside our annotations. While all articles were obtained from a public website without login credentials, the applicability of copyright restrictions is relevant to address.
We believe the research benefits and the limited harms to the copyright holders justify this use, due to the four criteria considered in the fair use doctrine in U.S. copyright law (U.S. Copyright Office, 2021): (1) the non-commercial, nonprofit educational purpose of our use of the text, (2) the factual nature of the news reports, (3) the limited substitutability of our dataset for the original news site, 15 and (4) our expectation that our limited corpus will not harm the market for readers of the news site.
The issue of copyright status within NLPoriented corpora is of increasing interest. Sag  (2021) investigate BooksCorpus, a previously poorly documented corpus widely used for training language models, finding it contains large amounts of copyrighted work, highlighting how current data curation practices in machine learning (and adjacent) communities need improvement (Paullada et al., 2020;Jo and Gebru, 2020).
We also acknowledge the sensitivities around this period of violence in India. Its significance motivates computational work to enable more effective study of it and related episodes, but our news-derived data on its own, in the absence of deeper qualitative work, does not permit us to draw new substantive conclusions about the causes and consequences of the violence in Gujarat in 2002. We defer to the large scholarly and journalistic literature on the violence; see references in §1 and §4. We provide details on our annotation process here, including the semantic event class definitions we provided to annotators, the per-class agreement statistics, statistics on the time it took to annotate, and further qualitative analysis of the annotations. All results reported in this appendix correspond to responses to the annotation questions (shown in Table A1), which are slightly different from the main semantic event classes reported in the main paper, as described in Table A1's caption and Footnote 9.

A.1 Annotator Instructions
To train our annotators, we provided them with a the question for each semantic event class, a short description to clarify the question, and an example positive sentence ( Figure A1). We met with the annotators as a group to talk through the document and then gave them a training round with documents we had previously annotated. Based on that training round, we added frequently asked questions to the instructions document, provided individual feedback to annotators, and then began the production annotation process on our corpus. Figure A2 shows a stylized version of the custom interface we built using the Prodigy annotation tool (Montani and Honnibal, 2018). Annotators are presented with an entire document, with sentences sequentially highlighted. For each highlighted question, they are asked each of the questions. If the sentence contains a positive answer to the question(s), they select the corresponding box(es) and advance to the next sentence.

A.3 Multi-sentence labels
We record whether annotators report using information from other sentences in the document to annotate the current sentence. Specifically, we provide a checkbox in the interface with the label "I used information from other sentences to answer the question". We collected this information in order to understand the number of sentences that could be classified on their own and how many needed broader document context. We caution that we left the interpretation of the sentence up to each annotator and did not train them or compare their usage of this label as we did with other labels. We do not use the labels in our analysis but provide them in our dataset to potentially help future research.

A.4 Annotator Agreements
We calculated inter-annotator agreement, both raw agreement and Krippendorff's alpha, for all annotators on our corpus (Table A1). Because the event classes are rare in our corpus, we prefer Krippendorff's alpha over raw agreement, which is inflated by the large number of zeros in our data. After half of the documents were annotated, we calculated agreement to check for annotators with high disagreement. We found one annotator with high disagreement on the KILL class and provided updated instructions for them. We used the final agreement rates to select the three annotators with the highest agreement rates with the full set of annotators to serve as our adjudicators in the final round of annotations. Figure A3 shows the distribution of the time that annotators took to annotate each document.

B Properties of Annotated Data
B.1 Event locality Figure A4 shows the the label density for each event class in different sections in a document. To compute the label density, we partition the sentences in each document into ten equal and ordered sections, where the first section indicates the earliest location in a document. Then we compute the number of positive event in those sections. Figure  A4 shows that except for ARREST, label density is not high in the initial sections of a document. In information retrieval and news summarization, typically the first k tokens are assumed a good approximation for document representation (Dai and Callan, 2019), which our dataset seems to present contradictory evidence.

B.2 Analysis of Free-Text Explanations
We analyzed the free text explanations given by annotators, grouping them into non-exclusive categorizes. The most common categories of annotator explanations are shown in Table A2. Table A3 shows a selection of sentences from our corpus that illustrate challenging annotation deci-• Hotkeys: you can select categories using the number keys (1-6), accept the example with the "a" key, and reject the example with the "x" key. • Make sure you save your work when you're done (command-S or disk icon on the upper left).
Description: Click the checkbox if the sentence indicates police were responsible for killing anyone.
Example: Two people died due to police firing and another three were injured from the shooting (2) Did police arrest someone?
Description: Click the checkbox if the sentence indicates police arrested anyone.
Example: Police arrested ten people yesterday. "Over two dozen people were arrested at the protest." (3) Did police fail to act or not intervene?
Description: Click the checkbox if the sentence indicates police were present in any capacity but stood by and did not respond to any events that were unfolding.
Example: On Saturday, the police observed the conflict but did not intervene.
(4) Did police use other force or violence?
Description: Click the checkbox if the sentence indicates police used any other type of force towards others. This could include beating, shooting, shoving etc.
Example: Police beat innocent bystanders.
(5) Did police do or say something else (not included above)?
Description: Click the checkbox if the sentence indicates police did or said anything else, not mentioned above.
Example: Police reported that the incident happened at 2:59am.
(6) I used information from other sentences to answer the question.

Description:
Click this if you had to rely on information from other sentences to answer the question.
Example: "Yesterday, the police arrested 100 protesters. Even the secretary of the BJP was not spared." Recognizing that sentence 2 concerns an arrest relies on information from sentence 1.   Table A1: Sentence-level agreement, Krippendorff, and support for each question answered by annotators. "All" refers to all 20,527 annotated sentences in the corpus and "(1+)" refers to a subset of the corpus that excludes sentences that both annotators agree do not have police actions. Questions 1, 2, and 4 map to KILL, ARREST, and FAIL TO ACT, respectively; FORCE is defined as (1 OR 3), and ANY ACTION is (1 OR 2 OR 3 OR 5), as described in Footnote 9. sions for our annotators or interesting ambiguity in the sentences.
C Modeling Details

C.1 Declarative versions of questions
We use the following declarative versions of event class labels as input to RoBERTa+MNLI: • KILL: "Police killed someone." • ARREST: "Police arrested someone." • FAIL TO ACT: "Police failed to intervene." • FORCE: "Police used violence." • ANY ACTION: "Police did something."

C.2 Keyword Approach
We report here the terms used in the keyword matching. These terms were generated using subject matter expertise and expanded using WordNet and a custom word2vec model trained on the complete set of Times of India articles from 2002 and 100,000 additional articles from the Indian newspaper The Hindu. The expanded set was filtered using subject matter expertise. We report the keywords on the following categories: Police: police, policemen, cop, cops, constables, constables, jawan, jawans, grp , cid , rpf , stf , bsf , dcp , dsp , ssp , sho , cisf , dgp.
Kill: kill, kills, killed, killing, lynch, lynched, lynching, annihilate, annihilating, annihilated, annihilates, drown, drowning, drowned, drowns, massacre, massacring, massacred, massacres, slaughter, slaughtering, slaughtered, slaughterers, Figure A2: Illustration of the dataset annotation interface given an example document. In practice, annotators view and label all sentences in the document, but this figure highlights three informative sentence examples. (A) An example document with post-hoc numbering of sentences. (B) The user sees a bold sentence its context and then is asked a series of yes/no questions about the bold sentence. For this example, the annotators do not check any boxes (the answer to all the questions is no). (C) For this bold sentence, annotators check the boxes (yes answers) for the questions "Did police kill someone?" and "Did police use other force or violence." (D) For this bold sentence, annotators check the box for "Did police use other force or violence?", select "I used information from other sentences to answer the question," and provide a free-text explanation for why they thought the example was difficult.

Author-assigned category Count
No explicit agent 63 Agent may not be police 57 Police are mentioned but not agents 37 Hypothetical or future events 35 Failing to act vs. acting and failing 30 True ambiguity in language 27 Ambiguity in "arrest" 26 Total free text explanations 311 Total sentences with explanations 299 Total sentences with police activity * 2,783 Total sentences in corpus 21,391 Annotators flagged ambiguity in "rounded up" vs. "arrested" One of them who was on duty on December 16 even recollected how he and another colleague had to burst the tear gas shell themselves as the constables deliberately looked the other way.
Annotators flagged this sentence as two separate police agencies acted and failed to act.
Police on Friday lathicharged ram sevaks who attempted to rush towards the make-shift temple in the disputed site here giving some anxious moments to security forces. "Lathi charges" are an Indian riot control tactic that our United States-based annotators were not familiar with.
Meanwhile the district administration has tightened the security in and around the temple city.
Many annotators flagged sentences where police are implicitly the agents. Table A3: Example sentences illustrating several of the challenges of annotating the documents or in applying existing models. We provide our own commentary on why the sentences are difficult. Intervention: intervene, intervening, intervened, intervenes, intervention, interfere, interfering, interfered, interferes, stand by, standing by, stood by, stands by, abstain, abstaining, abstained, abstains.
• ARREST: If a police keyword AND an arrest keyword appear in the same piece of text, classify it as a positive.
• FAIL TO ACT: If a police keyword AND an intervention keyword appear in the same piece of text, classify it as a positive. (This is a very simple rule-based method and we leave to future work to develop a keyword-based method that more adequately captures the not semantics of "did not intervene.") • FORCE: If a police keyword AND an force keyword appear in the same piece of text, classify it as a positive.
• ANY ACTION: If a police keyword appears in a piece of text, classify it as a positive.
For IR, there are two standard architectures for scoring passages and queries: cross-encoders in which the architecture performs full attention over the pair and bi-encoders in which the passage and query are each mapped independently into a dense vector space (Luan et al., 2020). We chose a model with a cross-encoder architecture since these have been shown to consistently have higher performance (Thakur et al., 2021).

D Results
This section provides additional results beyond those included in the main paper, including results for document-level models, variants on the BM25 model, mean absolute error results to complement the Spearman correlations presented in the main paper, and the temporally aggregated results for all of the semantic event classes.

D.1 Document Level F1
To complement the sentence-level F1 metrics in the main paper, we present the document-level metrics in Table A4.

D.2 BM25 and Variants
In addition to the standard BM25 model reported in the paper, we tested several variants, including automatic term expansion using RM3 and manual term expansion using the same keywords from our keyword method. The results are shown in Table  A5.

D.3 Spearman and MAE results
In the main paper, we report Spearman correlations between the daily count of gold standard events identified by our annotators. In Table A6 we also report the mean absolute error in daily event counts between our two models for each event class. We prefer Spearman correlations over MAE because the correlation is normalized between -1 and 1, while MAE tends to be higher for high-prevelance event classes.

D.4 Temporal Aggregates for All Event Classes
We report the temporal aggregate comparisons for all event classes in Figure A5 to supplement the figure in the main text showing results for the FAIL TO ACT class.

E Prototype span-based annotation schema
Before arriving at the annotations via natural language described in Section 4.2, we first attempted to gather span-based text annotations in order to collect more fine-grained details about police activity. In these prototype rounds, we first asked annotators to highlight spans in the text that answered "What action did police do?" Then given the action text-span they highlighted, we asked them to highlight spans for the following questions: "Police did the action using what?" "Police did the action towards whom?" "Where did the action occur?" "When did the action occur?" "Why did the action occur?" There were several major barriers to this annotation schema that caused us to abandon the   span-based annotation approach for our current approach-pre-selecting semantic event classes of interest and having annotators give sentence classification labels. First, we were unable to resolve discrepancies in how much the annotators should highlight for given spans. Following the "argument reduction criterion" of Stanovsky et al. (2016), we asked annotators to "highlight as much as you need to answer the question but not more. If you can exclude a word from the highlighting without changing the answer to the question, you should exclude it." For example, in the text "Police suddenly attacked protesters with sticks" we expected annotators to highlight "suddenly attacked" versus just "attacked" because the former is a slightly different action. However, this criterion did not succeed in improving annotator agreement on span extents.
Furthermore, with span-based annotations, it was difficult to decide how to properly aggregate police actions (e.g. how do we automatically separate suddently attacked from attacked from did not attack?) Had we been committed to span-based annotations, we may have had to develop much longer, more detailed guidelines, such as those from the Richer Event Description project (O'Gorman et al., 2016). We believe this approach-which requires more work in developing guidelines and training annotators in them-is less easily extensible to new problems and social science domains. Finally, in a training round, the action text spans that annotators did select were not very substantively interesting and worth the additional cost and effort on the part of annotators. 16