MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection

Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text. Despite considerable research efforts in recent years for English text, the task of ED in other languages has been significantly less explored. Switching to non-English languages, important research questions for ED include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages. To answer those questions, it is crucial to obtain multilingual ED datasets that provide consistent event annotation for multiple languages. There exist some multilingual ED datasets; however, they tend to cover a handful of languages and mainly focus on popular ones. Many languages are not covered in existing multilingual ED datasets. In addition, the current datasets are often small and not accessible to the public. To overcome those shortcomings, we introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages; 5 of them have not been supported by existing multilingual datasets. We also perform extensive experiments and analysis to demonstrate the challenges and transferability of ED across languages in MINION that in all call for more research effort in this area. We will release the dataset to promote future research on multilingual ED.


Introduction
Event Detection (ED) is one of the critical steps for an Event Extraction system in Information Extraction (IE) that aims is to recognize mentions of events in text, i.e., change of state of real world entities. Specifically, an ED system identifies the word(s) that most clearly refer to the occurrence of an event, i.e., event trigger, and also detects the type of event that is evoked by the event trigger. For instance, in the sentence "The city was reportedly struck by F16 missiles.", the word "struck" is the trigger for an ATTACK event. An ED model can be incorporated into other IE pipelines to facilitate the extraction of information related to events and entities, thereby supporting various downstream applications such as knowledge base construction, question answering and text summarization.
Due to its importance, ED has been extensively studied in the IE and NLP community over the past decade. Existing methods for ED extend from feature-based models (Ahn, 2006;Liao and Grishman, 2010;Miwa et al., 2014a), to advanced deep learning methods (Nguyen and Grishman, 2015;Chen et al., 2015;Sha et al., 2018;Wang et al., 2019;Yang et al., 2019;Cui et al., 2020;Lai et al., 2020;Pouran Ben Veyseh et al., 2021b). As such, the creation of large annotated datasets for ED, e.g., ACE 2005 (Walker et al., 2006), has been critical to progress measurement and growing development of ED research. However, a majority of current datasets for ED only provide annotation for texts in a single language (i.e., monolingual datasets). For instance, the recent challenging datasets for ED, e.g., MAVEN , RAMS (Ebner et al., 2020), or CySecED (Man et al., 2020, are all proposed for English documents only. In addition, there are a few existing datasets that include ED annotation for multiple languages (multilingual datasets), e.g., ACE 2005 (Walker et al., 2006), TAC KBP (Mitamura et al., 2016, 2017, and TempEval-2 (Verhagen et al., 2010). However, those multilingual datasets only cover a handful of languages (i.e., 3 languages in ACE 2005 and TAC KBP, and 6 languages in TempEval-2), mainly focusing on popular languages such as English, Chinese, Arabic, and Spanish, and leaving many other languages unexplored for ED. For instance, Turkish and Polish are not covered in existing multilingual datasets for ED. We also note that existing ED datasets tend to employ different annotation schema and guidelines that prevent the combination of current datasets to create a larger one. In all, the limited coverage of languages and annotation discrepancy in current monolingual/multilingual ED datasets hinder comprehensive studies for the challenges of ED in diverse languages. It also limits thorough evaluations for multilingual generalization of ED models. Finally, we note that the major multilingual datasets for ED are not publicly accessible due to the licence of involving documents, e.g., ACE 2005 and TAC KBP, thus further impeding research effort in this area.
To address such issues, our goal is to introduce a new Multilingual Event Detection dataset (called MINION) to support multilingual research for ED. In particular, we provide a large-scale dataset that manually annotates event triggers for 8 typologically different languages, i.e., English, Spanish, Portuguese, Polish, Turkish, Hindi, Japanese and Korean. Among them, the five languages Portuguese, Polish, Turkish, Hindi, and Japanese are not covered in existing popular datasets for multilingual ED (i.e., ACE 2005, TAC KBP, and TempEval-2). To facilitate public release and sharing of the dataset, we employ the event articles from Wikipedia for annotation in 8 languages. In addition, to improve quality of the data, we inherit the annotation schema and guideline in ACE 2005, the well-designed and widely-used dataset for ED research. In total, our MINION dataset involves more than 50K annotated event triggers, which is much larger than those in existing multilingual ED datasets (i.e., less than 11K and 27K in ACE 2005 and TempEval-2 respectively). We expect that the significantly larger size with more diverse set of languages and public texts in MINION can contribute to accelerate and extend research in ED to a larger population.
Given the proposed dataset, we conduct thorough analysis on MINION using the state-of-theart (SOTA) models for ED. In particular, we first study the challenges of ED in different languages using monolingual evaluations where ED models are trained and tested in the same languages. Our experiments suggest that the performance of existing ED models is not yet satisfactory in multiple languages and the model performance on non-English languages is in general poorer than those for English. We also show that current pre-trained language models for specific languages (i.e., monolingual models) are less effective for ED models than multilingual pre-trained language models, e.g., mBERT (Devlin et al., 2019). In all, our findings highlight greater challenges of ED for non-English languages that should be further pursued in future research.
In addition, our MINION dataset also facilitate zero-shot cross-lingual transfer learning experiments that serve to reveal the transferability of ED knowledge and annotation across languages. In these experiments, ED models are trained on English data (the source language), but tested in other target languages. Our results in this setting demonstrate a wide range of cross-lingual performance for different target languages in MINION that introduces a diverse set of languages and data for ED research. Finally, we report extensive analysis on MINION to provide further data insights for future ED research, including challenges of data annotation, language differences, and cross-dataset evaluation.

Data Annotation
Our dataset MINION follows the same definition of events as the annotation guideline in ACE 2005 (Walker et al., 2006). Specifically, an event is defined as an occurrence that results in the change of state of a real world entity. Moreover, an event mention is evoked by an event trigger which most clearly describes the occurrence of the event. While event triggers are mostly single words, we also allow multi-word event triggers to better accommodate ED annotation in multiple languages. For instance, the phrasal verb "tayin etmek" with two words in Turkish, meaning "appoint", is necessary to express the event type Start-Position.
We also inherit the annotation schema/ontology (i.e., to define event types for annotation) and guideline in ACE 2005 to benefit from its well-designed documentation and be consistent with most of prior ED research. However, to improve the quality of the annotated data, we prune some event sub-types from the original ACE 2005 ontology in our dataset. In particular, event sub-types that have very similar meanings in some language are not included in our final ontology. This promotes the distinction between event labels and avoids confusion for annotators to provide high-quality data in different languages. For instance, the event sub-types Convict and Sentence are very similar in Turkish (i.e., both Convict and Sentence can be translated as Mahkum etmek in Turkish), thus being removed in our ontology. In addition, we also exclude event sub-types in ACE 2005 that are not frequent in our collected data from Wikipedia (more details on data collection later), e.g., Nominate and Declare-Bankruptcy. Finally, 16 event sub-types (for 8 event types) are preserved in the final event schema for our dataset. We provide detailed explanation and sample sentences for the event types in our dataset in the Appendix A.

Candidate Selection
As mentioned in the introduction, we aim to annotate ED data for 8 languages, i.e., English, Spanish, Portuguese, Polish, Turkish, Hindi, Japanese and Korean. These languages are selected due to their diversity in term of typology and novelty w.r.t. to existing multilingual ED datasets that can be helpful for multilingual model development and generalization evaluation. To collect text data for annotation in each language, we employ the articles of the language-specific editions of Wikipedia. Specifically, for each language, we obtain its latest dump of Wikipedia articles 1 , then process the dump with the parser WikiExtractor (Attardi, 2015) to extract textual and meta data for articles. To increase the likelihood of encountering event mentions for effective annotation, we utilize the articles that are classified under one of the sub-categories of the Event category in Wikipedia. In particular, we focus on six sub-categories Economy, Politics, Technology, Crimes, Nature, and Military due to their relevance to the event types in our ontology. Note that we map these (sub)categories in English to the corresponding (sub)categories in other languages using the provided links in Wikipedia. Afterward, to split the texts into sentences and tokens, we leverage the multilingual toolkit Trankit (Nguyen et al., 2021a) that has demonstrated state-of-the-art performance for such tasks in our languages.
Given a Wikipedia article, an approach for ED annotation is to ask the annotators to annotate the entire document for event triggers at once. However, as Wikipedia articles tend to be long, this approach might be overwhelming for annotators, thus potentially limiting the annotation quality. To this end, motivated by the annotation with 5-sentence windows in the RAMS dataset (Ebner et al., 2020), we split each article into segments of 5 sentences that will be annotated separately by annotators. In this way, annotators only need to process a shorter context at a time to improve the attention and accuracy of annotated data. This annotation approach 1 Dumps were downloaded in May 2021. is also supported by a large amount of prior ED research where a majority of previous ED models have employed context information in single sentences to deliver high extraction performance for the event types in ACE 2005(Nguyen and Grishman, 2015, 2018Wang et al., 2019;Yang et al., 2019;Cui et al., 2020), including models for multiple languages (M'hamdi et al., 2019;Ahmad et al., 2021;Nguyen et al., 2021b).

Annotation Process
To annotate the produced article segments, we hire annotators from upwork.com, a crowd-sourcing platform with freelancer annotators across the globe. In particular, our annotator candidate pool for each language of interest involves native speakers of the language who also have experience on related data annotation projects (e.g., for named entity recognition), an approval rate higher than 95%, and fluency in English. These information is provided by annotator profiles in Upwork. In the next step, the candidates are trained for ED annotation using the English annotation guideline and examples for the designed event schema in our dataset (i.e., inherited from ACE 2005). Finally, we ask the candidates to take an annotation test designed for ED in English and only candidates with passing results are officially selected for the annotators of our multilingual ED dataset. Overall, we recruit several annotators for each language of interest as shown in Table 2. To prepare for the actual annotation, the annotators for each language will work together to produce a translation of the English annotation guideline/examples where language-specific annotation rules are discussed and included in the translated guideline to form common annotation perception for the language. The translated guideline and examples are also verified by our language experts to avoid any potential conflicts and issues.
Finally, given the language-specific guidelines, the annotators for each language will independently annotate a chunk of article segments for that language. The breakdown numbers of annotated text segments for each language and Wikipedia subcategory in our MINION dataset are shown in Table  3. As such, 20% of the annotated text segments for each language is selected for co-annotation by the annotators to measure inter-annotator agreement (IAA) scores while the remaining 80% is distributed to annotators for separate annotation. Turkish Hindi Japanese Korean  Economy  1,095  112  168  315  297  189  199  250  Politics  3,202  308  772  1,270  1,233  349  232  248  Technology  2,171  189  400  712  815  295  312  249  Crimes  893  78  220  152  118  95  80  73  Nature  1,195  398  705  455  398  245  299 neau, 2006) for the IAA scores of each language in our dataset. After independent annotation, the annotators will resolve the conflict cases to produce the final version of our MINION dataset. Overall, our dataset demonstrates high agreement scores for all the 8 languages, thus providing a high-quality dataset for multilingual ED.

Data Analysis
The main statistics for our MINION dataset is provided in Table 3. This table shows that for a majority of languages, there are multiple event triggers in a text segment, thereby introducing a challenge for ED models due to the overlap of event context.
In addition, the table shows that text segments in some languages are more replete with event mentions than those for other languages. Specifically, comparing Polish and English text segments, the density of event mentions in Polish is almost two times more than that for English. Finally, Figure  1 shows the distributions of 8 event types for the 8 languages in our dataset. As can be seen, the languages in our dataset tend to involve different levels of discrepancy regarding the distributions over event types. As such, the type density and distribution divergence between languages suggest other challenges that robust ED models should han-dle to perform well across languages in MINION.

Annotation Challenges
Despite the high inter-annotator agreement scores, there are some conflicts between our annotators during the annotation process due to the ambiguity of event triggers, especially in the multilingual setting. This section highlights some of the key ambiguities/conflicts that we encounter during our analysis of annotation results from the annotators. Note that all of these conflicts have been resolved by the annotators in the final version of our dataset. Language-Specific Challenges: Despite common notion of events in different languages, each language might has its own exceptions regarding how an event trigger should be annotated, causing confusions/conflict for our annotators in the annotation process. One exception concerns the necessity to include event arguments in the annotation of an event trigger in some language. For example, in the Polish sentence "Samolot sie rozbił" (translated as "The plane crashed itself"), some annotators believe that the meaning of the verb "rozbił" (i.e., crashed) is incomplete if its argument word "sie" (i.e., itself) is not associated. As such, annotating both the verb and its argument (i.e., "sie rozbił") is necessary to express an event in this case. However, other annotators suggest that only annotating the word "rozbił" is sufficient. Our annotators have decided to annotate event triggers along with necessary arguments to achieve their complete meanings in such cases.
Background Knowledge: Background knowledge is sometime important to correctly recognize an event trigger in input text. In such cases, the annotators might have conflicting event annotation decisions for a word as their levels of background knowledge are different. For instance, in the sentence "The match was canceled in the memory of victims of Katyn crime", some annotators annotate the word "crime" as a Die event trigger as they know that "crime" is referring to a mass execution  Table 3: Statistics of the MINION dataset. Seg. represent text segments. All annotated segments consist of 5 sentences and their lengths (Avg. Length) are computed in terms of number of tokens. "Challenging Type" indicates the type whose event trigger annotation involves the largest disagreement between annotators in each language. event. However, some annotators do not consider "crime" as an event trigger as they are not aware of the execution event. Eventually, we have decided to annotate the text segments based on only the presented information in the input texts to resolve conflicts and avoid inconsistency.

Experiments
This section aims to study the challenges of ED for 8 languages in our MINION dataset. As such, we evaluate the performance of the state-of-theart (SOTA) ED models in the monolingual situations where models are trained and tested on the annotated data of the same language. To prepare for the experiments, we randomly split the annotated data for each language in MINION into separate training/development/test sets with the ratio of 80/10/10 (respectively). As MINION allows multiword event triggers to accommodate language specialities in multiple languages, we model the ED task in our dataset as a sequence labeling problem. Concretely, given an input text segment D = [w 1 , w 2 , . . . , w n ] with n words, ED models need to predict the label sequence Y = [y 1 , y 2 , . . . , y n ] where y i indicates the label for the word w i ∈ D using the BIO tagging schema.
To this end, following prior work on multilingual ED  and cross-lingual ED (M'hamdi et al., 2019), we examine the following representative SOTA models for sequencelabeling ED: (1) Transformer: A pre-trained transformer-based language model (PTLM), e.g., mBERT (Devlin et al., 2019), is augmented with a feed-forward network to predict a label for each word in the input text; (2) Transformer+CRF: This model also employs an PTLM as the Transformer model; however, a Conditional Random Field (CRF) layer is additionally introduced as the final layer to predict the label sequence Y ; (3) Transformer+BiLSTM: This model extends the Transformer model by injecting a bidirectional Long Short-Term Memory network (BiLSTM) between the PTLM and the feed-forward network to further abstract the representation vectors; and (4) Transformer+BiLSTM+CRF: This model is similar to the Transformer+BiLSTM model with an exception that a CRF layer is employed in the end for label sequence prediction. As such, to implement the models, we explore two SOTA multilingual PTLMs models, i.e., mBERT (Devlin et al., 2019) and XLMR (Conneau et al., 2020) (their base versions) for text encoding. In the model notation, we will replace the prefix "Transformer" with "mBERT" or "XLMR" depending on the actual PTLM to use (e.g., mBERT, mBERT+CRF, mBERT+BiLSTM). Following prior work M'hamdi et al., 2019), in the experiments, we evaluate the models using precision, recall and F1 scores for correctly predicting event trigger boundaries and types in text.
Our fine-tuning process suggests similar values of hyper-parameters for the models across languages in MINION. In particular, for En-  glish, we use one layer for BiLSTM modules with 300 dimensions for the hidden states (for Transformer+BiLSTM and Trans-former+BiLSTM+CRF). For feed-forward networks, we employ 2 layers with 200 dimensions for the hidden vectors. The learning rate is set to 1e-4 for the Adam optimizer and the batch size of 8 is employed during training. Monolingual Performance: The performance of the four ED models on the test data of each language are presented in Tables 4 (for mBERT) and 5 (for XLMR). There are several observations from these tables. First, the best average F1 score of the models over different languages is 72.31% (achieved by the XLMR model). This performance is still considerably lower than a perfect model, thus suggesting significant challenges of ED in multiple languages and calling for more research effort in this area. Second, the performance of the models for non-English language is significantly worse than the English counterpart. This difference thus further highlights the necessity of more research on ED for non-English languages. Finally, the superior performance of XLMR over other models in almost all languages indicates better effectiveness of the multilingual PTLM model XLMR for ED in different languages (compared to mBERT).
It also implies that traditional BiLSTM and CRF layers for sequence labeling are less necessary for multilingual ED when a PTLM is employed for text encoding. As such, in the following experiments, we will employ Transformer as the main ED model for further analysis.
Monolingual PTLMs: To better understand the benefits of multilingual PTLMs (i.e., mBERT and XLMR) for multilingual ED, we further evaluate the performance the Transformer model when monolingual language-specific PTLMs are leveraged to encode input texts (i.e., replacing mBERT and XLMR). Accordingly, for monolingual language-specific PTLMs, we consider both BERT-based and RoBERTa-based models for comprehensiveness. Tables 6 (for BERT) and 7 (for RoBERTa) report the monolingual performance of Transformer when monolingual language-specific PTLMs are employed. Note that we only show ED performance for languages where monolingual PTLMs are publicly available. As can be seen, compared to multilingual PTLMs, monolingual PTLMs (based on BERT or RoBERTa) improve the performance of Transformer for English. However, for other languages, monolingual PTLMs are on-par (for BERT-based models) or significantly worse (for RoBERTa-based models) than multilingual Language P R F1 English (Devlin et al., 2019) 78.12 81.61 79.83 Spanish (Cañete et al., 2020) 72.73 62.25 67.08 Portuguese (Souza et al., 2020)    PTLMs for ED, thus demonstrating the general advantage of multilingual PTLMs for ED. In addition, it is suggestive that future work can explore methods to improve monolingual language-specific PTLMs for ED in different languages. Cross-lingual Performance: To understand the transferability of ED knowledge and annotation across languages, we explore the cross-lingual evaluation setting where models are trained on English data (the source language) and directly evaluated on test data of other target languages in MINION. As such, we report the cross-lingual performance of Transformer with both mBERT and XLMR as the PTLMs in Table 8. Note that we inherit the same hyper-parameters selected for Transformer in the fine-tuning process of monolingual experiments for consistency.
Compared to the monolingual performance coun-  terparts of mBERT and XLMR in Tables 4 and 5, it is clear that the performance of Transformer in non-English languages decreases significantly in the cross-lingual evaluation, i.e., the average performance loss due to cross-lingual evaluation is 15.2% for both mBERT and XLMR. We also observe a wide range of cross-lingual performance for the target languages in Table 8, thus suggesting the diverse nature of the data and languages in MINION to support robust model development for ED. Among the target languages, Portuguese exhibits the smallest performance difference between monolingual and cross-lingual settings while the largest performance loss with cross-lingual transfer occurs in Japanese, Turkish, Korean, and Hindi. One possible reason for such performance loss is due to the language structure difference where Japanese, Turkish, Korean, and Hindi follow the Subject-Object-Verb word order while English and other languages in our dataset utilize the Subject-Verb-Object order. Another reason can be linked to different patterns/distributions of event triggers in different languages. For instance, some languages tend to mention the events using verbs (e.g., in English 78% of the triggers are verb) while other languages might use more diverse parts of speech to express event trigger (e.g., in Japanese only 63% of triggers are verbs). Also, Section 4 provide an additional explanation regarding the diversity of event triggers in different languages. In all, the cross-lingual performance in our MINION dataset demonstrates the challenges of transferring ED knowledge across languages that can be further studied in future work.

Analysis
This section provides additional analysis to better understand the multilingual ED task in MINION. Cross-dataset Evaluation: As the event ontology in MINION is inherited and pruned from the ACE 2005 dataset, it is helpful to learn how the annotated events in MINION is different from those in ACE 2005. To this end, we propose to evaluate model performance on the cross-dataset setting: models are trained on the English data of ACE 2005 and evaluated on test data of different languages in MINION. In particular, we utilize the standard data split from prior work (Nguyen and Grishman, 2015;Chen et al., 2015;Wang et al., 2019)  only triggers of event sub-types in our MINION dataset are retained for a compatibility between two datasets. Due to its superior performance in previous experiments, we employ the Transformer model with XLMR in this experiment. The hyperparameters for the model is fine-tuned on the development data of ACE 2005. Table 9 shows the model performance in the cross-dataset evaluation.
Compared to the corresponding cross-lingual performance of MINION in Table 8, it is clear that the performance on MINION is significantly worse when the model is trained on ACE 2005 data. As such, a possible explanation for this performance loss includes domain difference between ACE 2005 and MINION, i.e., MINION involve Wikipedia articles while ACE 2005 is based on news articles, conversational telephone speeches, and others. In addition, as the size of English training data in MINION (i.e., over 14K triggers) is significantly larger than those for ACE 2005 (i.e., less than 6K triggers), the training data in MINION might cover more event patterns to produce better performance for ED models. Future work can explore this crossdataset evaluation setting to build more robust models for ED. Trigger Diversity in Different Languages: To understand how events are expressed in different languages, we explore the ratio of unique trigger words over the total number of event triggers for an event sub-type (called unique ratio). Figure  2 shows the averages of unique ratios over event sub-types for different languages in our MINION dataset. As such the diagram shows that English is relatively simpler than other languages in ED as its diversity of event triggers for event types is the least among all the considered languages. Korean, Turkish, and Japanese are the languages that exhibit the largest diversities of event triggers. This further helps to explain the worst cross-lingual  performance of models from English to Korean, Turkish, and Japanese in Table 8.
Challenging the Supremacy of English for Event Detection: English has been the major language for ED research. In particular, in crosslingual transfer learning for ED, English has often been considered as a high-resource source language to train ED models to apply to other target languages (M'hamdi et al., 2019;Nguyen et al., 2021b). In this experiment, we argue that English is not necessary the optimal source language for crosslingual transfer learning of ED. In particular, using Transformer with XLMR as the base model, we train the model on the training data of both English and Spanish; the resulting models are evaluated on the test data of the other languages in MINION. To ensure a fair comparison, we use the same size of training data for English and Spanish, i.e., 3,000 annotated text segments randomly sampled in MIN-ION. Table 10 presents the cross-lingual performance of the models. The table demonstrates that using Spanish as the source language can achieve better performance than English for all the target languages in MINION. As such, our findings suggest that choosing appropriate source languages for cross-lingual transfer learning of ED is important and can be further explored in future work.

Related Work
Early attempts for ED have employed feature-based models (Ahn, 2006;Ji and Grishman, 2008;Patwardhan and Riloff, 2009;Liao and Grishman, 2010;Hong et al., 2011;Li et al., 2013;Miwa et al., 2014b;Yang and Mitchell, 2016) while deep learning has recently been proven to be a better approach for ED (Nguyen and Grishman, 2015;Chen et al., 2015;Nguyen et al., 2016;Sha et al., 2018;Yang et al., 2019;Wang et al., 2019;Cui et al., 2020; Lai , 2021aNgo Trung et al., 2021;Pouran Ben Veyseh et al., 2021a). There have also been recent efforts on creating new datasets for ED for different domains, including biomedical texts (Kim et al., 2009), literary texts (Sims et al., 2019), cybersecurity texts (Satyapanich et al., 2020;Man et al., 2020), Wikipedia texts , fine-grained event types (Le and Nguyen, 2021), and historical texts (Lai et al., 2021b). However, such prior works and datasets for ED are mainly devoted to English, ignoring challenges in many non-English languages. Non-English datasets for ED also exist (Kobyliński and Wasiluk, 2019;Sahoo et al., 2020); however, these datasets are only annotated for one language with divergent ontology and annotation guidelines, thus unable to support comprehensive studies and transferability research for ED on multiple languages. Existing ED datasets that cover multiple languages involve ACE 2005 (Walker et al., 2006), TAC KBP (Mitamura et al., 2016, 2017, and TempEval-2 (Verhagen et al., 2010). Among such datasets, ACE 2005 is the most popular dataset used in prior multilingual/cross-lingual ED research (Chen and Ji, 2009;M'hamdi et al., 2019;Ahmad et al., 2021;Nguyen et al., 2021c;Nguyen and Nguyen, 2021). However, such multilingual datasets suffer from the issues of small data size, limited language coverage with greater focus on popular languages, and inaccessibility to the public as discussed in the introduction. Finally, we also note some prior works that claim event detection datasets for non-English datasets (Im et al., 2009;Küçük and Yazici, 2011;Lejeune et al., 2015). However, such datasets are not comparable to our dataset as their event detection task is indeed a sentence classification problem where established definition of events with event triggers are not fol-lowed and annotated.

Conclusion
We introduce a new dataset for ED in 8 typologically different languages. The dataset is significantly larger and covers more and newer languages than prior resources. Specifically, 31,226 text segments from language-specific articles of Wikipedia are manually annotated in the dataset. Our experiments and analysis demonstrate the high quality of the dataset and the multilingual challenges of ED, providing ample room for future research in this direction. In the future, we will extend the dataset to include event argument annotations.