MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts

Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as a fundamental task for information extraction. Despite the remarkable progress achieved in the past several years, most research efforts focus on detecting events from formal texts (e.g., news articles, Wikipedia documents, financial announcements). Moreover, the texts in each dataset are either from a single source or multiple yet relatively homogeneous sources. With massive amounts of user-generated text accumulating on the Web and inside enterprises, identifying meaningful events in these informal texts, usually from multiple heterogeneous sources, has become a problem of significant practical value. As a pioneering exploration that expands event detection to the scenarios involving informal and heterogeneous texts, we propose a new large-scale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service. We carefully investigate the proposed dataset’s textual informality and multi-domain heterogeneity characteristics by inspecting data samples quantitatively and qualitatively. Extensive experiments with state-of-the-art event detection methods verify the unique challenges posed by these characteristics, indicating that multi-domain informal event detection remains an open problem and requires further efforts. Our benchmark and code are released at https://github.com/myeclipse/MUSIED.


Introduction
Event detection (ED), which aims to identify event triggers and classify them into specific types from unstructured texts, has been widely researched and applied in various downstream tasks (Basile et al., 2014;Cheng and Erk, 2018;Kuhnle et al., 2021).Advanced models have been continuously proposed, ranging from feature-based models (Shasha et al., 2010;Hong et al., 2011;Li et al., 2013) to recent neural-based models (Chen et al., 2015;Nguyen et al., 2016;Chen et al., 2018;Xi et al., 2021;Xiangyu et al., 2021).Despite the significant progress, we find that previous works have the following two limitations in practical scenarios.
1. Current efforts mainly focused on event detection from formal texts.For example, a popular line of works (Li et al., 2013;Chen et al., 2015;Nguyen et al., 2016;Chen et al., 2018;Lou et al., 2021) aim to detect general domain events from news articles (e.g., ACE 2005(Doddington et al., 2004)) and Wikipedia documents (e.g., MAVEN (Wang et al., 2020b)).Some other explorations involve extracting events from the financial announcements (Yang et al., 2018;Zheng et al., 2019;Liang et al., 2021) or cybersecurity articles (Trong et al., 2020), which are also written in a relatively official style.In practical scenarios, however, we usually face the bottleneck of identifying events from informal texts.Compared with formal text, texts produced in more casual contexts (e.g., online chat and phone conversation) pose some unique challenges of long event triggers, high event density, and typos noises, as revealed in our analysis ( § 4.3).Indeed, with vast amounts of usergenerated text accumulating on the open Web and private enterprise systems, extracting meaningful events in these informal texts has become an urgent problem of significant practical value.
2. The targeting event-related texts are either from a single source or multiple yet homogeneous sources.Most recent datasets (e.g, MAVEN (Wang et al., 2020b), CySecED (Trong et al., 2020), ChFinAnn (Yang et al., 2018), and BRAD (Lai et al., 2021)) are built from an individual data source.The most widely-used ACE 2005(Doddington et al., 2004) covers six sources, which are, however, relatively homogeneous internet media to some extent.Regarding informal text, endusers can produce them in many different ways, and hence they have more versatile expressing styles.Therefore, multi-source heterogeneity comes as another difficulty that inherently accompanies text informality.For example, texts generated via online chat and phone calls in after-sales services may greatly diversify, e.g., on length and style.Unfortunately, current ED works fail to adequately address the issue of multi-source heterogeneity.
To address these two problems, in this paper, we expand event detection to the scenarios involving informal and heterogeneous texts.We construct a new large-scale Chinese event detection dataset based on Meituan * , the most popular Chinese ecommerce platforms for food service, which provides users with multiple ways to feed back on food safety issues (events), such as posting reviews and communicating with after-sale staff.These reviews and conversations yield a large-scale multi-source heterogeneous informal text repository, which contains valuable information about food safety events and hence can serve as a corpus.We collect the desensitized data from three typical scenarios: i) users posting reviews, ii) users communicating with after-sale staff through text messages, and iii) users communicating with after-sale staff on the phone.By extracting user reviews, text conversations, and phone conversations, we create a massive dataset consisting of MUlti-Source heterogeneous Informal texts for Event Detection (MUSIED).
Our contributions can be summarized as follows: * https://about.meituan.com/en • We expand event detection to the scenarios involving informal and heterogeneous texts, for the first time, by carefully curating a new large-scale dataset.
• Extensive experiments with state-of-the-art methods verify the unique challenges posed by textual informality and multi-source heterogeneity characteristics, and indicate multiple promising directions worth pursuing.

Event Detection Definition
We follow the classical settings and terminologies adopted by ACE 2005program (Doddington et al., 2004) and MAVEN (Wang et al., 2020b), and specify the vital event terminologies as follows.Event: a specific occurrence involving participants (location, time, subject, object, etc.).Event Mention: a phrase or sentence within which an event is described.Event Trigger: the main word or phrase that most clearly expresses the occurrence of an event.Event Type: the semantic class of an event.
Event detection aims to identify event trigger words and classify their event types for a given text.Accordingly, ED is conventionally divided into two subtasks: (1) Trigger identification, which aims to identify the event triggers.(2) Trigger classification, which aims to classify the recognized trigger into predefined categories.Both subtasks are evaluated with micro precision, recall, and F-1 scores.Most recent works (Chen et al., 2015;Nguyen et al., 2016;Chen et al., 2018;Wang et al., 2019b) perform trigger classification directly (add an additional type "N/A" to be classified at the same time, indicating that the candidate is not a trigger).We also inherit these settings in this paper.

Data Collection
We collect data from Meituan, which provides users with multiple channels to feed back on food safety issues (events), among which the three most common ways are: i) users post reviews to restaurants where they have ordered food; ii) users communicate with after-sale staff through text messages; iii) users communicate with after-sale staff on the phone.First, we collect the user reviews, text conversations, and phone conversations from logs of online services for a week.Further, we desensitized and anonymized the private information from the raw data (see § 7 for details).The samples from each scenario are shown in Figure 1 to promote understanding.Note that the phone conversations are speech data, which is transformed into text data via the Automatic Speech Recognition (ASR) service (Wang et al., 2019a;Kaur et al., 2021).The above collected data may not involve food safety events (e.g., users make positive reviews).We hire annotators to select the reviews and conversations involving food safety incidents.Finally, we retained 4,226 user reviews, 3,767 text conversations, and 3,388 phone conversations, forming a corpus composed of 11,381 documents in total.

Event Schema Construction
With the assistance of food safety experts, we construct an event schema, from the perspective of users.We exemplify using a typical food delivery service scenario shown in Figure 2, where users usually feed back in terms of: (1) Food quality Poor food quality is the main cause of food safety problems (e.g., food is expired or undercooked).
(2) Restaurant The illegal or improper behaviors of restaurants (e.g., uses illegal food additives) may lead to food safety problems.(3) Delivery person A small but noticeable percentage of food safety problems are caused by the delivery person (e.g., damages the packaging and pollutes the food).(4) Physical feelings Rather than above causes, the users may directly express their physical feelings (e.g., feel uncomfortable), which suggest the existence of food safety problems.Finally, the schema contains 21 event types and broadly covers the user's feedback about above cases.Please refer to Appendix A for the full event schema description.

Restaurant Delivery Person Food Physical Feelings
Figure 2: A typical food delivery service scenario.

Annotation Process
Though with a detailed annotation guideline, the annotation process is complicated and error-prone.
For accuracy and consistency, we organize a twostage iterative annotation, following ACE 2005(Doddington et al., 2004) and MAVEN (Wang et al., 2020b).We recruit 20 annotators with food safety domain knowledge, and train them with the guideline.After that, they are given an annotation exercise and 9 annotators with accuracy > 90% are selected to perform formal annotation.At the first stage, each document is annotated by 3 independent annotators.The annotation is finished if and only if 3 annotators reach an agreement.Otherwise, in the second stage, all 9 annotators and language experts will discuss documents with annotation disagreements together and determine the final results.

Annotation Challenges And Solutions
Candidate Selection Since Chinese lacks natural delimiters, words are necessarily generated by segmentation toolkits, which might not exactly match with triggers (Zeng et al., 2016;Lin et al., 2018).Also, the informal texts are more diverse.It would be impractical and inaccurate to select words with specific features, as done in English dataset (Wang et al., 2020b).To address above challenge, we annotate in a character-wise manner, instead of performing word segmentation and word-wise annotation sequentially.In this way, though the trigger candidate set is larger because each possible phrase is regarded as a candidate trigger, we tackle the problem of i) limitation of word boundary and 2) error propagation of word segmentation toolkits.
Boundary confusion During annotation, we find the triggers are usually followed or surrounded by stop words (such as auxiliary words, modal particles, etc), especially in telephone conversations.
We follow the principle that event triggers should not contain redundant information, as long as they can fully express the event information.For example, we do not annotate the modal particles in the following sentence S1. "臭(stinky)" and "吃吐(Eat and vomit)" are the triggers of Abnormalities and Uncomfortable event.However, the token "的" and "了" following them are modal particles in Chinese, and do not express useful information.S1: The duck intestines were stinky, I Eat and vomit.(鸭肠是臭的，把人都吃吐了) Ambiguous User Expression The informal user statements are not rigorous and may be insufficient for resolving ambiguities for event types.For example, for the trigger "梆硬(hard)" in the following sentence S2, some annotators believe the reason for "梆硬(hard)" is that the chicken is undercooked and considers it as a trigger of Undercooked event, while others think the reason is that the temperature is too low and treats it as a trigger of Cold event.
S2: I felt that the chicken chop was cold, and the chicken in the chicken roll was also hard (感觉鸡 排冷了，鸡肉卷里的鸡肉也是梆硬的。) The annotators are required to disambiguate by integrating contextual information.For example, considering the context that the user first complains that the chicken chop is cold (i.e., "冷 (cold)"), the annotators tend to believe the following phrase "梆 硬 (hard") also triggers a Cold event.

Annotation Quality
With the strict annotation process, our dataset is of high quality.For data with annotation disagreement in the first stage, all annotators discuss together and reach agreements (by voting sometimes).Also, we randomly sample 500 documents without annotation disagreement in first stage, and invite different first-stage annotators to annotate these documents.We measure the inter-annotator agreements of annotation between two annotators with Cohen's Kappa score.The results for trigger and type annotation are 0.83 and 0.82 respectively, which belongs to the Near-perfect agreement range of [0.81, 0.99].The annotated samples are shown in Appendix C.

Data Size
Following Wang et al. (2020b), we show the main statistics of MUSIED and compare with the following datasets in Table 1: (1) ACE 2005 (Walker et al., 2006), which is the most wide-used dataset and covers general domain events.(2) Rich ERE (Mitamura et al., 2015), which is provided by TAC KBP competition and contains a series of datasets; (3) MAVEN (Wang et al., 2020b), which is the largest general domain dataset constructed from Wikipedia and FrameNet; (4) RAMS (Ebner et al., 2020), which follows the AIDA ontology and uses Reddit articles.( 5) BRAD (Lai et al., 2021)

Data Distribution
The instance number of each event type is shown in Figure 3, which shows the existence of the inherent data imbalance problem.We also display the top 5 event types with their instance numbers and proportions in Appendix B. count for 61.7% of the data.18 (85.7%)event types have a below-average number of labeled instances and 6 event types even have fewer than 50 labeled instances.Though potentially hindering the performance of ED models, the occurrence frequency of event types conforms to the long-tail phenomenon in the real world.We maintain the original distribution of MUSIED, which can evaluate the ability of the ED models in the long-tail scenario.

Analysis of Textual Informality
A key characteristic of MUSIED is that the corpus is composed of informal text.We introduce the features brought by textual informality as follows.

Long Triggers
Our observation shows that users tend to use more casual expressions and longer triggers to express events.For example, in the following sentence S3, the user says his/her two teeth are broken due to the hard noodles.The phrase "牙齿都干掉两颗 (two teeth are broken)" triggers an Uncomfortable event and consists of 7 tokens.S3: The rice is rotten, noodles are as hard as steel wire, two teeth are broken (米饭稀烂，面条 跟钢丝条一样硬，牙齿都干掉两颗) MUSIED contains a much higher proportion of long triggers, as Figure 4 shows.Considering the proportion of triggers consisting of more than 2 tokens, MUSIED is nearly 53 times larger than ACE 2005 English (i.e., 26.97% v.s.0.50%) and 9 times larger than ACE 2005 Chinese (i.e., 26.97% v.s.3.06%).The long trigger phenomenon poses a great challenge to existing ED models.

Multiple Events
Unlike professionals who write articles or documents in a relatively official style, users may hurriedly express multiple events within one sentence.For example, in the following sentence S4, the user reports multiple food quality related events, which lead to an Uncomfortable event.
S4: Then I ate his fried rice.Because his prawns were not fresh and undercooked.Then I had his grilled sausages and sausages, and it all didn't feel very fresh.After eating, I had diarrhea.(然后我 吃了他那个炒饭因为他那个虾不新鲜然后也不 熟然后再加上他那个烤肠啊腊肠啊都是感觉不 是很新鲜然后吃了之后我我拉肚子) Following previous works (Chen et al., 2015), we make statistics on sentences with multiple events and find that the proportion of multi-events sentence in MUSIED is much larger than ACE 2005 (i.e., 36.9% of MUSIED v.s.27.3% of ACE 2005 English v.s.19.3% of ACE 2005 Chinese).The reason lies in that food safety event correlations are closer and users tend to simultaneously express the cause and consequence.

Typos
Different from formal texts which are produced by professionals, the user-generated informal texts are less rigorous and may contain typos.The automatic speech recognition service may also produce errors.For example, in the following sentence S5, the user finds the beef rice is sour and expresses a spoiled event.However, the user types a typo token "搜" (means search), which has the same pronunciation (pronounced as "sou" in Chinese) but different meaning as the token "馊" (means sour).
S5: I ordered beef rice, it looks search(sour) already (我点的牛肉饭，看起来都搜(馊)掉了) We make statistics on the typos using the stateof-the-art spelling error corrector (SEC) (Li et al., 2021).The result shows that 2.2% of sentences contain spelling errors, 0.1% of tokens are typos and 1.5% of them are within the triggers.Though the accuracy of SEC may be limited in our corpus, the result together with our observation reveals that existence of typos is a noticeable problem.

Analysis of Multi-Source Heterogeneity
In this section, we analyze the multi-source heterogeneity from the following perspectives.

General Textual Features
The textual features shift remarkably across sources of MUSIED.We present the statistics on each source in Appendix B.2, from which we can observe that document size varies significantly (i.e., 1.4 of user reviews v.s.48.1 of text conversations v.s.37.8 of phone conversations in terms of #sentences per document).The reason lies in that conversation with staff is more official and users tend to provide more complete information.Also, we calculate the average sentence length for each source and further compute the standard deviation of the average sentence lengths.The standard deviation of MUSIED is notably larger than ACE 2005 (i.e., 5.06 of MUSIED v.s.3.31 of ACE 2005 English v.s.3.87 of ACE 2005 Chinese).

Event Type Distribution and Event Density
The event type distribution and event density vary significantly across sources of MUSIED.The top 5 event types for each source are shown in Appendix B.1, from which we can easily observe the notable diversity of event type distributions across sources of MUSIED.For a quantitative analysis, we calculate the event type distribution for each source and calculate the wasserstein distance (Vallender, 1974)  To sum up, MUSIED is of more significant heterogeneity and can effectively support the exploration of ED involving multi-source heterogeneity.Conversely, the limited heterogeneity, together with the data scarcity problem, makes ACE 2005 insufficient for benchmarking relevant research.

Benchmark Settings
We randomly split the annotated documents into train, dev, and test sets with the ratio of 8:1:1.The statistics of the three sets are shown in Table 2

Experimental Settings
Recently, neural-based models have achieved significant progress.Thus, we investigate the following state-of-the-art neural-based methods, which can be roughly divided into two categories: Sentence-Level Models which use information within the sentence to extract triggers.DMCNN (Chen et al., 2015) which uses CNN as feature extractor and concatenates sentence and lexical feature; BiLSTM (Hochreiter and Schmidhuber, 1997) which uses bi-directional long short-term memory network as encoder; BiLSTM-CRF (Lafferty et al., 2001) which uses bi-directional long short-term memory network followed by a conditional random field layer; C-BiLSTM (Zeng et al., 2016) which proposes a convolution bidirectional LSTM to capture both sentence-level and lexical information; DMBERT (Wang et al., 2019b) which takes BERT as encoder and adopts the dynamic multi-pooling mechanism; BERT (Yang et al., 2019) which fine-tune BERT on the down-stream ED task via a sequence labeling manner.
Document-Level Models which integrate the document-level contextual information.HBT-NGMA (Chen et al., 2018) which dynamically fuses the sentence-and document-level information; MLBiNet (Lou et al., 2021) which captures the document-level association of events.
The implementation details such as hyperparameters are listed in Appendix D. Following previous works, we report Precision (P), Recall (R) and F1-Score (F1) on trigger classification.

Overall Experimental Results
The overall experimental results are shown in Table 3, from which we have the following observations: (1) Sequence labeling methods have advantages over token-level classification models.For example, BiLSTM and BERT achieve 2.9 and 2.9 F1 improvements over DMCNN and DMBERT respectively.The reason lies in that token-level classification models separately predict trigger candidates without considering the event interdependency, while sequence labeling methods generate representation and make predictions collectively.
(2) BiLSTM+CRF achieves notable improvements over BiLSTM (e.g.,72.8 v.s. 70.7 in terms of F1), with the assistance of CRF layer modeling event correlations.The observation confirms our analysis in § 4.3.2 that modeling event correlations is important for MUSIED, due to the multi-event sentences.
(3) By incorporating document-level contextual information, HBTNGMA gains an absolute improvement of 3.4 F1-Score over BiLSTM+CRF (i.e.,76.2 v.s. 72.8).The experiment result is consistent with our observation of ambiguous user expression ( § 3.3.2),and clearly indicates the importance of document-level contextual information.

Challenge of Long Triggers
As § 4.3.1 shows, MUSIED contains long triggers, due to the informal expressions.We make statistics on BERT's recall on triggers of different lengths, as Table 4 shows, from which we can easily observe a general trend that the longer the trigger, the worse the recall rate.Existing ED models have difficulty in capturing the distribution pattern of long triggers, and the challenge should be further addressed. Length

Challenge of Multi-Event Sentences
Following Chen et al. ( 2015), we divide the test set into two parts according to the event number in a sentence (single event (i.e., 1/1) and multiple events (i.e., 1/N)), and perform evaluation separately.From Table 5 we can observe that: (1) All models perform much worse on 1/N, which coincides with previous findings (Chen et al., 2015(Chen et al., , 2018)).( 2

Challenge of Typos
We use the state-of-the-art spelling error corrector (SEC) (Li et al., 2021) on the test set, then manually collect the samples that are indeed typos.Further, we retest these corrected samples with BERT, as Table 6 shows.After correction, some mislabeled samples can be fixed and the performance is improved.For example, the S5 in § 4.3.3can be correctly predicted.Another concrete case is shown in Appendix F.2 to promote understanding.Though of great potential to address the typo challenge, our sampling statistics show the precision of SEC is quite limited in our corpus (47.8%).One possible reason is the textual features of our corpus are quite different from the SEC's training corpus.We believe that developing a SEC more suitable for MUSIED and exploring more sophisticated methods such as incorporating pronunciation features may be useful to address the challenge.

Analysis of Multi-Source Heterogeneity
Since the different sources of MUSIED have diversified data characteristics, we investigate the multi-source heterogeneity via the following two typical research topics (i.e., multi-domain learning and domain adaptation).Following Pradhan et al. ( 2013); Ganin and Lempitsky (2015); Chen et al. ( 2021); Wang et al. (2020a), we treat each source as a single "domain" in the following investigation.

Analysis of Multi-Domain Learning
So far, we exploit a standard strategy by naively pooling all available data across domains (sources) and discarding the domain information.A shared model is trained to serve all domains.However, the multi-source heterogeneity drives us to explore ways to utilize the domain information.Following Chen and Cardie (2018); Wang et al. (2020a), we select BERT and experiment with the following multi-domain learning strategies: (1) SingleDomain (SD) which trains an individual ED model for each domain separately and only uses the training data for the single domain.
(2) PoolDomain (PD) which is the strategy we used.The model ignores domain information, albeit uses all available training data.
(3) PoolDomain-MultiTask (PDMT) which is similar to PoolDomain, except that we add an auxiliary task that learns domain labels.The domain information is utilized, though in a simple way.(4) MultiDomain-Shared-Private (MDSP) which uses i) a shared MLP for all domains that extracts generic and domain-invariant features; and ii) a private MLP for each domain that extracts domain-specific characteristics.
We report the performance in each domain and overall test set in Table 7, from which we can observe that: (1) The difficulty of event detection varies across domains.Text conversations is the easiest, and phone conversations is the hardest.(2) PD outperforms SD, which is consistent with the observations in Chen and Cardie (2018).The information sharing between domains may improve the generalization ability of ED models.(3) PDMT gains slight improvement over PD by utilizing the domain information via a simple multi-task way, demonstrating that domain information can bring effective clues.(4) Further, the MDSP strategy generally outperforms all models (e.g., achieving 76.9 F1).The shared-private framework can effectively capture common language features shared across domains, as well as domain-specific patterns.The above analysis show that domain information is effective enhancement, and multi-domain learning deserves more research efforts.

Analysis of Domain Adaptation
Domain adaptation is another key criteria for evaluating ED models.Following Naik and Rose (2020), we investigate the typical unsupervised domain adaptation (UDA) problem, and adopt the following strategies: (1) BERT-Naive which utilizes the labeled source domain dataset and ignores the target domain data.(2) BERT-ADA which incorporates the adversarial domain adaptation (ADA) framework to construct representations predictive for ED, but not predictive of the domain.
As Table 8 shows, we select source and target domain from the three domains in turn, forming six UDA settings.Though the ADA framework is reported of advantage (Naik and Rose, 2020), it is not the case with MUSIED.BERT-ADA underperforms BERT-Naive in several settings (e.g., U→T, T→P and P→T), which indicates that domain adaptation in MUSIED is challenging due to the multi-source heterogeneity, and more research efforts are required.Other DA settings (e.g., semisupervised DA) can also be effectively supported by MUSIED and should be further investigated.

Related Work
Most existing works towards event detection adopt general domain datasets such as ACE 2005 (Walker et al., 2006), TAC KBP datasets (Mitamura et al., 2015) and MAVEN (Wang et al., 2020b) as benchmarks.Also, some works present domain-specific datasets and valuable explorations.For example, event extraction from biomedical texts are extensively researched (Pyysalo et al., 2007;Thompson et al., 2009;Buyko et al., 2010;Nédellec et al., 2013).Sims et al. (2019) present a new dataset of literary events.CASIE (Satyapanich et al., 2020) Table 8: Performance of unsupervised domain adaptation on trigger classification (%).A→B denotes that A and B are source and target domain.U, T and P denotes user review, text conversations and phone conversations respectively.The performances on both source (i.e., the In-Domain column) and target domain test set (i.e., the Out-Of-Domain column) are reported.
and CySecED (Trong et al., 2020) are proposed to facilitate the research of detecting cybersecurity events.Continuous works (Yang et al., 2018;Zheng et al., 2019;Liang et al., 2021) have focused on detecting financial events from the Chinese financial announcements (i.e., ChFinAnn dataset).Lai et al. (2021) presents BRAD, focusing on Black Rebellions events in African Diaspora.
However, most existing works focus on detecting events from formal texts (e.g., news articles, Wikipedia documents, etc), and target the datasets where the texts are either from a single source (e.g., MAVEN (Wang et al., 2020b), CySecED (Trong et al., 2020), ChFinAnn (Yang et al., 2018)) or multiple yet homogeneous sources (e.g., ACE 2005(Doddington et al., 2004)).In this paper, we present a massive multi-source heterogeneous informal text dataset for event detection, for the first time.It is also the first food safety event detection dataset.

Conclusion and Future Work
We have presented MUSIED, a massive multisource heterogeneous informal text dataset for event detection, based on user reviews, text conversations and phone conversations of online food services.The extensive evaluation verify the unique challenges posed by the textual informality and multi-source heterogeneity characteristics.Our in-depth investigations present multiple promising directions worth pursuing, including exploiting document-level information, multi-domain learning and domain adaptation.In the future, we are interested in extending MUSIED to more eventrelated tasks such as event argument extraction.

Limitations
MUSIED is composed of Chinese corpus, which might be less friendly to researchers who are unfamiliar with Chinese.However, considering many non-English datasets have been proposed and promoted research in related fields (e.g., Douban Conversation Corpus (Wu et al., 2017) in dialogue system, DuReader (He et al., 2018) in machine reading comprehension, etc.), we believe that the language barrier does not hinder the contribution of MUSIED to the community.Also, we provide a well-documented homepage and easy-touse toolkits including preprocessing, models and checkpoints, to further reduce the impact of language barrier.

Ethics Impact
In consideration of ethical concerns, we provide the following detailed description: 1.The corpus is sampled from the logs of a real e-commerce platform, and we strictly desensitized and anonymized the private information.

F.3 Impact of Data Imbalance
As § 4.2 shows, the inherent data imbalance problem exists in MUSIED.To quantitatively investigate the effect, we first rank labels (i.e., event types) based on the number of their corresponding training instances and then divide them into several subsets with continuous rankings.Since instances with a specific label may be too few, empirical results on instances of a label set could yield more robust and convincing conclusions.The first event type alone forms a single subset, and the remaining 20 event types are equally grouped into three subsets.In this way, we finally get a division of four subsets, named Subset-1, Subset-2, Subset-3 and Subset-4, which contain 1, 6, 7 and 7 labels respectively.As Table 17 shows, we collect the F1-scores of four baselines for each subset, from which we can find that the data imbalance problem significant hinders the performance and results in a degradation (e.g., 88.96 of Subset-1 v.s.64.38 of Subset-2 v.s.59.31 of Subset-3 v.s.49.99 of Subset-4 for BERT).The performance is significantly worse when label has fewer training instances.Hence, further explorations on handling the data imbalance challenge may be critical for MUSIED.

我吃完一直在拉肚子
>>I have been having diarrhea after eating !!! 是这笔订单吗？ >> Is this your order?对，就是这个订单，就这家店 >>Yes.That is exactly the order and the restaurant.亲您别着急，您现在还是身体不舒服吗？>> Please don't worry.Are you still feeling uncomfortable?啊就是我刚刚点了那个外卖里边儿 那个，绿豆芽里边儿有一根头发。 >>Ah, I just ordered takeout, there is a hair inside the mung bean sprouts.是这个[店名]这个订单是吗？>>Isthis [store name] this order ? of phone conversations 吃完过后，直接拉肚子了，感觉那个鸡排不是很新鲜， 像是炸了很多遍的 >>After eating it, I had diarrhea.I felt that the chicken chop was not very fresh, like it was fried many times.

Figure 3 :
Figure 3: Instance number of each event type.

Figure 4 :
Figure 4: Distribution of triggers with different length.

pu rit ies Un com for tab le Ab no rm ali tie s Lo w-qu ali ty Sp oil ed Un de rco ok ed Co ld Ex pir ed Da ma ged Po or-en vir on me nt Re cyc led -m ate ria l Inc on sis ten t-p rod uc t Fa ke Ste al No n-c om pli an t Po or-pa ck ag ing Un rel iab le-pr od uc t Th aw Ad dit ive s Co ntr ab an d Ha rm ful -re sid ue s Event Type 0 2000 4000 6000 8000 10000 12000 14000 #Instance
(Trong et al., 2020) Rebellions events in African Diaspora; (6) CySecED(Trong et al., 2020), which is the largest cybersecurity event dataset.We can observe that our MUSIED is large-scale compared with existing datasets.In terms of average instance number per event type, MUSIED has significant advantage over other datasets(e.g., 1,756 of MUSIED  v.s.707 of MAVEN v.s.162 of ACE 2005).Thus, MUSIED can stably train and benchmark sophisticated neural-based models.
1. MUSIED has 21 event types and 35,313 labeled instances, yet "Impurities" (with 13,883 labeled instances) and "Uncomfortable" (with 7,935 labeled instances) ac- Chinese), in terms of the average wasserstein distance.Also, we compute the average event density for each source and the standard deviation of the average event densities, which shows that the disparity of event density across MUSIED's sources is more remarkable (i.e., 0.35 of MUSIED v.s.0.17 of ACE 2005 English v.s.0.08 of ACE 2005 Chinese).

Table 2 :
The statistics of splitting MUSIED.

Table 4 :
BERT's recall of triggers of different lengths.

Table 6 :
Performance on corrected samples.
Xilun Chen and Claire Cardie.2018.Multinomial adversarial networks for multi-domain text classification.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1226-1240, New Orleans, Louisiana.Association for Computational Linguistics.Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao.2015.Event extraction via dynamic multipooling convolutional neural networks.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 167-176.Yubo Chen, Hang Yang, Kang Liu, Jun Zhao, and Yantao Jia.2018.Collective event detection via a hierarchical and bias tagging networks with gated multilevel attention mechanisms.In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1267-1276, Brussels, Belgium.Association for Computational Linguistics.models for event extraction and generation.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5284-5294.Ying Zeng, Honghui Yang, Yansong Feng, Zheng Wang, and Dongyan Zhao.2016.A convolution bilstm neural network model for chinese event extraction.In Natural Language Understanding and Intelligent Applications -5th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2016, and 24th International Conference on Computer Processing of Oriental Languages, ICCPOL 2016, Kunming, China, December 2-6, 2016, Proceedings, volume 10102 of Lecture Notes in Computer Science, pages 275-287.Springer.Shun Zheng, Wei Cao, Wei Xu, and Jiang Bian.2019.Doc2edag: An end-to-end document-level framework for chinese financial event extraction.arXiv preprint arXiv:1904.07535.

Table 17 :
F1-score of different models in four subsets.