Trigger-Argument based Explanation for Event Detection

,


Introduction
Event Detection (ED) aims at identifying event triggers with specific event types, which is the first and fundamental step for extracting semantic and structural knowledge from plain text (Ahn, 2006;Nguyen and Grishman, 2015).For instance, event mention "The train driver was beaten over the head by a thug." in Figure 1 comprises an event trigger "beaten" and a set of arguments such as "the train driver", "the head" and "a thug".An ideal ED system is expected to detect "beaten" as an event trigger of the type Bodily_harm.Recently, with the growth of open source annotated datasets ( Walker et al., 2006;Wang et al., 2020) and the development of deep learning technologies, deep approaches have become popular for tackling the ED problem (Nguyen et al., 2016;Wang et al., 2021).Despite their great performance, they are still opaque for people to comprehend the inner mechanisms.
Although there exist many works that focus on explaining the model behavior on natural language processing (NLP) problems, such as text classification (Lei et al., 2016), text matching (Jiang et al., 2021) and machine reading comprehension (Ju et al., 2021), very little progress has been made to interpret ED models.We identify two major limitations that prevent the existing explanation methods from being applied to ED models.
Neglecting event structured knowledge.Existing methods mainly focus on assessing the contributions of individual input unit (e.g., word or phrase) to generate explanations for neural networks (Li et al., 2016;Jiang et al., 2021).As shown in Figure 1(a) and (b), both explanations provide insights of which words (e.g., "beaten") or phrases (e.g., "beaten over the head") contribute to the prediction.However, neither of them is suitable to explain ED models as an event is represented as a structure comprising an event trigger and a set of arguments.Thus, the trigger-argument structures are more sensible clues to explain ED systems.In Figure 1(c), "beaten" is an ambiguous word that may evoke completely dissimilar events such as Bodily_harm and Competition.In this case, trigger word "beaten" and its arguments (e.g., "the head" and "a thug" which refer to Body_part and Agent) work together for the prediction Bodily_harm.Thus, how to take advantage of the event structure knowledge for ED model explanation is a non-trivial task.
Explanations cannot reflect the decisionmaking process.Models usually provide important features which are words or phrases selected from an input text as explanations, but they do not further elaborate the function of these features, i.e., why models produce the prediction according to these features.It poses challenges to interpret an explanation and connect it to model prediction.For example, in Figure 1(a) and (b), models may assign high relevance score to "train driver" or "thug", but it is still confused why these features can lead to the prediction Bodily_harm.In fact, "train driver" and "thug" serve as Victim and Agent, which compose together for the Bodily_harm event in which "An Agent injures a Victim" in Figure 1(c).Furthermore, Figure 1(d) provides an example that wrongly classifies Competition as Bodily_harm, because models take "Harley" and "John" as Agent and Victim rather than Participant_1 and Participant_2.Thus, exploring explanations that can not only identify important features but also reveal how these features contribute to the prediction are urgently needed.
To address the aforementioned challenges, we propose TAE, a Trigger-Argument based Explanation method, to generate structure level explanations for ED models.Specifically, TAE focuses on utilizing neuron features of ED models to construct explanations based on trigger-argument knowledge.It has three core sub-modules: Group Modeling aims to divide neurons into different groups, where each group is regarded as an event structure, in such a way that each neuron corresponds to one argument and works together with other neurons that belong to the same event structure to explain the prediction of ED models; Sparsity Modeling aims to compact explanations by designing differentiable masking mechanisms to automatically filter out useless features generated by the group mechanism, and the intuition behind this module is that a good explanation should be short for understanding or reading (Miller, 2019); Support Modeling aims to ensure that the explanations generated by the group and sparsity mechanisms are faithful to the predictive model.Note we utilize FrameNet, a well-defined linguistic knowledge base by experts, to assist TAE identify event structures and help humans understand the decision-making process.The contributions of this paper are as follows: • We propose a model-agnostic method, called TAE ( Trigger-Argument based Explanation), to construct structure-level explanations for Event Detection (ED) systems.To the best of our knowledge, this is the first exploration to explain ED with structure knowledge.
• TAE adopts three strategies, namely, Group Modeling, Sparsity Modeling and Support Modeling to characterize the trigger-argument based explanations from structuralization, compactness, and faithfulness perspectives.
• We utilize FrameNet (Baker et al., 2006), a well-defined knowledge base, to help complete the event structure in MAVEN.The annotated data is released for further research1 .
• Experimental results on the large-scale MAVEN and widely-used ACE 2005 datasets show that TAE can generate more faithful and human-understandable explanations.

Related Work
In this section, we review the related works on Event Detection and Interpretation Methods.Event Detection.Event detection is a key task for Event Knowledge Graph (Pan et al., 2017) construction.Traditional methods for ED have employed feature based techniques (Ji and Grishman, 2008;Li et al., 2013).These approaches mainly rely on elaborately designed features and NLP tools.Later, advanced deep learning methods have been applied for ED, such as convolutional neural networks (Chen et al., 2015), bidirectional recurrent neural networks (Nguyen et al., 2016), which can take advantage of neural networks to learn features automatically.Since pre-trained language models (PLMs) are capable of capturing the meaning of words dynamically by considering their context, they have proven successful on a range of NLP tasks including ED (Tong et al., 2020;Wang et al., 2021Wang et al., , 2020)).Although neural networks and PLMs bring incredible performance gains on ED task, they offer little transparency concerning the inner workings.
Interpretation Methods.There has been growing interests in producing explanations for deep learning systems in recent years, enabling humans to understand the intrinsic mechanism.In general, the explanations from these methods can typically be categorized as post-hoc explanations that aim to explain a trained model and reveal how model arrives at prediction (Lipton, 2016).Among them, gradient-based, attention-based and erasure-based methods are three typical methods.
Gradient-based methods are model-aware interpretation methods using gradients to measure feature importance (Shrikumar et al., 2017).Since a token index is not ordinal, methods simply sum up the relevance scores of each representation dimension.Because the score can have a negative or positive sign, the score may become zero even if it does contribute to prediction (Arras et al., 2017).
Attention-based methods attempt to use attention weights as feature importance scores (Vashishth et al., 2019).However, attention is argued to not be an optimal method to identify the attribution for an output as its validity is still controversial (Bastings and Filippova, 2020).
Erasure-based methods are widely-used approaches where a subset of features is considered irrelevant if it can be removed without affecting the model prediction (Feng et al., 2018).A straightforward approach is to erase each token by replacing it with a predefined value such as zero (Li et al., 2016).However, these erasure methods usually generate explanations by calculating the contribution of individual unit to the predictions, which are not suitable for ED as an event is often correctly identified with event structure.
In this paper, we attempt to generate explanations for ED models by considering semantic structured knowledge (Chen et al., 2018) entailed in the input at neuron level, which is complementary to the aforementioned approaches.

Event Detection
An event refers to "a specific occurrence involving one or more participants" in automatic content extractions.To facilitate the understanding of the ED task, we introduce related terminologies as follows: Event Trigger: the main word which most clearly expresses an event that happens.
Event Arguments: the entities that are involved in an event that serves as a participant.
Event Mention: a phrase or sentence within which an event is described.
For event mention "The train driver was beaten over the head by a thug", an event extractor is expected to identify an Bodily_harm event triggered by "beaten" and extract corresponding arguments with different roles such as "the train driver" (Victim) and "a thug" (Agent).In this paper, instead of explaining the overall standard event extraction models, we concentrate only on the ED task.That is, for this example, our goal is to explain why ED models can classify the event as Bodily_harm or not.

Problem Formulation
ED explanation aims to explain a trained ED model and reveal how the model arrives at the prediction.For an event mention x = {x 1 , x 2 , ..., x i , ..., x n } with n words, a given pre-trained neural network (NN) f maps x to the corresponding label y j , where y j ∈ {y 1 , y 2 , ..., y j , ..., y m } is corresponding event type which has unique trigger-arguments F x ∈ F .
Assume that the NN model f = g(h(x)) can be decomposed into two-stages: (1) utilizes h(•) to map the input x to the intermediate layer h d) , and h k (x) is the k-th neuron in h(x); and (2) uses g(•) to map the intermediate layer h(x) to the output g(h(x)), which is the probability of input x being predicted as label y j by NN model, as shown in top part of Figure 2.
To better understand neurons, recent work attempts to identify the closest features to explain its behavior.The correlations between neuron and feature are obtained as follows: where F are features of the input x, such as key words, POS, and trigger-arguments.Neu(h k (x)) is the most related feature selected in x to represent h k (x).ρ is an arbitrary correlation calculation

Method
In this paper, we propose TAE, a trigger-argument based explanation method for event detection, which attempts to utilize event structure knowledge to explain model predictions at neuron level.The overview of our method is shown in Figure 2, which contains three modules: (1) The Group module captures structured knowledge of events; (2) The Sparsity module encourages models to select few but key features in the event structure; (3) The Support module is a fundamental module that guarantees explanations generated by Group and Sparsity consistent with the original prediction.
The loss function of an structured explanation for an event is obtained by an optimization problem: where (L g , L s and L sd ) are from the group, sparsity and support modules, while λ g , λ s and λ sd are hyper-parameters controlling weights of different losses.

Group Modeling
The Group module aims to divide neurons into different groups, and each group corresponds to a trigger-argument structure.Some existing works try to aggregate related features according to the distance information, such as encouraging the highly overlapping regions of the image (Varshneya et al., 2021), or gathering the neighbor words to enhance the current word (De Cao et al., 2020;Jiang et al., 2021).However, these methods might not work, as arguments of event types can be scattered in different positions and usually not adjacent to each other in input texts.
To solve this problem, we propose a group loss objective that constructs event structures by aggregating neurons corresponding to the related arguments.We first use the clustering algorithm kmeans (Hartigan and Wong, 1979;Ghorbani et al., 2019) to automatically cluster neurons with the nearest mean into the same group.
where G ∈ {G 1 , G 2 , ..., G L } is the group set and L is group number.Then, for individual group G l , we use the IoU to measure the contribution ϕ(h l i (x)) of neuron h l i (x) in the group.
) where F x is the trigger-argument feature of input x, and h l i (•) is the i-th neuron in group G l .Finally, the group objective L g is to minimize the intra-cluster sum of the distances from each neuron to the labeled feature in the input (Varshneya et al., 2021), given by the following equation: During train phase, for each batch, we extract the trigger-arguments on the whole batch while calculating the L g .It means that the neuron can learn the batch data features rather than individual features, which can enhance the generalization ability.

Sparsity Modeling
The Sparsity module aims to produce compact and human-friendly explanations.This is achieved by removing "dead neurons" (Mu and Andreas, 2020), which are useless for model prediction, while only keeping the key information to explain predictions.
To this end, following the existing work (De Cao et al., 2020), we use the differentiable masking mechanism to filter out the useless neuron features.Specially, for each extracted neuron, a classifier with a sigmoid activation function is added to determine whether the neuron should to be masked or not.During training phase, we directly use L1 norm (Jiang et al., 2021) to minimize the number of the neurons as follows: where φ(•) is the neuron classifier.The straightforward idea is to minimize the non-zero position.

Support Modeling
The support module aims to ensure the faithfulness of explanations generated by Group and Sparsity.
A desirable interpretable event detection model should satisfy the intuition that a prediction is directly dependent on the selected features.For an ED model, we choose the neurons in h(x) to generate explanations.Group and Sparsity are utilized to select neuron features µ containing structured and important information.Thus the goal of Support is to measure whether µ can depict the true profile of how the model works.Specifically, function h ′ (•) maps µ to the new hidden states h ′ (µ), and g(•) maps the new hidden states h ′ (µ) to the new output g(h ′ (µ)), as shown in the bottom part of Figure 2.
We introduce an optimization objective to guarantee the support modeling.Different from the existing work matching the current prediction, we directly ask the reconstructed representation to meet the ground truth distribution 3 .
where ŷ and θ are the ground truth labels and trainable parameters respectively.KL represents Kullback-Leibler divergence.Note h ′ (•) can be any current popular network architectures, such as LSTM, Transformer and PLMs.In our setting, to maintain the interpretability, we use the simple linear projection and MLP (multilayer perceptron) to build the network, and the computation is much more efficient since we don't need to optimize the whole backbone (Yeh et al., 2020).In addition, in this way, it mainly focuses on learning the neuron behavior instead of sacrificing the performance of the pre-trained CNN models.
3 Assume that we extract neurons from the pre-trained model, and the neuron exactly meets the current prediction.If we detect the trigger-argument information to be useful for model decision and remove the useless neurons, a reasonable explanation may meets or better than the current prediction.
MAVEN is a manually annotated dataset4 for event detection task without annotated arguments, which contains 168 event types and 4,480 documents.The event types are manually derived from the frames defined in FrameNet (Baker et al., 1998;Guan et al., 2021b,a).To satisfy our needs, we utilize the automatic frame parser SEMAFOR (Das et al., 2014) to parse the MAVEN data.We select the data that have event type in MAVEN, and regard the corresponding frame elements (Guo et al., 2020) as the event arguments.Finally, we collect 12,649 event mentions, and randomly split them into train/dev/test sets with sizes of 8,469/2,000/2,000.
ACE 2005 is also a manually annotated dataset5 , which contains 8 event types, 35 argument roles and 599 English documents (Li et al., 2020).We further remove the data without arguments, and finally select 3,014 examples.Since the data size is relatively small, and cannot use it to learn a better NN model.So we directly utilize the models trained on MAVEN to test on ACE 2005, which can also verify the models' generalization ability.6

ED Models
Our TAE is a model-agnostic method to explain ED models.In this paper, we first select two typical NN models, namely LSTM (Hochreiter and Schmidhuber, 1997) which contains 2 layers with 300 hidden states, LSTM+CNN (Tan et al., 2015) which has 2 layers with 300 hidden states.Moreover, we also select three PLM-based models which  achieve promising performance on ED, including BERT (Devlin et al., 2019) which has 12 layers and 768 hidden states, DMBERT (Wang et al., 2019) which also applied on BERT-base version with 768 hidden states, and DeBERTa (He et al., 2021) which has 24 layers and 1,536 hidden states.
Table 1 shows the results (P, R, F1) of different models on both datasets in our experiments, where DeBERTa outperforms the other four models with higher F1 scores.

Support Evaluation
We adopt three metrics to evaluate support degree (i.e., faithfulness): two metrics from prior explanation work including area over reservation curve (AORC) (DeYoung et al., 2020) and area over the perturbation curve (AOPC) (Nguyen, 2018), and a new defined evaluation metric called support-score (SUPP).AORC calculates the distance between the original predicted logits and the masked ones by reserving top k% neuron features which are identified by trigger-arguments as follows: where P (k) ′′ (ŷ|x) means the prediction which reserves the top k% neuron features.Under this metric, lower AORC scores are better.
AOPC score calculates the average change in prediction probability on the predicted class over all test data by deleting the top r% neuron features.
where P (r) ′′ (ŷ|x) is the prediction which remove the top r% neuron features.N denotes the number of examples.In our experiment, r is set to 20.Under this metric, the larger scores are better.
We propose SUPP score to verify whether the new prediction g(h ′ (µ)) is positive to the original ones g(h(x)).Under this metric, the larger SUPP scores are better.
Automatic support evaluation results are shown in Table 2, and we have the following three observations: (1) TAE achieves better performance in most cases across all the three metrics on both datasets.For metric SUPP, all methods achieve positive results, indicating our method can identify important features and make a positive contribution to model predictions.
(2) Compare to LSTM-and CNN-based methods, BERT-based methods achieve significantly better performance.It is perhaps because BERT has  already preserved a large amount of general knowledge by training on large-scale data.
(3) Compared with MAVEN, the results on ACE are equally remarkable.Overall, our model achieves very strong results on different types of data and methods, proving that it is a good modelagnostic approach.

Sparsity Evaluation
For evaluating the sparsity, we directly report the sparsity score, which obtained in Equation 6 just like the explanation work (Jiang et al., 2021), as the metric.In this criterion, the score means the degree of sparsity, and the lower scores are better.The intuition behind this criterion is that a good explanation should be short for understanding.
The results are reported in Table 2.We can see TAE achieves the lowest SPAR values in most cases for the automatic evaluation, for example, the SPAR of our TAE + DEBERTA on ACE and MAVEN are 1.603 and 1.148, while the SPAR of LEAVE-ONE-OUT are 23.90 and 23.55, which indicates that TAE can effectively discover the useless neurons.

Group Evaluation
In order to verify the effectiveness of the group mechanism, we use two metrics for explanations.First, following the previous explanation work (Bau et al., 2017;Varshneya et al., 2021), for each predefined group, we compute the number of unique trigger-argument in the group as the interpretability score.Second, for trigger-argument structure, we average the IoU score which is computed in Equation 1 to represent its explanation quality score like (Mu and Andreas, 2020).
Figure 3 shows the comparison of the interpretability score with different groups.With the group number increasing, the number of triggerarguments detected by the model gradually increases, indicating grouping mechanism can improve the model interpretability.However, the num-  w / t r i g g e r -a r g u m e n t w / o t r i g g e r -a r g u m e n t ber of trigger-arguments remains constant after the group number exceeds 50.A major reason is the uneven distribution of the data, which mainly concentrates on 20% of the event types.Note, the maximum group number is limited to event type number, such as for MAVEN, the maximum group number is 168.
Table 3 shows the IoU score of 8 event types.In this setting, we separate test for each event type.The score increases with the group mechanism on most cases, which can further prove the effectiveness of the group mechanism.

Analysis on Trigger-Argument
To further verify the effectiveness of the triggerarguments, we introduce features used in Mu and Andreas (2020) as comparison, such as POS (partof-speech), most common words, and entity.They suggest that neuron cannot be regarded as a simple detector (Bau et al., 2017) but may express the meaning of multiple concepts.So they use composition operations such as OR, AND and NOT to expand the neuron behavior.We use the average IoU score of whole neurons on different formula lengths as one metric: where SL i is the IoU score of the formula length i, and F is the feature set.h j (•) denotes the j- th neuron feature and |h x | is the neuron number.Under this metric, larger IoU scores are better.
Figure 4 shows the results with (w/) and without (w/o) trigger-arguments, and we obtain the following two findings: (1) with the help of triggerargument, the IoU scores are larger than that w/o trigger-argument on each formula length.The results demonstrate that trigger-argument can help generate more faithfulness explanations compared to word level features.That's because each argument expresses complete meaning which may contain a semantic span rather than an individual word.(2) as the max formula length keeps getting larger, the IoU score keeps getting larger.When the formula length is greater than 10, the score is no longer changing, indicating the maximum representation capacity of neuron is 10 trigger-arguments for the model.
We further perform a qualitative analysis by deleting the arguments with high support scores.As shown in Figure 5, event mention "The crash sparked a review of helicopter safety" belongs to Causation.Explanation of our TAE model is that "The crash" and "sparked a review of helicopter safety" are two core arguments to form Causation that "An Cause causes an Effect".So when we delete argument effect ("sparked a review of helicopter safety"), ED model wrongly classifies the event as Process_start.The same applies to the second event Attack, when delete the Victim, ED models identify it as Catastrophe.The qualitative results indicate that our proposed TAE can capture trigger-argument structures that are important for model prediction.

Case Study
Figure 6 shows an example of TAE explanation.Given an event mention, 1) Group Modeling divides neurons into different event structure according to the arguments information, e.g., neurons are grouped into Military_operation, Attack During the American led coalition offensive in the Persian Gulf War, American, Canadian, British and French aircraft and ground forces attacked retreating Iraqi military personnel attempting to leave Kuwait on the night of February 26-27, 1991, resulting in the destruction of hundreds of vehicles and the deaths of many of their occupants.

Event Mention
Assailant physically attacks the Victim .

Group Modeling Sparsity Modeling
American, Canadian, British and French aircraft and ground forces attacked retreating Iraqi military personnel.and Departing; 2) Sparsity Modeling filters useless features such as Depictive and Result to compact the explanations; 3) Support Modeling selects features that are consist with the prediction, for example, Attack are more faithful comparing to Military_operation and Departing.From the above three procedures, we obtain the TAE explanation, which not only contains important features from the text but also reveals why they are important for the final prediction.For instance, "American, Canadian, British and French aircraft and ground forces" and "retreating Iraqi military personnel" respectively refer to Assailant and Victim, which work together to characterize the Attack event in which "Assailant physically attacks the Victim".In addition, with the help of trigger-argument information, the explanation is more helpful for human understanding.

Conclusion
In this paper, we propose a trigger-argument based explanation method, TAE, which exploits the event structure-level explanations for event detection (ED) task.TAE focuses on utilizing neuron features of ED models to generate explanations, along with three strategies, namely, group modeling, sparsity modeling, and support modeling.We conduct experiments on two ED datasets (i.e., MAVEN and ACE).The results show that TAE achieves better performance compared to the strong baselines on multiple public metrics.In addition, TAE also provides more faithful and human-understandable explanations.
There might be a few different future directions.Firstly, we might look into the idea of using explanations to further improve ED classification, as well as ED explanations in downstream applications.Secondly, we plan to explore ED under the multi-modal setting.Thirdly, event relation extraction is still challenging and deserves some further investigation.From the practical aspect of event knowledge graphs (EKGs), it is worth investigating high-quality yet efficient methods for constructing EKGs and making use of EKGs to predict future events (Lecue and Pan., 2013;Deng et al., 2020).Furthermore, it might be an idea to integrate commonsense knowledge (Speer et al., 2016;Romero et al., 2019;Malaviya et al., 2020) into event knowledge graphs.

Limitations
In this section, we discuss the limitations of TAE.First, as our method depends on event structure information which is obtained through automatic parser, if the parser is not good enough, then it will impact the performance.Second, since we focus on leveraging structural information, we restrict the experiments on text-based event explanation.Future work will explore multi-modal event detection explanations and evaluate models on other NLP tasks.

Figure 1 :
Figure 1: Different explanations.For (a) and (b), features with deeper colors are considered more important by previous work.The usefulness of event triggers and arguments are illustrated in (c) and (d)."arg" refers to "argument".Bodily_harm and Competition are two event types in MAVEN.
u p N u m b e r

Figure
Figure 3: Interpretability score for TAE.

Figure 5 :
Figure 5: Examples of deleting important arguments.

Figure 6 :
Figure 6: An example of TAE explanation.

Table 1 :
Model performance on MAVEN and ACE.P and R refer to Precision and Recall respectively.

Table 2 :
Support and Sparsity evaluation of different methods on ACE 2005 and MAVEN.

Table 3 :
IoU scores for the 8 event types.