Honey or Poison? Solving the Trigger Curse in Few-shot Event Detection via Causal Intervention

Event detection has long been troubled by the trigger curse: overfitting the trigger will harm the generalization ability while underfitting it will hurt the detection performance. This problem is even more severe in few-shot scenario. In this paper, we identify and solve the trigger curse problem in few-shot event detection (FSED) from a causal view. By formulating FSED with a structural causal model (SCM), we found that the trigger is a confounder of the context and the result, which makes previous FSED methods much easier to overfit triggers. To resolve this problem, we propose to intervene on the context via backdoor adjustment during training. Experiments show that our method significantly improves the FSED on both ACE05 and MAVEN datasets.


Introduction
Event detection (ED) aims to identify and classify event triggers in a sentence, e.g., detecting an Attack event triggered by fire in "They killed by hostile fire in Iraqi". Recently, supervised ED approaches have achieved promising performance (Chen et al., 2015;Nguyen and Grishman, 2015;Nguyen et al., 2016;Lin et al., 2018Lin et al., , 2019bDu and Cardie, 2020;Liu et al., 2020a;Lu et al., 2021), but when adapting to new event types and domains, a large number of manually annotated event data is required which is expensive. By contrast, fewshot event detection (FSED) aims to build effective event detectors that are able to detect new events from instances (query) with a few labeled instances (support set). Due to their ability to classify novel types, many few-shot algorithms have been used in FSED, e.g., metric-based methods like Prototypical Network (Lai et al., 2020;Deng et al., 2020;Cong et al., 2021).
Unfortunately, there has long been a "trigger curse" which troubles the learning of event detec- * Corresponding authors.
The data distribution of FSED after causal intervention Figure 1: Illustration of the causal intervention strategy proposed in this paper. The graph includes the event E, the trigger set T , the context set C, the support instance S, the prediction Y and the query instance Q.
tion models, especially in few-shot scenario (Bronstein et al., 2015;Liu et al., 2017;Chen et al., 2018;Liu et al., 2019;Ji et al., 2019). For many event types, their triggers are dominated by several popular words, e.g., the Attack event type is dominated by war, attack, fight, fire, bomb in ACE05. And we found the top 5 triggers of each event type cover 78% of event occurrences in ACE05. Due to the trigger curse, event detection models nearly degenerate to a trigger matcher, ignore the majority of contextual information and mainly rely on whether the candidate word matches the dominant triggers. This problem is more severe in FSED: since the given support instances are very sparse and lack diversity, it is much easier to overfit the trigger of the support instances. An intuitive solution for the trigger curse is to erase the trigger information in instances and forces the model to focus more on the context. Unfortunately, due to the decisive role of triggers, directly wiping out the trigger information commonly hurts the performance (Lu et al., 2019;Liu et al., 2020b). Some previous approaches try to tackle this problem by introducing more di-versified context information like event argument information (Liu et al., 2017(Liu et al., , 2019Ji et al., 2019) and document-level information (Ji and Grishman, 2008;Liao and Grishman, 2010;Duan et al., 2017;Chen et al., 2018). However, rich context information is commonly not available for FSED, and therefore these methods can not be directly applied.
In this paper, we revisit the trigger curse in FSED from a causal view. Specifically, we formulate the data distribution of FSED using a trigger-centric structural causal model (SCM) (Pearl et al., 2016) shown in Figure 1(a). Such trigger-centric formulation is based on the fact that, given the event type, contexts have a much lower impact on triggers, compared with the impact of triggers on contexts. This results in the decisive role of triggers in event extraction, and therefore conventional event extraction approaches commonly follow the triggercentric procedure (i.e., identifying triggers first and then using triggers as an indicator to find arguments in contexts). Furthermore, the case grammar theory in linguistics (Fillmore, 1967) also formulate the language using such trigger/predicate-centric assumption, and have been widely exploited in many NLP tasks like semantic role labeling (Gildea and Jurafsky, 2002) and abstract meaning representation (Banarescu et al., 2013).
From the SCM, we found that T (trigger set) is a confounder of the C(context set) and the Y (result), and therefore there exists a backdoor path C ← T → Y . The backdoor path explains why previous FSED models disregard contextual information: it misleads the conventional learning procedure to mistakenly regard effects of triggers as the effects of contexts. Consequently, the learning criteria of conventional FSED methods are optimized towards spurious correlation, rather than capturing causality between C and Y . To address this issue, we propose to intervene on context to block the information from trigger to context. Specifically, we apply backdoor adjustment to estimate the interventional distribution that is used for optimizing causality. Furthermore, because backdoor adjustment relies on the unknown prior confounder (trigger) distribution, we also propose to estimate it based on contextualized word prediction.

Structural Causal Model for FSED
This section describes the structural causal model (SCM) for FSED, illustrated in Figure 1(a). Note that, we omit the causal structure of the query for simplicity since it is the same as the support set. Concretely, the SCM formulates the data distribution of FSED: 1) Starting from an event E we want to describe (in Figure 1(a) is an Attack in Iraqi).
2) The path E → T indicates the trigger decision process, i.e., selecting words or phrases (in Figure 1(a) is fire) which can almost clearly express the event occurrence (Doddington et al., 2004). 3) The path E → C ← T indicates that a set of contexts are generated depending on both the event and the trigger, which provides background information and organizes this information depending on the trigger. For instance, the context "They killed by hostile [fire] in Iraqi" provides the place, the role and the consequences of the event, and this information is organized following the structure determined by fire. 4) an event instance is generated by combining one of the contexts in C and one of the triggers in T via the path C → S ← T . 5) Finally, a matching between query and support set is generated through S → Y ← Q.
Conventional learning criteria for FSED directly optimize towards the conditional distribution P (Y |S, Q). However, from the SCM, we found that the backdoor path C ← T → Y pass on associations (Pearl et al., 2016) and mislead the learning with spurious correlation. Consequently, the learning procedure towards P (Y |S, Q) will mistakenly regard the effects of triggers as the effects of contexts, and therefore overfit the trigger information.

Causal Intervention for Trigger Curse
Based on the SCM, this section describes how to resolve the trigger curse via causal intervention. Context Intervention. To block the backdoor path, we intervene on the context C and the new context-intervened SCM is shown in Figure 1(b). Given support set s, event set e of s, context set C of s and query instance q, we optimize the interventional distribution P (Y |do(C = C), E = e, Q = q) rather than P (Y |S = s, Q = q), where do(·) denotes causal intervention operation. By interven-ing, the learning objective of models changes from optimizing correlation to optimizing causality. Backdoor Adjustment. Backdoor adjustment is used to estimate the interventional distribution 4 : where P (s|C, t) denotes the generation of s from the trigger and contexts. P (s|C, t) = 1/|C| if and only if the context of s in C and the trigger of s is t. P (Y |s, q) ∝ φ(s, q; θ) is the matching model between q and s parametrized by θ.
Estimating P (t|e) via Contextualized Prediction. The confounder distribution P (t|e) is unknown because E is a hidden variable. Since the event argument information is contained in C, we argue that P (t|e) ∝ M (t|C) where M (·|C) indicates a masked token prediction task (Taylor, 1953) which is constructed by masking triggers in the support set. In this paper, we use masked language model to calculate P (t|e) by first generating a set of candidate triggers through the context: where t i is the i-th predicted token and t 0 is the original trigger of the support set instance, then P (t|e) is estimated by averaging logit obtained from the MLM: where l i is the logit for the i th token. To reduce the noise introduced by MLM, we assign an additional hyperparameter λ ∈ (0, 1) to t 0 .
Optimizing via Representation Learning. Given the interventional distribution, FSED model can be learned by minimizing the loss function on it: where Q is training queries and f is a strict monotonically increasing function. However, the optimization of L(θ) needs to calculate every P (Y |s, q; θ), which is quite time-consuming. To this end, we propose a surrogate learning criteria L SG (θ) to optimize the causal relation based on representation learning: 4 The proof is shown in Appendix Here R is a representation model which inputs s or q and outputs a dense representation. g(·, ·) is a distance metric measuring the similarity between two representations. Such loss function is widely used in many metric-based methods (e.g., Prototypical Networks and Relation Networks). In the Appendix, we prove L SG (θ) is equivalent to L(θ).  (2020) in Few-shot NER. We randomly sample few instances as support set and all other instances in the test set are used as queries. A support set corresponds to an event type and all types will be evaluated by traversing each event type. Models need to detection the span and type of triggers in a sentence. We also compared the results across settings in Section 4.3. We evaluate all methods using macro-F1 and micro-F1 scores, and micro-F1 is taken as the primary measure.  Table 1: F1 score of 5-shot FSED on test set. * means fixing the parameters of encoder when finetuning. ± is the standard deviation of 5 random training rounds.

Episode
Ambiguity Ours 2) FS-ClusterLoss (Lai et al., 2020), which add two auxiliary loss functions when training. Furthermore, we compare our method with models finetuned with support set (Finetune) and pretrained using the training set (Pretrain). BERT base (uncased) is used as the encoder for all models and MLM for trigger collection.

Experimental Results
The performance of our method and all baselines is shown in Table 1. We can see that: 1) By intervening on the context in SCM and using backdoor adjustment during training, our method can effectively learn FSED models. Compared with the original metric-based models, our method achieves 8.7% and 1.6% micro-F1 (average) improvement in prototypical network and relation network respectively.
2) The causal theory is a promising technique for resolving the trigger cruse problem. Notice that FS-LexFree cannot achieve the competitive performance with the original FS models, which indicates that trigger information is import and underfitting triggers will hurt the detection performance. This verifies that trigger curse is very challenging and causal intervention can effectively resolve it.
3) Our method can achieve state-of-the-art FSED performance. Compared with best score in baselines, our method gains 7.5%, 1.0%, and 2.0% micro-F1 improvements on ACE05, MAVEN and KBP17 datasets respectively.

Effect on Different Settings
To further demonstrate the effectiveness of the proposed method, we also conduct experiments under different FSED settings: 1) The primal episodebased settings (Episode), which is the 5+1-way 5-shot settings in Lai et al. (2020). 2) Episode + ambiguous instances (Ambiguity), which samples some additional negative query instances that include words same as triggers in support set to verify whether models overfit the triggers.
The performance of different models with different settings is shown in Figure 2. We can see that: 1) Generally speaking, all models can achieve better performance on Episode because correctly recognize high-frequent triggers can achieve good performance in this setting. Consequently, the performance under this setting can not well represent how FSED is influenced by trigger overfitting. 2) The performance of all models dropped on Ambiguity setting, which suggests that trigger overfitting has a significant impact on FSED. 3) Our method still maintains good performance on Ambiguity, which indicates that our method can alleviate the trigger curse problem by optimizing towards the underlying causality.

Case Study
We select ambiguous cases (in Table 2) to better illustrate the effectiveness of our method. For Query 1, FS-Base wrongly detects the word run to be a trigger word. In Support set 1, run means nomi-

A Proof of Backdoor Adjustment
We prove the backdoor adjustment for our SCM using the rules of do-calculus (Pearl, 1995). For a causal graph G, let G X denote the graph where all of the incoming edges to Node X are removed. let G X denote the graph where all of the outgoing edges from Node X are removed. ⊥ ⊥ G denotes d-separation in G.
D-separation (Pearl, 2014): Two (sets of) nodes X and Y are d-separation by a set of nodes Z (i.e. X ⊥ ⊥ G Y |Z) if all of the paths between (any node in) X and (any node in) Y are blocked by Z.
The rules of do-calculus are: where Z(W ) denotes the set of nodes of Z that aren't ancestors of any node of W in G T .

B Detailed Task Settings
One-way K-Shot Settings. We adopt One-way K-shot setting in our experiments, in which the support set in an episode contains one event type (called concerned event) and the query can contain any event type. The model aims to detect triggers of the concerned event in query and all types will be evaluated by traversing each event type. The support set and query in an episode can be formulated as follows: where S is the support set, E is the concerned event, S i = {s i 1 , s i 2 , . . . , s i n i } is the i-th sentence in support, s i j is the j-th token in S i , Y i = {y i 1 , y i 2 , . . . , y i n } is the labels of tokens in S i and y i j = 1 only if t i is the trigger (or part of trigger) of concerned event, otherwise y i j = 0.
where Q is the set of query and Q i = {q i 1 , q i 2 , . . . , q i m i } is the i-th query sentence and q i j is the j-th token in Q i The model is expected to output the concerned event in Q: where O Q is the set of triggers of concerned event detected in Q, T i k is the k-th trigger of concerned event in sentence Q i and n i ≥ 0 means the number of triggers of concerned event in Q i .
Evaluation We improve the traditional episode evaluation setting by evaluating the full test set. For each event type in test set, we randomly sample K instances as support set and all other instances are used as query. Following previous event detection works (Chen et al., 2015), the predicted trigger is correct if its event type and offsets match those of a gold trigger. We evaluate all methods using macro-F1 and micro-F1 scores, and micro-F1 is taken as the primary measure.

C Few-shot Event Detection Baselines
We use two metric-base methods in our experiments: Prototypical network (Snell et al., 2017) and Relation network (Sung et al., 2018), which contain an encoder component and a classifier component.
Encoder We use BERT (Devlin et al., 2019) to encoder the support set and the query. Given a sentece X = {x 1 , x 2 , . . . , x n }, BERT encodes the sequence and output the represent of each token in X: R = {r 1 , r 2 , . . . , r n }. After obtaining the feature representation of the support set, we calculate the prototype of the categories (concerned event and other): where p i is the prototype of category i, R i is the set of feature representation of tokens that labeled with y = i in support set.
Classifier The models classify each token in query based on its similarity to the prototype.
We first calculate the similarity between prototype and token in query.
where g(x, y) measures the similarity between x and y, q i j is the represent of j-th token in i-th query sentence.
Then we calculate the probability distribution of token q i j : During training, we use the Cross-Entropy loss on each token of query. And the support set and the query are randomly sampled from the training set.
When evaluating, we treat the labels as IO tagging schemes, and adjacent I are considered to be the same trigger so that we can handle a trigger with multiple tokens.
Similarity Functions For prototypical network, the similarity in Equation 6 is Euclidean distance. For relation network, we calculate similarity using neural networks. Unlike the original paper, we find the following calculation to be more efficient: where ⊕ means concatenation vectors and F is two-layer feed-forward neural networks with a ReLU function on the first layer.

D Proof of Loss Function
We prove L SG (θ) is equivalent to L(θ), which indicates that minimizing L SG (θ) is equivalent to minimizing L(θ). At first , we define a function φ(s, q) ∝ P (Y |s, q; θ) and then we need to prove that g( t∈T s∈S P (t|e)p(s|C, t)r s , q) = f ( t∈T s∈S P (t|e)P (s|C, t)φ(s, q)).
From Appendix-A, we can obtain: Here, we assume that the feature representations of the same event type in support are close to each other so that | s p s r s − s p s q| ≈ s p s |r s − q|.

E Implementation Details
All of our experiments are implemented on one Nvidia TITAN RTX. Our implementation is based on HuggingFace's Transformers (Wolf et al., 2019) and Allennlp (Gardner et al., 2018). We tune the hyperparameters based on the dev performance. We train each model 5 times with different random seed, and when evaluating, we sample 4 different support sets.

Metric-based Methods
The hyperparameter is shown in Table 5. During training, the support set and the query is sampled in training set, the query contains 2 positive instances and 10 negative instances (5 times of positive instances). During validating, the support set and the query is sampled in dev set, the query contains 10 positive instances and 100 negative instances (10 times of positive     instances). The results of dev set are shown in Table 3. For FS-Causal, we found that there is an impact on whether backdoor adjustment is applied separately to the support set and query, as shown in Table 4. Based on the best results of the dev set, we evaluate it on the test set.
Finetuning-based Methods The hyperparameter is shown in Table 6. For pretraining, we train a supervised event detection model using the training set. For finetuning, we use the support set to finetune the parameters of the event detection model and then detect the event in query.