Trigger is Not Sufficient: Exploiting Frame-aware Knowledge for Implicit Event Argument Extraction

Implicit Event Argument Extraction seeks to identify arguments that play direct or implicit roles in a given event. However, most prior works focus on capturing direct relations between arguments and the event trigger. The lack of reasoning ability brings many challenges to the extraction of implicit arguments. In this work, we present a Frame-aware Event Argument Extraction (FEAE) learning framework to tackle this issue through reasoning in event frame-level scope. The proposed method leverages related arguments of the expected one as clues to guide the reasoning process. To bridge the gap between oracle knowledge used in the training phase and the imperfect related arguments in the test stage, we further introduce a curriculum knowledge distillation strategy to drive a final model that could operate without extra inputs through mimicking the behavior of a well-informed teacher model. Experimental results demonstrate FEAE obtains new state-of-the-art performance on the RAMS dataset.


Introduction
In this work, we investigate the problem of Implicit Event Argument Extraction (IEAE) (Ebner et al., 2020), which seeks to identify arguments that play specific roles respect to a given trigger . Unlike previous event argument extraction task that only processes a single sentence, arguments in IEAE could span multiple sentences. As shown in Figure 1, given a conflict/attack/firearmattack event triggered by the word shooting, an IEAE system is required to extract four corresponding arguments with their roles in brackets: mass murder (target), firearms (instrument), Andrey Shpagonov (attacker), and Tatarstan (place). * Corresponding Author.  Figure 1: Instance of implicit event argument extraction on RAMS. Solid lines link the event trigger, event type, arguments, and argument roles. The dashed line connects two implicitly related arguments that could be inferred from each other.
Mainstream methods to extract event arguments focus on learning pair-wise information between arguments and the given trigger. Chen et al. (2015a); Nguyen et al. (2016a); Liu et al. (2018); Sha et al. (2018) cast argument extraction as a relation classification problem to extract pairs of trigger and candidate arguments. Ebner et al. (2020); Zhang et al. (2020b) utilize event trigger as the predicate and leverage semantic role labeling model (Surdeanu et al., 2008;Hajic et al., 2009) to identify arguments. Former state-of-the-art approaches (Du and Cardie, 2020;Zhang et al., 2020a) formulate event argument extraction as a Machine Reading Comprehension (MRC) problem through asking trigger and role-specific questions. Despite the success of these works in single sentence event argument extraction, current methods struggle in IEAE due to the following critical issues: 1.Long-range Dependency: Since arguments could span multiple sentences, there exist longrange and cross-sentence dependencies between arguments and the given trigger, which is hard to be captured through existing methods.
2.Implicit Arguments: Extracting implicit event arguments requires the ability to reason over event roles, and it is difficult for prior methods to learn these indirect relations.
We attribute these limitations to that current works are mainly designed to capture direct relations between arguments and the given event trigger. This pair-wise learning paradigm lacks the ability of effective reasoning. Instead of only using trigger information, we observe that in MRC-based event argument extraction methods, the related arguments, which refer to arguments (also their roles) in the same event except for the required one, could provide information to perform reasoning. For example, as shown in Figure 1, if we have already known Andrey Shpagonov plays the attacker role of a firearmattack event, intuitively, firearms could be the instrument of attacker. Implicit relations may lie between the two arguments, helping identifying firearms. In this manner, arguments corresponding to roles defined in the event frame-level scope could act as clues to perform reasoning and be utilized as relay nodes to capture long-range dependencies.
Nevertheless, the importance of related arguments is under-exploited. Liu et al. (2017) model event arguments as supervising attention information to promote trigger extraction.  propose to learn the association of arguments, but their method works on golden-standard candidate spans, which is unavailable in real-world applications. Existing methods could also be extended to incorporate related arguments and their roles by taking such information as inputs. However, since the model is trained with goldenstandard arguments, predicted imperfect arguments might introduce noise and affect the performance in the test stage.
In this work, we introduce a Frame-aware Event Argument Extraction (FEAE) learning framework for IEAE. We extend the MRC-based method to allow reasoning in event frame-level scope by exploiting related arguments and their roles as clues to capture the argument-argument dependencies. This method could learn to extract implicit arguments of an event trigger and handle the longrange dependency problem. To bridge the gap between the unavailable oracle knowledge (Fang et al., 2021) and the imperfect test inputs, we introduce a teacher-student framework that drives a final model that could operate without extra inputs through mimicking the behavior of well-informed teachers. Inspired by the curriculum theory (Bengio et al., 2009), we further introduce a curricu-lum distillation strategy that gradually increases the learning complexity of the student model to make it more compatible with the real situation, thus driving a better model. In summary, our contributions in this work are as follows: 1) We introduce a Frame-aware Event Argument Extraction framework to train models for implicit event argument extraction. Event frame-level knowledge is incorporated to reason and capture long-range dependencies among triggers and arguments.
2) The proposed model learns to incorporate frame-level knowledge implicitly. Knowledge distillation and curriculum learning are utilized to drive a model that does not require extra tools to produce reasoning clues, and could incorporate frame-level knowledge implicitly.
3) Our approach outperforms existing methods significantly. We achieve new state-of-the-art performance on the RAMS dataset.

Related Work
Event Argument Extraction (EAE) seeks to extract entities with specific roles in an event. Methods that learn direct relation between arguments and triggers have achieved significant progress in this field (Chen et al., 2015b;Nguyen et al., 2016b;Zhang et al., 2019;Liu et al., 2018). Recently, there is a trend to formulate EAE as a Question Answering (QA) problem, and several MRC models report performing well (Zhang et al., 2020a;Du and Cardie, 2020;. These methods leverage role-specific questions to extract boundaries of the expected arguments. Implicit Event Argument Extraction (IEAE) is a less studied problem where arguments could span multiple sentences and appear in an implicit way. There have been only a few works for IEAE. Ebner et al. (2020); Zhang et al. (2020b) formulate IEAE as a semantic role labeling task and extract arguments by classifying phrase pairs. These methods only explicitly consider direct relations between triggers and arguments.  also consider the relation among arguments, however, their method could only deal with argument linking task that identifies the role of a given argument span, which is not available in a realistic situation. Knowledge Distillation is proposed to guide a student model to imitate a well-trained teacher model. It is first proposed by Hinton et al. (2015) and has been widely used in the natural language process-ing (NLP) field (Ruder and Plank, 2018;Gong et al., 2018;Lee et al., 2018;Jiao et al., 2020). In this work, we employ the knowledge distillation training strategy to handle the train-test disparity caused by unavailable oracle knowledge in the test stage through driving a student model to learn the behavior of a well-informed teacher. Curriculum Learning is a learning strategy firstly proposed by Bengio et al. (2009) that trains a neural network better through increasing data complexity of training data. It is broadly adopted in many NLP domains (Platanios et al., 2019;Huang and Du, 2019;Xu et al., 2020). In this work, since data with rich related arguments is easier to be learned than those without extra inputs, we promote the training of our student model by gradually increasing the learning complexity of the distillation process by decreasing the proportion of given arguments.

Method
Our FEAE framework consists of two training steps to drive a model that could utilize framelevel knowledge for IEAE, and details are shown in Figure 2. For single teacher situations, firstly we train an MRC-based teacher model M T with oracle knowledge composing of golden-standard relevant arguments to exploit frame-aware information and obtain the capacity to reason. Then a student model M S that does not have access to this oracle information is driven with the guidance of M T to be used in practice. Our framework can also be extended to multi-teacher circumstances.
In the following sub-sections, we will give the formulation of our task and our MRC-based model. After that, we will illustrate the curriculum knowledge distillation strategy to bridge the gap between the training and inference stage.

Task Formulation
We formulate IEAE as a QA problem and leverage the MRC-based model to extract answer spans. For each argument type, the provided information consists of a tuple < q, c >, where q and c refer to the question and context, respectively. In practice, the question q should contain information about a trigger, the event type, and the role of the expected argument. We aim to extract a span s in the context that contains the answer to the question.
Formally, given the context C = {w i } n i=1 consisting of n words and a known event trigger with the corresponding event type, we seek to identify a set of argument tuples Y s j , Y e j , Role j m j=1 , where Y s j and Y e j are the start and end index of the j-th argument, respectively; Role j is the role of this argument.

Frame-aware Question Generation
The key of MRC-based QA is to generate questions that contain information about text spans to be extracted. We leverage a template-based question generation strategy to acquire meaningful descriptions about the desired event argument in this work. The question template we used to extract arguments with the role of Arg T ype is as follows: are related arguments and their role types in the same event. Elements in underlines contain oracle knowledge and are excluded during the test stage. The MRC-based model could be explicitly aware of the frame-level information by filling in this template, thus making better predictions.

MRC-based Argument Extraction
We employ the pre-trained language model BERT (Devlin et al., 2019) as the backbone of our MRC-based argument extraction model. The text input is formulated as: where [CLS] and [SEP ] are special tokens defined in BERT; question refers to the query generated with our template, and context denotes the context words where arguments are extracted.
This input sequence is then converted into an embedding matrix E and used as inputs of the MRC model. We leverage BERT to build semantic representation for each word in the context. After the encoding stage, we utilize hidden states from the last BERT layer to represent each token: This encoding stage makes a deep fusion between the question and the context by interactions between multi-head and multi-layer attention. In order to explicitly inform the model with the location of trigger word, we further introduce positional embedding to reflect the relevant distances between words and the specific trigger. The concatenations of positional embedding and hidden states are then utilized to produce two probability vectors of the start and end positions: where E p is the positional embedding matrix; ⊕ is the operator of concatenation and τ is the parameter of softmax temperature. We use cross-entropy between the prediction and golden labels as our training criterion to optimize our model. The following two losses are used for training start and end index predictions: where Y start and Y end are ground-truth labels for the index of desired span, respectively. For the situation where no answer exists in the context (missing role of the event), we point these two heads to the [CLS] token. The overall loss of the basic MRC model is formulated as:

Teacher-student Framework
Although oracle knowledge about related arguments in the same event could provide clues to assist reasoning in the training stage, this goldenstandard information is not available for the test stage in practice. This train-test disparity may lead to a performance drop when noisy, or even unrelated arguments are used in the test stage.
To bridge this gap, we adopt the teacher-student framework to drive a model that is capable of reasoning without the requirement of extra clues. Specifically, as shown in Figure 2 (a), we first input frame-aware question Q f ull that contains all categories of oracle knowledge to obtain a welltrained teacher model M T . Then M T is utilized to generate hidden states H T and the span distributions p T start and p T end . Likewise, a student model M S , which does not utilize oracle information, produces hidden states H S and index distributions p S start and p S end . The M S distills knowledge from M T through learning to have similar behavior in both hidden vectors and prediction distributions: where KL and MSE are short for KL-divergence loss and mean squared error loss, respectively.
Both the teacher M T and the student M S share the same architecture but with diverse parameters. The weights of M T are fixed and we only optimize the parameters of the student model during the knowledge distillation stage. The overall loss of M S under our teacher-student framework is formulated as: where α and β are two weight coefficients. Note that oracle knowledge in the question template, marked with underlines, is not available in a realistic test situation. In this work, we only utilize them to guide our teacher model to capture frame-aware information in the training stage. As illustrated in Figure 2(b), for the test stage of our student model M S , we discard these extra inputs and fill in slots with event-aware context, which only consists of the event trigger, event type, and the expected argument type. Besides, as oracle knowledge is included in the input of the teacher model, during the distillation process we mask out the question part of the text input in both teacher and student models, and only distill the knowledge of context part.
This teacher-student framework could be further extended to a multi-teacher manner which enables a student model to capture knowledge from multiple perspectives. A teacher model could learn to focus on several patterns to apply reasoning by providing different combinations of related arguments. We drive four teachers trained with diverse templates to capture different categories of oracle knowledge among roles, which are represented with ALL, ALL − 1, ALL − 2, and N ON E, respectively. These templates utilize arguments of different proportions. Take the example of the knowledge distillation training stage in Figure 2 (a), there is one expected argument to be extracted and three related arguments. ALL indicates we fill in the input template with all related arguments. ALL − 1 denotes that we randomly enumerate the possibilities of two out of the three other arguments and leave one slot unfilled. Questions for ALL − 2 and N ON E are generated in the same method where two or all slots remain unfilled.
For the multi-teacher situation, we distill knowledge into the student model from the four teachers mentioned above simultaneously. The overall multi-teacher distillation loss is formulated as: where ω k and L T k ,S are the weighting factor and the loss function calculated with the k-th teacher model using equation 6, respectively.

Curriculum Distillation
In this subsection, we view the disparity between the training and test stage from the perspective of learning complexity and introduce our curriculum distillation strategy. Clues in the form of related arguments and their roles are explicitly given for the teacher model to promote reasoning. While for the student model (the inference stage), there are no golden-standard clues, making it challenging for the model to extract the expected argument by relying on associated ones. Intuitively, the training process of the student model is harder than that of the teacher.
Inspired by the curriculum theory that a machine learning model could be trained better by feeding data following the easier to harder order, we introduce a curriculum distillation strategy to promote the learning of student model. We utilize the proportion of given arguments to measure the complexity of the learning task and data points in IEAE task. As in Figure 2 (c), at the beginning of the distillation stage, we utilize questions containing oracle knowledge with all related arguments to train the student as a warm-up procedure. Then we gradually reduce the proportion of given arguments and finally transit to using no extra arguments as in a realistic situation. Note that all teacher models utilize oracle knowledge as they are trained throughout the whole process.
Details of the curriculum distillation strategy are shown in Algorithm 1. I ALL and I are two sets of training instances with all golden-standard arguments and no extra knowledge are used to build questions, respectively. {M T k } 4 k=1 are four wellinformed teacher models trained with diverse templates that capture different categories of oracle knowledge. M S is the student model. For each training step, firstly, we sample a batch of instances following Bernoulli distribution and the probability of selecting an example from the I ALL is a%. Secondly, we cache the hidden state, start and end distribution of the four teachers with I All as input.
Finally, we utilize all cached status from teacher models to simultaneously distill knowledge to student network. As the training stage progresses, the value of a gradually decreases from 100 to 0, leading to the learning difficulty of batches of data from easier to harder. Note that we evaluate the performance of M S using data without extra arguments in questions. We apply the early stop strategy to avoid over-fitting when the obtained F1 score on the development set no longer improves after several iterations.

Experiment Setup
Dataset. We conduct experiments on the RAMS 1 dataset, which is annotated with 139 event types and 65 corresponding argument roles. Each instance consists of a 5-sentences context around the typed event trigger, and there are several typed arguments to be extracted. RAMS dataset consists of 7329, 924, and 871 instances in the training, development, and test set, respectively. Evaluation and Hyperparameters. An argument is considered correctly identified when the predicted offset fits the golden-standard span. If both the span and the role of an extracted argument are matched with golden-standard one, then this argument is correctly classified. Precision (P), Recall (R), and F measure (F1) are adopted as valuation metrics. Besides, gold event type information is used in the type constrained decoding (TCD) setting.
1 https://nlp.jhu.edu/rams/ In experiments, we adopt BERT-base, which has 12 layers, 768 hidden units, and 12 attention heads in every layer, as our MRC model. The batch size is set to 4 and the max sequence length is 512. We set the dimension of the trigger position embedding to 76 and the epoch is set to 7. We train the models with an Adam weight decay optimizer with an initial learning rate of 3e-5. The warming up portion for learning rate is 10%. Temperature τ is set to 1. And we set α as 0.5, β as 2e-3 to balance cross-entropy, KL-divergence, and MSE loss. The proportionality factor a in every epoch is set to 100,70,40,30,20,10,0. And the weighting factors {ω k } 4 k=1 from ALL, ALL − 1, ALL − 2, and N ON E are configured as 0.35, 0.25, 0.25, 0.15, respectively.

Overall Performance
Baselines.
(2) Zhang's (Zhang et al., 2020b) is a two-step head-based model that first predicts headwords of an argument and then expands to the full span. Since IEAE is a newly proposed task, there are only a few existing works. To demonstrate the effectiveness of our method, we also adopt several strong methods from the EAE task and report performances of these baselines and their variants. (3) Student is our base model that extracts arguments with MRC framework based on Du and Cardie (2020). (4) Student-SUP is the variant where argument information is explicitly modeled with supervising attention mechanism based on Liu et al. (2017). (5)   tracted from Stanford corenlp toolkit 2 , and adopts multi-hop graph convolutional network for reasoning based on Liu et al. (2018). (6) Student-MKD is a multi-teacher knowledge distillation framework where four student models trained with various random seeds are used as teachers, and then distill to another student model. (7) Student-DA is the variant that utilizes questions with different proportions of oracle knowledge as the data augmentation strategy. (8) Student-BAG is the variant that ensembles 5 well-trained student models through a bagging paradigm. (9) Teacher is the variant with the same architecture as the student, and it is trained and tested with oracle knowledge. (10) Teacher-R has the same setting as the Teacher but tested with raw text. (11) Teacher-MT is the variant where answering histories from previous turns are fused to the current question in a multi-turn manner. From experimental results shown in Table 1, we can conclude that: (1) MRC-based methods exceed those directly learn pair-wise relations among event targets and candidate arguments, leading to strong baselines for IEAE. We attribute these improvements to that MRC models could capture relations among arguments implicitly during the encoding stage through the QA framework. These methods also benefit from the prior knowledge contained in task descriptions. (2) With the same architecture, Student-SUP, Student-GCN, Student-DA, and FEAE surpass the Student, and the Teacher that utilizes oracle knowledge in both the training and test stage performs best. These results indicate the effectiveness of related arguments and verify our intuition that reasoning in the event frame-level scope contributes to IEAE. (3) The result gaps among Teacher, Teacher-R, and Teacher-MT clearly show that the train-test disparity could affect the inference procedure. Compared with Teacher-MT, our FEAE obtains a gain of 6.80 points in F1, indicating the effectiveness of our teach-student learning 2 http://stanfordnlp.github.io/CoreNLP/   strategy. An explanation is that in Teacher-MT, incorrect answers in the previous turn may bring noise and seriously affect the results of subsequent answers. However, FEAE is trained with goldenstandard related arguments, thus could alleviate such error accumulation problem. (4) Student-SUP that does not require extra NLP tools to build an explicit graph outperforms Student-GCN. Our method further obtains an improvement of 2.07 absolute points in the argument classification task. These results demonstrate that implicit reasoning is a powerful way to capture the interrelation between arguments. Another reason is that building explicit reasoning graphs could not avoid introducing noises. (5) The improvements of Student-MKD, Student-DA, and Student-BAG are marginal, illustrating that the improvement in our method is mainly from the architecture of knowledge distillation rather than introducing additional factors. (6) The proposed FEAE outperforms strong baselines and achieves new state-of-the-art results for both argument identification and argument classification. Without using extra inputs, our approach achieves results similar to the one with oracle knowledge. The performance gain clearly indicates that our FEAE could capture frame-aware information effectively. Ablation Study. To investigate the effect of each component, we conduct an ablation study by removing multi-teacher (-multi), curriculum learning (-cl), and knowledge distillation framework (-kd). We train the model with oracle knowledge containing all related arguments when eliminating multiteacher(-multi), results are shown in Table 2. We can observe that: (1) Knowledge distillation brings as large as 1.69 absolute points in F1 for argument classification. By mimicking the behavior of a wellinformed teacher, our method could effectively ob-   tain the ability of reasoning in event frame-level scope, thus achieving better performances. (2) The curriculum strategy could promote the training process of our student model by gradually filling in the gap between train and test inputs.
(3) Introducing multiple teachers could provide more accurate guidance from different views and enhance the knowledge distillation framework. Impact of Frame-aware Knowledge. To get a better understanding of the impact of frame-aware knowledge, we show results with different teacher settings in Table 3, where we adopt a single-teacher curriculum knowledge distillation strategy in experiment. The main difference between these variants is the percentage of oracle knowledge utilized to train teachers, as shown in section 3.4. We find that with the increase of the percentage of ground-truth related argument (the completeness in event framelevel scope), the student could achieve better performance, verifying our assumption that frame-aware knowledge could provide essential information for IEAE. FEAE achieves the best results and shows the importance of capturing multi-view guidances. Performance on Argument Linking. We present the performances of FEAE and baselines on the argument linking task in Table 4, where ground-truth argument spans are provided and these models are required to identify the role of each span. For our MRC variants, we add the expected argument into the question and apply binary classification on the vector of [CLS] token to decide whether the argument plays the given role in the event. We find that FEAE has an 8.3 points improvement in F1 score compared to Ebner's -TCD, and our FEAE also surpasses baselines. Results of this study indicate that frame-aware knowledge also contributes to improving the performance of argument linking. Performance breakdown by distance. To test our method's ability to capture long-range dependencies, we list the performance breakdown on different sentence distances between arguments and the given trigger in Table 5. Similar to Zhang et al. (2020b), we observe that all models have a performance drop for the non-local arguments (where d = ±2 or d = ±1). Compared with Student, FEAE achieves a gain of more than 4 times by summing the results in the condition of d = ±2, and the F1 score even increases by 6 times when d = −2.
To explore the reasons, we sort all argument roles in the d = ±2 cases by the number of occurrences and find the top five categories are place, recipient, instrument, participant, and attacker, which covers more than 56% of the total number. Intuitively, there are strong semantic associations between the aforementioned roles and other roles defined in the frame scope. Since our FEAE enables the model to reason with frame-level knowledge, it is natural that our method could mitigate the performance degradation in long-range dependency situations.

BERT Attention Analysis
To have a better understanding of how FEAE improves the MRC model, we conduct an experiment   Table 7. It should be noted that the averaged attention weights among different role-pairs are numerically incomparable. But in a particular pair, FEAE tends to have a larger value than that of the student model, indicating that FEAE learns to reason by paying more attention to the relevant arguments. For example, in the first instance, intuitively, when looking for place, arguments with the role of damager destroyer could provide clues.

Case Study
In this section, we further illustrate how FEAE could alleviate long-range dependencies and implicit argument problems. As shown in Table 6, we give representative examples where student model misses the correct answers, while FEAE is able to correctly find them. For the scenario of long-range dependencies in E1, it is difficult to identify the argument of role victim because there are too many words between the argument Armenian and the trigger Genocide. However, there is a strong implicit semantic relationship between killer and victim. FEAE could better capture such oracle knowledge than student model, thus FEAE successfully find and classify Armenian as victim. For the implicit argument situations in E2, since there is no direct association between argument Russian farms and trigger word immigrating, student model falls to identify Russian farms. But frame-aware knowledge provides the priory that there is an implicit connection between argument role transporter and passenger. Consequently, FEAE successfully recalls argument Russian farms.

Conclusion and Future Work
In this paper, we exploit frame-aware knowledge for extracting implicit event arguments. Specifically, we introduce a curriculum knowledge distillation strategy, FEAE, to train an MRC model that could focus on frame-aware information to identify implicit arguments. The proposed method leverages a teacher-student framework to avoid the requirement of extra clues and could perform reasoning with the guidance in event frame-level scope. Experiments show that our method surpasses strong state-of-the-art baselines in RAMS, and could scientifically alleviate long-range dependency and implicit argument problems.