Learning from a Friend: Improving Event Extraction via Self-Training with Feedback from Abstract Meaning Representation

Data scarcity has been the main factor that hinders the progress of event extraction. To overcome this issue, we propose a S elf-T raining with F eedback (S TF ) framework that leverages the large-scale unlabeled data and acquires feedback for each new event prediction from the unlabeled data by comparing it to the Ab-stract Meaning Representation (AMR) graph of the same sentence. Specifically, S TF consists of (1) a base event extraction model trained on existing event annotations and then applied to large-scale unlabeled corpora to predict new event mentions as pseudo training samples, and (2) a novel scoring model that takes in each new predicted event trigger, an argument, its ar-gument role, as well as their paths in the AMR graph to estimate a compatibility score indicating the correctness of the pseudo label. The compatibility scores further act as feedback to encourage or discourage the model learning on the pseudo labels during self-training. Experimental results on three benchmark datasets, including ACE05-E, ACE05-E + , and ERE, demonstrate the effectiveness of the S TF framework on event extraction, especially event argu-ment extraction, with significant performance gain over the base event extraction models and strong baselines. Our experimental analysis further shows that S TF is a generic framework as it can be applied to improve most, if not all, event extraction models by leveraging large-scale unlabeled data, even when high-quality AMR graph annotations are not available. 1


Introduction
Event extraction (EE), which aims to identify and classify event triggers and arguments, has been a long-stand challenging problem in natural language processing.Despite the large performance leap brought by advances in deep learning, recent studies (Deng et al., 2021;Wang et al., 2021b) have shown that the data scarcity of existing event annotations has been the major issue that hinders the progress of EE.For example, in ACE-052 , one of the most popular event extraction benchmark datasets, 10 of the 33 event types have less than 80 annotations.However, creating event annotations is extremely expensive and time-consuming, e.g., it takes several linguists over one year to annotate 500 documents with about 5000 event mentions for ACE-05.
To overcome the data scarcity issue of EE, previous studies (Chen and Ji, 2009;Liao and Grishman, 2011a;Ferguson et al., 2018a) develop self-training methods that allow the trained EE model to learn further by regarding its own predictions on large-scale unlabeled corpora as pseudo labels.However, simply adding the high-confidence event predictions to the training set inevitably introduces noises (Liu et al., 2021;Arazo et al., 2020;Jiang et al., 2018), especially given that the current state-of-the-art performance of event argument extraction is still less than 60% F-score.To tackle this challenge, we introduce a Self-Training with Feedback framework, named STF, which consists of an event extraction model that is firstly trained on the existing event annotations and then continually updated on the unlabeled corpus with selftraining, and a scoring model that is to evaluate the correctness of the new event predictions (pseudo labels) from the unlabeled corpus, and the scores further act as feedback to encourage or discourage the learning of the event extraction model on the pseudo labels during self-training, inspired by the REINFORCE algorithms (Williams, 1992).
Specifically, the event extraction model of our STF framework can be based on any state-of-the-art architecture.In this paper, we choose OneIE (Lin et al., 2020) and AMR-IE (Zhang and Ji, 2021), due to their superior performance and publicly available source code.The scoring model leverages the Abstract Meaning Representation (AMR) (Banarescu et al., 2013) which has been proven to be able to provide rich semantic and structural signals to map AMR structures to event predictions (Huang et al., 2016(Huang et al., , 2018;;Wang et al., 2021b) and thus their compatibility can indicate the correctness of each event prediction.The scoring model is a self-attention network that takes in a predicted event trigger, a candidate argument and its argument role, as well as their path in the AMR graph of the whole sentence, and computes a score ranging in [-1, 1] based on the compatibility between the AMR and the predicted event structure: -1 means incompatible, 1 means compatible, and 0 means uncertain.Inspired by the REINFORCE algorithm (Williams, 1992), we multiply the compatibility scores and the gradient of the EE model computed on the pseudo event labels during self-training, so as to (1) encourage the event extraction model to follow the gradient and hence maximize the likelihood of the pseudo label when it is compatible with the AMR structure; (2) negate the gradient and minimize the likelihood of the pseudo label when it is incompatible with the AMR structure; and (3) reduce the magnitude of the gradient when the scoring model is uncertain about the correctness of the pseudo label.
We take AMR 3.03 and part of the New York Times (NYT) 2004 corpus4 as additional unlabeled corpora to enhance the event extraction model with STF, and evaluate the event extraction performance on three public benchmark datasets: ACE05-E5 , ACE05-E +6 , and ERE-EN7 .The experimental results demonstrate that: (1) the vanilla self-training barely improves event extraction due to the noise introduced by the pseudo examples, while the proposed STF framework leverages the compatibility scores from the scoring model as feedback and thus makes more robust and efficient use of the pseudo labels; (2) STF is a generic framework and can be applied to improve most, if not all, of the event extraction models optimized by gradient descent algorithm and achieves significant improvement over the base event extraction models and strong baselines on event argument extraction on the three public benchmark datasets; (3) By exploiting dif-ferent unlabeled corpora with gold or system-based AMR parsing, STF always improves the base event extraction models, demonstrating that it can work with various qualities of AMR parsing.Notably, different from previous studies (Huang et al., 2018;Zhang and Ji, 2021;Wang et al., 2021b) that require high-quality AMR graphs as input to the model during both training and inference, STF does not require any AMR graphs during inference, making it more computationally efficient and free from the potential errors propagated from AMR parsing.

STF for Event Extraction
The event extraction task consists of three subtasks: event detection, argument identification and argument role classification.Given an input sentence W = [w 1 , w 2 , ..., w N ], event detection aims to identify the span of an event trigger τ i in W and assign a label l τ i ∈ T where T denotes the set of target event types.Argument identification aims to find the span of an argument ε j in W , and argument role classification further predicts a role α ij ∈ A that the argument ε j plays in an event τ i given the set of target argument roles A.
Figure 1 shows the overview of our STF framework which consists of two training stages.At the first stage, a base event extraction model (Section 2.1) is trained on a labeled dataset.At the second stage, we apply the trained event extraction model to an unlabeled corpus to predict new event mentions.Instead of directly taking the new event predictions as pseudo training examples like the vanilla self-training, we propose a novel scoring model (Section 2.2) to estimate the correctness of each event prediction by measuring its compatibility to the corresponding AMR graph, and then take both event predictions and their compatibility scores to continue to train the base event extraction model while the scores update the gradient computed on pseudo labels (Section 2.3).After the training of the second stage, we get a new event extraction model and evaluate it on the test set.

Base Event Extraction Model
Our proposed framework can be applied to most, if not all, event extraction models.We select OneIE (Lin et al., 2020) and AMR-IE (Zhang and Ji, 2021) as base models given their state-of-theart performance on the event extraction task and publicly available source code.Next, we briefly describe the common architectures in the two models and refer readers to the original papers for more details.OneIE and AMR-IE perform event extraction in four8 steps.First, a language model encoder (Devlin et al., 2019;Liu et al., 2019) computes the contextual representations W for an input sentence W . Second, two identification layers take in the contextual representations W. One identifies the spans of event triggers and the other identifies the spans of arguments (i.e., entities).Both of them are based on a linear classification layer followed by a CRF layer (Lafferty et al., 2001) to capture the dependencies between predicted tags.They are optimized by minimizing the negative log-likelihood of the gold-standard tag path, which is denoted as L Tri_I and L Arg_I for trigger and argument identification, respectively.Third, for each trigger or argument candidate, we compute its representation by averaging the token representations within the whole identified span.Each trigger representation is fed into a classification layer to predict its type by minimizing the cross-entropy classification loss L Tri_C .Each pair of trigger and argument representations are concatenated and fed into another classification layer to predict the argument role, which is also optimized by the cross-entropy loss L Arg_C .Finally, both OneIE and AMR-IE learn an additional global feature vector to capture the interactions across sub-tasks (e.g., a LOC entity is impossible to be the Attacker of an Attack event) and instances (e.g., the Defendant of a Sentence event can also be an Agent of a Die event).During training, a global feature score is computed for the predicted information graph and the gold annotation, respectively, from their global feature vectors.The training objective is to minimize the gap between these two global feature scores, denoted as L G .Thus, the overall loss for the base event extraction model is: As the first stage of our STF framework, we optimize the base event extraction model on labeled event mentions X L based on L E and the trained model will later be used to predict new event mentions for self-training.

Scoring Model
At the second stage of STF, we aim to further improve the event extraction model by taking the event mentions predicted from an external unlabeled corpus X u as pseudo samples for selftraining.To avoid the noise contained in the pseudo samples, we propose a scoring model that can evaluate the correctness of each event prediction.Our scoring model takes AMR graph as a reference motivated by the observation that an event structure usually shares similar semantics and network topology as the AMR graph of the same sentence, thus their compatibility can be used to measure the correctness of each event structure.This observation has also been discussed and shown effective in previous studies (Rao et al., 2017;Huang et al., 2018;Zhang and Ji, 2021).However, previous studies directly take AMR graphs as input to the extraction model and thus require AMR graphs during both training and inference, making their performance highly dependent on the quality of AMR parsing.Different from them, our proposed STF only takes AMR graphs during reference to measure the correctness of event predictions during self-training, making it free from the potential errors propagation from AMR parsing during inference.
Given a sentence W ∈ X u from the unlabeled corpus and a predicted trigger τi and its argument εj from W , we aim to estimate a correctness score for each pair of the trigger and argument prediction based on its compatibility with their path in the AMR graph9 .Thus, we first apply the state-of-theart AMR parsing tool (Astudillo et al., 2020) to generate an AMR graph for We follow (Huang et al., 2016;Zhang and Ji, 2021) and group the original set of AMR relations into 19 categories10 , thus e ij denotes a particular relation category and R denotes the set of AMR relation categories.Then, we identify the v i , v j from AMR graph G as the corresponding node of τi , εj , by node alignment following Zhang and Ji (2021).Then, we utilize the Breadth First Search to find the shortest path p i,j that connects and includes, v i and v j in G.If there is no path between v i and v j , we add a new edge to connect them and assign other as the relation.
Given a predicted trigger τi and its type lτ i , a predicted argument εj and its argument role αij , the scoring model estimates their correctness by taking [ lτ i , p ij , αij ] as input and outputs a compatibility score.As Figure 1 shows, it consists of a language model encoder (Devlin et al., 2019;Liu et al., 2019) to encode the sentence W and obtain the contextual representations for the tokens11 , which are then used to initialize the representation of each node in p ij based on the alignment between the input tokens and the nodes in AMR graph following Zhang and Ji (2021).We draw edge representations from the AMR relation embedding matrix E rel and com-bine them with node representations to form H p ij , a representation for path p ij .We also get an event type representation h τ i for lτ i from the event-type embedding matrix E tri and an argument role representation h α ij for αij from the argument role embedding matrix E arg .Here, E rel , E tri , and E arg are all randomly initialized and will be optimized during training.Finally, we obtain the initial representations To estimate the compatibility between the event trigger and argument prediction and their path in the AMR graph, we apply multi-layer selfattention (Vaswani et al., 2017) over the joint representation of the AMR path and the event prediction H init ij to learn better contextual representations for the sequence [ lτ i , p ij , αij ] and we add the position embedding E pos to H init ij before feed it into the self-attention layers: where M denotes the number of attention layers.Finally, we compute an overall vector representation Ĥfinal ij from H f inal ij via average-pooling and feed it into a linear-layer and a Sigmoid function to compute a probability c ij , indicating the correctness of the predicted event trigger and argument.We optimize the scoring model based on the binary cross-entropy objective: where y ij ∈ (0, 1) is a binary label that indicates the argument role is correct (y ij = 1) or not (y ij = 0)12 , and ψ is the parameters of the scoring model.During training, we have gold triggers and arguments as positive training instances and we swap the argument roles in positive training instances with randomly sampled incorrect labels to create negative training instances.After training the scoring model, we will fix its parameters and apply it to self-training.

Self-Training with Feedback
To improve the base event extraction model with self-training, we take the new event predictions (τ i , lτ i , εj , αij ) from the unlabeled corpus X u as pseudo samples to further train the event extraction model.The gradients of the event extraction model on each pseudo sample is computed as: where θ denotes the parameters of the event extraction model.Note that there can be multiple event predictions in one sentence.Due to the prediction errors of the pseudo labels, simply following the gradients g st ij computed on the pseudo labels can hurt model's performance.Thus, we utilize the correctness score c ij predicted by the scoring model to update the gradients, based on the motivation that: (1) if an event prediction is compatible with the AMR structure, it's likely to be correct and we should encourage the model learning on the pseudo label; (2) on the other side, if an event prediction is incompatible with its AMR structure, it's likely incorrect and we should discourage the model learning on the pseudo label; (3) if the scoring model is uncertain about the correctness of the event prediction, we should reduce the magnitude of the gradients learned from the pseudo label.Motivated by this, we first design a transformation function f c to project the correctness score c ij ∈ [0, 1] into a range [−1, 1] where -1 (or c ij = 0) indicates incompatible, 1 (or c ij = 1) means compatible, and 0 (or c ij = 0.5) means uncertain.Here, f c is based on a linear mapping: We then apply the compatibility scores as feedback to update the gradients of the event extraction model on each pseudo sample during self-training: To improve the efficiency of self-training, we update the event extraction model on every minibatch, and to avoid the model diverging, we combine the supervised training and self-training, so the overall loss for STF is: where β is the combining ratio, L E is computed on the labeled dataset X L and L STF is computed on the pseudo-labeled instances from X u .ent reinforcement learning (Sutton et al., 1999).GradLRE showed improvements over other selftraining methods on low-resource relation extraction which is a similar task to argument role classification.Appendix C describes the training details for both baselines and our approach.

Evaluation of Scoring Model
We first evaluate the performance of the scoring model by measuring how well it distinguishes the correct and incorrect argument role predictions from an event extraction model.Specifically, we compute event predictions by running a fully trained event extraction model (i.e., OneIE or AMR-IE) on the validation and test sets of the three benchmark datasets.Based on the gold event annotations, we create a gold binary label (correct or incorrect) for each argument role prediction to indicate its correctness.For each event prediction, we pass it along with the corresponding AMR graph of the source sentence into the scoring model.If the correctness 16 predicted by the scoring model agrees with the gold binary label, we treat it as a true prediction for scoring model, otherwise, a false prediction.
To examine the impact of leveraging AMR in scoring model performance, we develop a baseline scoring model that shares the same structure with our proposed scoring model except that it does not take an AMR graph as an input.Specifically, the baseline scoring model just takes the event mention (triggers, arguments and argument labels) in order to measure the compatibility score.The baseline scoring model is essentially an ablation of our scoring model where the AMR path is absent.As shown in Table 2, the performance of our scoring model outperforms the baseline scoring model by 1.4-1.7 F-score on the test sets, demonstrating the effectiveness of AMR graph in characterizing the correctness of each event prediction.
In Table 1, we can observe that the semantics and structure of AMR paths can be easily mapped to argument role types.Sometimes, the even triggers are far from their arguments in plain text, but the AMR paths between them is short and informative.Another observation is that the scoring model tends to assign positive scores to argument roles that are 16 When the correctness score c ij > 0.5 computed by the scoring model, the predicted label is correct, otherwise, incorrect.
more compatible with the AMR paths, although sometimes the scores for the gold argument roles are not the highest.

Evaluation of STF on Event Extraction
Table 3 shows the event extraction results of both our approach and strong baselines17 .For clarity, in the rest of the section, we refer to our proposed framework as STF AMR and our proposed framework with the baseline scoring model as STF W/O_AMR .We can see that, both STF AMR and STF W/O_AMR improve the performance of the event extraction models on argument role classification while the vanilla self-training and GradLRE barely work, demonstrating the effectiveness of leveraging the feedback to the pseudo labels during selftraining.
We further analyze the reasons in terms of why the vanilla self-training and GradLRE do not work and notice that: due to the data scarcity, the base event extraction model (i.e., OneIE) performs poorly on many argument roles (lower than 40% F-score).Thus, the event predictions on unlabeled corpora can be very noisy and inaccurate.The model suffers from confirmation bias (Tarvainen and Valpola, 2017;Arazo et al., 2020;Pham et al., 2020): it accumulates errors and diverges when it's iteratively trained on such noisy pseudo labeled examples during self-training.In addition, we also notice that with self-training, the event extraction model becomes overconfident about its predictions.We check the averaged probability of all the argument role predictions on the unlabeled dataset which is 0.93.In such case, it is clear that the predicted probability can not faithfully reflect the correctness of the predictions, which is referred as the calibration error (Guo et al., 2017;Niculescu-Mizil and Caruana, 2005).Thus, the self-training process which relies on overconfident prediction can become highly biased and diverge from the initial baseline model.In GradLRE, the quality of the reward is highly depend on the averaged gradient direction computed during the supervised training process.However, due to the scarcity of the training data, the stored gradient direction can be unreliable.In addition, the gradient computed on the pseudo-labeled dataset with high reward is used to update the average gradient direction, which can introduce noises into the reward function.As seen in Table 3, the best models of self-training and GradLRE are on par or worse than the baseline approach, and these approaches show the detrimental effects as they show a continuous decline of the performance as training proceeds.By considering AMR structure, STF AMR encourages the event extraction models to predict event structures that are more compatible with AMR graphs.This claim is supported by Table 4, which compares the compatibility scores between the model without STF (OneIE baseline) and one with STF (OneIE +STF) framework on the three benchmark datasets.The compatibility scores are measured by the AMR based scoring models.We can clearly see that the compatibility scores measured on OneIE+STF AMR are much higher than the scores measured on base OneIE.
Lastly, we observe that OneIE+STF AMR outperforms AMR-IE+STF AMR , even when AMR-IE performs better than OneIE baseline without STF.We argue the reason is that even though STF AMR does not need AMR parsing at inference time, AMR-IE does require AMR graphs at inference time which causes it to suffer from potential errors in the AMR parsing.On the other hand, OneIE trained by STF AMR does not require AMR graphs at inference time, making it free from potential error propagation.Figure 2 shows more examples to illustrate how the feedback from AMR structures in STF helps to improve event predictions.

Effect of Confidence Threshold
Intuitively, STF can leverage both certain (including compatible and incompatible) and uncertain pseudo labeled examples, as when the example is uncertain, the probability c predicted by the scoring model is close to 0.5 and thus f c (c) is close to 0, making the gradients computed on this pseudolabeled example close to 0. To verify this claim, we conduct experiments with STF AMR by using the probability c predicted by the scoring model to determine certain and uncertain pseudo labels and analyzing their effect to STF AMR .Note that we don't use the probability from the base event extraction model due to its calibration error (Guo et al., 2017) 18 .Specifically, we first select a threshold s st ∈ {0.5, 0.6, 0.7, 0.8, 0.9}.For each pseudo example, if the probability c predicted by the scoring model is higher than s st (indicating a confident positive prediction) or lower than 1 − s st (indicating a confident negative prediction), we will add it for STF AMR .The higher the threshold s st , the most certain pseudo labels we can select for STF AMR .As Figure 3 shows, STF AMR can even benefit from the less-confident pseudo labeled examples with threshold s st around 0.6, demonstrating that it can make better use of most of the predicted events from the unlabeled corpus for self-training.

Impact of AMR Parsing
AMR annotations are very expensive and hard to obtain.To show the potential of STF AMR in the scenarios where gold AMR parsing is not available, we conduct experiments by leveraging the NYT 2004 corpus as the external unlabeled corpus with system generated AMR parsing for self-training.As shown in  and improve over the baseline scoring model without using AMR.The gap between STF with gold AMR and STF with system AMR is small, demonstrating that STF is more robust to the potential errors from AMR parsing.Table 5: Performance comparison between using gold AMR, system-labeled AMR, and not using AMR.

Related Work
Most prior studies have been focusing on learning supervised models (Ji and Grishman, 2008;Mc-Closky et al., 2011;Li et al., 2013;Chen et al., 2015;Feng et al., 2016;Nguyen et al., 2016;Wadden et al., 2019;Du and Cardie, 2020;Lin et   2020; Zhang and Ji, 2021;Wang et al., 2022;wan;Nguyen et al., 2021) based on manually annotated event mentions.However, the performance of event extraction has been barely improved in recent years, and one of the main reasons lies in the data scarcity and imbalance of the existing event annotations.Several self-training and semi-supervised studies have been proposed to automatically enrich the event annotations.Huang and Riloff (2012) uses extraction patterns based on nouns that, by definition, play a specific role in an event, to automatically label more data.Li et al. (2014) proposes various event inference mechanisms to reveal additional missing event mentions.(Huang, 2020;Huang and Ji, 2020) propose semi-supervised learning to automatically induce new event types and their corresponding event mentions while the performance of old types is also improved.(Liao andGrishman, 2010, 2011b;Ferguson et al., 2018b) propose techniques to select a more relevant and informative corpus for self-training.All these studies cannot handle the noise introduced by the automatically labeled data properly.Compared with them, our STF framework leverages a scoring model to estimate the correctness of each pseudo-labeled example, which further guides the gradient learning of the event extraction model, thus it can efficiently mitigate the impact of the noisy pseudo-labeled examples.
Self-training has been studied for many years (Yarowsky, 1995;Riloff and Wiebe, 2003;Rosenberg et al., 2005) and widely adopted in many tasks including speech recognition (Kahn et al., 2020;Park et al., 2020), biomedical imaging (You et al., 2022a,b), parsing (McClosky et al., 2006;McClosky and Charniak, 2008), and pretraining (Du et al., 2021).Self-Training suffers from inaccurate pseudo labels (Arazo et al., 2020(Arazo et al., , 2019;;Hu et al., 2021a) especially when the teacher model is trained on insufficient and unbalanced datasets.To address this problem, (Pham et al., 2020;Wang et al., 2021a;Hu et al., 2021a) propose to utilize the performance of the student model on the held-out labeled data as a Meta-Learning objective to update the teacher model or improve the pseudo-label generation process.Hu et al. (2021b) leverage the cosine distance between gradients computed on labeled data and pseudo-labeled data as feedback to guide the self-training process.(Mehta et al., 2018;Xu et al., 2021) leverage the span of named entities as constraints to improve semi-supervised semantic role labeling and syntactic parsing, respectively.

Conclusion
We propose a self-training with feedback (STF) framework to overcome the data scarcity issue of the event extract task.The STF framework estimates the correctness of each pseudo event prediction based on its compatibility with the corresponding AMR structure, and takes the compatibility score as feedback to guide the learning of the event extraction model on each pseudo label during self-training.We conduct experiments on three public benchmark datasets, including ACE05-E, ACE05-E + , and ERE, and prove that STF is effective and general as it can improve any base event extraction models with significant gains.We further demonstrate that STF can improve event extraction models on large-scale unlabeled corpora even without high-quality AMR annotations.

Limitations
Our method utilizes the AMR annotations as additional training signals to alleviate the data scarcity problem in the event extraction task.In this problem setup, generally speaking, AMR annotations are more expensive than event extraction annotations.Nonetheless, in reality, the AMR dataset is much bigger than any existing event extraction dataset, and AMR parsers usually have higher performance than event extraction models.Leveraging existing resources to improve event extraction without requiring additional cost is a feasible and practical direction.Our work has demonstrated the effectiveness of leveraging the feedback from AMR to improve event argument extraction.However, it's still under-explored what additional information and tasks can be leveraged as feedback to improve trigger detection.
We did not have quantitative results for the alignment between AMR and event graphs.The authors randomly sampled 50 event graphs from ACE05-E and found 41 are aligned with their AMR graphs based on human judgment.In future work, more systematic studies should be conducted to evaluate the alignment.
There is a large gap between the validation and testing datasets in terms of label distribution on ACE05-E and ACE05-E+.We observe that performance improvement on the validation set sometimes leads to performance decreasing on the test set.Both the validation and test dataset miss certain labels for event trigger types and argument role types.The annotations in the training set, validation set, and test set are scarce and highly unbalanced, which causes the low performance on trained models.We argue that a large-scale more balanced benchmark dataset in the event extraction domain can lead to more solid conclusions and facilitate research.

C Training Details
For all experiments, we use Roberta-large as the language model which has 355M parameters.We train all of our models on a single A100 GPU.
Base OneIE We follow the same training process as (Lin et al., 2020) to train the OneIE model.We use BertAdam as the optimizer and train the model for 80 epochs with 1e-5 as learning rate and weight decay for the language encoder and 1e-3 as learning rate and weight decay for other parameters.The batch size is set to 16.We keep all other hyperparameters the same as (Lin et al., 2020).For each dataset we train 3 OneIE models and report the averaged performance.
Base AMR-IE We follow the same training process as (Zhang and Ji, 2021) to train the AMR-IE model.We use BertAdam as the optimizer and train the model for 80 epochs with 1e-5 as learning rate and weight decay for the language encoder and 1e-3 as learning rate and weight decay for other parameters.The batch size is set to 16.We keep all other hyperparameters exactly the same as (Zhang and Ji, 2021).For each dataset we train 3 AMR-IE models and report the averaged performance.
Scoring Model We use BertAdam as the optimizer and train the score model for 60 epochs with 1e-5 as learning rate and weight decay for the language encoder and 1e-4 as learning rate and weight decay for other parameters.The batch size is set to 10.The scoring model contains two self-attention layers.We train 3 scoring models and reported the averaged performance.

Self-Training
For self-training we use SGD as optimizer and continue to train the converged base OneIE model for 30 epochs with batch size 12, learning rate 1e-4, weight decay for the language encoder as 1e-5, and learning rate 1e-3 and weight decay 5e-5 for all other parameters except the CRF layers and global features which are frozen.For self-training, we use 0.9 as the threshold to select the confident predictions as pseudo-labeled instances.For all the experiments, we train 3 models and report the averaged performance.

Gradient Imitation Reinforcement Learning
For GradLRE, we use the BertAdam as the optimizer with batch size 16, learning rate 1e-5 and weight decay 1e-5 for the language encoder and learning rate 1e-3 and weight decay 1e-3 for other parameters to first train OneIE model for 60 epochs.The standard gradient direction vector is computed by averaging the gradient vector on each optimization step.Then following the same training process in the original paper, we perform 10 more epochs of Gradient Imitation Reinforcement Learning, and set the threshold for high reward as 0.5.For all the experiments, we train 3 models and report the averaged performance.

Self-Training with Feedback from Abstract
Meaning Representation For STF, we first train the OneIE model on the labeled dataset for 10 epochs and continue to train it on the mixture of unlabeled data and labeled dataset for 70 more epochs with batch size 10, learning rate 1e-4, weight decay for the language encoder as 1e-5, and learning rate 1e-3 and weight decay 5e-5 for all other parameters.We leverage a linear scheduler to compute the value for the loss combining ratio β.
Figure 2: Qualitative results of STF.Examples are taken from the development and test splits of ACE05-E.The orange tokens denote event triggers and blue tokens denote arguments.The AMR paths are between event triggers and arguments.The Base OneIE and STF fields show the predicted argument roles from two methods respectively.All the predictions from STF are correct.The compatibility scores are computed by the same scoring model.Note that OneIE and STF do not use AMR graph at inference time and AMR graph is shown just to provide intuitions.ACE05-E ACE05-E+ ERE-EN Dev Test Dev Test Dev Test Base OneIE 70.1 68.4 76.9 61.9 76.4 69.2 + STF AMR 72.2 70.8 80.2 64.0 78.0 75.1Table 4: The compatibility scores computed by scoring models on the development and test sets of the three benchmark datasets.

Figure 3 :
Figure 3: Performance change with different thresholds to select certain pseudo labeled examples for selftraining.

Table 1 :
Tell that to the family of Margaret Hassan, the school teacher who was brutally tortured and then slaughtered by these same guys, they aren't so bad are they Chris Matthews?It is irritating enough to get sued by Sam Sloan; imagine how irritating it would be to get BEATEN by him because you have done something so egregious that a court is forced to agree with him.Qualitative Results of the compatibility scores.

Table 5 ,
with system-based AMR, STF can also improve the performance of base event extraction models on all three benchmark datasets,

Table 3 :
Test F1 scores of event trigger classification (Tri-C), and argument role classification (Arg-C) on three benchmark datasets.* denotes methods we re-implement to fit them into the event extraction task.Bold denotes the best performance in each local section and underline denotes the best global performance. al.,

Table 6 :
The 19 groups of the AMR relations used in our paper.

Table 7 :
The value of β is computed as epoch 70 .For all the experiments, we train 3 models and report the averaged performance.For model selection, we propose a new method called Compatibility-Score Based Model The statistics of the three benchmarks used in our paper.Compatibility-Score Based Model SelectionThe data scarcity problem not only appears in the training data of ACE-05, ACE-05+ and ERE-EN but appears in the development set.For example, in ACE-05, the development set only contains only 603 labeled argument roles for 22 argument role classes and 7 argument role classes have lees than 10 instances.To alleviate this problem, we propose to leverage part of the large-scale unlabeled dataset as a held-out development set.At the end of each epoch, instead of evaluating the event extraction model on the development set, we run the event extraction model on the unlabeled held-out development set to make event predictions and run the scoring model on the event predictions to compute compatibility scores.We utilize the averaged compatibility scores computed on all instances in the unlabeled held-out development datasets as the model selection criteria.We argue this is another application of the scoring model since its goal is to evaluate the correctness of event predictions.The size of the unlabeled held-out development set is 2,000.D Results of Base OneIE and +STF AMRWe show the F1 scores of Base OneIE and +STF AMR on three benchmark datasets with variances denoted.As one can see that Base OneIE and +STF AMR have similar variances on all three datasets except ACE05-E+.We leave how to reduce the variance of argument role classification to future work.

Table 8 :
Test F1 scores of argument role classification (Arg-C) on three benchmark datasets.