Mask-then-Fill: A Flexible and Effective Data Augmentation Framework for Event Extraction

We present Mask-then-Fill, a flexible and effective data augmentation framework for event extraction. Our approach allows for more flexible manipulation of text and thus can generate more diverse data while keeping the original event structure unchanged as much as possible. Specifically, it first randomly masks out an adjunct sentence fragment and then infills a variable-length text span with a fine-tuned infilling model. The main advantage lies in that it can replace a fragment of arbitrary length in the text with another fragment of variable length, compared to the existing methods which can only replace a single word or a fixed-length fragment. On trigger and argument extraction tasks, the proposed framework is more effective than baseline methods and it demonstrates particularly strong results in the low-resource setting. Our further analysis shows that it achieves a good balance between diversity and distributional similarity.


Introduction
Event Extraction (EE), which aims to extract triggers with specific types and their arguments from unstructured texts, is an important yet challenging task in natural language processing.In recent years, deep learning methods have emerged as one of the most prominent approaches for this task (Nguyen and Nguyen, 2019;Lin et al., 2020;Du and Cardie, 2020;Paolini et al., 2021;Lu et al., 2021;Lou et al., 2022).However, they are notorious for requiring large labelled data, which limits the scalability of EE models.Annotating data for EE is usually costly and time-consuming, as it requires expert knowledge.One possible solution is to leverage data augmentation (DA) (Simard et al., 1998).
Existing DA methods for NLP can be broadly classified into two types: (1) the first is to augment The cop said Mike left this town yesterday.
The police said Mike went away from this town yesterday.
The cop saw Mike left this town yesterday.
(1) BackTranslation (2) Synonym Replacement (3) BERT Figure 1: Visualization of three different data augmentation methods applied to a sentence containing a "Transport" event.Spans marked with different colors are event triggers and arguments.The parts of the augmented sample that differ from the original are colored in gray.Backtranslation (Xie et al., 2020) translates the input sentence into another language and back to the original.Synonym Replacement (Dai and Adel, 2020) and BERT (Yang et al., 2019) replace words in the sentence.
training data by modifying existing examples (Sennrich et al., 2016;Şahin and Steedman, 2018;Dai and Adel, 2020;Wei and Zou, 2019), and (2) the second is to generate new data by estimating a generative process and sample from it (Anaby-Tavor et al., 2020;Quteineh et al., 2020;Yang et al., 2020;Ye et al., 2022).Since the EE task requires DA methods to generate augmented samples and tokenlevel labels jointly, the second type of DA method is inapplicable here.In this study, we mainly focus on the first type of method.
Applying existing DA methods to the EE task is more challenging than to translation or classification tasks, because we need to augment training data while keeping the event structure (trigger and arguments) unchanged.Figure 1 presents examples of three different DA methods applied to a sentence containing a "Transport" event.The event is triggered by word "left" and it has three arguments with different roles ("Mike", "this town" and "yesterday").As shown in Figure 1, it is infeasible to apply sentence-level DA methods such as BackTranslation (Xie et al., 2020), because it may change the event structure (change "left" to "went away from").Previous attempts on DA for such tasks typically use heuristic rules such as synonym replacement (Dai and Adel, 2020;Cai et al., 2020) or context-based words substitution with BERT (Yang et al., 2019).Their idea is to replace adjunct tokens (the tokens in sentences except triggers and arguments) with other tokens, and thus can ensure the event structure is unchanged as much as possible.However, recent studies (Ding et al., 2020;Yang et al., 2020;Ye et al., 2022) find that such methods provide limited data diversity.In NLP, the diversity of data is mainly reflected in two aspects: expression diversity and semantic diversity (Zhao et al., 2019).The Synonym Replacement and BackTranslation methods lack semantic diversity, because they can only produce samples with similar semantics.The BERT-based method can only replace words and cannot change the syntax, so it cannot generate samples with a wide variety of expressions.The lack of sufficient diversity may lead to greater overfitting or poor performance through training on examples that are not representative.
To this end, we present Mask-then-Fill, a flexible and effective data augmentation framework for event extraction.Our approach allows for more flexible manipulation of text and thus can generate more diverse data while keeping the original event structure unchanged as much as possible.Specifically, we first define two types of text fragments in a sentence: event-related fragments (trigger and arguments) and adjunct fragments (e.g."The police said").Then, we model DA for the EE task as a Mask-then-Fill process: we first randomly masks out an adjunct sentence fragment and then infills a variable-length text span with a fine-tuned infilling model (T5) (Raffel et al., 2020).The main advantage lies in that it can replace a fragment of arbitrary length in the text with another fragment of variable length, compared to the existing methods which can only replace a single word or a fixedlength fragment.
To the best of our knowledge, we are the first to augment training data for event extraction via text infilling.We empirically show that the Maskthen-Fill framework improves performance for both classification-based (EEQA) and generation- The police said Mike left this town yesterday.
[MASK] Mike left this town yesterday.based (Text2Event) event extraction models on a well-known EE benchmark dataset (ACE2005).Especially, it demonstrates strong results in the low resource setting.We further investigate reasons for its effectiveness by introducing two metrics, Affinity and Diversity, and find that the data augmented by our approach have better diversity with less distribution shifts, achieving a good balance between diversity and distributional similarity.

The Mask-then-Fill Framework
Figure 2 presents an overview of Mask-then-Fill framework.The input sentence contains two types of text fragments: event-related fragments (words with colors) and adjunct fragments (underlined).Our idea is to rewrite the whole adjunct fragment instead of replacing some words, and the rewritten sentence fragment should fit the context and should not introduce new events.To this end, we model DA for EE as a Mask-then-Fill process: we first randomly mask out an adjunct sentence fragment and then infills a variable-length text span with a fine-tuned infilling model.In the following, we describe in detail the Mask-then-Fill framework.
Mask out an adjunct fragment.Given a prototype sentence X = {x 1 , • • • , x L } of length L from the training set, we first define an adjunct fragment as a set of non-overlapping spans of x that do not contain the event triggers and arguments.We then replace one of the adjunct fragments with a [MASK] symbol.The incomplete sentence x is a version of x with a fragment replaced with a [MASK] symbol.
Blank Infilling Model.We formulate our blank infilling process as the task of predicting the missing span of text which is consistent with the preceding and subsequent text.Figure 2 gives an example with an incomplete input sentence x, where the [MASK] is a placeholder for a blank, which has masked out multiple tokens.Our goal is to predict only the missing span y which will replace the [MASK] token in x.Therefore, the infilling task can be cast as learning p(y| x).
To train our infilling model, we fine-tune a pretrained sequence-to-sequence model T5 (Raffel et al., 2020) on the Gigaword corpus (Graff et al., 2003), which is from similar domains as the event extraction dataset ACE2005 adopted by our work.Given a corpus consisting of plain sentences, we first produce large pools of infilling examples and then train the T5 model on these examples.For a given complete sentence x from the training corpus, we generate an infilling example x with the following procedure: (1) randomly sample a span length l from the range of [1, min(10, l)]; (2) split the sentence into l spans; (3) randomly select a span to be replaced with a [MASK] symbol.The replaced span is used as the target y.We then fine-tune the T5 model on these infilling examples, yielding the model of the form p θ (y| x).
Fill in the blank.Once trained, the infilling model can be used to take the incomplete sentence x, containing one missing span, and return a predicted span y.We then replace the [MASK] token in x with the predicted span y to generate an augmented example.Note that we can produce large pools of augmented samples using top-k sampling.

Experimental Setup
Dataset.We empirically evaluate our proposed data augmentation method for event extraction on the ACE2005 corpus 1 with the same traindev-test split and preprocessing step as previous works (Zhang et al., 2019;Wadden et al., 2019).
We simulate a low-resource setting by randomly sampling 1,000, 4,000 and 8,000 examples from the training set to create the small, medium, and large training sets (denoted as S, M, L in Table 1, whereas the complete training set is denoted as F).
1 https://catalog.ldc.upenn.edu/LDC2006T06 We only augment the training data and keep the dev set and test sets unchanged.
Evaluation Metrics.Following the previous works (Du and Cardie, 2020;Lu et al., 2021) on event extraction, we adopt the same evaluation criteria defined in Li et al. (2013): (i) An event trigger is correctly identified and classified (Trig-ID+C) if its offsets match a gold trigger and its event type is also correct.(ii) An argument is correctly identified and classified (Arg-ID+C) if its offsets and event type match a gold argument and its event role is also correct.
Event Extraction Models.In our study, we consider two representative models for event extraction: • Text2Event (Lu et al., 2021) is a framework to solve the event extraction task by casting it as a SEQ2SEQ generation task.All triggers, arguments, and their labels are generated as natural language words.• EEQA (Du and Cardie, 2020) formulates the event extraction task as a question answering task.They develop two BERT-based QA models -one for event trigger detection and the other for argument extraction.
Comparison Methods.We compare our proposed data augmentation method Ours (t5-small) with three baselines: (1) Synonym Replacement replaces adjunct tokens with one of their synonyms retrieved from WordNet (Miller, 1992)  performance among all DA methods on both trigger extraction (F1) and argument extraction (F1).
Using our DA method gives the best results for the Text2event model on 7 out of 8 datasets.For the EEQA model, our method achieves the best results on 6 out of 8 datasets, where the difference between our method and the best results on Trig-S and Arg-L is very small, with only 0.06 and 0.97 points difference between them, respectively.Particularly, our methods demonstrates strong results in the low-resource setting.Using our DA gives the Text2Event model a performance improvement of 15.14% and 30.07%on Trig-S and Arg-S, respectively.
We also notice that as the amount of data increases, the improvement from all DA method decreases, and in some cases (EEQA model on Arg-L and Arg-F), there is even a slight decrease in performance.In the case of more data, the model may overfit if the augmented data are just some similar samples rather than data with large variations.We measure Affinity by computing the difference between the loss of a model trained on the original training set and tested on the original example, and the loss of the same model tested on an augmented example.We use the Dist-1/2 metric (Celikyilmaz Furthermore , the United States supported him in the war against Iran.
In addition, the United States supported him in the war against Iran.
Furthermore, the United States supported him in the war against Iran.
Later, the united States supported him in the war against Iran .
Furthermore, the United States supported iraq in the war against iraq .
Moreover, the combine States supported him in the war against Iran.
What is more, the unify DoS supported him in the war against Iran.
He also called for an end to the war against Iran.The U.S. military has been fighting a war against Iran.We first construct a new test set by generating a new sample for each data in the test set.We then calculate the Affinity and Dist-1/2 scores between the new data set and the original data set, respectively.As shown in Table 2, it is clear that the data augmented by our DA method have better diversity with less distribution shifts, obtaining a balance between diversity and distributional similarity.

Event
Case Study. Figure 3 presents examples generated by different DA methods.Given a sentence containing an "Attack" event triggered by the word "war", we generated two new samples for each DA method, and the parts of the new sample that differ from the original are colored in gray.Obviously, The synonym replacement based on WordNet cannot avoid introducing some words that do not fit the context (e.g "unify" and "DoS"), while the BERTbased word replacement can consider the context better.However, they both provide limited diversity.BackTranslation method performs even worse in terms of data diversity.Its generated data differs very little from the original sentence.Finally, compared with the original sentences, the new samples generated by our method are more fluent and more different in expression and semantics.Therefore, it not only generates data that fits the context better, but also provides better diversity.

Conclusion
In this paper, we present Mask-then-Fill, a flexible and effective data augmentation framework for event extraction.Our approach allows for more flexible manipulation of text and thus can generate more diverse data while keeping the original event structure unchanged.The main advantage lies in that it can replace a fragment of arbitrary length in the text with another fragment of variable length.We empirically show that the Mask-then-Fill framework improves performance for both EEQA and Text2Event EE models on the ACE2005 dataset.It demonstrates particularly strong results in the low-resource setting.Our further analysis shows that it achieves a good balance between diversity and distributional similarity.

Limitations
This paper presents a flexible and effective data augmentation framework for event extraction tasks.Here, we note some of Mask-then-Fill framework's limitations.First, performance gains can be marginal when data is sufficient.We believe this approach has much room for improvement in generating more diverse data.In this work, we select only one adjunct fragment at a time for modification, and modifying multiple adjunct fragments in an event mention can further enhance the diversity of the generated data.Second, currently this method can only replace one fragment at a time.This makes it easier to control the properties of the generated fragments, such as length or style.It is possible to modify multiple fragments at the same time using some existing techniques (Donahue et al., 2020;Du et al., 2022;Chen et al., 2022).This approach is more efficient, but it is prone to generate incoherent augmented samples and thus introduce more noise.A possible approach to solve this problem is to design some sample selection strategies.

:②
Figure 2: Overview of the proposed Mask-then-Fill framework.

Figure 3 :
Figure 3: Augmented examples of four different DA methods.Given a sentence containing an "Attack" event triggered by the word "war", we generate two new samples for each DA method.The parts of the new sample that differ from the original are colored in gray.et al., 2020), commonly used in text generation, to assess the Diversity of the augmented data.For implementation details of two metrics, see Appendix.We first construct a new test set by generating a new sample for each data in the test set.We then calculate the Affinity and Dist-1/2 scores between the new data set and the original data set, respectively.As shown in Table2, it is clear that the data augmented by our DA method have better diversity with less distribution shifts, obtaining a balance between diversity and distributional similarity.

Table 1 :
Results on trigger extraction and argument extraction using different subsets of the training data.The best results are marked in bold, and the second best is underlined.
Main Results.The main results are presented in Table1, where we use two EE models (Text2EVent and EEQA) to test the performance of different DA methods in both low-resource (S, M and L) and normal (F) settings.As shown in the table, we observe that Ours (t5-small) achieves the best overall

Table 2 :
Results on Affinity and Diversity.The best results are marked in bold.The second best is underlined.