Machine Reading Comprehension as Data Augmentation: A Case Study on Implicit Event Argument Extraction

Implicit event argument extraction (EAE) is a crucial document-level information extraction task that aims to identify event arguments beyond the sentence level. Despite many efforts for this task, the lack of enough training data has long impeded the study. In this paper, we take a new perspective to address the data sparsity issue faced by implicit EAE, by bridging the task with machine reading comprehension (MRC). Particularly, we devise two data augmentation regimes via MRC, including: 1) implicit knowledge transfer, which enables knowledge transfer from other tasks, by building a unified training framework in the MRC formulation, and 2) explicit data augmentation, which can explicitly generate new training examples, by treating MRC models as an annotator. The extensive experiments have justified the effectiveness of our approach — it not only obtains state-of-the-art performance on two benchmarks, but also demonstrates superior results in a data-low scenario.


Introduction
Textual event descriptions may span over multiple sentences. Implicit event argument extraction (EAE) (Ebner et al., 2020;Zhang et al., 2020), a crucial task for event information extraction, aims to identify event arguments beyond the sentence level. For example, in a document describing an AirstrikeMissileStrike event (shown in Figure 1), implicit EAE requires a model to recognize a global event argument "Syria", fulfilling the semantic role of PLACE. Note the argument is one-sentence away from the event trigger bombarding.
One key challenge faced by implicit EAE is data sparsity -owing to the complex interdependencies between triggers and arguments, it is expensive to label training data for the task. The existing datasets, which typically contain several dozens of documents, are too small to train a model for capturing regularities underlying how event argu-… This was indeed an IS attack, rather than an accidental explosion. New satellite imagery appears to reveal extensive damage to [ ments appear in texts (Li et al., 2021). For example, even the state-of-the-art model, trained on the full corpus of RAMS (Ebner et al., 2020), attains only 5% in F1 when an event argument is two-sentence away from the trigger (Zhang et al., 2020). This paper attempts to provide a new perspective to address the data sparsity issue faced by implicit EAE. Motivated by previous works handling information extraction via machine reading comprehension (MRC) (Levy et al., 2017;Li et al., 2020;Du and Cardie, 2020b;, we note implicit EAE may be more akin to MRC, as both of them are document-level tasks. For example, we may use a prompt question Where does the bombarding event take place? to extract the event argument "Syria", as shown in Figure 1 (Bottom). This formulation implies new ways to address implicit EAE, by leveraging resources in the domain of MRC to boost learning.
We devise two complementary data augmentation methods based on MRC for implicit EAE. The first one is implicit knowledge transfer, which aims to build a unified training framework, in the MRC formulation, to facilitate training multiple tasks together. It has two main advantages. First, by framing implicit EAE as MRC, we can directly use the sophisticated models in MRC to handle the task, which are shown to be excelled at capturing document-level clues (Devlin et al., 2019). Second, under this framework, we can leverage datasets in other tasks to boost learning. For example, we show it is possible to transfer knowledge from a wide range of tasks, including SQuAD question answering, FrameNet semantic role labeling, and ACE sentence-level event extraction.
Our second method performs data augmentation in a more explicit way, treating a pre-trained MRC model as an annotator to label new training instances. For example, we may use a question Who is the attacker in the bombarding event? to query external (unlabeled) documents, and regard documents with answers as new training examples annotated with an ATTACKER role. Compared with implicit knowledge transfer, explicit data augmentation can generate tangible training examples, which is shown to benefit a wide range of previous models (e.g., that based on sequence labeling (Shi and Lin, 2019)) for the task. Moreover, we show explicit data augmentation demonstrates better performance than implicit knowledge transfer does for addressing a zero-shot transfer scenario ( § 6.2).
The expensive experiments on two datasets, RAMS (Ebner et al., 2020) and WikiEvents (Li et al., 2021), have justified the effectiveness of our approach. Particularly, our method achieves substantially improvement over previous methods (+3% in F1 on the average). It also demonstrates promising results for capturing long-range dependencies. Moreover, equipped with the two data augmentation strategies, our approach can fit well with the data-low scenario. For example, on RAMS, with 1% of training data, our approach obtains over 30% in F1, yet previous methods only achieve an F1 score less than 10%.
Our contributions are summarized as follows: • We study a new view to address the data sparsity issue faced by implicit EAE, by bridging it with MRC. Besides being the first work introducing the MRC formulation to implicit EAE, our work may encourage more studies investigating data augmentation via MRC.
• We propose two novel data augmentation regimes for implicit EAE via MRC -implicit knowledge transfer and explication data augmentation. Their application scopes are carefully explored with extensive evaluations.
• We set up state-of-the-art performance on two implicit EAE benchmarks. We have released our code at https://github. com/jianliu-ml/DocMRC to facilitate further exploration.
2 Related Work

Implicit Event Argument Extraction
Implicit EAE has long been studied under the MUC-4 paradigm (MUC, 1992), with a core subtask to extract all roll fillers of an event template (Grishman and Sundheim, 1996;Huang and Riloff, 2011;Du and Cardie, 2020a). This line of work is further extended by studies on implicit semantic role labeling (Ruppenhofer et al., 2009;Moor et al., 2013). Despite many advances, the datasets provided by the above evaluations are relatively small, which have long impeded the study on the task. Recently, Ebner et al. (2020) propose a new benchmark, annotating over 3, 000 documents for implicit EAE, which has inspired many studies. Following the work, Zhang et al. (2020) devise a headto-region approach, demonstrating very promising results; Gangal and Hovy (2020) investigate to what extent the pre-trained language model can benefit learning. Very recently, Li et al. (2021) investigate a generative perspective on the task, achieving state-of-the-art performance. Nevertheless, the currently available datasets are still too small to train a learning based model to achieve decent performance. In this work, we propose a new perspective to address the data sparsity challenge, by bridging the task with machine reading comprehension (Hermann et al., 2015).

MRC for Information Extraction
Recently, there is a surge of work investigating addressing information extraction tasks using machine reading comprehension. To name a few, Levy et al. (2017);  cast relation extraction into question answering; Li et al. (2020) address named entity recognition via MRC; Du and Cardie (2020b);  formulate sentence-level event extraction as MRC. But most works focus on problem reformulation, which rarely consider the data issue. By comparison, our work fills the gap by proposing a new perspective leveraging MRC for data augmentation, which is also the first work extending MRC to implicit EAL. Additionally, we show our approach can also boost learning for sentence-level event extraction task ( § 6.5).

Data Augmentation for EAE
Due to the fine-trained annotation of events, data augmentation for event argument extraction is generally challenging. The existing methods are based on distantly supervision (Chen et al., 2017;Yang et al., 2018), leveraging external knowledge bases to generate new training data. However, such works rely on a great deal of expertise limited to domain/language. The work of Yang et al. (2019) combines entity substitution with pre-trained language models, achieving improved performance. But the newly generated examples may be a bit of rigid as the entities remain the same. Different from previous works, we study the possibilities of leveraging MRC for data augmentation. Our method on the one hand does not rely on complex domain knowledge and on the other hand can generate more diversified training data.

Implicit EAE in MRC Formulations
We formulate implicit EAE as follows: Assume a document D contains a set of events E, each represented by an event trigger (e.g., bombarding in the previous example). The type of an event e ∈ E can determine a set of roles the arguments may take, denoted by R e . For each semantic role r ∈ R e , implicit EAE requires a model to find an event argument a, which is a textual span in D, resulting in a (r, a) pair 1 . Different from previous methods addressing the task via sequence labeling (Shi and Lin, 2019) or span ranking (Ebner et al., 2020), we propose a new perspective based on MRC.
Query Generation. To frame implicit EAE as MRC, we transfer each semantic role r into a question q r . We devise a template based method operating in three steps: 1) Role Categorization, in which we categorize r as person-based, general, or place-based one to select proper interrogative words (e.g., Who, What, and When). 2) Trigger Format Conversion, where we convert verb-based triggers into their noun formats, using WordNet (Miller, 1992). 3) Query Realization, in which we use the templates in Table 1 to realize the final question. Consider the previous example in Figure  1. Our method can yield two questions Who is the  attacker in the bombarding event? and Where does the bombarding event take place? for the role of ATTACKER and PLACE respectively.

Trigger-Aware Representation
Learning. Given the document D and a question q r , we build a BERT encoder (Devlin et al., 2019) to learn their joint representations. Particularly, we first construct an extended sequence S = [CLS] q r [SEP] D [SEP] 2 to concatenate q r and D. Then, considering multiple events may be contained by D, to indicate which event is currently focused we also devise trigger-aware embeddings, modifying BERT's segmentation embeddings to indicate the location of an event trigger (the trigger's segmentation embeddings are sets as 1 instead of 0 as in conventional BERT). Finally, we use BERT encoder to encode S and take the output of its last hidden layer as the joint representation, which is denoted by H D qr ∈ R N ×d , where N is the length of the extended sequence and d is the hidden dimension of BERT.

Argument Extraction as Answer Generation.
Based on H D qr , we compute two normalized vectors, containing the probabilities for the start and end positions of an event argument a over S: where w start ∈ R d and w end ∈ R d are parameters to be learned. The predicted locations of a correspond to the positions having the largest values in p start and p end . For the case a = , i.e., no event argument corresponds to r, we assume the start/end position of a is 0. Namely, the leading token [CLS] in S is treated as a no-answer indicator.
Based on the above formulation, we next detail our two MRC-based data augmentation regimes.

A Unified Training Framework
In the MRC Formulation

Data Augmentation via MRC
Based on the above proxy of implicit EAE and MRC, we devise two data augmentation regimes: implicit knowledge transfer ( § 4.1) and explicit data augmentation ( § 6.2).

Implicit Knowledge Transfer
As shown in Figure 2 (Left), implicit knowledge transfer aims to build a unified training framework, which therefore facilitates knowledge transfer from other tasks into implicit EAE. We adopt a pretraining followed by fine-tuning learning paradigm.
Cross-Task Pre-Training. After setting up a MRC model, we first pre-train it using the training data in other tasks. In addition to adopting MRC dataset, i.e., SQuAD 2.0 (Rajpurkar et al., 2018), we also use corpora in FrameNet semantic role labeling (SRL) (Atkins et al., 2003) and ACE event extraction (EE) for pre-training 3 , by framing these tasks as a MRC problem in a similar way. The following pre-training objective is adopted: where T ranges over each task and (D,q,â) ranges over each training example in a MRC formulation. P (â|D,q) denotes the likelihood of predictingâ givenD andq, which equals to p start [a start ] + p end [a end ], where a start and a end denote the golden start and end positions ofâ inD.
In-Domain Fine-Tuning. After the pre-training stage converges, we fine-tune the model using indomain data, with the following training objective: log P (a|D, e, r) (4) where D ranges over each document; e ranges over each event instance in D; (r, a) indicates a roleargument pair. In this way, the knowledge learned from other tasks can be implicitly transferred into the implicit EAE task, which is shown to benefit learning largely in the data-low scenarios ( § 6.1).

Explicit Data Augmentation
One drawback of implicit knowledge transfer is that it cannot generate explicit training data, and therefore it only supports learning in a MRC formulation. We propose another data augmentation strategy, which can generate explicit examples and benefit models in any formulation for implicit EAE.
Automatic Data Annotation. As shown in Figure 2 (Right), the core idea of explicit data augmentation is to use the pre-trained MRC model as an annotator, to label new instances from unlabeled documents. Given a source document D , the following steps are conducted: 1) Identify all event triggers in D , using an event detector pre-trained on the in-domain data. 2) For each event trigger e , enumerate each semantic role r determined by the event type, and convert r as a question q r . 3) Use the pre-trained MRC model to predict answer a by using q r as a prompt. 4) If a = , construct a new training example (D , e , r , a ). To enhance the annotation quality, we only consider answers whose likelihoods are above a threshold λ. Please refer to § 5.1 for implementation and the statistics of the generated training examples.
Joint Training Strategy. The following objective is devised to combine the original training data with the automatically generated data for training: log P (a|D, e, r) where δ is a weight balancing their contributions. The overall process of explicit data augmentation can be seen as "eliciting" knowledge from a pretrained MRC model, and as the training set is explicitly expanded, it has the potential to benefit any model (e.g., that based on sequence labeling (Shi and Lin, 2019) or span prediction (Ebner et al., 2020)) proposed for implicit EAE.

Experimental Setup
Datasets. We conduct our experiments on two implicit EAE benchmarks RAMS (Ebner et al., 2020) and WikiEvents (Li et al., 2021). RAMS provides 3,993 paragraphs in total, annotated with 139 event types and 65 semantic roles; WikiEvents provides 246 documents, annotated with 50 event types and 59 semantic roles. Table 2 gives the detailed data statistics. For evaluation, we use Precision (P), Recall (R), and F1 score (F1) as evaluation metrics. Our experimental results are based on the Exact Match (EM) criterion: only when the predicted argument span matches exactly a gold one, do we count it a correct prediction.
Implementations. In our MRC model, we use a BERT-base-uncased encoder (Devlin et al., 2019), to keep consistent with previous studies (Ebner et al., 2020;Li et al., 2021  Baseline Models. The following state-of-the-art methods are treated as baselines for comparison: • BERT-CRF (Shi and Lin, 2019), which combines BERT with Condition Random Field (Lafferty et al., 2001), achieving state-of-theart performance on sentence-level SRL task.
• SpanSel (Ebner et al., 2020), a method based on span ranking (Lee et al., 2017), which enumerates each possible span in a document to identify the most likely event arguments.
• Head-Expand (Zhang et al., 2020), which extends SpanSel, by first identifying an argument's head, and then its region. It achieves state-of-the-art performance on RAMS.
• BART-Gen 5 (Li et al., 2021), a concurrent work to ours, adopts a generative perspective to address implicit EAE, based on the BART architecture (Lewis et al., 2020).
Our approach is denoted by DocMRC.   (2020), we adopt two experimental settings, where "w/ Type Constraint" and "w/o Type Constraint" indicate considering gold event types or not. In WikiEvents, the setting of taking co-reference into consideration is denoted by WikiEvents (CR). We denote our approaches with only in-domain training and with implicit knowledge transfer by "w/ In-Domain" and "w/ Impl. DA" respectively.

Experimental Result
The experimental results have justified the effectiveness of our approach. Particularly, our approach with implicit knowledge transfer attains the best F1 on the two datasets with different settings, outperforming previous methods by over 3% on the average. Moreover, we note the model with only in-domain training can achieve the state-ofthe-art performance, suggesting the effectiveness of problem re-formulation. Implicit knowledge transfer can further boost learning, particularly in Recall (+1.5% on the average). This implies that the knowledge transferred from other tasks enhances the generalization of the model. Additionally, we note a large performance drop of BART-Gen in the setting of "w/o Type Constraint", where event types are known. This suggests it is heavily dependent on correctly predicting the event types. By contrast, our approach doesn't rely on golden event types that much to extract event arguments. Table 4 gives the performance of different models addressing cases with different trigger-argument distance 6 . The results suggest that our approach is excelled at capturing long-range dependencies. For example, on RAMS, in the case where the event argument is two-sentence ahead the trigger (d=-2), our full approach achieves 21.0% in F1, outperforming previous methods by 3.3%. Nevertheless, there are still many rooms for improvement.

Impact of Implicit Knowledge Transfer
To better understand the impact of implicit knowledge transfer, we compare the performance of different models in a stimulated data-low scenario, where we vary the ratio of in-domain training examples for fine-tuning. This scenario also covers a zero-shot transfer case with completely no in- domain training data. Figure 3 gives the results.
The results demonstrate the advantage of our implicit knowledge transfer method clearly. For example, with only 1% of in-domain data, on the RAMS corpus our method augmented with implicit knowledge transfer achieves about 30% in F1, while our method with only in-domain training and other methods achieve less than 10% in F1. Moreover, we note implicit knowledge transfer can even support zero-shot scenario -it achieves 5.8% and 6.9% in F1 on the two datasets without using any in-domain training data. Table 5 shows the impact of using different tasks for pre-training. From the results, each task can boost learning, and their impacts are complementary. FrameNet and SQuAD lead to larger improvements than ACE, perhaps because they have large, diversified, and wide-coverage datasets.

Impact of Explicit Data augmentation
Explicit data augmentation, compared with implicit knowledge transfer, has an advantage to generate tangible training examples. We study its performance regarding 1) zero-shot transfer evaluation, and 2) boosting previous models for learning. Table 6 gives the results of zero-shot evaluation, where we only use the automatically generated data to train a model. From the results, explicit data augmentation yields better performance than implicit   knowledge transfer for addressing the zero-shot scenario. A plausible explanation for its effectiveness is that: by using a MRC model as an annotator, we can distill specific knowledge fitting to the event ontology out to boost learning. Moreover, we note the data generated by explicit data augmentation can also help other models, e.g., BERT-CRF and Head-Expand, to address the zero-shot scenario. Table 7 gives the results of joint training, based on RAMS. From the results, explicit data augmentation improves the performance of different approach by 1.0% in F1 on the average, demonstrating its effectiveness. Nevertheless, we show explicit data augmentation underperforms implicit knowledge transfer in joint training (44.4% v.s. 45.7% in F1). This implies that implicit knowledge transfer may be more preferable than the explicit data generation strategy when we can obtain relatively abundant in-domain training data.

F1 w/ EDA ∆F1
BERT-CRF (Shi and Lin, 2019) 40.3 41.5 +1.2 SpanSel (Ebner et al., 2020) 40.7 41.5 +0.8 Head-Expand (Zhang et al., 2020) Table 8 gives three examples generated by our explicit data augmentation method. From the results, our approach does identify global event arguments. For example, in (1), our approach finds out that "Persepolis", which is two-sentence away from the event trigger fire, is an event argument fulfilling the role of PLACE.

Error Analysis
Following Zhang et al. (2020), we conduct an error analysis, by sampling out 100 error cases from the development set of RAMS. We identify four typical errors: 1) Partial Match, which accounts for 16%. For example, the golden annotation of an ATTACKER in "the Palestine solidarity", but our approach predicts "Palestine solidarity". This issue is partially derived from the inconsistency of human annotation (Ebner et al., 2020). 2) Spurious Semantic, which accounts for 8%. For example, our approach incorrectly predicts that "Japan" fulfills a PLACE role in "... Japan had accepted the terms ...", owing to not fully understanding the sentence semantic. 3) Commonsense, which accounts for 3%. For example, our approach fails to predict that "computer network" fulfills the role of GIVER in acquires given a text: "... into the computer network. Someone acquires the information ...". How to master commonsense for reasoning is still an open challenge in implicit EAE. 4) Co-reference, which accounts for 4%. Different from RAMS, the dataset of WikiEvents has noted this issue and considered co-reference into evaluation, which improves about 2-point in F1 according to Table 3.
6.5 Impact on Sentence-Level EAE Table 9 gives the result of our approach on the ACE sentence-level EAE task. We compare our method with QAEE (Du and Cardie, 2020b), which adopts a fine-grained query generation strategy (we directly use the trigger prediction result of QAEE to ensure comparability). The results have justified the effectiveness of our approach. Particularity, with implicit knowledge transfer, our approach outperforms QAEE by 2.4% in F1. Additionally, we show explicit data augmentation can also benefit learning -it leads to +1.7% in F1 for the model based on sequence labeling (Shi and Lin, 2019).
In this paper we take a new view to handle the data sparsity challenge faced by implicit EAE. Two data augmentation regimes based on MRC are devised, which can implicitly transfer knowledge from related tasks, or generate new training data explicitly, to boost learning. The extensive experiments have justified the effectiveness of our approach. In the future, we would design better question generation method and apply our method to other tasks.