Capturing Event Argument Interaction via A Bi-Directional Entity-Level Recurrent Decoder

Capturing interactions among event arguments is an essential step towards robust event argument extraction (EAE). However, existing efforts in this direction suffer from two limitations: 1) The argument role type information of contextual entities is mainly utilized as training signals, ignoring the potential merits of directly adopting it as semantically rich input features; 2) The argument-level sequential semantics, which implies the overall distribution pattern of argument roles over an event mention, is not well characterized. To tackle the above two bottlenecks, we formalize EAE as a Seq2Seq-like learning problem for the first time, where a sentence with a specific event trigger is mapped to a sequence of event argument roles. A neural architecture with a novel Bi-directional Entity-level Recurrent Decoder (BERD) is proposed to generate argument roles by incorporating contextual entities’ argument role predictions, like a word-by-word text generation process, thereby distinguishing implicit argument distribution patterns within an event more accurately.


Introduction
Event argument extraction (EAE), which aims to identify the entities serving as event arguments and classify the roles they play in an event, is a key step towards event extraction (EE). For example, given that the word "fired" triggers an Attack event in the sentence "In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel" , EAE need to identify that "Baghdad", "cameraman", "American tank", and "Palestine hotel" are arguments with Place, Target, Instrument, and Target as roles respectively.
Recently, deep learning models have been widely applied to event argument extraction and † Corresponding authors. achieved significant progress (Chen et al., 2015;Nguyen et al., 2016;Sha et al., 2018;Yang et al., 2019;Wang et al., 2019b;Du and Cardie, 2020). Many efforts have been devoted to improving EAE by better characterizing argument interaction, categorized into two paradigms. The first one, named inter-event argument interaction in this paper, concentrates on mining information of the target entity (candidate argument) in the context of other event instances (Yu et al., 2011;Nguyen et al., 2016), e.g., the evidence that a Victim argument for the Die event is often the Target argument for the Attack event in the same sentence. The second one is intra-event argument interaction, which exploits the relationship of the target entity with others in the same event instance (Yu et al., 2011;Sha et al., 2016Sha et al., , 2018. We focus on the second paradigm in this paper.
Despite their promising results, existing methods on capturing intra-event argument interaction suffer from two bottlenecks.
(1) The argument role type information of contextual entities is underutilized. As two representative explorations, dBRNN (Sha et al., 2018) uses an intermediate tensor layer to capture latent interaction between candidate arguments; RBPB (Sha et al., 2016) estimates whether two candidate argument belongs to one event or not, serving as constraints on a Beam-Search-based prediction algorithm. Generally, these works use the argument role type information of contextual entities as auxiliary supervision signals for training to refine input representation. However, one intuitive observation is that the argument role types can be utilized straightforwardly as semantically rich input features, like how we use entity type information. To verify this intuition, we conduct an experiment on ACE 2005 English corpus, in which CNN (Nguyen and Grishman, 2015) is utilized as a baseline. For an entity, we incorporate the ground-truth roles of its contextual arguments into the baseline model's input representation, obtaining model CNN(w. role type). As expected, CNN(w. role type) outperforms CNN significantly as shown in Table 1  The challenge of the method lies in knowing the ground-truth roles of contextual entities in the inference (or testing) phase. That is one possible reason why existing works do not investigate in this direction. Here we can simply use predicted argument roles to approximate corresponding ground truth for inference. We believe that the noise brought by prediction is tolerable, considering the stimulating effect of using argument roles directly as input.
(2) The distribution pattern of multiple argument roles within an event is not well characterized. For events with many entities, distinguishing the overall appearance patterns of argument roles is essential to make accurate role predictions. In dBRNN (Sha et al., 2018), however, there is no specific design involving constraints or interaction among multiple prediction results, though the argument representation fed into the final classifier is enriched with synthesized information (the tensor layer) from other arguments. RBPB (Sha et al., 2016) explicitly leverages simple correlations inside each argument pair, ignoring more complex interactions in the whole argument sequence. Therefore, we need a more reliable way to learn the sequential semantics of argument roles in an event.
To address the above two challenges, we formalize EAE as a Seq2Seq-like learning problem (Bahdanau et al., 2014) of mapping a sentence with a specific event trigger to a sequence of event argument roles. To fully utilize both left-and rightside argument role information, inspired by the bidirectional decoder for machine translation (Zhang et al., 2018), we propose a neural architecture with a novel Bi-directional Entity-level Recurrent Decoder (BERD) to generate event argument roles entity by entity. The predicted argument role of an entity is fed into the decoding module for the next 1 In the experiment we skip the event detection phase and directly assume all the triggers are correctly recognized. or previous entity recurrently like a text generation process. In this way, BERD can identify candidate arguments in a way that is more consistent with the implicit distribution pattern of multiple argument roles within a sentence, similar to text generation models that learn to generate word sequences following certain grammatical rules or text styles.
The contributions of this paper are: 1. We formalize the task of event argument extraction as a Seq2Seq-like learning problem for the first time, where a sentence with a specific event trigger is mapped to a sequence of event argument roles.
2. We propose a novel architecture with a Bidirectional Entity-level Recurrent Decoder (BERD) that is capable of leveraging the argument role predictions of left-and right-side contextual entities and distinguishing argument roles' overall distribution pattern.
3. Extensive experimental results show that our proposed method outperforms several competitive baselines on the widely-used ACE 2005 dataset. BERD's superiority is more significant given more entities in a sentence.

Problem Formulation
Most previous works formalize EAE as either a word-level sequence labeling problem (Nguyen et al., 2016;Zeng et al., 2016;Yang et al., 2019) or an entity-oriented classic classification problem (Chen et al., 2015;Wang et al., 2019b). We formalize EAE as a Seq2Seq-like learning problem as follows. Let S = {w 1 , ..., w n } be a sentence where n is the sentence length and w i is the i-th token. Also, let E = {e 1 , ..., e k } be the entity mentions in the sentence where k is number of entities. Given that an event triggered by t ∈ S is detected in ED stage , EAE need to map the sentence with the event to a sequence of argument roles R = {y 1 , ..., y k }, where y i denotes the argument role that entity e i plays in the event.

The Proposed Approach
We employ an encoder-decoder architecture for the problem defined above, which is similar to most Seq2Seq models in machine translation (Vaswani et al., 2017;Zhang et al., 2018), automatic text summarization (Song et al., 2019;Shi et al., 2021), and speech recognition (Tüske et Figure 1: The detailed architecture of our proposed approach. The figure depicts a concrete case where a sentence contains an Attack event (triggered by "fired") and 4 candidate arguments {e 1 , e 2 , e 3 , e 4 }. The encoder on the left converts the sentence into intermediate continuous representations. Then the forward decoder and backward decoder generates the argument roles sequences in a left-to-right and right-to-left manner (denoted by − → y i and ← − y i ) respectively. A classifier is finally adopted to make the final prediction y i . The forward and backward decoder shares the instance feature extractor and generate the same instance representation x i for i-th entity. The histogram in green and brown denotes the probability distribution generated by forward decoder and backward decoder respectively. The orange histogram denotes the final predictions. Note that the histograms are for illustration only and do not represent the true probability distribution.
In particular, as Figure 1 shows, our architecture consists of an encoder that converts the sentence S with a specific event trigger into intermediate vectorized representation and a decoder that generates a sequence of argument roles entity by entity. The decoder is an entity-level recurrent network whose number of decoding steps is fixed, the same as the entity number in the corresponding sentence. On each decoding step, we feed the prediction results of the previously-processed entity into the recurrent unit to make prediction for the current entity. Since the predicted results of both left-and right-side entities can be potentially valuable information,we further incorporate a bidirectional decoding mechanism that integrates a forward decoding process and a backward decoding process effectively.

Encoder
Given the sentence S = (w 1 , ..., w n ) containing a trigger t ∈ S and k candidate arguments E = {e 1 , ..., e k }, an encoder is adopted to encode the word sequence into a sequence of continuous rep-resentations as follows 2 , where F (·) is the neural network to encode the sentence. In this paper, we select BERT (Devlin et al., 2019) as the encoder. Considering representation H does not contain event type information, which is essential for predicting argument roles. We append a special phrase denoting event type of t into each input sequence, such as "# ATTACK #".

Decoder
Different from traditional token-level Seq2Seq models, we use a bi-directional entity-level recurrent decoder (BERD) with a classifier to generate a sequence of argument roles entity by entity. BERD consists of a forward and backward recurrent decoder, which exploit the same recurrent unit architecture as follows.

Recurrent Unit
The recurrent unit is designed to explicitly utilize two kinds of information: (1) the instance infor-mation which contains the sentence, event, and candidate argument (denoted by S, t, e); and (2) contextual argument information which consists of argument roles of other entities (denoted by A).
The recurrent unit exploits two corresponding feature extractors as follows: Instance Feature Extractor. Given the representation H generated by encoder, dynamic multipooling (Chen et al., 2015) is then applied to extract max values of three split parts, which are decided by the event trigger and the candidate argument. The three hidden embeddings are aggregated into an instance feature representation x as follows: where [·] i is the i-th value of a vector, p t , p e are the positions of trigger t and candidate argument e 3 . Argument Feature Extractor. To incorporate previously-generated arguments, we exploit CNN network to encode the instance with arguments information as follows.
Input. Different from Chen et al. (2015) where input embedding of each word consists of its word embedding, position embedding, and event type embedding, we append the embedding of argument roles into the input embedding for each word by looking up the vector A, which records argument role for each token in S. In A, tokens of previouslypredicted arguments are assigned with the generated labels, tokens of the candidate entity e are assigned with a special label "To-Predict", and the other tokens are assigned with label N/A.
Pooling. Max-pooing operation is then applied to extract the argument feature x a as follows, We concatenate the instance feature representation x and the argument feature representation x a as the input feature representation for the argument role classifier, and estimate the role that e plays in the event as follows: where W and b are weight parameters. o is the probability distribution over the role label space. For the sake of simplicity, in rest of the paper we use Unit(S, t, e, A) to represent the calculation of probability distribution o by recurrent unit with S, t, e, A as inputs.

Forward Decoder
Given the sentence S with k candidate arguments E = {e 1 , ..., e k }, the forward decoder exploits above recurrent unit and generates the argument roles sequence in a left-to-right manner. The conditional probability of the argument roles sequence is calculated as follows: For i-th entity e i , the recurrent unit generates prediction as follows: where − → y i denotes the probability distribution over label space for e i and − → A i denotes the contextual argument information of i-th decoding step, which contains previously-predicted argument roles R <i . Then we update −−→ A i+1 by labeling e i as g( − → y i ) for next step i+1, where g( − → y i ) denotes the label has the highest probability under the distribution − → y i .
The argument feature extracted by recurrent units of forward decoder is denoted as − → x a i .

Backward Decoder
The backward decoder is similar to the forward decoder, except that it performs decoding in a rightto-left way as follows: where R >i denotes the role sequence {y i+1 , ..., y k } for {e i+1 , ..., e k }. The probability distribution over label space for i-th entity e i is calculated as follows: where ← − A i denotes the contextual argument information of i-th decoding step, which contains previously-predicted argument roles R >i . We update ←−− A i−1 by labeling e i as g( ← − y i ) for next step i-1. The argument feature extracted by recurrent units of backward decoder is denoted as ← − x a i .

Classifier
To utilize both left-and right-side argument information, a classifier is then adopted to combine argument features of both decoders and make final prediction for each entity e i as follows: where y i denotes the final probability distribution for e i . W c and b c are weight parameters.

Training and Optimization
As seen, the forward decoder and backward decoder in BERD mainly play two important roles. The first one is to yield intermediate argument features for the final classifier, and the second one is to make the initial predictions fed into the argument feature extractor. Since the initial predictions of the two decoders are crucial to generate accurate argument features, we need to optimize their own classifier in addition to the final classifier. We use −−−−→ p(y i |e i ) and ←−−−− p(y i |e i ) to represent the probability of e i playing role y i estimated by forward and backward decoder respectively. p(y i |e i ) denotes the final estimated probability of e i playing role y i by Equation 10. The optimization objective function is defined as follows: where D denotes the training set and t ∈ S denotes the trigger word detected by previous event detection model in sentence S. E S represents the entity mentions in S. α, β and γ are weights for loss of final classifier, forward decoder and backward decoder respectively.
During training, we apply the teacher forcing mechanism where gold arguments information is fed into BERD's recurrent units, enabling paralleled computation and greatly accelerates the training process. Once the model is trained, we first use the forward decoder with a greedy search to sequentially generate a sequence of argument roles in a left-to-right manner. Then, the backward decoder performs decoding in the same way but a right-toleft manner. Finally, the classifier combines both left-and right-side argument features and make prediction for each entity as Equation 10 shows. Cardie, 2020), we evaluate our models on the most widely-used ACE 2005 dataset, which contains 599 documents annotated with 33 event subtypes and 35 argument roles. We use the same test set containing 40 newswire documents, a development set containing 30 randomly selected documents and training set with the remaining 529 documents.
We notice Wang et al. (2019b) used TAC KBP dataset, which we can not access online or acquire from them due to copyright. We believe experimenting with settings consistent with most related works (e.g., 27 out of 37 top papers used only the ACE 2005 dataset in the last four years) should yield convincing empirical results.

Hyperparameters
We adopt BERT (Devlin et al., 2019) as encoder and the proposed bi-directional entity-level recurrent decoder as decoder for the experiment. The hyperparameters used in the experiment are listed. BERT. The hyperparameters of BERT are the same as the BERT BASE model 4 . We use a dropout probability of 0.1 on all layers. Argument Feature Extractor. Dimensions of word embedding, position embedding, event type embedding and argument role embedding for each token are 100, 5, 5, 10 respectively. We utilize 300 convolution kernels with size 3. The glove embedding (Pennington et al., 2014) are utilized for initialization of word embedding 5 . Training. Adam with learning rate of 6e-05, β 1 = 215 0.9, β 2 = 0.999, L2 weight decay of 0.01 and learning rate warmup of 0.1 is used for optimization. We set the training epochs and batch size to 40 and 30 respectively. Besides, we exploit a dropout with rate 0.5 on the concatenated feature representations. The loss weights α, β and γ are set to 1.0, 0.5 and 0.5 respectively.

Baselines
We compare our method against the following four baselines. The first two are state-of-the-art models that separately predicts argument without considering argument interaction. We also implement two variants of DMBERT utilizing the latest interevent and intra-event argument interaction method, named BERT(Inter) and BERT(Intra) respectively.

DMBERT which adopts BERT as encoder
and generate representation for each entity mention based on dynamic multi-pooling operation (Wang et al., 2019a). The candidate arguments are predicted separately.
2. HMEAE which utilizes the concept hierarchy of argument roles and utilizes hierarchical modular attention for event argument extraction (Wang et al., 2019b).
3. BERT(Inter) which enhances DMBERT with inter-event argument interaction adopted by Nguyen et al. (2016). The memory matrices are introduced to store dependencies among event triggers and argument roles.
4. BERT(Intra) which incorporates intra-event argument interaction adopted by Sha et al. (2018) into DMBERT. The tensor layer and self-matching attention matrix with the same settings are applied in the experiment.
Following previous work (Wang et al., 2019b), we use a pipelined approach for event extraction and implement DMBERT as event detection model. The same event detection model is used for all the baselines to ensure a fair comparison.
Note that Nguyen et al. (2016) uses the last word to represent the entity mention 6 , which may lead to insufficient semantic information and inaccurate evaluation considering entity mentions may consist of multiple words and overlap with each other. We sum hidden embedding of all words when collecting lexical features for each entity mention. 6 Sha et al. (2018) doesn't introduce the details.

Main Results
The performance of BERD and baselines are shown in Table 2 (statistically significant with p < 0.05), from which we have several main observations. (1) Compared with the latest best-performed baseline HMEAE, our method BERD achieves an absolute improvement of 1.0 F1, clearly achieving competitive performance.
(2) Incorporation of argument interactions brings significant improvements over vanilla DMBERT. For example, BERT(Intra) gains a 1.5 F1 improvement compared with DMBERT, which has the same architecture except for argument interaction.

of BERD). (4) Compared with BERT(Inter)
and BERT(Intra), our proposed BERD achieves the most significant improvements. We attribute the solid enhancement to BERD's novel seq2seq-like architecture that effectively exploits the argument roles of contextual entities.

Effect of Entity Numbers
To further investigate how our method improves performance, we conduct comparison and analysis on effect of entity numbers. Specifically, we first divide the event instances of test set into some subsets based on the number of entities in an event.
Since events with a specific number of entities may be too few, results on a subset of a range of entity numbers will yield more robust and convincing conclusion. To make the number of events in all subsets as balanced as possible, we finally get a division of four subsets, whose entity numbers are in the range of [1,3], [4,6], [7,9], and [10,] and event quantities account for 28.4%, 28.2%, 25.9%, and 17.5%, respectively.
The performance of all models on the four subsets is shown in Figure 2, from which we can observe a general trend that BERD outperforms other baselines more significantly if more entities appear in an event. More entities usually mean more com-  plex contextual information for a candidate argument, which will lead to a performance degradation. BERD alleviates degradation better because of its capability of capturing argument role information of contextual entities. We notice that BERT(Intra) also outperforms DMBERT significantly on Subset-4, which demonstrates the effectiveness of intraevent argument interaction. Note that the performance on Subset-1 is worse than that on Subset-2, looking like an outlier. The reason lies in that the performance of the first-stage event detection model on Subset-1 is much poorer (e.g., 32.8 of F1 score for events with one entity).

Effect of Overlapping Entity Mentions
Though performance improvement can be easily observed, it is nontrivial to quantitatively verify how BERD captures the distribution pattern of multiple argument roles within an event. In this section, we partly investigate this problem by exploring the effect of overlapping entities. Since there is usually only one entity serving as argument roles in multiple overlapping entities, we believe sophisticated EAE models should identify this pattern. Therefore, we divide the test set into two subsets (Subset-O and Subset-N ) based on whether an event contains overlapping entity mentions and check all models' performance on these two subsets. Table 3 shows the results, from which we can find that all baselines perform worse on Subset-O. It is a natural result since multiple overlapping entities usually have similar representations, making the pattern mentioned above challenging to capture. BERD performs well in both Subset-O and Subset-N, and the superiority on Subset-O over baseline is more  significant. We attribute it to BERD's capability of distinguishing argument distribution patterns.

Effect of the Bidirectional Decoding
To further investigate the effectiveness of the bidirectional decoding process, we exclude the backward decoder or forward decoder from BERD and obtain two models with only unidirectional decoder, whose performance is shown in the lines of "-w/ Forward Decoder" and "-w/ Backward Decoder" in Table 4. From the results, we can observe that: (1) When decoding with only forward or backward decoder, the performance decreases by 1.6 and 1.3 in terms of F1 respectively. The results clearly demonstrate the superiority of the bidirectional decoding mechanism (2) Though the two model variants have performance degradation, they still outperform DMBERT significantly, once again verifying that exploiting contextual argument information, even in only one direction, is beneficial to EAE.  Considering number of model parameters will be decreased by excluding the forward/backward decoder, we build another two model variants with two decoders of the same direction (denoted by "-w/ Forward Decoder x2" and "-w/ Backward Decoder x2"), whose parameter numbers are exactly equal to BERD. Table 4 shows that the two enlarged single-direction models have similar performance with their original versions. We can conclude that the improvement comes from complementation of the two decoders with different directions, rather than increment of model parameters.

Model
Besides, we exclude the recurrent mechanism by preventing argument role predictions of contextual entities from being fed into the decoding module, obtaining another model variant named "-w/o Recurrent Mechanism". The performance degradation clearly shows the value of the recurrent decoding process incorporating argument role information.

Case Study and Error Analysis
To promote understanding of our method, we demonstrate three concrete examples in Figure 3. Sentence S1 contains a Transport event triggered by "sailing". DMBERT and BERT(Intra) assigns Destination role to candidate argument "the perilous Strait of Gibraltar ", "the southern mainland" and "the Canary Islands out in the Atlantic", the first two of which are mislabeled. It's an unusual pattern that a Transport event contains multiple destinations. DMBERT and BERT(Intra) fail to recognize the information of such patterns, showing that they can not well capture this type of correlation among prediction results. Our BERD, however, leverages previous predictions to generate argument roles entity by entity in a sentence, successfully avoiding the unusual pattern happening.
S2 contains a Transport event triggered by "visited", and 4 nested entities exists in the phrase "Ankara police chief Ercument Yilmaz". Since these nested entities share the same sentence context, it is not strange that DMBERT wrongly predicts such entities as the same argument role Artifact. Thanks to the bidirectional entity-level recurrent decoder, our method can recognize the distribution pattern of arguments better and hence correctly identifies these nested entities as false instances. In this case, BERD reduces 3 false-positive predictions compared with DMBERT, confirming the results and analysis of Table 3.
As a qualitative error analysis, the last example S3 demonstrates that incorporating previous predictions may also lead to error propagation problem. S3 contains a Marry event triggered by "marry". Entity "home" is mislabeled as Time-Within role by BERD and this wrong prediction will be used as argument features to identify entity "later in this after", whose role is Time-Within. As analyzed in the first case, BERD tends to avoid repetitive roles in a sentence, leading this entity incorrectly being S1:Tens of thousands of destitute Africans try to enter Spain illegally each year by crossing the perilous Strait of Gibraltar to reach the southern mainland or by sailing northwest to the Canary Islands out in the Atlantic  Figure 3: Case study. Entities and triggers are highlighted by green and purple respectively. Each tuple (E,G,P 1 ,P 2 ,P 3 ) denotes the predictions for an entity E with gold label G, where P 1 , P 2 and P 3 denotes prediction of DMBERT, BERT(Intra) and BERD respectively. Incorrect predictions are denoted by a red mark. predicted as N/A.

Related Work
We have covered research on EAE in Section 1, related work that inspires our technical design is mainly introduced in the following.
Though our recurrent decoder is entity-level, our bidirectional decoding mechanism is inspired by some bidirectional decoders in token-level Seq2Seq models, e.g., of machine translation (Zhou et al., 2019), speech recognition (Chen et al., 2020) and scene text recognition (Gao et al., 2019).
We formalize the task of EAE as a Seq2Seq-like learning problem instead of a classic classification problem or sequence labeling problem. We have found that there are also some works performing classification or sequence labeling in a Seq2Seq manner in other fields. For example,  formulates the multi-label classification task as a sequence generation problem to capture the correlations between labels. Daza and Frank (2018) explores an encoder-decoder model for semantic role labeling. We are the first to employ a Seq2Seqlike architecture to solve the EAE task.

Conclusion
We have presented BERD, a neural architecture with a Bidirectional Entity-level Recurrent Decoder that achieves competitive performance on the task of event argument extraction (EAE). One main characteristic that distinguishes our techniques from previous works is that we formalize EAE as a Seq2Seq-like learning problem instead of a classic classification or sequence labeling problem. The novel bidirectional decoding mechanism enables our BERD to utilize both the left-and rightside argument predictions effectively to generate a sequence of argument roles that follows overall distribution patterns over a sentence better.
As pioneer research that introduces the Seq2Seqlike architecture into the EAE task, BERD also faces some open questions. For example, since we use gold argument roles as prediction results during training, how to alleviate the exposure bias problem is worth investigating. We are also interested in incorporating our techniques into more sophisticated models that jointly extract triggers and arguments.