Incorporating Circumstances into Narrative Event Prediction

Aiming at discovering the event evolution, the narrative event prediction is essential to modeling sophisticated real-world events. Existing studies focus on mining the inter-events relationships while ignoring how the events happened, which we called circumstances. However, we observe that the circumstances indicate the event evolution implicitly, and are significant for the narrative event prediction. To incorporate circumstances into the narrative event prediction, we propose the CircEvent, which adopts the multi-head attention to retrieve circumstances at the local and global levels. We also introduce a regularization of attention weights to leverage the alignment between events and local circumstances. The experimental results demonstrate that CircEvent outperforms existing baselines by 12.2%. Further analysis demonstrates the effectiveness of our multi-head attention modules and regularization. Our source code is available at https://github.com/ Shichao-Wang/CircEvent.


Introduction
The Narrative event chain, which is similar to the classical notion of the script (Schank and Abelson, 2013), is a structural knowledge that captures the relationships between event sequences and their participants in the given scenario. Figure 1 describes a scenario of "going to the restaurant.". Modeling the narrative event chain can help the AI systems to understand sophisticated real-world events and benefit many downstream applications (Han et al., 2021), such as financial analysis (Yang et al., 2019). This paper focuses on modeling the narrative event chain and predicting what will happen next, which is called the Multiple Choice Narrative Cloze (MCNC) (Granroth-Wilding and Clark, 2016). As shown in Figure 1, the MCNC evaluation aims to choose the correct event from the candidate choices set given a sequence of historical events.
Early studies learn event representation with the rule-based (Schank and Abelson, 2013), countbased (Chambers and Jurafsky, 2008;Pichotta and Mooney, 2016), and deep learning (Modi and Titov, 2014a,b;Granroth-Wilding and Clark, 2016) method. Recently, more and more studies attempt to incorporating external knowledge into event representation. Li et al. (2018) builds the Narrative Event Evolutionary Graph (NEEG), which describes event evolutionary principles and patterns. FEEL (Lee and Goldwasser, 2018) introduces a feature enriched event embedding. Despite the subject, predicate, and object, FEEL also considers sentiment and animacy as the parts of events. Lee and Goldwasser (2019) regards event embedding learning as a multi-relational problem and captures different relations of events pairs, such as the cause and contrast. Zheng et al. (2020b) builds a heterogeneous event graph to mining subordinated relations between events and words.
In addition to the previous events that have already happened, particular situations also affect the event evolution, which are defined by circumstances in this paper. The event circumstances include detailed descriptions of the event situation such as the weather, the place status, and the protagonist behavior. As the example shown in Figure 1,

Circumstance 1
Peter find Jenny is waiting him.

Circumstance 2
It was rather popular.

Circumstance 3
There were a few people. existing works tend to predict choice (c) or (d) based on previous events or historical knowledge for the event walk (Peter, restaurant). Given different circumstances shown in Figure 2, they could not make a different decision based on the restaurant's environment, such as described in circumstances 2 and 3. Peter is more likely to get seated if the restaurant has a few customers. He will ask the waiter about the available seats if the restaurant is popular, meaning it is crowded. Circumstances, such as the crowdedness of the restaurant, weather, or the protagonist action et al., will influence the event evolutionary, while existing works do not consider them.
In this paper, we propose CircEvent to represent events together with their circumstances. Following previous studies (Chambers and Jurafsky, 2008;Lee and Goldwasser, 2019), events in this paper are also extracted from the unstructured text corpus, and each event belongs to one specific sentence in the text. The extracted event only contains the minimum information of an event, e.g., the subject, predicate, and object. The contextual information, environment description, and semantics, which we discussed as circumstances, are left in the original sentence. We attempt to collect event circumstance information from the unstructured text. However, the unstructured text contains so much information that not all contribute to the event evolution.
To tackle this challenge, we develop two multihead attention-based networks to incorporate event representation and its circumstance into narrative event prediction at the local and global levels. At the local level, events come from a specific sentence, containing the most related circumstance. We develop the a multi-head attention to retrieve the local circumstances for events. Moreover, the context local circumstances also contribute the event representation. We develop another multihead attention to get the global circumstances by aggregating the context local circumstances adaptively.
After the circumstances retrieval, we adopt the transformer as backbone to encode the context events and circumstances. The transformer decoder is used to compute the similarity scores of candidate events. The candidate events are compared implicitly inner the transformer decoder benefited from its architecture.
Our contributions in this paper are three folds: 1. We propose the CircEvent to incorporate event circumstances into narrative event prediction with the transformer architecture. 2. We introduce two multi-head attention to retrieve the event circumstances from the corpus at the local and global levels. 3. Our proposal outperforms the existing baselines by 12.2% on the MCNC task, and our further analysis proves the effectiveness of event circumstances.
2 Related Work

Narrative Event Representation
In the literature, the methods to get event representation can be categorized in two: the self-contained and the external knowledge enriched.
In the self-contained event representation research work, they only use the events and corresponding connection relation. Event-Comp (Granroth-Wilding and Clark, 2016) employed distributed representation, word2vec (Mikolov et al., 2013), to learn the word representation of arguments that appear in the event, and the event representation is the linear combination of these arguments representation. RoleFactor (Weber et al., 2018) proposed a scalable tensor-based composition model for event representations, which composite event argument in a hierarchical structure. The UniFA-S (Zheng et al., 2020a) adopted Variational AutoEncoder architecture(Kingma and Welling, 2014) with a unified fine-tuning method to learn event representation from intra-inter-event and scenario level. HeterEvent (Zheng et al., 2020b) proposed a heterogeneous graph neural network that models discontinuous event segments explicitly.  In the external knowledge enriched representation research line, researchers have attempted several external knowledge into event representation. FEEL (Lee and Goldwasser, 2018) injects sentiment and animacy information into event embedding. (Yang et al., 2019) enrich event representation with news information. EventTransE (Lee and Goldwasser, 2019) regards event embedding learning as a multi-relational problem and incorporates relationships among events into event representation learning. (Ding et al., 2019) leverage commonsense knowledge about intent and sentiment into the event, which can be found in the knowledge bases such as Event2Mind (Rashkin et al., 2018) and ATOMIC (Sap et al., 2019). In this paper, we attempt to incorporating event circumstances into event representation.

Attention Mechanism in Narrative Event Prediction
Since (Bahdanau et al., 2015) firstly adopt attention mechanism in neural machine translation. The attention mechanism has shown its effectiveness in many NLP applications. Many previous works on the narrative event prediction (Wang et al., 2017;Li et al., 2018;Lv et al., 2019) also apply attention mechanism to the context events, as they assume different context events have different weights for choosing the correct subsequent event. Besides, Lv et al. (2019) employs a self-attention mecha-nism (Lin et al., 2017) to represent the event chain in diverse event segments within the chain implicitly. Zheng et al. (2020b) adopt the graph attention network (Velickovic et al., 2018) to aggregate neighborhood events information. We employ the multi-head attention (Vaswani et al., 2017) to extract circumstance representation from the event sentence and aggregate circumstances in the global level adaptively.

Model
This section introduces our CircEvent neural network in four modules: the event representation, the circumstance representation, the event chain encoder and the prediction module.

Event Representation
Each event consists of three arguments, i.e., subject, predicate, and object. Each argument has n args words. We follow (Zheng et al., 2020b) to apply a max-pooling and an average pooling layer on argument word embeddings and then concatenate them to get the event argument embeddings. The subject, predicate and the object representaion are denoted as s(e), p(e), o(e) ∈ R 2de . For the subject, its representation s(e) follows: where w s ∈ R narg×d h is the sequence of subject word embeddings. The max, avg, and [; ] refer to

Local Level
Global Level Figure 4: The circumstance representation module. The local circumstance is placed at the left side. The vector in yellow represent a single event embedding, and the matrix in blue is the corresponding sentence hidden states. The global circumstance is at right side. The event matrix in yellow is the context event embeddings, and the purple one is the local circumstances. Our regularization is applied on the Global Weights matrix.
the max pooling, average pooling and concatenation operation respectively. The predicate and the subject representation are obtained similarly. The event embedding e(e) is the linear combination of its argument vectors. Formally, the event embedding e(e) definition follows: is a non-linear function, we employ a dense layer followed by a tanh(·) activation here.

Event Circumstance Representation
This section details three methods to get the event circumstance representation c(e).
Local As we described in Section 1, the event circumstance can be extracted from the sentence that contains the event. We first adopt a bidirectional recurrent neural network (BRNN) (Schuster and Paliwal, 1997) to retrieve the contextualized sentence hidden states. We adopt the multi-head attention (Vaswani et al., 2017) to aggregate sentence hidden states by corresponding event representation.
Suppose we have a sentence s = [s 1 , . . . , s ns ], which has n s tokens, represented in the word embedding sequence. s i is the word embedding vector of i th token in the sentence. To equip each word with context information, we use bidirectional recurrent neural network to encode word embeddings to hidden states: We use LSTM (Sak et al., 2014) We stack all the output hidden states to get the hidden Inspired by (Vaswani et al., 2017), we use event embedding to query circumstances from contextualized sentence hidden states. The multi-head attention follows: where Q ∈ R nq×d h , K ∈ R n l ×d h , V ∈ R n l ×d h , W O ∈ R n h ·d h ×d h are learnable parameters. n q refers to query sequence length, n l refers to context sequence length. The local event circumstance representation follows: H is the hidden states matrix of the corresponding event sentence. Thus the local circumstance is highly related to the event. We can use that local circumstance embedding as the final circumstance embedding.
Global Instead of limited in the local circumstance, we claim that the context event circumstances also contribute to the event. We adopt another multi-head attention module to obtain the global circumstance from local circumstances.
The global circumstance computation follows Eq. 1 but with different Q, K, V . The global cir-cumstance embedding follows: where E ∈ R n×d h is the sequence of context event embeddings. C l = [c l (e 1 ), c l (e 2 ), . . . , c l (e n )] ∈ R n×d h is the sequence of context event local circumstances. Under the global circumstance setting, the contribution weights of context circumstances are obtained via the attention mechanism. However, collecting information from complicated context events with an attention mechanism is not reliable and ignores the alignment between events and sentences. Based on this assumption, we equip the global circumstance with a regularization.
Global + Regularization We observe that each event belongs to a specific sentence, and a sentence may contain multiple events. This aligned information can be formulated into a binary matrix Y . If the continuous events from i to j belongs to the same sentence, the sub-matrix Y i:i+j,i:i+j are set to 1. Figure 6c is a visual example, in which event 2, 3 and event 4, 5 belong to the same sentences respectively. We apply a regularized method on the attention heads, leading the event to aggregate more local circumstances of homologous events, which is extracted from the same sentences. The regularization computations follow: where A i,j is the average attention weights of i th event to the j th sentence. The n h is the number of heads in the multi-head attention.

Event Chain Encoder
The encoder network structure refers to the Transformer Encoder proposed in (Vaswani et al., 2017). In the encoder, the event embedding and the circumstance embedding are concatenated to form the encoder inputs x = [e(e); c(e)]. The circumstance embedding c(e) can be either c l (e) or c g (e). The formal formula follows: where x i is the i th composed embedding. C ∈ R n×d h is the representation of the event chain.

Prediction Layer
We adopt the Transformer Decoder (Vaswani et al., 2017) as our prediction layer. The candidate events are fed into the decoder as queries. The transformer decoder contrasts the candidates events based on the event chain context. We adopt a dense layer to pool out the similarity scores of candidate events. The prediction layer follows: where c i is the i th candidate event, and e(c i ) is the i th candidate event embedding. The s i is the similarity score of i th candidate event, and the w o ∈ R d h is a learnable vector. We select the event choice with the highest score as the possible event to take place. There are five candidate events for each chain, and we apply softmax(·) to normalize and get the final score of each choice. We select the candidates with the maximum probability as the predicted event: P (e c i |e 1 , e 2 , . . . , e n ) = exp (s i ) where e c i is the i th candidate event.

Training Object
Our main training object is to minimize the crossentropy loss between the gold event and the predicted event. The main loss follows: log P (e cg |e 1 , e 2 , . . . , e n ) where Θ is the model parameters. The e cg refers to the ground truth event. The λ is the L2 regularization factor of model parameters. Moreover, Under the Global + Reg setting, the penalty on attention weights is also included with an α factor:

Experiment
In this section, we describe the dataset and the preprocessing pipeline. We evaluate our model in the MCNC task and report the accuracy score.

Dataset
Following Lee and Goldwasser (2019), we extract events from the NYT portion of the Gigaword corpus (Graff and Cieri, 2003). We use the Stanford CoreNLP  for POS tagging, dependency parsing, and coreference resolution. The extraction pipeline is detailed in the following paragraph. The event chains are split into the train set, develop set, and test set based on the documents split provided by Granroth-Wilding and Clark (2016). The detailed dataset statistics are shown in Table 1.

Event Chain Extraction
In this paper, we describe an event in a triplet (pred, subj, obj), which means verb, subject, and object, respectively. We first use the POS tagger, dependency parser, and coreference resolver in Stanford CoreNLP  to annotate the raw corpus. Events are extracted following entities' coreference chain. We retrieve their predicate, subject, and object from the dependency parse tree for each mention in the coreference chain. We constraint the event arguments, e.g. subject, object, and predicate, length to n arg = 15. For the sake of compatibility, we use a special token UNK for the missing arguments. Take Peter find Jenny is waiting him. He walks into the restaurant. as an example. After the event extraction pipeline, there will have an event chain that contains walk (Peter, restaurant) and another contains wait (Jenny, Peter).
Candidate Event Generator For each ground truth event, we follow (Lee and Goldwasser, 2019) to generate distractive events. We first collect all the events to construct an considerable event pool. An distract event is randomly sampled from the event pool, and then we randomly replace one of its arguments with that of the ground truth event. The ground truth and four distract events are combined and shuffled, serving as candidate events.

Baselines
We compare our model with following baselines: • Event-Comp (Granroth-Wilding and Clark, 2016) is a neural network based on intraevents relationship. • SGNN (Li et al., 2018) incorporates interevents information by constructing a narrative event evolutionary graph (NEEG), which describes the event evolution patterns. • SAM-Net (Lv et al., 2019) is an attention based model that captures event segments implicitly, and modeling the candidate events at the event-level and the chain-level. • EventTransE (Lee and Goldwasser, 2019) is an representation learning method that explores discourse relations among events. • HeterEvent [W +E] (Zheng et al., 2020b) is a representation learning method, which adopts heterogeneous event graph to capture the discontinuous event segments explicitly. • UniFA-S (Zheng et al., 2020a) is a representation learning method based on the variational auto-encoder, which fine-tunes the pre-trained BERT (Devlin et al., 2019) on the NYT corpus and the event chains in multi-steps. • SAM-Net Our We extend the SAM-Net to deal with the full event argument words rather than the headword and remove preposition from the event arguments for the comparability.

Experiment Configuration
The pre-trained Glove (Pennington et al., 2014) is used for word embedding, and the dimension d e is set to 100. The input sentence length n s is truncated to 60. For the transformer, the number of attention heads n h is set to 4, and the number of encoder layer and decoder layer are set to 1. We set the batch size to 128. Adam (Kingma and Ba, 2015) is used to optimizing our model parameters.
The learning rate set is to 1e-4, the λ is set to 1e-5, and the α is set ot 0.8. The model size d h is set to 128. All the hyper-parameters are searched on validation set. The training process employs the early-stopping strategy on validation accuracy.

Results
We report the performance of the proposed CircEvent model and other baseline models on the NYT portion of the Gigaword corpus on the MCNC task in Table 2. The CircEvent shows outstanding performance and achieves the best accuracy score in the MCNC task. We first zoom into the comparison among baselines. SAM-Net and our CircEvent are supervised learning methods in the narrative event prediction.

Method
Accuracy(%)  However, the origin SAM-Net accepts the headword of event arguments, which lead to the corrupt event. Thus we re-implement it with the same representation layer use in CircEvent to solve the problems, serving as SAM-NET Our , which performs the best of existing models. Next, we compare our CircEvent with baselines. Our CircEvent achieves the best performance on the MCNC task, which is a definite 12.2% improvement over the best baselines. We discuss the contribution of each part further in Sec 4.5.

Ablation Study
In this part, we perform an ablation study to demonstrate the effectiveness of our neural network architecture. We depart our model in three parts, Event, Local, and Global refers to the event embedding, the local embedding, and the global embedding, respectively. We conduct the ablation studies based on them, and the results are shown in Table 3. Our ablation studies include the following two aspects: Influence of Additional Circumstances We first study the influence of the additional local circumstances. We compare the result between the Event + Local and the Event. In the Event + Local experiment, the event embedding is concatenated with circumstance embedding. The event embedding is duplicated and concatenated in the Event experiment. The experiment results show that the local circumstances improve the accuracy score by 2.96%, which demonstrates the local circumstances containing valuable information to the next event.
Similar to the local circumstances, we compare the results between the Event + Global and the Event, which demonstrate that the additional global circumstance also contribute to the narrative event prediction. With the global circumstance embed-

Method
Accuracy (  ding, the accuracy score increase by 2.24%. Comparing with the local and global circumstances, the local benefit more to the accuracy score. The attention matrix in global circumstance shows that all the context events relay on the last circumstance sentence most, because it is the closest one to the target event. We discuss it further in Sec 4.6.

Influence of Independent Circumstances
The experiments above include event embedding. In this part, we would like to evaluate the quality of circumstance embedding independently. We remove the event embedding from the event chain encoder's input and use the local circumstances or the global circumstances to represent the event chain. The result is shown as Local and Global in Table 3. In these two experiments the event embedding is used as distantly supervised information, which aggregates the sentence hidden states and the local circumstances. From the result, we can conclude that the local and the global circumstance contains valuable information. With the distantly event embedding information, the Global and Local experiment results outperform that of the Event.

Qualitative Analysis
In this section, we provide a qualitative analysis of local circumstances and global circumstances. Figure 5 is the visualization of local circumstance attention weights. Each row is a pair of the event and the sentence contains the event. The context events describe men were judged because of harvesting the abalone illegally. All of the last four events notice the word abalone, which is an important topic or element in the context but does not appear in the events. However, in the first sentence, the abalone is concatenated with the May creating an out-of-bag word. We blame this atten-arrest (official, men) State Fish and Game officials arrested the men with 468 red abaloneMay 20 when they landed Ward 's urchin boat . admit (men, ) The men admitted they harvested and planned to sell the abalone. harvest (men, abalone) The men admitted they harvested and planned to sell the abalone.

ban (judge, men)
A Mendocino County judge forever banned two SouthernCalifornia men from fishing , and sent them toprison for being caught with the largest single illegal abalone haulin California in 15 years .
send (judge, men) A Mendocino County judge forever banned two SouthernCalifornia men from fishing , and sent them toprison for being caught with the largest single illegal abalone haulin California in 15 years .  (c) Target Figure 6: The global attention weights matrices for the example given in Figure 5. The deeper the shade of red, the more attention it is. Figure 6a and Figure 6b show the attention weights without and with the regularization respectively. Figure 6c formulates the alignment between events and circumstances.
tion loss on the incorrect sentence tokenization. In the homologous events admit (men, _) and harvest (men, abalone), which come from the same sentence, pay attention to the different parts of the sentence. The harvest event not only cares about the harvest predicate itself but also considers the coordinate verb planned. In the last two homologous events, the attention weights are incredibly similar. Despite the predicate itself, they also pay attention to the other words full of semantics, such as largest, illegal, and the topic abalone. Since, we use the linear combination to construct the event embedding, we think there is a lack of event expression that leads to similar heat maps.
We also visualize the global attention weights matrices in Figure 6. The attention weights lean to the last circumstance without our regularization. Thus, the Global and the Event + Global experiment results are worse than the Local and the Event + Local, respectively. With our regularization, the global attention weights align with the target matrix, which describes the alignment between the events and the circumstances. The leading diag-onal elements have the prominent weight in each row in the regularized weights matrix. It means all the events pay attention mainly to their local circumstances. In the meantime, events also aggregate the information from context circumstances, especially the homologous events. The homologous events pay more attention to each other than to other events, such as the event 2, 3 and the event 4, 5 shown in Figure 6b. The results confirm our intuition that the circumstances have a significant influence on the narrative event prediction.

Conclusion
This paper develops multi-head attention modules to capture the circumstances from event text at local and global levels. We utilize the transformer architecture, encode context events and circumstances. The standard evaluation shows that our model achieves the best accuracy score compared to other baselines. The visual analysis on the attention heat map shows the effectiveness of circumstances.