From Discourse to Narrative: Knowledge Projection for Event Relation Extraction

Current event-centric knowledge graphs highly rely on explicit connectives to mine relations between events. Unfortunately, due to the sparsity of connectives, these methods severely undermine the coverage of EventKGs. The lack of high-quality labelled corpora further exacerbates that problem. In this paper, we propose a knowledge projection paradigm for event relation extraction: projecting discourse knowledge to narratives by exploiting the commonalities between them. Specifically, we propose Multi-tier Knowledge Projection Network (MKPNet), which can leverage multi-tier discourse knowledge effectively for event relation extraction. In this way, the labelled data requirement is significantly reduced, and implicit event relations can be effectively extracted. Intrinsic experimental results show that MKPNet achieves the new state-of-the-art performance and extrinsic experimental results verify the value of the extracted event relations.


Introduction
Event-centric knowledge graphs (EventKGs) model the narratives of the world by representing events and identifying relations between them, which are critical for machine understanding and can benefit many downstream tasks, such as question answering (Costa et al., 2020), news reading (Vossen, 2018), commonsense knowledge acquisition (Zhang et al., 2020a) and so on.
Recently, semi-automatically constructing Even-tKGs have gained much attention (Tandon et al., 2015;Rospocher et al., 2016;Gottschalk and Demidova, 2018;Zhang et al., 2020b). These methods extract event knowledge from massive raw corpora with or without little human intervention, which makes them scalable solutions to build large-scale

Knowledge Projection
Cause Discourse Narrative Figure 1: The knowledge projection paradigm for event relation extraction. The explicit projection directly projects connectives to event relations, e.g., from "because" to Reason. The implicit projection leverages the discourse knowledge to discover implicit event relations without connectives via MKPNet.
EventKGs. Commonly, each node in EventKGs represents an event, and each edge represents a predefined relation between an event pair 1 . Currently, event relations are majorly extracted based on the explicit connectives between them. For example, in Figure 1, a Reason relation is extracted between E 2 : "PER orders two hamburgers" and E 3 : "PER is so hungry" using the explicit connective "because" between them.
Unfortunately, the connective-based approaches face the critical coverage problem due to the sparsity of connectives. That is, a large proportion of event pairs are not connected with explicit connectives, but with underlying event relations. We denote them as implicit event relations. Further-more, the related events can even not close to each other in a document. For the example in Figure 1, the implicit relation Reason between E 1 : "PER goes to the restaurant" and E 3 : "PER is so hungry" can not be extracted due to the absence of explicit connective as well as the discontinuity between these two clauses. The common practice in previous connective-based approaches is to ignore all these implicit instances (Zhang et al., 2020b). As a result, the coverage of EventKGs is significantly undermined. Besides, because the scale of the existed event relation corpus  is limited, it is also impractical to build effective event relation classifiers via supervised learning.
In this paper, we propose a new paradigm for event relation extraction -knowledge projection. Instead of relying on sparse connectives or building classifiers starting from scratch, we project discourse knowledge to event narratives by exploiting the anthropological linguistic connections between them. Enlightened by Livholts and Tamboukou (2015); Altshuler (2016); Reyes and Wortham (2017), discourses and narratives have significant associations, and their knowledge are shared at different levels: 1) token-level knowledge: discourses and narratives share similar lexical and syntactic structures, 2) semantic-level knowledge: the semantics entailed in discourse pairs and event pairs are analogical, e.g., E 3 -Reason→E 1 and D 3 -Cause→D 1 in Figure 1., and 3) label-level knowledge: heterogeneous event and discourse relations have the same coarse categories, e.g., both the event relation Reason and the discourse relation Cause are included in the coarse-grained relation Contingency. By exploiting the rich knowledge in manually labelled discourse corpus and projecting them into event relation extraction models, the performance of event relation extraction can be significantly improved, and the data requirement can be dramatically reduced.
Specifically, we design Multi-tier Knowledge Projection Network (MKPNet), which can leverage multi-tier discourse knowledge effectively for event relation extraction. MKPNet introduces three kinds of adaptors to project knowledge from discourses into narratives: (a) token adaptor for tokenlevel knowledge projection; (b) semantic adaptor for semantic-level knowledge projection; (c) coarse category adaptor for label-level knowledge projection. By sharing the parameters of these three adaptors, the commonalities between discourses and narratives at various levels can be effectively explored. Therefore, we can obtain more general token representations, more accurate semantic representations, and more credible coarse category representations to better predict event relations.
We conduct intrinsic experiments on ASER (Zhang et al., 2020b), one of the representative EventKGs, and extrinsic experiments on Winograd Scheme Challenge (WSC) (Levesque et al., 2012), one of the representative natural language understanding benchmarks. Intrinsic experimental results show that the proposed MKPNet significantly outperforms the state-of-the-art (SoA) baselines, and extrinsic experimental results verify the value of the extracted event relations 2 .
The main contributions of this paper are: • We propose a new knowledge projection paradigm, which can effectively leverage the commonalities between discourses and narratives for event relation extraction.
• We design MKPNet, which can effectively leverage multi-tier discourse knowledge for event relation extraction via token adaptor, semantic adaptor and coarse category adaptor.
• Our method achieves the new SotAevent relation extraction performance, and an enriched EventKG is released by extracting both explicit and implicit event relations. We believe it can benefit many downstream NLP tasks.  Figure 2: An overview of MKPNet, which projects discourse knowledge for event relation extraction: (a) token adaptor for token-level knowledge projection, (b) semantic adaptor for semantic-level knowledge projection, and (c) coarse category adaptor for label-level knowledge projection.
Associations between Discourse and Narrative. Recent NLP studies have proved that discourse and narratives closely interact with each other, and leveraging discourse knowledge benefits narrative analysis significantly, such as subevents detection (Aldawsari and Finlayson, 2019) and main event relevant identification (Choubey et al., 2020). Motivated by the above observation, this paper leverages the knowledge of discourse by a knowledge projection paradigm. Blessed with the associations at token-, semantic-and coarse category-levels, the discourse corpora and knowledge can be effectively exploited for event relation extraction.

Multi-tier Knowledge Projection Network for Event Relation Extraction
In this section, we describe how to learn an effective event relation extractor by projecting resourcerich discourse knowledge to the resource-poor narrative task. Specifically, we propose Multi-tier Knowledge Projection Network (MKPNet) which can effectively leverage multi-tier discourse knowledge for implicit event relation extraction. Figure 2 shows an overview of MKPNet, which uses token adaptor, semantic adaptor and coarse category adaptor to fully exploit discourse knowledge at different levels. In the following, we first describe the neural architecture of MKPNet and then describe the details of three adaptors.

Neural Architecture of MKPNet
For knowledge projection, we model both event relation extraction (ERE) and discourse relation recognition (DRR) as an instance-pair classification task (Devlin et al., 2019;Kishimoto et al., 2020). For ERE, the input is an event pair such as <E 1 : "PER goes to the restaurant", E 3 : "PER is so hungry"> and the output is an event relation such as Reason. For DRR, the input is a clause pair such as <D 1 : "Tom goes to the restaurant", D 3 :"he is so hungry"> and the output is a discourse relation such as Cause.
Specifically, MKPNet extends the SotADRR model -BERT-CLS (Kishimoto et al., 2020) by the VAE-based semantic encoder and the coarse category encoder to model knowledge tier-by-tier (Pan et al., 2016;Guo et al., 2019;Kang et al., 2020;Li et al., 2020b). It 1) first utilizes the BERT-based token encoder to encodes an instance pair as a token representation h [CLS] ; 2) then obtains the semantic representation h z via a VAE-based semantic encoder; 3) predicts the coarse-grained label and embeddings it as the coarse category representation h Y c ; 4) finally classifies its relation with the guidance of the aggregate instance-pair representation: where ⊕ means the concatenation operation. In this way, the parameters of MKPNet can be grouped by {θ BERT , θ Semantic , θ Coarse , θ F ine }, where θ BERT for BERT-based token encoder, θ Semantic for VAE-based semantic encoder, θ Coarse for coarse category encoder and θ F ine for the final relation classifier layer respectively.

Token Adaptor
Recent studies have shown that similar tasks usually share similar lexical and syntactic structures and therefore lead to similar token representations (Pennington et al., 2014;Peters et al., 2018). The token adaptor tries to improve the token encoding for ERE by sharing the parameters θ BERT of the BERT-based encoders with DRR. In this way, the encoder is more effective due to the more supervision signals and is more general due to the multi-task settings.
Specifically, given an event pair <E 1 , E 2 >, we represent it as a sequence: where[CLS] and [SEP] are special tokens. For each token in the input, its representation is constructed by concatenating the corresponding token, segment and position embeddings. Then, the event pair representation will be inputted into BERT architecture (Devlin et al., 2019) and updated by multilayer Transformer blocks (Vaswani et al., 2017). Finally, we obtain the hidden state corresponding to the special [CLS] token in the last layer as the token-level event pair representation: The token-level discourse pair representation h d [CLS] can be obtained in the same way for DRR. To project the token-level knowledge, we use the same BERT for event pair and discourse pair encoding. During the optimization process, it is fine-tuned using the supervision signals from both ERE and DRR.

Semantic Adaptor
Because narrative and discourse analyses need to accurately represent the deeper semantic of the instance pairs, the shallow token-level knowledge captured by the BERT-based token encoder is not enough. However, BERT always induces a nonsmooth anisotropic semantic space which is adverse for semantic modelling of large-grained linguistic units (Li et al., 2020a).
To address this issue, we introduce an variational autoencoder-based (VAE-based) semantic encoder to represent the semantics of both events and clauses by transforming the anisotropic semantic distribution to a smooth and isotropic Gaussian distribution (Kingma and Welling, 2014; Rezende ℎ ℎ ℎ [ ] P Q Figure 3: The illustration of the semantic encoder as a directed graph. We use solid lines to denote the gener- [CLS] ), and dashed lines to denote the variational approximation Q = q(h z |h [CLS] , h Y ). Both variational parameters and generative parameters are learned jointly. Sohn et al., 2015). To better learn the semantic encoder, the semantic adaptor shares the parameters θ Semantic of it between ERE and DRR and train it using both classification supervision signals and KL divergence.
Specifically, VAE is a directed graphical model with the generative model P and the variational model Q, which learns the semantic representation h z of the input by an autoencoder framework. Figure 3 illustrates the graphic representation of the semantic encoder. Specifically, we assume that there exists a continuous latent variable h z ∼ N (µ, diag(σ 2 )), where µ and σ 2 are mean and variance of the Gaussian distribution respectively. With this assumption, the original conditional probability of the event/discourse relations can be expressed by the following formula: The posterior approximation is q(h z |h [CLS] , h Y ), where h [CLS] can be h e [CLS] or h d [CLS] and h Y can be h e Y or h d Y according to the different tasks. We 1) first obtain the inputand output-side representations via the shared BERT-based token encoder and the individual relation embedding networks, i.e., h [CLS] and h Y ; 2) then perform a non-linear transformation that project them onto the semantic space: 3) obtain the above-mentioned Gaussian parameters µ and logσ 2 through linear regression: where W and b are the parameter matrix and bias term respectively; 4) use a reparameteriza-tion trick (Kingma and Welling, 2014;Sohn et al., 2015) to get the final semantic representation: where ∼ N (0, I) and h z can be h e z or h d z . The neural model for the prior p(h z |h [CLS] ) is the same as that for the posterior q(h z |h [CLS] , h Y ), except for the absence of h Y . Besides, those two models have parameters independent of each other.
During testing, due to the absence of the outputside representation h Y , we set h z to be the mean of p(h z |h [CLS] ) (Zhang et al., 2016a), i.e., µ. During training, we minimize the Kullback-Leibler divergence KL(P ||Q) between the generation model P and the inference model Q. Intuitively, KL divergence connects these two models: To project the semantic-level knowledge, we use the same VAE for both event pair and discourse pair. Therefore, the commonalities of event semantics and discourse semantics can be captured more accurately.

Coarse Category Adaptor
The token adaptor and the semantic adaptor commendably cover the knowledge entailed on the input-side. In addition, we found that ERE and DRR share the same coarse-grained categories: Temporal, Contingency, Comparison and Expansion (Prasad et al., 2008;Zhang et al., 2020b), although they have different fine-grained categories.
To this end, we design the coarse category adaptor in a coarse-to-fine framework (Petrov, 2009) to bridge the gap between the heterogeneous finegrained targets. Specifically, we share the parameters θ Coarse of the coarse-grained classifier and the coarse label embedding network to obtain more credible coarse category representations.
Specifically, we first use the token representation h [CLS] and the semantic representation h z to predict the coarse-grained labels: where Y c ∈ {Temporal, Contingency, Comparison, Expansion}. After that, we use the coarse label embedding network to obtain the corresponding coarse-grained label embedding h Y c , which is referred as the coarse category representation.
To project that label-level knowledge, we use the same coarse-grained classifier and the same coarse label embedding network. During the optimization process, both event instances and discourse instances can be used to train this coarse category encoder. The more supervision signals make it more effective.

Full Model Training
In this paper, we utilize multi-task learning (Caruana, 1997) to implement the knowledge projection from discourse to narrative. It expects correlative tasks (ERE and DRR) can help each other to learn better by sharing the parameters of three adaptors. Given ERE and DRR training datasets, an alternate optimization approach (Dong et al., 2015) is used to optimizate MKPNet: where Y can be Y im or Y d according to the different tasks, λ, α are two hyperparameters, KL(P ||Q)) is the KL divergence in the semantic encoder, L(θ; Y ) and L(θ; Y c ) are fine-grained and coarse-grained objectives respectively: It should be noticed that in MKPNet, {θ BERT , θ Semantic , θ Coarse } are the shared parameters of the BERT-based token encoder, the VAE-based semantic encoder and the coarse category encoder between ERE and DRR. And {θ F ine } are separated parameters of the fine-grained ERE and DRR classifiers.

Experiments
We conduct intrinsic experiments on ASER (Zhang et al., 2020b) to assess the effectiveness of the proposed MKPNet, and extrinsic experiments on WSC (Levesque et al., 2012) to verify the value of the extracted event relations.

Intrinsic Experiments
Datasets. For discourse relation recognition (DRR), we use PDTB 2.0 (Prasad et al., 2008) with the same splits of Ji and Eisenstein (2015): sections 2-20/0-1/21-22 respectively for train/dev/test. For event relation extraction (ERE), because there is no labelled training corpus, we construct a new dataset by removing the connectives of the explicit event relation instances in ASER core version 3 and retaining at most 2200 instances with the highest confidence scores for each category 4 . In this way, we obtain 23,181/1400/1400 train/dev/test instanceswe denoted it as implicit event relation extraction (IERE) dataset. Implementation. We implement our model based on pytorch-transformers (Wolf et al., 2020). We use BERT-base and set all hyper-parameters using the default settings of the SotADRR model (Kishimoto et al., 2020).
Baselines. For ERE, we compare the proposed MKPNet with the following baselines: • Baselines w/o Discourse Knowledge are only trained on IERE training set. We choose the BERT-CLS as the representative of them due to its SotAperformance.
• Baselines with Discourse Knowledge improve the learning of ERE via transfer learning (Pan and Yang, 2009;Pan et al., 2010) from discourse models, i.e., first pre-train a parameter prior on PDTB 2.0 and then fine-tune it on IERE --we denote it as BERT-Transfer.
For DRR, we compare the proposed MKPNet with the following baselines: • Bai and Zhao (2018) is a deep neural network model augmented by variable grained text representations like character, sentence and sentence pair levels.
• Kishimoto et al. (2020) is the SotADRR model, BERT-CLS, which incorporating BERT with one additional output layer. We can see that:

Overall Results
1. Based on MKPNet, we enrich the original ASER by abundant implicit event relations. Considering the computational complexity, we classify the event pairs co-occurrence in the same document 3 https://hkust-knowcomp.github.io/ASER 4 Higher confidence score means more credible instance.
2. The proposed MKPNet achieves SotAperformance for ERE. MKPNet can significantly outperform the BERT-Transfer and achieves 55.86 accuracy and 55.36 F1. MKPNet w/o KP obtains considerable performance improvements when com-pared with BERT-CLS. We believe this is because MKPNet fully explores the knowledge on different tiers, and modelling knowledge tier-by-tier is effective.
3. By projecting knowledge at token-level, semantic level and label level, all three adaptors are useful and are complementary with each other. When compared with the full model MKP-Net, its four variants show declined performance in different degrees. MKPNet outperforms MKPNet w/o CA 0.72 accuracy and 0.94 F1, which indicates that our coarse category adaptor successfully bridges the gap of heterogeneous fine-grained targets. MKPNet outperforms MKPNet w/o SA 0.57 accuracy and 0.44 F1, and therefore we believe that our latent semantic adaptor is helpful for capture the semantic-level commonalities. Finally, there is a significant decline between MKPNet w/o KP and MKPNet w/o SA & CA, which means that token adaptor is indispensable. The insight in those observations is that the commonalities between discourses and narratives under the hierarchical structure, thus projecting them at different levels is effective, and three adaptors can be complementary with each other.
4. The commonalities between discourses and narratives are beneficial for both ERE and DRR. Compared with the baselines w/o discourse knowledge -BERT-CLS and MKPNet w/o KP, both the naive transfer method -BERT-Transfer and our MKPNet achieve significant performance improvements: BERT-Transfer gains 1.29 accuracy and 1.20 F1 when compared to BERT-CLS, and MKP-Net gains 1.92 accuracy and 1.84 F1 when compared to MKPNet w/o KP. Besides, for DRR, our method MKPNet also substantially outperforms the other baselines and its variant MKPNet w/o KP. These results verified the commonalities between discourse knowledge and narrative knowledge.

Detailed Analysis
Effects of Semantic-level Knowledge and Labellevel Knowledge. In these experiments, we compare the performance of our models, MKPNet, MKPNet w/o CA and MKPNet w/o SA with or without knowledge projection to find out the effects of semantic-level knowledge and label-level knowledge. From  mainly come from the discourse knowledge rather than the neural architecture; (2) Current knowledge projection can be further improved by exploiting more accurate discourse knowledge: MKPNet w/o SA*, which uses golden coarse categories, achieves striking performance (Acc 70.50;F1 70.32).
Tradeoff between Dataset Quality and Size. As described above, the IERE training dataset is constructed using the most confident instances in ASER core version. We can construct a larger but lower quality dataset by incorporating more instances with lower confidence, i.e., the quality-size tradeoff problem. To analyze the tradeoff between the quality and size, we construct a set of datasets with different sizes/qualities, and Figure 4 shows the corresponding results of MKPNet on the development set. We can see that the size is the main factors for performance improvements at the beginning: every 5,000 additional instances can result in a significant improvement (about 2 to 3 F1 gain). When the size is large (more than 20,000 instances in our experiments), more instances will not result in performance improvements, and the low-quality instances will hurt the performance.

Extrinsic Experiments
The above intrinsic experiments verified the effectiveness of the proposed MKPNet for ERE. In this section, we use the core version of our enriched EventKGs -ASER++, and then conduct extrinsic experiments on Winograd Schema Challenge (WSC) (Levesque et al., 2012) to verify the effect of ASET++. WSC Implementation. WSC is challenging since its schema is a pair of sentences that differ only in one or two words and that contain a referential ambiguity that is resolved in opposite directions in the two sentences. According to Certu et al. (2019), fine-tuning pre-trained language models on WSC-schema style training sets is a robust method to tackle WSC. Therefore, as Figure 5 shows, we transform ASER++ to WSC-schema style training data in the same way as Zhang et al. (2020b) and fine-tune BERT on it, which we refer to as BERT (ASER++). We compare BERT (ASER++) with these baselines: • Pure Knowledge-based Methods are heuristical rule-based methods, such as Knowledge Hunting (Emami et al., 2018) and String Match (Zhang et al., 2020b).
• Language Model-based Methods use language model trained on large-scale corpus and tuned specifically for the WSC task, such as LM (Trinh and Le, 2018).
• External Knowledge Enhanced Methods are models based on BERT and trained with the different external knowledge resource, e.g., WscR (Ng, 2012;Certu et al., 2019) We implement our model based on pytorchtransformers (Wolf et al., 2020  Extrinsic Results. Table 5 shows the overall results of extrinsic experiments. We can see that: By fine-tuning BERT on our enriched EventKG -ASER++, the WSC performance can be significantly improved. BERT (ASER++) and BERT (ASER++ & WscR) outperform BERT (ASER) and BERT (ASER & WscR) respectively, which verified the effectiveness of ASER++ and implicit event relations are beneficial for downstream NLU tasks.

Related Work
Event-centric Knowledge Graphs. Knowledge graphs have come from entity-centric ones (Banko et al., 2007;Suchanek et al., 2007;Bollacker et al., 2008;Wu et al., 2012) to event-centric ones. However, the construction of traditional KGs takes domain experts much effort and time, which are often with limited size and cannot effectively resolve realworld applications, e.g., FrameNet (Baker et al., 1998). Recently, many modern and large-scale KGs have been built semi-automatically, which focus on events (Tandon et al., 2015;Rospocher et al., 2016;Gottschalk and Demidova, 2018;Zhang et al., 2020b) and commonsense (Speer et al., 2017;Smith et al., 2018;Huang et al., 2018;Sap et al., 2019). Specifically, Yu et al. (2020) proposes an approach to extract entailment relations between eventualities, e.g., "I eat an apple" entails "I eat fruit", and release an event entailment graph (EEG). Different from EEG, this paper focuses on implicit event relations which are not extracted due to the absences of the connectives and discontinuity.
Knowledge Transfer. Due to the data scarcity problem, many knowledge transfer studies have been proposed, including multi-task learning (Caru-ana, 1997), transfer learning (Pan and Yang, 2009;Pan et al., 2010), and knowledge distillation (Hinton et al., 2014). Recently, researchers are interested in training/sharing/transferring/distilling models layer by layer to fully excavate the knowledge (Pan et al., 2016;Guo et al., 2019;Kang et al., 2020;Li et al., 2020b). In this paper, we propose a knowledge projection method which can project discourse knowledge to narraties on different tiers.

Conclusions
In this paper, we propose a knowledge projection paradigm for event relation extraction and Multitier Knowledge Projection Network (MKPNet) is designed to leverage multi-tier discourse knowledge. By effectively projecting knowledge from discourses to narratives, MKPNet achieves the new state-of-the-art event relation extraction performance, and extrinsic experimental results verify the value of the extracted event relations. For future work, we want to design new data-efficient algorithms to learn effective models using low-quality and heterogeneous knowledge.