Exploiting Document Structures and Cluster Consistencies for Event Coreference Resolution

We study the problem of event coreference resolution (ECR) that seeks to group coreferent event mentions into the same clusters. Deep learning methods have recently been applied for this task to deliver state-of-the-art performance. However, existing deep learning models for ECR are limited in that they cannot exploit important interactions between relevant objects for ECR, e.g., context words and entity mentions, to support the encoding of document-level context. In addition, consistency constraints between golden and predicted clusters of event mentions have not been considered to improve representation learning in prior deep learning models for ECR. This work addresses such limitations by introducing a novel deep learning model for ECR. At the core of our model are document structures to explicitly capture relevant objects for ECR. Our document structures introduce diverse knowledge sources (discourse, syntax, semantics) to compute edges/interactions between structure nodes for document-level representation learning. We also present novel regularization techniques based on consistencies of golden and predicted clusters for event mentions in documents. Extensive experiments show that our model achieve state-of-the-art performance on two benchmark datasets.


Introduction
Event coreference resolution (ECR) is the task of clustering event mentions (i.e., trigger words that evoke an event) in a document such that each cluster represents a unique real world event. For example, the three event mentions in Figure 1, i.e., "refuse to sign, "raised objections", and "doesn't sign", should be grouped into the same cluster to indicate their coreference to the same event.
A common component in prior ECR models involves a binary classifier that receives a pair of event mentions and predict their coreference Lu et al., 2016;Lu and Ng, 2017). To this end, an important step in ECR models is to transform event mention pairs into representation vectors to encode discriminative features for coreference prediction. Early work on ECR has achieved feature representation via feature engineering where multiple features are hand-designed for input event mention pairs (Lu and Ng, 2017). A major problem with feature engineering is the sparsity of the features that limits the generalization to unseen data. Representation learning in deep learning models has recently been introduced to address this issue, leading to more robust methods with better performance for ECR (Nguyen et al., 2016;Choubey and Huang, 2018;Huang et al., 2019;Barhom et al., 2019). As such, there are at least two limitations in existing deep learning models for ECR that will be addressed in this work to improve the performance.
First, as event mentions pairs for coreference prediction might belong to long-distance sentences in documents, capturing document-level context between the event mentions (i.e., beyond the two sentences that host the event mentions) might present useful information for ECR. As their first limitation, prior deep learning models for ECR has only attempted to encode document-level context via hand-designed features (Kenyon-Dean et al., 2018;Barhom et al., 2019) that still suffer from the feature sparsity issue. In addition, such prior work is unable to exploit ECR-related objects in documents (e.g., entity mentions, context words) and their connections/interactions (possibly beyond sentence boundary) to aid representation learning. An example for the importance of context words, entity mentions, and their interactions for ECR can be seen in Figure 1. Here, to decisively determine the coreference of "raised objections" and "doesn't sign", ECR systems should recognize "Trump" and "the Donald Trump continued to refuse to sign a relief package agreed in Congress and headed instead to the golf course…. Trump, who is spending the Christmas and New Year holiday at his Mar-a-Lago resort in Florida, raised objections to the $900bn relief bill only after it was passed by Congress last week, having been negotiated by his own treasury secretary Steven Mnuchin… All these folks and their families will suffer if Trump doesn't sign the damn bill.

Coreferential event mentions
Coreferential entity mentions $900bn relief bill" as the arguments of "raised objections", and "Trump" and "the damn bill" as the arguments of "doesn't sign". The systems should also be able to realize the coreference relation between the two entity mentions "Trump", and between "the $900bn relief bill" and "the damn bill" to conclude the same identity for the event mentions (i.e., as they involve the same arguments). As such, it is helpful to identify relevant entity mentions, context words and leverage their relations/interactions to improve representation vectors for event mentions in ECR. Motivated by this issue, we propose to form graphs for documents (called document structures) to explicitly capture relevant objects and interactions for ECR that will be consumed to learn representation vectors for event mentions. In particular, context words, entity mentions, and event mentions will serve as the nodes in our document structures due to their intuitive relevance to ECR. Different types of knowledge sources will then be exploited to connect the nodes for the document structures, featuring discourse information (e.g., to connect coreferring entity mentions), syntactic information (e.g., to directly link event mentions and their arguments), and semantic similarity (e.g., to connect words/event mentions with similar meanings). Such rich document structures allows us to model the interactions of relevant objects for ECR beyond sentence level for document-level context. Using graph convolutional neural networks (GCN) (Kipf and Welling, 2017;Nguyen and Grishman, 2018) for representation learning, we expect enriched representation vectors from the document structures can further improve the performance of ECR systems. To our knowledge, this is the first time that rich document structures are employed for ECR.
Second, prior deep learning models for ECR fails to leverage consistencies between golden clus-ters (provided by human) and predicted clusters (generated by models) to promote representation learning. In particular, it is intuitive that ECR models can achieve better performance if their predicted event clusters are more similar to the golden event clusters in the data. To this end, we propose to obtain different inconsistency measures between golden and predicted clusters that will be incorporated into the overall loss function for minimization. As such, we expect that the consistency/similarity regularization between two types of clusters can provide useful training signals to improve representation vectors for event mentions in ECR. To our knowledge, this is also the first work to exploit cluster consistency-based regularization for representation learning in ECR. Finally, we conduct extensive experiments for ECR on the KBP benchmark datasets. The experiments demonstrate the benefits of the proposed methods and lead to state-of-the-art performance for ECR.

Related Work
Event coreference resolution is broadly related to works on entity coreference resolution that aim to resolve nouns phrases/mentions for entities (Raghunathan et al., 2010;Ng, 2010;Durrett and Klein, 2013;Lee et al., 2017a;Joshi et al., 2019b,a). However, resolving event mentions has been considered as a more challenging task than entity coreference resolution due to the more complex structures of event mentions (Yang et al., 2015).
Our work focuses on the within-document setting for ECR where input event mentions are expected to appear in the same input documents; however, we also note prior works on crossdocument ECR (Lee et al., 2012a;Adrian Bejan and Harabagiu, 2014;Choubey and Huang, 2017;Kenyon-Dean et al., 2018;Barhom et al., 2019;Cattan et al., 2020). As such, for within-document ECR, previous methods have applied feature-based models for pairwise classifiers (Ahn, 2006;Cybulska and Vossen, 2015;Peng et al., 2016), spectral graph clustering , information propagation (Liu et al., 2014), markov logic networks (Lu et al., 2016), joint modeling of ECR with event detection (Araki and Mitamura, 2015;Lu et al., 2016;Chen and Ng, 2016;Lu and Ng, 2017), and recent deep learning models (Nguyen et al., 2016;Choubey and Huang, 2018;Huang et al., 2019;Choubey et al., 2020). Compared to previous deep learning works for ECR, our model presents a novel representation learning framework based on document structures to explicitly encode important interactions between relevant objects, and representation regularization to exploit the cluster consistency between golden and predicted clusters for event mentions.

Model
Formally, in ECR, given an input document D = w 1 , w 2 , . . . , w N (of N words/tokens) with a set of event mentions E = {e 1 , e 2 , . . . , e |E| }, the goal is to group the event mentions in E into clusters to capture the coreference relation between mentions. Our ECR model consists of four major components: (i) Document Encoder to words into representation vectors, (ii) Document Structure to create graphs for documents and learn rich representation vectors for event mentions, (iii) End-to-end Resolution to simultaneously resolve the coreference for the entity mentions in D, and (iv) Cluster Consistency Regularization to regularize representation vectors based on consistency constraints between golden and predict event mention clusters. Figure  2 presents an overview of our model for ECR.

Document Encoder
In the first step, we transform each word w i ∈ D into a representation vector x i by feeding D into the pre-trained language model BERT (Devlin et al., 2019). In particular, as BERT might split w i into several word-pieces, we average the hidden vectors of the word-pieces of w i in the last layer of BERT to obtain the representation vector x i for w i . To handle long documents with BERT, we divide D into segments of 512 consecutive word-pieces that will be encoded separately. The resulting sequence X = x 1 , x 2 , . . . , x n for D is then sent to the next steps for further computation.

Document Structure
This component aims to learn representation vectors for the event mentions in E using an interaction graph G = {N , E} for D that facilitates the enrichment of representation vectors for event mentions with relevant objects and interactions at document level. As such, the nodes and edges in G for our ECR problem are constructed as follows: Nodes: The node set N for our interaction graph G should capture relevant objects for the coreference between event mentions in D. Toward this goal, we consider all the context words (i.e., w i ), event mentions, and entity mentions in D as relevant objects for our ECR problem. For convenience, let M = {m 1 , m 2 , . . . , m |M | } be the set of entity mentions in D. The node set N for G is thus created by the union of D, E, and M : To achieve a fair comparison, we use the predicted event mentions that are provided by (Choubey and Huang, 2018) in the datasets for E. The Stanford CoreNLP toolkit is employed to obtain the entity mentions M .
Edges: The edges between the nodes in N for G will be represented by an adjacency matrix A = {a ij } i,j=|N | (a ij ∈ R) in this work. As A will be consumed by Graph Convolutional Networks (GCN) to learn representation vectors for ECR, the value/score a ij between two nodes n i and n j in N is expected to estimate the importance (or the level of interaction) of n j for the representation computation of n i . This structure allows n i and n j of N to directly interact and influence the representation computation of each other even if they are sequentially far away from each other in D. As presented in the introduction, we explore three types of information to design the edges E (or compute the interaction scores a ij ) for G in our model, including discourse-based, syntax-based and semantic-based information. Discourse-based Edges: Due to multiple sentences and event/entity mentions involved in the input document D, we need to understand where such objects span and how they relate to each other to effectively encode document context for ECR. To this end, we propose to exploit three types of discourse information to obtain the interaction graph G, i.e., sentence boundary, coreference structure, and mention span for event/entity mentions in D.
Sentence Boundary: Our motivation for this information is that event/entity mentions appearing in the same sentences tend to be more contextually related to each other than those in different sentences. As such, event/entity mentions in the same sentences might involve more helpful information for the representation computation of each other in our problem. To capture this intuition, we compute the sentence boundary-based interaction score a sent ij for the nodes n i and n j in N where a sent ij = 1 if n i and n j are the event/entity mentions of the same sentences in D (i.e., n i , n j ∈ E ∪ M ); and 0 otherwise. We will use a sent ij as an input to compute the overall interaction score a ij for G later.
Entity Coreference Structure: Instead of considering within-sentence information as in a sent ij , coreference structure focuses on the connection of entity mentions across sentences to enrich their representations with the contextual information of the coreferring ones. As such, to enable the interaction of representations for coreferring enity mentions, we compute the conference-based score a coref ij for each pair of nodes n i and n j to contribute to the overall score a ij for representation learning. Here, a coref ij is set to 1 if n i and n j are coreferring entity mentions in D, and 0 otherwise. Note that we also use the Stanford CoreNLP toolkit to determine the coreference of entity mentions in this work.
Mention Span: The sentence boundary and coreference structure scores model interactions of event and entity mentions in D based on discourse structure. To connect event and entity mentions to context words w i for representation learning, we employ the mention span-based interaction score a span ij as another input for a ij . Here, a span ij is only set to 1 (i.e., 0 otherwise) if n i is a word (n i ∈ D) in the span of the entity/event mention n j (n j ∈ E ∪M ) or vice verse. a span ij is important as it helps ground representation vectors of event/entity mentions to the contextual information in D. Syntax-based Edges: We expect the dependency trees of the sentences in D to provide beneficial information to connect the nodes in N for effective representation learning in ECR. For example, dependency trees have been used to retrieve important context words between an event mentions and their arguments in prior work (Li et al., 2013;Veyseh et al., 2020a,b). To this end, we propose to employ the dependency relations/connections between the words in D to obtain a syntax-based interaction score a dep ij for each pair of nodes n i and n j in N , serving as an additional input for a ij . In particular, by inheriting the graph structures of the dependency trees of the sentences in D, we set a dep ij to 1 if n i and n j are two words in the same sentence (i.e., n i , n j ∈ D) and there is an edge between them in the corresponding dependency tree 1 , and 0 otherwise. Semantic-based Edges: This information leverages the semantic similarity of the nodes in N to enrich the overall interaction scores a ij for G. Our motivation is that a node n i will contribute more to the representation computation of another node n j for ECR if n i is more semantically related to n j . In particular, as the representation vectors for the nodes in N have captured the contextual semantics of the words in D, we propose to explore a novel source of semantic information that relies on external knowledge for the words to compute interaction scores between the nodes N in our document structures for ECR. We expect the external knowledge for the words to provide complementary information to the contextual information in D, thus further enriching the overall interaction scores a ij for the nodes in N . To this end, we propose to utilize WordNet (Miller, 1995), a rich network of word meanings, to obtain external knowledge for the words in D. The word meanings (i.e., synsets) in WordNet are connected to each other via different semantic relations (e.g., synonyms, hyponyms). In particular, our first step to generate knowledge-based similarity scores involves mapping each word node n i ∈ D ∩ N to a synset node M i in WordNet using a Word Sense Disambiguation (WSD) tool. In particular, we employ WordNet 3.0 and the state-of-the-art BERT-based WSD model in (Blevins and Zettlemoyer, 2020) to perform the word-synset mapping in this work. Afterward, we compute a knowledge-based similarity score a struct ij for each pair of word nodes n i and n j in D ∩ N using the structure-based similarity of their linked synsets M i and M j in WordNet (i.e., a struct ij = 0 if either n i or n j is not a word node in D ∩ N ). Accordingly, the Lin similarity measure (Lin et al., 1998)  Here, IC and LCS represent the information content of synset nodes and the least common subsumer of two synsets in the WordNet hierarchy (the most specific ancestor node) respectively 2 . Structure Combination: Up to now, five scores have been generated to capture the level of interactions in representation learning for each pair of nodes n i and n j in N according to different information sources (i.e., a sent ij , a coref ij , a span ij , a dep ij and a struct ij ). For convenience, we group the five scores for each node pair n i and n j into a vector ] of size 5. To combine the scores in d ij into an overall rich interaction score a ij for n i and n j in G, we use the following normalization: (1) 2 We use the nltk tool to obtain the Lin similarity: https://www.nltk.org/howto/wordnet. html. We tried other WordNet-based similarities available in nltk (e.g., Wu-Palmer similarity), but the Lin similarity produced the best results in our experiments.
where q is a learnable vector of size 5. Representation Learning: Given the combined interaction graph G with the adjacency matrix A = {a ij } i,j=|N | , we use GCNs to induce representation vectors for the nodes in N for ECR. In particular, our GCN model takes the initial representation vectors v i of the nodes n i ∈ N as the input. Here, the initial representation vector v i for a word node n i ∈ D is directly obtained from the BERT-based representation vector x c ∈ X (i.e., v i = x c ) of the corresponding word w c for n i . In contrast, for event and entity mentions, their initial representation vectors are obtained by max-pooling the contextualized embedding vectors in X that correspond to the words in the event/entity mentions' spans. For convenience, we organize v i into rows of the input matrix H 0 = [v 1 , . . . , v |N | ]. The GCN model then involves G layers that generate the matrix H l at the l-th layer for the nodes in N (1 ≤ l ≤ G) via: H l = ReLU (AH l−1 W l ) (W l is the weight matrix for the l-th layer). The output of the GCN model after G layers is H G whose rows are denoted by H G = [h 1 , . . . , h |N | ], serving as more abstract representation vectors for the nodes n i in the coreference prediction for event mentions. Also, for convenience, let {r e 1 , . . . , r e |E| } ⊂ H G be the set of GCN-induced representation vectors for the event mention nodes in e 1 , . . . , e |E| in E.

End-to-end Coreference Resolution
To facilitate the incorporation of the consistency regularization between golden and predicted clusters into the training process, we perform and endto-end procedure that seeks to simultaneously resolve the coreference for the event mentions in E in a single process. Motivated by the entity coreference resolution in (Lee et al., 2017b), we implement the end-to-end resolution via a set of antecedent assignments for the event mentions in E. In particular, we assume that the event mentions in E are enumerated in their appearance order in D.
As such, our model aims to link each event mention e i ∈ E to one of its prior event mention in the set Y i = { , e 1 , . . . , e i−1 } ( is a dumpy antecedent).
Here, a link of e i to a non-dumpy antecedent e j in Y i represents a coreference relation between e i and e j . In contrast, a dumpy assignment for e i indicates that e i is not coreferent with any prior event mention. By forming a coreference graph with e i as the nodes, the non-dumpy antecedent assignments for every event mention in E can be utitlized to connect coreference event mentions. Connected components from the coreference graph can then be returned to serve as predicted event mention clusters in D.
In order to predict the coreferent antecedent y i ∈ Y for an event mention e i , we compute the distribution over the possible antecedents in Y i for e i via: P (y i | e i , Y i ) = e s(e i ,y i ) y ∈Y(i) e s(e i ,y ) where s(e i , e j ) is a score function to determine the coreference likelihood between e i and e j in D. To this end, we set s(e i , ) = 0 for all e i ∈ E. Inspired by (Lee et al., 2017b), we obtain the score function s(e i , e j ) for e i and e j by leveraging their GCNinduced representation vectors r e i and r e j via: s(ei, ej) = sm(ei) + sm(ej) + sc(ei, ej) + sa(ei, ej) where F m and FF c are two-layer feed-forward networks, w m and w a are learnable vectors, W c is a weight matrix, and is the element-wise multiplication. At the inference time, we employ the greedy decoding to predict the antecedent y i for e i :ŷ i = argmaxP (y i |e i , Y i ). For training, we use the negative log-likelihood as the loss function in our end-to-end framework: is the golden antecedent for e i ).

Cluster Consistency Regularization
To further improve representation learning for ECR, we propose to regularize the induced representation vectors of the event mentions in E to explicitly enforce the consistency/similarity between golden and predicted event mention clusters in D. This is based on our motivation that ECR models will perform better if they can produce more similar event mention clusters to the golden ones. As such, for convenience, let T = {T 1 , T 2 , . . . , T |T | } and P = {P 1 , P 2 , . . . , P |P| } be the golden and predicted sets of event mentions in E respectively, i.e., T i , P j ⊂ E, and T 1 ∪ T 2 ∪ . . . ∪ T |T | = P 1 ∪P 2 ∪. . .∪P |P| = E. Also, for each cluster C in T or P, we compute a centroid vector r C for it by averaging the representation vectors of the event mention members: r C = average e∈C (r e ). This leads to the centroid vectors {r T 1 , r T 2 , . . . , r T |T | } and {r P 1 , r P 2 , . . . , r P |P| } for T and P respectively. We propose the following regularization terms for cluster consistency: Intra-cluster Consistency: This constraint concerns the inner information of each cluster, characterizing the structure of each individual event mention in its golden and predicted clusters in T and P. In particular, for each event mention e i ∈ E, we expect its distances to the centroid vectors of the corresponding golden and predicted clusters T i and P i in T and P (respectively) to be similar, i.e., T i ∈ T , P i ∈ P, e i ∈ T i , e i ∈ P i . As such, we compute the distances between the representation vector r e i of e i to the centroid vectors r T i and r P i via the Euclidean distances r e i − r T i 2 2 and r e i − r P i 2 2 . Afterward, the differences L inner between the two distances for golden and predicted clusters are aggregated over all event mentions and added into the overall loss function for minimization: Inter-cluster Consistency: In this constraint, we expect that the structure among the clusters T i in the golden set T is consistent with those for the predicted event cluster set P (i.e., inter-cluster regulation). To implement this idea, we encode the structure of the clusters in a set via the average of the pairwise distances between the centroid vectors of the clusters. In particular, the inter-cluster structure scores for the golden and predicted clusters in T and P are computed via: The difference between the structure scores for golden and predicted clusters T and P is then included into the overall loss function for minimization: L inter = |s T − s P |. Inter-set Similarity: This constraint aims to directly promote the similarity between the golden clusters in T and the predicted clusters in P. As such, for the golden and predicted cluster sets T and P, we first obtain the overall centroid vectors u T and u P (respectively) by averaging the centroid vectors of their member clusters: u T = average T ∈T (r T ) and u P = average P ∈P (r P ). The Euclidean distance L sim is then integrated into the overall loss for minimization: L sim = u T −u P 2 2 . Note that L inner , L inter , and L sim will be zero if the predicted clusters in P are the same as those in the golden clusters in T .
To summarize, the overall loss function L to train our ECR model in this work is: L = L pred + α inner L inner + α inter L inter + α sim L sim with α inner , α inter , and α sim as the trade-off parameters.

Dataset & Hyperparameters
Following prior work (Choubey and Huang, 2018), we train our ECR models on the KBP 2015 dataset  and evaluate the models on the KBP 2016 and KBP 2017 datasets for ECR (Mitamura et al., 2016(Mitamura et al., , 2017. In particular, the KBP 2015 dataset includes 360 annotated documents for ECR (181 documents from discussion forum and 179 documents from news articles). We use the same 310 documents from KBP 2015 as in (Choubey and Huang, 2018) for the training data and the remaining 50 documents for the development data. Also, similar to (Choubey and Huang, 2018), the news articles in KBP 2016 (85 documents) and KBP 2017 (83 documents) are leveraged for test datasets. To ensure a fair comparison, we use the predicted event mentions provided by (Choubey and Huang, 2018) in all the datasets. Finally, we report the ECR performance based on the official KBP 2017 scorer (version 1.8) 3 . The scorer employs four coreference scoring measures, including MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAF-e (Luo, 2005), BLANC (Lee et al., 2012b), and the unweighted average of their F1 scores (AVG F 1 ).

Performance Evaluation
We compare the proposed model for ECR with document structures and cluster consistency regularization (called StructECR) with prior work ECR models in the same evaluation setting, including the joint model between ECR and event detection (Lu and Ng, 2017), the integer linear programming approach in (Choubey and Huang, 2018), and the discourse structure profiling model in (Choubey et al., 2020) (also the model with the best reported performance in KBP datasets). In addition, we examine the following baselines of StructECR to highlight the benefits of the proposed components: E2E-Only: This variant implements the end-toend resolution model described in Section 3.3 where all event mentions in a document are resolved simultaneously in a single process. However, different from our full model StructECR, E2E-Only does not include the document structure component with GCN for representation learning, i.e., it directly uses the initial representation vectors v i (induced from BERT) for the event mentions in the computation of the distribution P (y i |e i , Y i ). Also, the cluster consistency regularization in Section 3.4 is also not included in this model. Pairwise: This model is similar to E2E-Only in that it does not applies the document structures and regularization terms in StructECR. In addition, instead of simultaneously resolving event mentions in documents, Pairwise predicts the coreference for every pair of event mentions separately. In particular, the representation vectors v e i and v e j for two event mentions e i and e j (included from BERT) are combined via [v e i , v e j , v e i v e j ]. This vector is then sent into a feed-forward network to produce a distribution over possible coreference labels between e i and e j (i.e., two labels for being coreferent or not). The coreference labels for every pair of event mentions are then gathered in a coreference graphs among event mentions; the connected components will be returned for the event clusters. Table 1 reports the performance of the ECR models on the KBP 2016 and KBP 2017 datasets. As can be seen from the table, E2E-Only performs comparably or better than prior state-of-the-art models for ECR, e.g., (Choubey and Huang, 2018) and (Choubey et al., 2020), that employ extensive feature engineering. In addition, the better performance of E2E-Only over Pairwise (for both KBP 2016 andKBP 2017) illustrates the benefits of endto-end coreference resolution for event mentions in documents. Most importantly, the proposed model StructECR significantly outperforms all the baseline models for which the performance improvement over E2E-Only is 1.94% and 1.26% (i.e., AVG F 1 scores) over the KBP 2016 and KBP 2017 datasets respectively. This clearly demonstrates the benefits of the proposed ECR model with rich document structures and cluster consistency regularization for representation learning.

Ablation Study
Two major components in the proposed model StructECR involve the document structures and the cluster consistency regularization. This section performs an ablation study to reveal the contribution of such components for the full model. First, for the document structures, we examine the following ablated models: (i) "StructECRx": where x is one of the five interaction scores used to compute the unified score a ij for G (i.e., a sent ij , a coref ij , a span ij , a dep ij , and a struct ij ). For example, "StructECRa span ij " implies a variant of StructECR where the span-based interaction score a span ij is not included in the compuation of the overall score a ij ; (ii) "StructECR -Entity Nodes: this model excludes the entity mention nodes from the interaction graph G in StructECR (i.e., N = D ∪ E only); (iii) "StructECR -GraphCombine": instead of unifying the five interaction scores in d ij into an overall score a ij in Equation 1, this model considers each of the five generated interaction scores as forming a separate interaction graph, thus producing six different graphs. The GCN model is then applied over those five graphs (using the same initial representation vectors v i for the nodes n i in N ). The outputs of the GCN model for the same node n i (with different graphs) are then concatenated to compute the final representation vector h i for n i ; and (iv) StructECR -Doc Structures: this model removes the GCN model from StructECR. As such, the interaction graph G is not used and the GCN-induced representation vectors h i are replaced by the initial BERT-induced representation vectors v i in the computation for end-to-end resolution and consistency regularization.
Second, for the cluster consistency regularization, we evaluate the following ablated models for StructECR: (v) StructECR -y (y ∈   {L inner , L inter , L sim }): these models exclude one of the regularization terms for the consistency between golden and predicted clusters from the overall loss function L; and (vi) StructECR -Regularization: this model completely ignores the consistency regularization component from StructECR. Table 2 shows the performance of the models on the development data of the KBP 2015 dataset. As can be seen, the elimination of any component from StructECR would significantly hurt the performance, thus clearly demonstrating the benefits of the designed document structures and cluster consistency regularization in StructECR.

Cross-domain Evaluation
To further demonstrate the benefits for the proposed model StructECR, we evaluate StructECR and the baseline models Pairwise and E2E-Only in the cross-domain setting. In this setting, we aim to train the models on one domain (the source domain) and evaluate them on another domain (the target domain). We leverage the KBP 2016 and KBP 2017 datasets for this experiment. In particular, KBP 2016 annotates ECR data for 85 newswire and 84 discussion forum documents (i.e., two domains/genres) while KBP 2017 provides annotated data for ECR on 83 news articles and 84 discussion forum documents. As such, for each dataset, we consider two setups where documents in one domain (i.e., newswire or discussion forum) are used for the source domain, leaving documents in the other domain for the target domain data. We use the same hyper-parameters that are tuned on the development set of KBP 2015 for the models in this experiment. Table 3 presents the performance of the models. It is clear from the table that StructECR are significantly and substantially better than the baseline models (p < 0.01) over different datasets and settings for the source and target domains, thereby confirming the domain generalization advantages of StructECR for ECR.

Conclusion
We present a novel end-to-end coreference resolution framework for event mentions based on deep learning. The novelty in our model is twofold. First, document structures are introduced to explicitly capture relevant objects and their interactions in documents to aid representation learning. Second, several regularization techniques are proposed to exploit the consistencies between human-provided and machine-generated clusters of event mentions in documents. We perform extensive experiments on two benchmark datasets for ECR to demonstrate the advantages of the proposed model. In the future, we plan to extend our models to related problems in information extraction, e.g., event extraction.