Filling in the Gaps: Efficient Event Coreference Resolution using Graph Autoencoder Networks

We introduce a novel and efficient method for Event Coreference Resolution (ECR) applied to a lower-resourced language domain. By framing ECR as a graph reconstruction task, we are able to combine deep semantic embeddings with structural coreference chain knowledge to create a parameter-efficient family of Graph Autoencoder models (GAE). Our method significantly outperforms classical mention-pair methods on a large Dutch event coreference corpus in terms of overall score, efficiency and training speed. Additionally, we show that our models are consistently able to classify more difficult coreference links and are far more robust in low-data settings when compared to transformer-based mention-pair coreference algorithms.


Introduction
Event coreference resolution (ECR) is a discoursecentered NLP task in which the goal is to determine whether or not two textual events refer to the same real-life or fictional event.While this is a fairly easy task for human readers, it is far more complicated for AI algorithms, which often do not have access to the extra-linguistic knowledge or discourse structure overview that is required to successfully connect these events.Nonetheless ECR, especially when considering cross-documents settings, holds interesting potential for a large variety of practical NLP applications such as summarization (Liu and Lapata, 2019), information extraction (Humphreys et al., 1997) and content-based news recommendation (Vermeulen, 2018).
However, despite the many potential avenues for ECR, the task remains highly understudied for comparatively lower-resourced languages.Furthermore, in spite of significant strides made since the advent of transformer-based coreference systems, a growing number of studies has questioned the effectiveness of such models.It has been suggested that classification decisions are still primarily based on the surface-level lexical similarity between the textual spans of event mentions (Ahmed et al., 2023;De Langhe et al., 2023), while this is far from the only aspect that should be considered in the classification decision.Concretely, in many models coreferential links are assigned between similar mentions even when they are not coreferent, leading to a significant number of false positive classifications, such as between Examples 1 and 2.
1.The French president Macron met with the American president for the first time today

French President Sarkozy met the American president
We believe that the fundamental problem with this method stems from the fact that in most cases events are only compared in a pairwise manner and not as part of a larger coreference chain.The evidence that transformer-based coreference resolution is primarily based on superficial similarity leads us to believe that the current pairwise classification paradigm for transformer-based event coreference is highly inefficient, especially for studies in lower-resourced languages where the state of the art still often relies on the costly process of fine-tuning large monolingual BERT-like models (De Langhe et al., 2022b).
In this paper we aim to both address the lack of studies in comparatively lower-resourced languages, as well as the more fundamental concerns w.r.t. the task outlined above.We frame ECR as a graph reconstruction task and introduce a family of graph autoencoder models which consistently outperforms the traditional transformer-based methods on a large Dutch ECR corpus, both in terms of accuracy and efficiency.Additionally, we introduce a language-agnostic model variant which disregards the use of semantic features entirely and even outperforms transformer-based classification in some situations.Quantitative analysis reveals that the lightweight autoencoder models can consistently classify more difficult mentions (cfr.Examples 1 and 2) and are far more robust in low-data settings compared to traditional mention-pair algorithms.
2 Related Work

Event Coreference Resolution
The primary paradigm for event coreference resolution takes the form of a binary mention-pair approach.This method generates all possible event pairs and reduces the classification to a binary decision (coreferent or not) between each event pair.A large variety of classical machine learning algorithms has been tested using the mentionpair paradigm such as decision trees (Cybulska and Vossen, 2015), support vector machines (Chen et al., 2015) and standard deep neural networks (Nguyen et al., 2016).
More recent work has focused on the use of LLMs and transformer encoders (Cattan et al., 2021a,b), with span-based architectures attaining the best overall results (Joshi et al., 2020;Lu and Ng, 2021).It has to be noted that mention-pair approaches relying on LLMs suffer most from the limitations discussed in Section 1.In an effort to mitigate these issues some studies have sought to move away from the pairwise computation of coreference by modelling coreference chains as graphs instead.These methods' primary goal is to create a structurally-informed representation of the coreference chains by integrating the overall document (Fan et al., 2022;Tran et al., 2021) or discourse (Huang et al., 2022) structure.Other graph-based methods have focused on commonsense reasoning (Wu et al., 2022).
Research for comparatively lower-resourced languages has generally followed the paradigms and methods described above and has focused on languages such as Chinese (Mitamura et al., 2015), Arabic (NIST, 2005) and Dutch (Minard et al., 2016).

Graph Autoencoders
Graph Autoencoder models were introduced by Kipf and Welling (2016b) as an efficient method for graph reconstruction tasks.The original paper introduces both variational graph autoencoders (VGAE) and non-probabilistic graph autoencoders (GAE) networks.The models are parameterized by a 2-layer graph-convolutional network (GCN) (Kipf and Welling, 2016a) encoder and a generative inner-product decoder between the latent variables.While initially conceived as lightweight models for citation network prediction tasks, both the VGAE and GAE have been successfully applied to a wide variety of applications such as molecule design (Liu et al., 2018), social network relational learning (Yang et al., 2020) and 3D scene generation (Chattopadhyay et al., 2023).Despite their apparent potential for effectively processing large amounts of graph-structured data, application within the field of NLP has been limited to a number of studies in unsupervised relational learning (Li et al., 2020).

Data
Our data consists of the Dutch ENCORE corpus (De Langhe et al., 2022a), which in its totality consists of 12,875 annotated events spread over 1,015 documents that were sourced from a collection of Dutch (Flemish) newspaper articles.Coreferential relations between events were annotated at the within-document and cross-document level.

Baseline Coreference Model
Our baseline model consists of the Dutch monolingual BERTje model (de Vries et al., 2019) finetuned for cross-document ECR.First, each possible event pair in the data is encoded by concatenating the two events and by subsequently feeding these to the BERTje encoder.We use the token representation of the classification token [CLS] as the aggregate embedding of each event pair, which is subsequently passed to a softmax-activated classification function.Finally, the results of the text pair classification are passed through a standard agglomerative clustering algorithm (Kenyon-Dean et al., 2018;Barhom et al., 2019) in order to obtain output in the form of coreference chains.
We also train two parameter-efficient versions of this baseline model using the distilled Dutch Language model RobBERTje (Delobelle et al., 2022) and a standard BERTje model trained with bottleneck adapters (Pfeiffer et al., 2020).

Graph Autoencoder Model
We make the assumption that a coreference chain can be represented by an undirected, unweighted graph G = (V, E) with|V | nodes, where each node represents an event and each edge e ∈ E between two nodes denotes a coreferential link between those events.We frame ECR as a graph reconstruction task where a partially masked adjacency matrix A and a node-feature matrix X are used to predict all original edges in the graph.We employ both the VGAE and GAE models discussed in Section 2.2.In a non-probabilistic setting (GAE) the coreference graph is obtained by passing the adjacency matrix A and node-feature matrix X through a Graph Convolutional Neural Network (GCN) encoder and then computing the reconstructed matrix Â from the latent embeddings Z: For a detailed overview of the (probabilistic) variational graph autoencoder we refer the reader to the original paper by Kipf and Welling (2016b).
Our experiments are performed in a crossdocument setting, meaning that the input adjacency matrix A contains all events in the ENCORE dataset.Following the original approach by Kipf and Welling (2016b) we mask 15% of the edges, 5% to be used for validation and the remaining 10% for testing.An equal amount of non-edges is randomly sampled from A to balance the validation and test data.
We extract masked edges and non-edges and use them to build the training, validation and test sets for the mention-pair baseline models detailed above, ensuring that both the mention-pair and graph autoencoder models have access to exactly the same data for training, validation and testing.We define the encoder network with a 64-dimension hidden layer and 32-dimension latent variables.For all experiments we train for a total duration of 200 epochs using an Adam optimizer (Kingma and Ba, 2014) and a learning rate of 0.001.
We construct node features through Dutch monolingual transformer models by average-pooling token representations for each token in the event span in the models' final hidden layer, resulting in a 768-dimensional feature vector for each node in the graph.For this we use the Dutch BERTje model (de Vries et al., 2019), a Dutch sentence-BERT model (Reimers and Gurevych, 2019) and the Dutch RoBERTa-based RobBERT model (Delobelle et al., 2020).Additionally, we create a second feature set for the BERTje and RobBERT models where each event is represented by the concatenation of the last 4 layers' average-pooled token representations Devlin et al. (2018).This in turn results in a 3072-dimensional feature vector.
Finally, we also evaluate a language-agnostic featureless model where X is represented by the identity matrix of A.

Hardware Specifications
The baseline coreference algorithms were trained and evaluated on 2 Tesla V100-SXM2-16GB GPUs.Due to GPU memory constraints, the Graph encoder models were all trained and evaluated on a single 2.6 GHz 6-Core Intel Core i7 CPU.

Results and Discussion
Results from our experiments are disclosed in Table 1.Results are reported through the CONLL F1 metric, an average of 3 commonly used metrics for coreference evaluation: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998) and CEAF (Luo, 2005).We find that the graph autoencoder models consistently outperform the traditional mentionpair approach.Moreover, we find the autoencoder approach significantly reduces model size, training time and inference speed even when compared to parameter-efficient transformer-based methods.We note that the VGAE models perform slightly worse compared to their non-probabilistic counterparts, which is contrary to the findings in Kipf and Welling (2016b).This can be explained by the use of more complex acyclic graph data in the original paper.In this more uncertain context, probabilistic models would likely perform better.
As a means of quantitative error analysis, we report the average Levenshtein distance between two event spans for the True Positive (TP) pairs in our test set in Figure 1.Logically, if graph-based models are able to better classify harder (i.e nonsimilar) edges, the average Levenstein distance for predicted TP edges should be higher than for the mention-pair models.For readability's sake we only include results for the best performing GAEclass models.A more detailed table can be found in the Appendix.We find that the average distance between TP pairs increases for our introduced graph models, indicating that graph-based models can, to some extent, mitigate the pitfalls of mention-pair methodologies as discussed in Section 1.

Ablation Studies
We gauge the robustness of the graph-based models in low-data settings by re-running the original experiment and continually reducing the available training data by increments of 10%. Figure 2 shows the CONLL F1 score for each of the models with respect to the available training data size.Also here, only the best-performing GAE-class models are visualized and an overview of all models' perfor-mance can be found in the Appendix.Surprisingly, we find that training the model on as little as 5% of the total amount of edges in the dataset can already lead to satisfactory results.Logically, feature-less models suffer from a significant drop in performance when available training data is reduced.We also find that the overall drop in performance is far greater for the traditional mention-pair model than it is for the feature-based GAE-class models in low-data settings.Overall, we conclude that the introduced family of models can be a lightweight and stable alternative to traditional mention-pair coreference models, even in settings with little to no available training data.

Conclusion
We show that ECR through graph autoencoders significantly outperforms traditional mention-pair approaches in terms of performance, speed and model size in settings where coreference chains are at least partially known.Our method provides a fast and lightweight approach for processing large cross-document collections of event data.Additionally, our analysis shows that combining BERT-like embeddings and structural knowledge of coreference chains mitigates the issues in mention-pair classification w.r.t the dependence on surface-form lexical similarity.Our ablation experiments reveal that only a very small number of training edges is needed to obtain satisfactory performance.
Future work will explore the possibility of combining mention-pair models with the proposed graph autoencoder approach in a pipeline setting in order to make it possible to employ graph reconstruction models in settings where initially all edges in the graph are unknown.Additionally, we aim to perform more fine-grained analyses, both quantitative and qualitative, regarding the type of errors made by graph-based coreference models.
We identify two possible limitations with the work presented above.First, by framing coreference resolution as a graph reconstruction task we assume that at least some coreference links in the crossdocument graph are available to train on.However, we note that this issue can in part be mitigated by a simple exact match heuristic for event spans on unlabeled data.Moreover, in most application settings it is not inconceivable that at least a partial graph is available.
A second limitation stems from the fact that we modelled coreference chains as undirected graphs.
It could be argued that some coreferential relationships such as pronominal anaphora could be more accurately modelled using directed graphs instead.

Figure 1 :
Figure 1: Average Levenshtein distance for True Positive (TP) classifications across all models

Figure 2 :
Figure 2: CONLL F1 performance with respect to the available training data.

Table 1 :
Results for the cross-document event coreference task.We report the average CONLL score and standard deviation over 3 training runs with different random seed initialization for the GCN weight matrices (GAE/VAE) and classification heads (Mention-Pair models).Inference runtime is reported for the entire test set.