Extracting Event Temporal Relations via Hyperbolic Geometry

Detecting events and their evolution through time is a crucial task in natural language understanding. Recent neural approaches to event temporal relation extraction typically map events to embeddings in the Euclidean space and train a classifier to detect temporal relations between event pairs. However, embeddings in the Euclidean space cannot capture richer asymmetric relations such as event temporal relations. We thus propose to embed events into hyperbolic spaces, which are intrinsically oriented at modeling hierarchical structures. We introduce two approaches to encode events and their temporal relations in hyperbolic spaces. One approach leverages hyperbolic embeddings to directly infer event relations through simple geometrical operations. In the second one, we devise an end-to-end architecture composed of hyperbolic neural units tailored for the temporal relation extraction task. Thorough experimental assessments on widely used datasets have shown the benefits of revisiting the tasks on a different geometrical space, resulting in state-of-the-art performance on several standard metrics. Finally, the ablation study and several qualitative analyses highlighted the rich event semantics implicitly encoded into hyperbolic spaces.


Introduction
Successful understanding of natural language depends, among other factors, on the capability to accurately detect events and their evolution through time.This has recently led to increasing interest in research for temporal relation extraction (Chambers et al., 2014;Wang et al., 2020) with the aim of understanding events and their temporal orders.Temporal reasoning has been proven beneficial, for example, in understanding narratives (Cheng et al., 2013), answering questions (Ning et al., 2020), or summarizing events (Wang et al., 2018).West German and French authorities have (e3: cleared) Dresdner Bank AG's takeover of a majority stake in Banque Internationale de Placement.The approval, which had been (e2: expected), (e4:permits) West Germany's second-largest bank to acquire shares of the French investment bank.Dresdner Bank (e0:said) it will (e5: buy) all shares (e1: tendered) by shareholders on the Paris Stock Exchange at the same price from today through Nov. 17.However, events that occurred in text are not just simple and standalone predicates, they rather form complex and hierarchical structures with different granularity levels (Fig. 1), a characteristic that still challenges existing models and restricts their performance on real-world datasets for temporal relation extraction (Ning et al., 2018a,c).Addressing such challenges requires models to not only recognize accurately the events and their hierarchical and chronological properties but also encode them in appropriate representations enabling effective temporal reasoning.Although this has prompted the recent development of neural architectures for automatic feature extraction (Ning et al., 2019;Wang et al., 2020;Han et al., 2019b), which achieves better generalization and avoids costly design of statistical methods leveraging hand-crafted features (Mani et al., 2006;Chambers et al., 2007;Verhagen and Pustejovsky, 2008), the inherent complexity of temporal relations still hinders approaches that just rely on the scarce availability of annotated data.Some of the intrinsic limitations of the mentioned approaches are due to the adopted embedding space.Existing approaches to temporal relation extraction typically operate in the Euclidean space, in which an event is represented as a point.Although Euclidean embeddings exhibit a linear algebraic structure which captures co-occurrence patterns among events, they are not able to reveal richer asymmetric relations, such as event temporal order (e.g., 'event A happens before event B' but not vice versa).Inspired by recent works on learning non-Euclidean embeddings, such as Poincarè embeddings (Nickel and Kiela, 2017;Tifrea et al., 2019) showing superior performance in capturing asymmetrical relations of objects, we propose to learn event embeddings in hyperbolic spaces (Ganea et al., 2018b).
Hyperbolic spaces can be viewed as continuous versions of trees, thus naturally oriented to encode hierarchical and asymmetrical structures.For instance, Sala et al. (2018) showed that hyperbolic spaces with just two dimensions (i.e., Poincaré disk) could efficiently embed tree structures with arbitrarily low distortion (Sarkar, 2011), while Euclidean spaces cannot achieve a comparable distortion even with an unbounded number of dimensions (Linial et al., 1994).Despite the hierarchical properties arising in modelling event relations, there are still very few studies on how to leverage those models for temporal relation extraction.We propose two hyperbolic-based approaches with different strengths: a lightweight and efficient embedding learning method based on the Poincaré ball model, and an end-to-end deep hyperbolic neural network based on the Riemannian optimization.
Our contributions can be summarized as follows: • We propose an embedding learning approach with a novel angular loss to encode events onto hyperbolic spaces, which pairs with a simple rule-based classifier to detect event temporal relations.With only 1.5k parameters and trained in about 4 minutes, it achieves results on par with far more complex models in the recent literature.• Alternatively, we propose a hyperbolic neural network architecture for end-to-end extraction of event TempRel.• We conduct a thorough experimental assessment on MATRES and TCR, with ablation studies and qualitative analyses demonstrating the benefits of tackling the TempRel extraction task in hyperbolic spaces.

Related Work
Our work is related to at least two lines of research: one about event temporal relation extraction and another on hyperbolic neural models.
Event TempRel Extraction Approaches to Tem-pRel extraction are largely built on neural models in recent years.These models have been proven capable of extracting automatically reliable event features for TempRel extraction when provided with high-quality data (Ning et al., 2019), alleviating significantly the required human-engineer effort and yielding results outperforming the above mentioned methodologies.In particular, Ning et al. (2019) employed an LSTM network (Hochreiter and Schmidhuber, 1997) to encode the textual events, taking into account their global context and feeding their representations into a multi-layer perceptron for TempRel classification.In addition, to enhance the generalization to unseen event-tuples, they simultaneously trained a Siamese network bridging common-sense knowledge across event relations.Similarly, Han et al. (2019a) combined a bidirectional LSTM (BiLSTM) with a structured support vector machine (SSVM), with the BiLSTM extracting the pair of events and the SSVM incorporating structural linguistic constraints across them2 .Wang et al. (2020) proposed a constrained learning framework, where event pairs are encoded via a BiLSTM, enhanced with common-sense knowledge from ConceptNet (Speer et al., 2017) and TEMPROB (Ning et al., 2018b), while enforcing a set of logical constraints at training time.The aim is to train the model to detect and extract the event relations while regularizing towards consistency on logic converted into differentiable objective functions, similarly to what was proposed in Li et al. (2019).
Hyperbolic Neural Models The aforementioned models are all designed to process data representations in the Euclidean space.However, several studies (Nickel et al., 2014;Bouchard et al., 2015) have shown the inherent limitations of the Euclidean space in terms of representing asymmetric relations and tree-like graphs (Nickel et al., 2014;Bouchard et al., 2015).Hyperbolic spaces, instead, are promising alternatives that have a natural hierarchical structure and can be thought of as continuous versions of trees.This makes them highly suitable and efficient to encode tree-like networks (Nickel and Kiela, 2017;Tran et al., 2020).
Previous works have explored their use in embedding taxonomy for network link prediction or modeling lexical entailment.In particular, Nickel and Kiela (2017) proposed to learn word hierarchies through a negative-sampling training, based on the distance metric on the Poincaré ball.Ganea et al. (2018a) generalized the idea of order embeddings (Vendrov et al., 2016) to the Poincaré ball, levering the projected areas to infer data relations.Ganea et al. (2018b) introduced a framework of hyperbolic neural networks composed of neural units learning and optimizing parameters in hyperbolic spaces.Their experiments show that, without increasing the number of parameters of the models, hyperbolic neural networks outperform their Euclidean counterparts on natural language inference and detection of noisy prefixes tasks.There have been a few attempts to revisit NLP tasks in the hyperbolic space framework by generalizing hyperbolic neural activation functions for machine translation (Gulcehre et al., 2018), to detect hierarchical entity types (López and Strube, 2020), and for document classification (Zhang and Gao, 2020).Compared to the above works, our model is the first attempt in devising an end-to-end hyperbolic architecture showing the benefit of addressing event TempRel extraction in hyperbolic spaces.

Preliminaries
In this section, we give a brief introduction to hyperbolic geometry and hyperbolic neural networks.
Hyperbolic geometry.A hyperbolic space is a non-Euclidean space that has the same negative sectional curvature at every point (i.e., a constant negative curvature).Intuitively, that a space has constant curvature implies that it keeps the same "curveness" at every point.An example of constant positive curvature space is a perfect globe.On the other hand, an example of an n-D hyperbolic space is a hyperboloid in a R n+1 space.
One of the most widely used models of hyperbolic space was proposed by Henri Poincaré.The Poincaré model is an open n-dimensional unit ball D n = {x ∈ R n | x < 1} equipped with the Riemannian metric tensor: x ∈ D n , • denotes the Euclidean norm, and g E denotes the Euclidean metric tensor.A twodimensional Poincaré model is called a Poincaré disk.A geodesic (i.e., the shortest path between two points) on the Poincaré disk is an arc, which is part of a circle perpendicular to the boundary circle.
Based on the metric tensor g D x , the distance of two points is defined as (Nickel and Kiela, 2017): ). (2) The Poincaré norm is the distance between the origin and the given point: Also, an angle ∠ABC in the Poincaré model can be derived as (Ganea et al., 2018a): where z 1 , z 2 ∈ T x D n \ {0} are the initial tangent vectors of the geodesics connecting B with A, and B with C.An important property of the Poincaré model is its conformality with a Euclidean space, which means both of their metrics define the same angles (Eq.4= z 1 ,z 2 z 1 z 2 , where •, • denotes Euclidean inner product).
It is known that an exponential map can be defined for each point x ∈ D n to map any point Ganea et al., 2018a): (5) Hyperbolic neural networks.In order to provide an algebraic setting for a hyperbolic space, which is not a vector space, where ⊗ is the Möbius product, ⊕ is the Möbius addition, ϕ ⊗ is a hyperbolic non-linearity, diag(x) is the square diagonal matrix of x, and In Hyperbolic MLR, the prediction probability of a given class k ∈ 1, ..., K is computed as: where x ∈ D n is the output vector of the previous layer, p k ∈ D n and a k ∈ T p k D n \ {0} are parameters.

Event Temporal Relation Extraction in the Hyperbolic Space
In this section, we propose two approaches to leverage a hyperbolic space for TempRel extraction.The first approach learns event embeddings that encode temporal order via a hyperbolic space, while the second one is an end-to-end hyperbolic neural network tailored for the TempRel extraction task.

Hyperbolic Event Embedding Learning
We first explore how to learn embeddings of events in a hyperbolic space while preserving their temporal orders.Temporal relations are asymmetric and transitive, exhibiting similar properties to hierarchical relations.Inspired by previous successes of hyperbolic embeddings for word hierarchies (Nickel and Kiela, 2017;Ganea et al., 2018a), we propose to learn event embeddings based on the Poincaré model to capture their temporal relations.For a given text sequence containing an event pair (u, v), we first extract the contextualized embeddings of the event tokens3 , e u and e v , from their, for example, ELMo (Peters et al., 2018) or RoBERTa (Liu et al., 2019) sequence encodings, and use the exponential mapping function to map them onto a Poincaré ball.The embeddings are then further projected to a lower-dimensional space through a hyperbolic feed-forward layer: where exp 0 (•) is the exponential map at the origin of a Poincaré ball as defined by Eq. ( 5), HFFL is a hyperbolic feed-forward layer, s u and s v are the final Poincaré embeddings of event u and v.
To encode temporal connections in the embeddings, we want to pull events that have temporal connections close to each other, while pushing events that have no temporal relations far apart.Thus, inspired by Poincaré embeddings (Nickel and Kiela, 2017), we define the first loss term: where D is the set of event pairs that have temporal connections, N (u) is the set of events that have no temporal relations with the event u. s u and s v denote the Poincaré embedding of event u and v, respectively.For example, in the MATRES dataset, event pairs are annotated with one of the following four relations: BEFORE, AFTER, EQUAL and VAGUE.We can consider the first three relations as temporal connections, and regard VAGUE as no relation (Ning et al., 2019).Therefore, the set D contains event pairs in the training set that have BEFORE, AFTER, or EQUAL labels.The set N (u) contains the events that are in the same documents with u but cannot reach u using only BEFORE, AFTER, or EQUAL edge.It is worth noting that because we do not encode the relation type explicitly, the order of input event pair (u, v) matters.We explicitly model the BEFORE relation in the event pair (u, v) (i.e., u happens earlier than v).For event pairs with the AFTER relation, we simply swap the events since AFTER and BEFORE are reciprocal.
In addition to the first loss term, we introduce a second novel loss term to enforce an angular property.As the example shown in Figure 2, we want to make ∠θ 1 of a positive event pair (u, v) smaller.Based on preliminary tests, the angular loss can further enforce the first loss term which is driving the norm of u to be larger than the norm of v and thus increases the performance.Moreover, this angular property can help to distinguish VAGUE pairs by using a threshold on the ∠θ 2 to determine whether to assign the event pair to the VAGUE label.We define the second loss term as the degree of ∠θ 1 , which is known to be equal to (Ganea et al., 2018a): Then, we train an HFFL based on the following objective function: where α is a hyperparameter to balance the importance of the two loss terms.
The training objective will push s u towards the boundary while pulling s v close to the origin.Thus, the norm of s u and s v will provide key information to determine the temporal order of an event pair, while the angle ∠θ 1 can help to distinguish VAGUE event pairs.We propose the following score function: Based on the score function, different types of relations can be predicted using the following rules: where the value of threshold t ∈ ( , 1) is adjusted on the validation set.

Hyperbolic Neural Network for Temporal Relation Detection
Apart from the aforementioned approach, an alternative method is to train an end-to-end hyperbolic neural network using the cross-entropy loss.We propose a hyperbolic neural network model for TempRel detection based on the operations defined on the Poincaré ball Given an input sequence containing an event pair (u, v), we first obtain its contextualized sentence representations from a pre-trained language model.The representations are denoted as a matrix B ∈ R l×d , where l represents the sentence length and d is the dimension of word embeddings.The sentence representations are then projected onto a Poincaré ball and fed into a hyperbolic feed-forward layer, C = exp 0 (B).The outputs are passed to a Hyperbolic Gate Recurrent Unit (HGRU) to derive the hidden state of each word.A position masking vector m u or m v , which has a value '1' in the position of an event and '0' otherwise, is applied to retrieve the hidden state representation of the corresponding event: The HGRU on top of the RoBERTa output can further compose the information from the event triggers and corresponding subjects and objects.
Afterward, the two event hidden states are combined by performing weighted Möbius aggregation: The s uv is further combined with the distance between the two event hidden states, d D (h u , h v ), before applying a hyperbolic non-linear function: The output o is then passed to a Hyperbolic Multinomial Logistic Regression (HMLR) layer (Eq.6) to generate the event temporal relation classification result, ŷ = HMLR(o).Figure 3 shows a schematic depiction of the network architecture.Additionally, commonsense knowledge can be incorporated within the HGRU.We follow Ning et al. (2019) and use a Siamese network trained on TEMPROB4 , discretize its output, and turn the output into categorical embeddings.Then, we project the categorical embeddings onto a hyperbolic space and use a Riemannian optimizer to update them.They are subsequently mapped into a Poincaré ball and processed using Hyperbolic Feed-Forward Layers (H-FFL) and Hyperbolic-GRUs (H-GRU).Then, a masking process ensures that only the event-related vectors are aggregated via Möbius operations, along with their d D distance and the relevant temporal common sense, extracted by a Siamese network pre-trained on TEMPROB knowledge base.Finally, the distribution over event temporal relations is derived using a Hyperbolic Multinomial Logistic Regression (H-MLR), analogous to a traditional Softmax layer in the Euclidean space.
The commonsense features can be directly combined with the components of Eq. 14.

The Use of Pre-trained Language Model
Both of our proposed methods incorporate pretrained language models.We investigated several ways to utilize them, and include two of them in this paper.The first one follows Ning et al. (2019), which only uses the static output of the pre-trained models.This approach is fast and allows a fair comparison with the models in Ning et al. (2019).
The second approach fine-tunes the pre-trained language models during training on the TempRel Extraction objective, which can achieve better performance with more cost on time.

Experimental Setup
We describe the datasets, methodologies used in the recent literature for the TempRel extraction tasks.We also briefly present the parameter setup of our experiments.A description of the evaluation metrics can be found in the Appendix D.
Dataset MATRES (Ning et al., 2018c) is a Tem-pRel dataset that is composed of news documents.
With the novel multi-axis annotation scheme, MA-TRES achieves much higher inter-annotator agree-  (Styler IV et al., 2014).MATRES consists of documents from three sources: TimeBank (183 documents), AQUAINT (72 documents), and Platinum (20 documents).We follow the official split in which TimeBank and AQUAINT are used for training, while Platinum is used for testing.We further split 20% of the training data as the validation set.
Temporal and Causal Reasoning (TCR) (Ning et al., 2018a) is another dataset that adopts the annotation scheme defined in MATRES.It is a much smaller dataset, with just 25 documents and 2.6K Parameter Setup Based on the results of preliminary experiments, the contextualized embeddings are produced from RoBERTa in both of our proposed approaches.We conducted preliminary experiments to determine the best dimension for the Poincaré embeddings but found the variations in performance had no statistical significance.Thus, we adopted the 2D Poincaré embeddings for the sake of simplicity and ease of visualization.More details about the model architecture and hyperparameter setting can be found in the Appendix A.

Experimental Results
Overall Comparison Existing methodologies adopted for TempRel extraction commonly leverage several auxiliary components, such as external commonsense knowledge and multi-task objectives.Therefore, to better understand the impact made by the adoption of the hyperbolic geometry, we conduct additional experiments over ablated versions of the baseline models.In particular, in Table 2, results for the LSTM and LSTM+knowledge are produced by employing the original code provided by the authors5 .Other results of compared methods are directly taken from the cited papers.
For consistency and fair comparison with Ning et al. (2019), we test our method with inputs from the ELMo (Peters et al., 2018), which has been reported achieving the best overall results for the models proposed in (Ning et al., 2019).We observe that the proposed Poincaré event embedding learning method presented in Section 4.1 outperforms LSTM and its variants which rely on fairly complex auxiliary features and constraints on both the MATRES and the TCR datasets, and produced more accurate event TempRel detection results even when compared to the JCL-base model.It is worth noticing that the Poincaré event embeddings (static RoBERTa) are trained with a shallow network with just 1.5k parameters and only takes about 4 minutes to train on a single RTX 2080 Ti.If further fine-tuning RoBERTa on MATRES, we observe a further improvement on F 1 by 1.8%.
Using our proposed alternative end-to-end HGRU model, HGRU (static ELMo) outperforms both the standard LSTM model and the variant incorporating commonsense knowledge (LSTM+knowledge) on MATRES.This verifies that the hyperbolic-based method is more efficient than its Euclidean counterparts.In order to fairly compare with the state-of-the-art model (Wang et al., 2020) on this task, we utilize RoBERTa and the auxiliary temporal commonsense knowledge since Wang et al. (2020) also fine-tunes RoBERTa on MATRES and uses external commonsense knowledge.The results show that HGRU (RoBERTa) + knowledge outperforms JCL-base (Wang et al., 2020) significantly by 7% in F 1 .It even outperforms JCL-all, which further incorporates logic constraints and multi-task learning.
On the TCR dataset, the proposed hyperbolicbased methods also see similar improvement over existing methods.In terms of the difference between the two proposed methods, Poincaré Events Embeddings achieve a higher F 1 score on TCR.The reason is that HGRU tends to predict more VAGUE labels, but there is no VAGUE in TCR.Interestingly, both proposed methods predict less VAGUE labels when using static RoBERTa.A detailed breakdown of results for each temporal relation and training cost is presented in the Appendix B.
Ablation Study In order to study the impact of different components of HGRU on event TempRel extraction, we conduct the ablation study.
First, the HGRU layer is removed and the contextual embeddings of events are directly fed into hyperbolic FFNN (HFFNN) while all the other hyperparameters are frozen.The resulting performance of HFFNN is significantly lower than HGRU, which indicates that the temporal information is spread across different time steps of the pre-trained language model output, and it is better encoded by a recurrent architecture.HGRU w/o d D shows the impact of the hyperbolic distance feature d D (h u , h v ) and its parameter w o in Eq. 14.The results show that the hyperbolic distance between the hidden states of two events encodes relevant information to predict the event temporal relations.
Although Ganea et al. (2018b) reported that mixing hyperbolic neural networks with Euclidean Multinomial Logistic Regression (EMLR) can at times achieve better performance than pure hyperbolic networks, we observe no significant difference on the MATRES dataset.Finally, as discussed earlier, fine-tunning RoBERTa gives better performance compared to static RoBERTa and ELMo.Additionally, our ablation study on the Poincaré Event Embedding shows that without the angular loss (α = 1, Eq. 10) it can only achieve 62.4 in F 1 , compared to 77.1 while using it (static RoBERTa).
Case Study Figure 4 shows a set of Poincaré event embeddings resulting from the method proposed in section 4.16 .The events are numbered according to the temporal relations detected by the model (a smaller number denotes an earlier event).Among them, it is worth noting the two temporal paths between the expected approval from the bank (e2) and the final acquisition of shares (e5), i.e., e2 → e5 and e2 → e3 → e4 → e5.The first direct path is accompanied by a more fine-grained path, specifying the clearance granted from the authorities (e3) to permit (e4) the bank acquisition and the consequential buying of tendered shares (e5).The model has encoded more recent events closer to the origin, and events in the past closer to the border, while simultaneously shaping a hierarchical structure to link their information with West German and French authorities have (e3: cleared) Dresdner Bank AG's takeover of a majority stake in Banque Internationale de Placement.The approval, which had been (e2: expected), (e4:permits) West Germany's second-largest bank to acquire shares of the French investment bank.Dresdner Bank (e0:said) it will (e5: buy) all shares (e1: tendered) by shareholders on the Paris Stock Exchange at the same price from today through Nov. 17.different granularity, for a resulting hierarchical structure with asymmetric connections.

Conclusion
In this paper, we proposed to model event temporal relations overcoming the limitations of Euclidean representations, and designing two TempRel extraction methods using hyperbolic geometry.The first approach highlighted the convenience of learning event embedding in the Poincaré ball, achieving performance on par with recent methodologies using a simple rule-based classifier.Then, we designed a hyperbolic neural network, incorporating temporal commonsense, outperforming state-ofthe-art models on the standard datasets.Finally, a qualitative analysis pointed out the inherent advantage of employing hyperbolic spaces to encode asymmetric relations.In the future, we plan to extend our approaches to a wider spectrum of event relations, including causal and sub-event relations.

Figure 1 :
Figure 1: Events annotated with temporal relations from a document excerpt.Arrow lines represent the Before relations, while red dashed lines the Vague ones.

Figure 2 :
Figure2: An illustration of the Poincaré embedding used to encode two events u and v with known temporal relation.θ 1 is the angle between the event pair (u, v), while θ 2 is the angle of an event pair (u, v ) resulting from the negative sampling process.
is (e1:leaving) for the US to start a new career, he (e2:said).

Figure 3 :
Figure 3: In the hyperbolic neural architecture for temporal relation extraction, sentence tokens are first associated with standard RoBERTa vectors (within the Euclidean space).They are subsequently mapped into a Poincaré ball and processed using Hyperbolic Feed-Forward Layers (H-FFL) and Hyperbolic-GRUs (H-GRU).Then, a masking process ensures that only the event-related vectors are aggregated via Möbius operations, along with their d D distance and the relevant temporal common sense, extracted by a Siamese network pre-trained on TEMPROB knowledge base.Finally, the distribution over event temporal relations is derived using a Hyperbolic Multinomial Logistic Regression (H-MLR), analogous to a traditional Softmax layer in the Euclidean space.

Figure 4 :
Figure 4: A document excerpt from the MATRES dataset and the related temporal event embedding generated by the Poincaré embedding method.

Table 1 :
The number of event pairs under each of the four relation classes in the MATRES and TCR datasets.

Table 2 :
Ning et al. (2019)ts on MATRES and TCR.Results presented in the top half are either directly taken from the cited papers or produced by employing the original source code supplied by the authors (models denoted by '*').Results presented in the lower half are generated from our proposed models and their variants.TempRels.Due to the TCR limited size, we followNing et al. (2019)by using the temporal relations in TCR to test the model trained on MATRES.The statistics of the data used in our experiments is shown in Table1.

Table 3 :
Ablation experiments on the MATRES dataset.