Extracting Temporal Event Relation with Syntax-guided Graph Transformer

Extracting temporal relations (e.g., before, after, and simultaneous) among events is crucial to natural language understanding. One of the key challenges of this problem is that when the events of interest are far away in text, the context in-between often becomes complicated, making it challenging to resolve the temporal relationship between them. This paper thus proposes a new Syntax-guided Graph Transformer network (SGT) to mitigate this issue, by (1) explicitly exploiting the connection between two events based on their dependency parsing trees, and (2) automatically locating temporal cues between two events via a novel syntax-guided attention mechanism. Experiments on two benchmark datasets, MATRES and TB-Dense, show that our approach significantly outperforms previous state-of-the-art methods on both end-to-end temporal relation extraction and temporal relation classification; This improvement also proves to be robust on the contrast set of MATRES. The code is publicly available at https://github.com/VT-NLP/Syntax-Guided-Graph-Transformer.


Introduction
Temporal relationship, e.g., Before, After, and Simultaneous, is important for understanding the process of complex events and reasoning over them. Extracting temporal relationship automatically from text is thus an important component in many downstream applications, such as summarization (Jiang et al., 2011;Ng et al., 2014), dialog understanding and generation (Ritter et al., 2010;, reading comprehension (Harabagiu and Bejan, 2005;Ning et al., 2020;Huang et al., 2019) and future event prediction Lin et al., 2022). While event mentions can often be detected reasonably well (Lin et al., 2020;, extracting event-event relationships, especially temporal relationship, still remains challenging (Chen et al., 2021).
Temporal Relation (e 1 e 2 ): Before S1: Now, Lockheed Martin which (e 1 : bought) an early version of such a computer from the Canadian company D-Wave systems two years ago is confident enough in the technology to upgrade it to commercial scale, becoming the first company to (e 2 : use) quantum computing as part of its business. Recent studies (Han et al., 2019b;Ning et al., 2017;Vashishtha et al., 2019;Wang et al., 2020a) have shown improved performance in temporal relation extraction by leveraging the contextual representations learned from pre-trained language models (Devlin et al., 2018;Liu et al., 2019). However, one remaining challenge of this task is that it requires accurate characterization of the connection between two event mentions and the cues indicating their temporal relationship, especially when the context is wide and complicated. For instance, by manually examining 200 examples of human annotated temporal relations from the MATRES (Ning et al., 2018) dataset, we find that about 52% of the temporal cues 1 come from the connection between two event mentions (e.g., S1 in Fig. 1), 39% from their surrounding contexts (S2 in Fig. 1) and the remaining 9% from others, e.g., event co-reference or subordinate clause structures (S3 in Fig. 1).

Graph Attention Over Ingoing and Outgoing Edges
Ingoing He won the Gusher  Figure 2: Architecture overview. The tokens highlighted with red and blue colors in the Input Sentence show the source and target events to be detected. The bold edges in the Input Graph Structure indicate the triples from the dependency path between the source and target event mentions as well as their surrounding context, and are considered by the syntax-guided attention.
Syntactic features, such as dependency parsing trees, have proved to be effective in characterizing the connection of two event mentions in pre-neural methods (Chambers, 2013;Mirza and Tonelli, 2016). However, how to make use of these features has been under-explored since the adoption of neural methods in this field. This paper closes this gap with a novel Syntax-guided Graph Transformer (SGT) network -in addition to the attention heads in a typical Graph Transformer, we bring in a new attention mechanism that specifically looks at the path from a source node to a target node over dependency parsing trees. SGT thus not only learns event representations as in a typical Graph Transformer, but also provides a way to represent syntactic dependency information between a pair of events (for temporal relation extraction, this means attending to the aforementioned temporal cues). We conduct experiments on two benchmark datasets, MATRES (Ning et al., 2018) and TB-DENSE  on both end-to-end temporal relation extraction and classification, which demonstrate the effectiveness of SGT over previous state-of-the-art methods. Experiments on the contrast set (Gardner et al., 2020) of MATRES further proves the robustness of our approach. Figure 2 shows the overview of our approach. Given an input sentences = [w 1 , w 2 , ..., w n ] with n tokens, we aim to detect a set of event mentions {e 1 , e 2 , ...} where each event mention e i may contain one or multiple tokens by leveraging the contextual representations learned from a pre-trained BERT (Devlin et al., 2018) encoder. Then, following previous studies (Ning et al., 2019(Ning et al., , 2017Han et al., 2019b;Wang et al., 2020a), we consider each pair of event mentions that are detected from one or two continuous sentences, and predict their temporal relationship.

Approach
To effectively capture the temporal cues between two event mentions, we build a dependency graph from one or two input sentences and design a new Syntax-guided Graph Transformer network to automatically learn a new contextual representation for each event mention by considering the triples that they are locally involved as well as the triples along the dependency path of the two event mentions within the dependency graph. Finally, the two event mention representations are concatenated to predict their temporal relationship.

Sequence Encoder
Given an input sentences = [w 1 , w 2 , ..., w n ], we apply the same tokenizer as BERT (Devlin et al., 2018) to get all the subtokens. Then, we feed the sequence of subtokens as input to a pre-trained BERT model to get a contextual representation for each token w i . If a token w i is split into multiple subtokens, we use the contextual representation of the first subtoken to represent w i . To enrich the contextualized representations, for each token, we create a one-hot Part-of-Speech (POS) tag vector and concatenate it with BERT contextual embeddings. In this way, we obtain a final representation c i 2 for each w i . These representations will be later used for event mention detection and also as the initial representations to our syntax-guided graph transformer network.

Event Detection
To detect event mentions from the sentence, we take the contextual representation of each word as input to a binary linear classifier to determine whether it is an event mention or not, which is optimized by minimizing the following binary crossentropy loss: where L eve denotes the cross-entropy loss for event detection. S is the set of sentences in the training dataset. α π is a weight coefficient for each class (0 or 1) to mitigate the data imbalance problem and α 0 + α 1 = 1. y i,π is a binary indicator to show whether π is the same as the groundtruth binary label (y i,π = 1) or not (y i,π = 0).ỹ i,π denotes the probability of the i-th token in s being predicted with a binary class label π. W eve and b eve are learnable parameters.

Syntax-guided Graph Transformer
From the example sentences in Fig. 1, the temporal cues for characterizing the temporal relationship between two event mentions mainly come from their surrounding contexts as well as their connections from their syntactic dependency path. However, a sequence encoder usually fails to capture such information, especially when the context between two event mentions is complicated, thus we further design a new Syntax-guided Graph Transformer (SGT) network.
Given a source event e s and a target event e t detected from one or two continuous sentences, we apply a public dependency parser 3 to parse each sentence into a tree-graph and connect the graphs of two continuous sentences with an arbitrary crosssentence edge (Peng et al., 2017;Cheng and Miyao, 2017) pointing from the root node of the preceding sentence to the root node of the following one, and obtain a graph all the neighbor triples of v i with in-going and outgoing edges respectively, r ∈ Υ where Υ is the label set for syntactic dependency relation, and use Node Representation Initialization For each node v i in graph G, we map it to a particular token w i from the original sentence and obtain a contextual representation c i from the BERT encoder. Then, we learn an initial node representation for each node v i as: where W e and b e are learnable parameters.
Graph Multi-head Self-attention Following transformer model (Vaswani et al., 2017;Wang et al., 2020b), we adapt the multi-head selfattention to learn a contextual representation for each node in the graph G. Each node v i in graph G is associated with a set of neighbor triples N in i ∪ N out i and a node representation h l−1 i where l is the index of a layer in our transformer architecture. To perform self-attention, we first apply a linear transformation to obtain a query vector based on each node v i , and employ another two linear transformations to get the key and value vectors based on the node's neighbor triples: where m is the index of a particular head. Q l i denotes a query vector corresponding to node v i , K l ij and U l ij is a key and value vector respectively, and both of them are learned from a triple m is the index of a particular head. denotes the concatenation operation. r ij denotes the representation of a particular relation r ij between v i and v j , which is randomly initialized and optimized by the model. W m q , W m k , W m u , W m r and b m r are learnable parameters.
For each node v i , we then perform self-attention over all the neighbor triples that it is involved, and compute a new context representation with multiple attention heads: where g l i is the aggregated representation computed over all neighbor triples of node v i with M attention heads at l-th layer. g l i will be later used to learn the updated representation of node v i . √ d k is the scaling factor denoting the dimension size of each key vector. W o is a learnable parameter.

Syntax-guided Attention
To automatically find the indicative temporal cues for two event mentions from their connection as well as surrounding contexts, we design a new syntax-guided attention mechanism. For two event nodes v s and v t , we first extract the set of nodes from the dependency path between v s and v t (including v s and v t ), which is denoted as Θ st . We then get all the triples from the dependency path between v s and v t as well as the triples that any node from Θ st is involved, which are denoted as To compute the syntax-guided attention over all the triples from Φ st , we apply three linear transformations to get the query, key and value vectors where the query vector is obtained from the representation of two event mentions, and key and value vectors are computed from the triples in Φ st : where m is the index of a particular head, Q l st ,K l ij ,Ũ l ij denote the query, key and value vec- Given the query vector, we then compute the attention distribution over all triples from Φ st and get an aggregated representation to denote the meaningful temporal features captured from the connection between two event mentions and their surrounding contexts.
whereg l st is the aggregated temporal related information from all the triples in Φ st based on the syntax-guided attention at l-th layer. W p is a learnable parameter.
Node Representation Fusion Each event node in graph G will receive two representations learned from the multi-head self-attention and syntaxguided attention, thus we further fuse the two representations for both the source node v s and the target node v t : where g l s and g l t denote the context representations learned from the multi-head self-attention for v s and v t .g l st denotes the representation learned from the triples from Φ st using syntax-guided attention. h l s andĥ l t are the fused representations of v s and v t , respectively.W f is a learnable parameter.
For each non-event node v i , which only receives a context representation g l i learned from the multihead self-attention, we apply a linear projection and get a new node representation: Our Syntax-guided Graph Transformer encoder is composed of a stack of multiple layers, while each layer consists of the two attention mechanisms and the fusion sub-layer. We use residual connection followed by LayerNorm for each layer to get the final representations of all the nodes:

Temporal Relation Prediction
To predict the temporal relation between two event mentions e s and e t , we concatenate the final hidden states of v s and v t obtained from the Syntax-guided Graph Transformer network, and apply a Feedforward Neural Network (FNN) to predict their relationship whereỹ st denotes the probabilities over all possible temporal relations between event mentions e s and e t . The training objective is to minimize the following cross-entropy loss function: where ∆ denotes the total set of event pairs for temporal relation prediction and X denotes the whole set of relation labels. y st,x is a binary indicator (0 or 1) to show whether x is the same as the groundtruth label (y st,x = 1) or not (y st,x = 0). We also assign a weight β x to each class to mitigate the label imbalance issue.

Experimental Setup
We perform experiments on two public benchmark datasets for temporal relation extraction: (1) TB-DENSE , which is a densely annotated dataset with 6 types of relations: Before, After, Simultaneous, Includes, Is_included and Vague.
(2) MATRES (Ning et al., 2018), which annotates verb event mentions along with 4 types of temporal relations: Before, After, Simultaneous and Vague. Additionally, we use POS tag information from MATRES provided by (Ning et al., 2019). For TB-DENSE, we use spacy annotation for predicting POS tag information which is based on Universal POS tag set 4 . For both benchmark datasets, we use the same train/dev/test splits as previous studies (Ning et al., 2019(Ning et al., , 2017Han et al., 2019a,b). Note that, for evaluation, similar as previous work, we disregard the Vague relation from both datasets (in the evaluation phase, we simply remove all ground truth Vague relation pairs). In addition, we will only consider event pairs from adjacent sentences due to the fact that it will require 4 https://spacy.io/api/data-formats an exponential number of annotations if we also consider event pairs from non-adjacent sentences, which is beyond the scope of this study. Table 1 shows statistics of the two datasets and Table 2 shows the label distribution.  Implementation Details For fair comparisons with previous baseline approaches, we use the pretrained bert-large-cased model 5 for fine-tuning and optimize our model with BertAdam. We optimize the parameters with grid search: training epoch 10, learning rate ∈ {3e-6, 1e-5}, training batch size ∈ {16, 32}, encoder layer size ∈ {4, 12}, number of heads ∈ {1, 8}. During training, we first optimize the event extraction module for 5 epochs to warm up, and then jointly optimize both event extraction and temporal relation extraction modules using gold event pairs for another 5 epochs.

Results
We evaluate SGT against two public benchmark datasets under two settings: (1) joint event and temporal relation extraction (Table 3); (2) temporal relation classification, where the gold event mentions are known beforehand (Table 4). Note in the "joint" setting, we adopt the same strategy proposed in (Han et al., 2019b): we first train the event extraction module, and then jointly optimize both event extraction and temporal relation extraction   (using gold event pairs as input to ensure training quality) modules. Overall, we observe that our approach significantly outperforms baseline systems in both settings, with up to 7.9% absolute F-score gain on MATRES and 2.4% on TB-DENSE. From Table 3, we see that our approach achieves better performance on event detection than baseline methods though they are based on the same BERT encoder. This is possibly because, during joint training, our approach leverages the dependency parsing trees, which improves the contextual representations of the BERT encoder. In Table 4, unlike other models which are based on larger contextualized embeddings such as RoBERTa, our approach with BERT base achieves comparable performance, and further surpasses the state-of-the-art baseline methods using BERT-large embeddings, which demonstrate the effectiveness of our Syntaxguided Graph Transformer network.
Some studies (Ning et al., 2019;Han et al., 2019b;Wang et al., 2020a;Zhou et al., 2020) focus on resolving the inconsistency in terms of the symmetry and transitivity of the temporal relations. For example, if event A and event B are predicted as Before, event B and event C are predicted as Before, then if event A and event C are predicted as Vague or After, it will be considered as inconsistent. How-  ever, our approach shows consistent predictions with few inconsistent cases when Simultaneous relation is involved. This analysis also demonstrates that our approach can correctly capture the temporal cues between two event mentions.
We also examine the correctness and robustness of our approach on a contrast set of MATRES (Gardner et al., 2020), which is created with small manual perturbation based on the original test set of MATRES in a meaningful way, such as rephrasing the sentence or simply changing a word of the sentence to alter the relation type. The contrast set S1: Before (e 1 : retiring) in 1984 , Mr. Lowe (e 2 : worked) as an inspector of schools with the department of education and sciences , and he leaves three sons from a previous marriage .

S2:
Mr. Erdogan has long (e 1 : sought) an apology for the raid in May 2010 on the Mavi Marmara , which was part of a Flotilla that (e 2 : sought) to break Israel's blockade of gaza. provides a local view of a model's decision boundary, thus it can be used to more accurately evaluate a model's true linguistic capabilities. Table 5 shows that our approach significantly outperforms the baseline model on both the original test set and the corresponding contrast set. The contrast consistency in Table 5 also indicates how well a model's decision boundary aligns with the actual decision boundary of the test instances, based on which we can see that by explicitly capturing temporal cues, our approach is more accurate and robust than the baseline method.

Ablation Study
We further conduct ablation studies to compare the performance of our approach with two ablated versions of our method: (1) BERT with Graph Transformer (BERT-GT), for which we remove the syntaxic-guided attention and only rely on the standard multi-head self-attention to obtain graph-based contextual representations of two event mentions and then predict their relation; (2) BERT, where we further remove the Graph Transformer, and only use the pre-trained BERT language model to encode the sentence and predict the temporal relationship of two event mentions based on their contextual representations.  Table 6: Ablation study on MATRES. We use BERT base as the comparison basis.
Table 6 also shows that by adding Graph Transformer, BERT-GT achieves 2.0% absolute F-score improvement over the BERT baseline model, demonstrating the benefit of dependency parsing trees to temporal relation prediction. By further adding the new syntax-guided attention into Graph Transformer, the absolute improvement on F-score (1.8%) shows the effectiveness of our new Syntaxguided Graph Transformer and the importance of capturing temporal cues from the connection of two event mentions as well as their surround contexts. Figure 3 shows two examples as qualitative analysis. In S1, BERT mistakenly predicts the temporal relation as Before probably because it's confused by the context word Before. However, by incorporating the dependency graph, especially the triples {worked, prep, Before}, {Before, pcomp, retiring} and the path between the two event mentions, worked→prep→Before→pcomp→retiring, both BERT-GT and our approach correctly determine the relation as After. In S2, both BERT and BERT-GT mistakenly predict the temporal relation as Before as the context between the two event mentions is very wide and complicated, and these two event mentions are not close within the dependency graph. However, by explicitly considering and understanding the connection between the two event mentions, sought e 1 →on→Marmara→was→part→Flotilla →sought e 2 , our approach correctly predicts the temporal relation between the two event mentions.

SGT on Temporal Cues
To analyze the source of temporal cues for relation prediction, we randomly sample 100 correct event relation predictions given gold event mentions from MATRES and select the triple that has the highest temporal attention weight from the last layer of the Syntax-guided Graph Transformer network as a temporal cue candidate. We manually evaluate the validity of each temporal cue candidate, and further analyze if the cue is from the dependency path between two event mentions, their surrounding context, or both. Our analysis shows that about 64% of the temporal cues are valid, 37% of them come from the dependency path, 17% are from local context, and the remaining 10% are from both. This verifies our initial observation that most of the temporal cues are from the dependency path between two event mentions as well as their surrounding context. It also demonstrates the effectiveness of our new syntax-guided attention mechanism.

Impact of Wide Context
We further illustrate the impact of context width to both baseline model and our approach. For fair comparison, we use three context width category, [context length < 10, 10 < context length < 20, context length > 20 ]. As we can see in Fig. 4, the first category has 267 pairs, the second category has 343 pairs and the third category has 817 pairs. From our results, we observe that the BERT baseline cannot predict the temporal relation of two event mentions with wide context but rather working well when the event mentions are close to each other. Our model overall performs slightly worse in the second category but in general is very good at predicting the temporal relationship for the event mentions with short and context width. This also proves the benefit of syntactic parsing trees to the prediction of temporal relationship. For the second category where the context length is within [10,20], the performance of our approach slightly drops due to two reasons: (1) the training samples within this range are not as sufficient as the other two categories; (2) for most event pairs from this category, their dependency path is very long and there is no explicit temporal indicative features within their context or dependency path, making it more difficult for the model to predict their temporal relationship. Figure 4: Context width analysis on TB-DENSE. The X axis shows the number of tokens between two events mentions. The left Y axis shows the data distribution of each width category indicating with blue bars. The right Y axis denotes the micro F-score for each width category.

Remaining Errors
We randomly sample 100 classification errors from the output of our approach and categorize them into four categories. As Figure 5 shows, the first category is due to the complex or ambiguous context (54% of the total errors). The second category is due to the complicated subordinate clause structure, especially the clauses that are related to quote or reported speech, e.g., S2 in Figure 5. The third error category is that our approach cannot correctly differentiate the actual events from the hypothetical and intentional events, while in most cases, the temporal relation among hypothetical and intentional events is annotated as Vague. The last category is due to the lack of sufficient annotation. We observe that none of the Simultaneous relation can be correctly predicted for MATRES dataset as the percentage of Simultaneous (3.7%) is much lower than other relation types. In TB-DENSE dataset, labels are even more imbalanced as the percentage of Vague relation is over 50% while the percentage of Includes, Is_Included and Simultaneous are all less than 4%.
Similar to our approach, several studies (Ling and Weld, 2010;Nikfarjam et al., 2013;Mirza and Tonelli, 2016;Meng et al., 2017;Cheng and Miyao, 2017;Huang et al., 2017) also explored syntactic path between two events for temporal relation extraction. Different from previous work, our approach considers three important sources of temporal cues: local context, denoting the neighbors of each event node within the dependency graph; connection of two event mentions, which is based on the dependency path between two event mentions; and rich semantics of concepts and dependency relations, for example, the dependency S2: "We were pleased that England and New Zealand knew about it, and we (e 1 : thought) that's where it would stop." He also (e 2 : talked) about his " second job " as the group's cameraman. (Vague)

Subordinate Clause (22%)
Complex Context (54%) S1: "This is not a Lehman , " he (e 1 : said) to the disastrous chain reaction (e 2 : touched) off by the collapse of Lehman brothers in 2008 . (After) Hypothetical Events and Intentional Events (18%) S3: The day before Raymond Roth was (e 1 : pulled) over, his wife, Ivana, showed authorities emails she had discovered that (e2: appeared) to detail a plan between him and his son to fake his death. (Vague) S4: Microsoft (e 1 : said) it has identified three companies for the china program to (e2: run) through June . (Simultaneous) Imbalanced Labels (6%) Figure 5: Types of remaining errors relation nmod between two event mentions usually indicates a Before relationship. All these indicative features are automatically selected and aggregated with the multi-head self-attention and our new syntax-guided attention mechanism.
Our work is also related to the variants of Graph Neural Networks (GNN) (Kipf and Welling, 2016;Veličković et al., 2018;Zhou et al., 2018), especially Graph Transformer (Yun et al., 2019;Hu et al., 2020;Wang et al., 2020b). Different from previous GNNs which aim to capture the context from neighbors of each node within the graph, in our task, we aim to select and capture the most meaningful temporal cues for two event mentions from their connections within the graph as well as their surrounding contexts.

Conclusion
Temporal relationship between events is important for understanding stories described in natural language text, and a main challenge is how to discover and make use of the connection between two event mentions, especially when the event pair is far apart in text. This paper proposes a novel Syntax-guided Graph Transformer (SGT) that represents the connection between an event pair via additional attention heads over dependency parsing trees. Experiments on benchmarking datasets, MA-TRES, TB-DENSE, and a contrast set of MATRES, show that our approach significantly outperforms previous state-of-the-art methods in a variety of settings, including event detection, temporal relation classification (where events are given), and temporal relation extraction (where events are predicted). In the future, we will investigate the potential of this approach to other relation extraction tasks.