TIMERS: Document-level Temporal Relation Extraction

We present TIMERS - a TIME, Rhetorical and Syntactic-aware model for document-level temporal relation classification in the English language. Our proposed method leverages rhetorical discourse features and temporal arguments from semantic role labels, in addition to traditional local syntactic features, trained through a Gated Relational-GCN. Extensive experiments show that the proposed model outperforms previous methods by 5-18% on the TDDiscourse, TimeBank-Dense, and MATRES datasets due to our discourse-level modeling.


Introduction
Temporal relation extraction (TempRel) is a challenging task that involves determining the temporal order between two events in a text (Pustejovsky et al., 2003). Understanding the temporal ordering of events in a document plays a key role in downstream tasks such as timeline creation (Leeuwenberg and Moens, 2018), time-aware summarization (Noh et al., 2020), temporal question-answering (Ning et al., 2020), and temporal information extraction (Leeuwenberg and Moens, 2019).
Prior work focuses on extracting temporal relations between event pairs (a.k.a., TLINKS) present in the same sentence (Intra-sentence TLINKS) or adjacent sentences (Inter-sentence TLINKS), mostly ignoring document-level pairs (Crossdocument TLINKS) (Reimers et al., 2016). Past works have used RNN (Cheng and Miyao, 2017;Meng et al., 2017;Goyal and Durrett, 2019;Ning et al., 2019;Han et al., 2019aHan et al., ,c,b, 2020b and Transformer networks (Ballesteros et al., 2020;Zhao et al., 2020b) for encoding a few sentences or a short paragraph but do not capture longrange dependencies and multi-hop reasoning at the document-level. This shortcoming is shown in the TDDiscourse dataset (Naik et al., 2019), which was designed to highlight global discourse-level challenges, e.g., multi-hop chain reasoning, future or hypothetical events, and reasoning requiring world knowledge.
We propose TIMERS -a TIME, Rhetorical, and Syntactic-aware model for document-level temporal relation extraction. TIMERS uses discourse features in the form of connections from Rhetorical Structure Theory (RST) parsers (Bhatia et al., 2015) to leverage long-range inter-sentential relationships. It also extends existing contextual embeddings with structural and syntactic dependency parse connections. Lastly, it uses timextimex relations, dct (document creation date)-timex relations, and temporal arguments obtained via sentence-level semantic role labeling. These rhetorical, syntactic, and temporal features are learned through a modified version of Relational Graph Convolutional Networks (R-GCN) with a gating mechanism (GR-GCN) (Schlichtkrull et al., 2018), which learns highly relational data relationships in densely-connected graph networks.
Our main contribution is a document-level model that incorporates these three features to improve temporal relationship extraction. We obtain state-of-the-art performance across three datasets with 5-18% relative improvement, showing improvement for events that require chain reasoning, causal prerequisite links, and future events.

Methodology
Let document D be defined as a sequence of n tokens w i ∈ W = {w 1 , · · · , w n }. The entire document is a list of m sentences V = [v 1 , · · · , v m ]. Each document has a set of p events E = {e 1 , · · · , e p } and q timexes T = {t 1 , · · · , t q }, where p, q ≤ n. The creation date of the document is represented by timestamp t DCT . We denote the source and target events by e s and e t , respec- tively. The task is to identify the temporal relation y ∈ R between the source and target event in a multi-class classification setup, where R is the set of all possible temporal links (TLINKs).
To solve this task, our model ( Fig.1)

Syntactic-Aware Graph
The syntactic graph captures the document structure and word dependency. Our syntactic-aware graph (G SG ) is made of separate nodes to represent the document D, each of its inherent sentences v i ∈ V , and all the constituent words w i ∈ W of each sentence. The edges of the Syntactic Graph encode five relations: (1) Document-Sentence Affiliation and (2) Sentence-Word Affiliation model the hierarchical structure of the document through a directed edge from the document node to each sentence node and from a sentence node to each word in the sentence.
(3) Sentence-Sentence Adjacency and (4) Word-Word Adjacency to preserve sequential ordering for consecutive sentence and word nodes. (5) Word-Word Dependency encodes the syntactical nature of the word-level relationships by adding an undirected edge between two word nodes if they share a parent-child relationship in the sentence-level dependency tree. We use BERT to encode each w i and obtain sentence embeddings v i by averaging the second-tolast hidden layer of BERT for each token. The document vector embedding D i was calculated as the average of all sentence embedding (D i = m i=0 v i ).

Time-Aware Graph
When events are anchored to a specific time, it becomes easier to infer event relationships from their associated date and time. The time-aware graph (G T G ) exploits this intuition and propagates relational information among events, timexes, and the Document Creation Time (DCT). The document node D is the node corresponding to the document creation date while the timexes t i and events e i are characterized by their corresponding word nodes in the Syntactic Graph. We design three types of edge connections: (1) DCT-Timex Association: exploit the ordering of timexes with respect to the document creation time through directed weighted edges from DCT to timexes.
(2) Timex-Timex Association: capture inherent nonlocal timeline ordering between timex pairs by a directed weighted edge.
(3) Predicate-Temporal Argument: anchor local temporal relations at the sentence level by connecting each event verb predicate to its temporal argument with a directed edge. The connections formed between temporal entities help navigate information from the source event to the target event while exploring interactions with other events, timexes, dct, and temporal arguments. We calculate timestamps for timexes and the DCT from the annotated TimeML format of input documents. The weight of the DCTtimex and timex-timex edges is determined based on the temporal order of the entities {After, Before, Simultaneous, None}. We added None as a relation when one of the timestamps cannot be anchored in time.

Rhetorical-Aware Graph
We use discourse features based on Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) to leverage long-range inter-dependencies through a discourse tree. The rhetorical discourse tree of a document contains nodes of phrases, where each phrase (a.k.a, Elementary Discourse Unit or EDU) is contiguous, adjacent and nonoverlapping. The interdependencies among EDUs are represented by conventional rhetorical relations (Mann, 1987), e.g. Elaboration, Span, Condition, Attribution. Prior work showed discourse features in the form of RST connections help leverage longrange document-level interactions between phrase units (Bhatia et al., 2015) and identify backgroundforeground events (Aldawsari et al., 2020).
Elementary Discourse Unit (EDU), a subsentence phrase unit, is the minimal selection unit for discourse segmentation of a document. We generate the document vector representations at EDU-level h i ∈ H = {h 1 , · · · , h d } via the Self-Attentive Span Extractor (SpanExt) from Lee et al.
(2017) over the BERT token embeddings. We use the converted dependency version of the tree to build the Rhetorical-aware graph (G DG ) by treating every discourse dependency from the i-th EDU to the j-th EDU as a directed edge weighted by the type of the rhetorical relation.

Temporal Relation Extraction
Each graph is instantiated as a gated variant of Relational Graph Convolutional Networks (R-GCN) (Schlichtkrull et al., 2018), which we term as Gated Relational Graph Convolution Network (GR-GCN). GR-GCN propagates messages among the nodes to   obtain a learned node representation and is inspired by . Fig. 2 shows how the learned representations obtained from the syntacticaware graph forms the input to the time-aware graph. For the time-aware graphs, the learned representations of nodes corresponding to the source event e s and target event e t are extracted (O T ). In the case of the rhetorical graphs, the span representations of the EDU span nodes corresponding to the source event (h e ) and target event (h s ) are extracted (O EDU ).
The output corresponding to the source and target nodes learnt by G T G (O T ) and G DG (O EDU ) are concatenated with output of BERT based context encoder (O CE ) (similar to BERT encoding in (Zhao et al., 2020a) . This is followed by a Softmax layer to predict temporal relations.

Data
We train and test our proposed model using the TD-DMan and TDDAuto subsets of the TDDiscourse corpus (Naik et al., 2019), which was designed to explicitly focus on global discourse-level temporal ordering. We also train and evaluate our method on the MATRES and TimeBank-Dense datasets, both of which primarily consist of local TLINKs that occur in either the same or adjacent  The ablation shows that context, discourse (GDG), and time-aware (GT G) graph encoders prove to be most beneficial.
sentences. Table 1 reports the data statistics and label distributions. (Naik et al., 2019) shows the distribution of the distance between event-pairs for all TLINKs in the TDD test set and explains that nearly 53% TLINKs in the TDD dataset comprise of event pairs that are more than 5 sentences apart. Like Cheng and Miyao (2017), we report results on non-vague labels of TimeBank-Dense. MATRES has no standard validation set. Hence, we follow the split used in (Ning et al., 2019).

Experimental Settings
Token Encoding:The word-level token representations are obtained by summing the corresponding BERT embeddings from the last 4 layers of pre-trained BERT-base encoder. Syntactic Dependency Parser: The dependency parse tree of individual sentences is obtained via SpaCy 1 to form word-word dependency connections in the syntactic-aware graph. Semantic Role Labeller: We extract semantic role labels using AllenNLP's SRL parser 2 that internally uses SRL-BERT (Shi and Lin, 2019) to obtain the temporal arguments corresponding to each verb event. Timex Normalization: Timex phrases are treated as a single unit for the purpose of graph construction by average pooling their BERT tokenized representations. Microsoft Recognizers-Text 3 is employed to normalize timexes and DCT date-time values.
The normalized timex expressions are compared through Allen's interval algebra, where each timex has a start and an endpoint. The comparison is then made on the basis of the endpoints of the timexes, forming an edge going from earlier to later ending timex. RST Discourse Parser: We used the shiftreduce discourse parser proposed by Ji and Eisenstein (2014) to build the discourse tree 4 , which is post-processed using discoursegraphs library 5 (Neumann, 2015) to build the rhetorical dependencies graph. Further implementation details can be found in the appendix.

Ablation Study
To assess the contribution of discourse, syntactic, and time-aware graphs, we performed an ablation experiment with different configurations (Table  3). Removing the context encoder significantly degrades performance, indicating that the graph components themselves cannot replace the contextual encoding. Removing any of the graph encoders hurts the model performance, motivating the need for all the constituent graph components. We also analyzed the relative importance of G DG , G SG , and G T G represented by color shading in the table. The results show that the syntactic graph is least important for document level pairs in TDDMan and TDDAuto, which we believe is due to the longer range dependencies present in this dataset. However, removing the discourse graph for TimeBank-Dense and MATRES datasets leads to the least performance deterioration as inter and intra-sentence pairs do not fully utilize document-level rhetorical relations. TIMERS outperforms the BERT baseline even without G T G , demonstrating its useful in cases where document creation date or timexes cannot be obtained easily.

Error Analysis
The error analysis results of TIMERS and its ablations for TDDMan are shown in Fig. 3 (the results on TDDAuto are in Appendix Fig.1). The results provide evidence that the syntactic-aware graph (G SG ) is most important for relations that can be extracted from a single sentence (SE). The time-aware graph (G T G ) plays an important role in improving relationships requiring chain reasoning (multi-hop) and relationship determined by future events. We also note the role of the rhetoricalaware graph (G DG ) for modeling future possibility (FE), hypothetical events (HN) and causal conditions for event occurrences (CP). This can be attributed to rhetorical relational features that extract plausible inter-dependencies such as cause, explanation, contrast (Lioma et al., 2012). None of the experimented models show improved performance on TLINK pairs which depend on world knowledge (WK) or event coreference (EC).

Conclusion
This work presents a neural architecture that utilizes local syntactic features, rhetorical discourse features, and temporal arguments in semantic role labels through a Gated Relational-GCN for document-level temporal relation extraction on TDDiscourse, MATRES, and TimeBank-Dense datasets. Experiments show that TIMERS shows substantial improvement for events that require chain reasoning and causal prerequisite links. Future work will focus on exploring real-world scenarios in which the temporal extraction task suffers from absent or erroneous event and timex annotations. We believe our proposed methods can also be adapted for other languages as well by overcoming possible limitations such as dependency parsing, semantic parsing, Timex normalization for the non-English corpora.

Ethics Statement
This work does not collect or release any new data resource. Moreover, all four of the datasets used in experiments (TDDiscourse, TimeBank-Dense and MATRES) are publicly available and free to use, hence do not intrude user privacy. During the course of this work, no human judgements were exploited nor any user-level data was collected, stored or processed. Our methods do not add to any preexisting data biases. Potential applications of this work include extracting event timelines from news, contractual documents, and digitizing patient electronic health records. We acknowledge that temporal information extraction finds applications in clinical NLP (Lin et al., 2019;Tourille et al., 2017). Hence, we would like to caution about shortcomings of the proposed system in terms of misclassifications on event pairs requiring real-world common sense reasoning and domain shift.

A.1 Node Connections
We detail the node connections present in each graph of our proposed model along with edge attributes in Table 4.

B Additional Results
We observe from Figure 4 a similar trend to TD-DMan, although with a stronger support for SS, CR, TI and and FE. This is partly due to the fact that TDDAuto was generated automatically (    One of the timex cannot be extracted or normalized Table 7: Timex-Timex and DCT-Timex relations used in the Time-aware graph G T G .