Utilizing Relative Event Time to Enhance Event-Event Temporal Relation Extraction

Event time is one of the most important features for event-event temporal relation extraction. However, explicit event time information in text is sparse. For example, only about 20% of event mentions in TimeBank-Dense have event-time links. In this paper, we propose a joint model for event-event temporal relation classification and an auxiliary task, relative event time prediction, which predicts the event time as real numbers. We adopt the Stack-Propagation framework to incorporate predicted relative event time for temporal relation classification and keep the differentiability. Our experiments on MATRES dataset show that our model can significantly improve the RoBERTa-based baseline and achieve state-of-the-art performance.


Introduction
Event temporal ordering is an important task to understand the evolution of events. Event-event temporal relation extraction aims to automatically extract the temporal order given a pair of events and further build a temporal graph. Figure 1 illustrates an example sentence and its temporal graph. There are three events in the sentence, said, identified and run. The temporal relation between said and identified is AFTER, and the temporal relations between said and run and between identified and run are BEFORE.
Neural network based methods have achieved promising improvement for temporal relation extraction (Meng et al., 2017;Meng and Rumshisky, 2018;Cheng et al., 2020;Ballesteros et al., 2020;Wen et al., 2021a). They mostly consider the task as pairwise classification. There are also efforts focusing on the global structures, including Markov logical networks and Integer Linear Programming 1 The resource for this paper is available at https: //github.com/wenhycs/EMNLP2021-Utilizing -Relative-Event-Time-to-Enhance-Event-Ev ent-Temporal-Relation-Extraction After Before Before said run identified Microsoft (e1,said) it has (e2, identified) three companies for the China program to (e3, run) through June . (ILP) based methods (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Yoshikawa et al., 2009;Do et al., 2012;Ning et al., 2017Ning et al., , 2018aHan et al., 2019). Though achieving great success, event time, an important feature, is often overlooked by previous work. Conceptually, if we know the exact time information for all events, their temporal relations can be naturally derived. For example, if we know that event A happened on Monday while event B happened on Tuesday in the same week, it is obvious that A happened BEFORE B. However, explicit time arguments can be rarely found in text, especially in news articles (Wen et al., 2021b). Leeuwenberg and Moens (2018) propose to predict the relative timeline and directly compare the relative timestamps of events to derive their temporal relations. Although showing promising performance, those predicted timestamps do not consider information from event pairs and cannot handle the uncertain temporal boundary of an event expressed in text to predict relations such as VAGUE.
In this paper, we follow the idea of Leeuwenberg and Moens (2018) to predict the relative event time for temporal relation extraction, where the relative time is a real number indicating the relative position of the event in the timeline. Instead of directly comparing relative event time as Leeuwenberg and Moens (2018), we consider them as additional features and incorporate them into training the temporal relation classifier. We propose a joint model with Stack-Propagation framework (Zhang and Weiss, 2016) to connect relative event time prediction and temporal relation extraction, which can exploit the explicit features from the former task for the latter task (Qin et al., 2019). Our model directly uses the output of relative event time prediction as the input for temporal relation classification so that the classification can benefit from both representations of event pairs and the predicted relative event time. Because we do not break the differentiability between two tasks, the training objective for one task can also promote another task. Similarly to Leeuwenberg and Moens (2018), we adapt margin-based optimization to constrain the distance between two relative event time values, given their temporal relation.
Our experiments show that the relative event time prediction can significantly help learn better temporal relation extraction, compared to vanilla RoBERTa-based  baseline. We have also achieved similar performance compared to the state-of-the-art temporal relation extraction approach that uses additional data (Ballesteros et al., 2020).

Approach
Microsoft said it has identified three RoBERTa Encoding ... In this section, we will discuss the relative event time prediction for temporal relation extraction. The overall approach is illustrated in Figure 2.

Our Baseline Model
Our baseline is a pretrained language model based pairwise classification model. It takes a sequence of tokens X with length n as input after preprocessing required by pretrained language models, such as subword tokenization (Devlin et al., 2019) or byte pair encoding (BPE, . The input also includes the positions of two event mentions (e i , e j ) in the text. For simplicity, we only use the start positions from the corresponding event mention spans. We denote the start position of an event mention e i as p i . The baseline model is used to predict the temporal relations between the given two event mentions.
The model first computes the contextualized representation for each input token using a pretrained language model (Devlin et al., 2019;. We denote the contextualized representation as H, where h i is the contextualized representation for the token at position i.
Then we concatenate the representations of any given two event mentions using the representation at their corresponding start positions, (1) We use a two-layer feed-forward neural network (FFN) with a tanh activation function and a softmax layer to convert the representation into a probability distribution,

Relative Timestamp Prediction
To better utilize the contextual information for events, we use an auxiliary task, relative event time prediction, to predict event time as a real number for all event mentions given its context, similar to Leeuwenberg and Moens (2018). Contrary to the above baseline method that takes a pair of representations and predict the pair-wise relation, event time information is only related to the event itself. Therefore, we predict the relative event time information by mapping the representation of an event mention e i from the pretrained language model to a real number, t i ∈ (−1, 1), where we use a twolayer feed-forward network with tanh activation functions, Although we may not have explicit time information in the given context, the gold-standard pairwise temporal relations can be considered as incidental supervision to constrain the predicted time.
For example, given two event mentions e i and e j , and their temporal relation e i BEFORE e j , then their predicted time t i and t j should follow t i < t j . Similarly, if their relation is EQUAL, then the distance between their predicted time should be as close as possible.
Therefore, we use a margin-based optimization method to constrain our predicted relative event time. We use different margins based on different temporal relations, If e i is BEFORE e j , the above optimization will maximize the distance (t j − t i ) unless it is equal or larger than 1, which follows the constraint t i < t j .
On the contrary, If e i is AFTER e j , it will maximize the distance (t i − t j ), which follows the constraint t i > t j . If e i is EQUAL e j , then it instead minimizes the distance |t i − t j |.

Stack-Propagation on Relative Timestamp
After obtaining the relative time for each event, we further incorporate this predicted feature into event-event temporal relation extraction. Since both relative time prediction and temporal relation extraction are based on contextualized representations from the pretrained language model, we adopt Stack-Propagation framework to connect these two tasks while preserving the differentiability. Specifically, besides the event-pair contextualized representation that the baseline method uses for pair-wise temporal relation classification, we further incorporate their predicted relative event time into classification, During training, we use cross-entropy objective for temporal relation classification, The final training objective is the interpolation of the classification task and time prediction task, Since we keep the differentiability for classification, the gradient from cross-entropy can be propagated to timestamps and jointly train relative time prediction.

Dataset
We conduct our experiments on MATRES (Ning et al., 2018b). MATRES contains refined annotations on TimeBank (Pustejovsky et al., 2003; and TempEval (UzZaman et al., 2013) (containing AQUAINT and Platinum subsets) documents. We follow the previous work that uses TimeBank and AQUAINT for training and we use Platinum as the test set. We randomly select 21 documents as development set. The detailed statistics can be found in Table 1.

Experimental Setup
We use F 1 to evaluate our system performance, following , where we consider VAGUE as "no relation". We compare our model with existing systems including 1) BiLSTM+MAP: A BiLSTM based joint event and temporal relation extraction model with MAP inference (Han et al., 2019). 2) LSTM+TEMPROB+ILP: LSTM-based method incorporating pretrained language model embedding, commensense prior (TEMPROB) and ILP . 3) Joint Constrained Learning: A constrained learning based optimization for joint event temporal and hierarchy relation extraction . 4) Self-Training: Multitask self-training on temporal relation extraction using additional time annotation from ACE2005 and the original TimeBank (Ballesteros et al., 2020). We use RoBERTa-large as our pretrained language model. Our best model is optimized using AdamW for 30 epochs with learning rate between {1e-5, 2e-5} for both pretrained model and other parameters. We use linear scheduler with warmup proportion at 0.1. We set weight decay to 0.01 and dropout rate to 0.1 for all parameters. The training batch size is 16. We use 5 different random seeds for our experiments, and choose the learning rate and model with the best averaged performance on development set for comparison on test set. The hidden size of FFN 1 (·) is 1024. The hidden size of FFN 3 (·) is 1026 (adding two predicted time as input). α is set to 1.   (Han et al., 2019;Ballesteros et al., 2020). We report our averaged test performance on all metrics over 5 random seeds. Models marked * use additional training resources.

Overall Performance
Our overall performance is shown in Table 2. Among these baseline systems, the multi-task self-training method (Ballesteros et al., 2020) has achieved the best performance. Our proposed method can achieve slightly better performance against their system, without introducing additional annotated and raw data, which demonstrates the effectiveness of the relative timestamp prediction objective, and the stack-propagation based method to incorporate predicted timestamps.  Table 3: Ablation study results (%) on our proposed method. We report our averaged test performance on all metrics. p < 0.05 for the two-sided heteroscedastic independent t-test between Vanilla RoBERTa and our model.

Ablation Study
We conduct ablation study for relative event time prediction and Stack-Propagation and the results are shown in Table 3. We can find that adding relative time prediction as an auxiliary task (Multi-Task) helps improve the performance, and incorporating relative event time as features (Stack-Propagation) further boosts the performance. We also compare with training relative time prediction and directly using relative timestamps to derive the temporal relation labels, similar to Leeuwenberg and Moens (2018). The model has the highest recall performance because it aggressively classifies relations as BEFORE or AFTER labels. However, because it cannot handle VAGUE, it's low in pre- He (said, e1) he (discussed, e2) the issue with Mr. Netanyahu during his visit to Israel this week, and that they (agreed, e3) the timing was good for a discussion with the Turkish leader. te1 = 0.9937, te2 = −0.7451, te3 = −0.0770 r(e1,e2) = AFTER, r(e1,e3) = AFTER, r(e2,e3) = BEFORE Traditionally , the (intentionally) funny lines by our presidents have (had, e1) one thing in common: They were self-deprecating. Sure, some presidents have (used, e2) jokes to take jabs at their opponents, but not to the extent of Obama. te1 = −0.5162, te2 = −0.5209 r(e1,e2) = VAGUE Table 4: Examples of temporal relation extraction and relative even time prediction results. The first example shows the correlation between predicted relative event time and temporal relations. The second example shows that the model can use event-pair information to predict VAGUE labels.
cision even compared to vanilla RoBERTa-based classifier (Vanilla Classifier).

Qualitative Analysis
We visualize the distance values of the relative event time for given event pairs and their predicted labels in Figure 3, where a negative value, for example, naturally indicates that the time of the former event is earlier than the latter event. We can find that BEFORE and AFTER predictions almost correlate with the distance values of the relative event time pairs. We can also find that VAGUE and EQUAL predictions are centered near 0, which shows that our model can take some event pairs that are hard to compare their predicted relative timestamps and classify them as VAGUE or EQUAL. Table 4 demonstrates two example system outputs. The first case shows that the temporal relation extraction correlates with relative event time prediction, while the second case shows that our model can utilize the event-pair representation and classify relation as VAGUE rather than completely depending on predicted timestamps.

Related Work
Earlier efforts on temporal relation extraction focus on global inference using methods such as Integer Linear Programming (Bramsen et al., 2006;Chambers and Jurafsky, 2008;Yoshikawa et al., 2009;Do et al., 2012;. Recently, neural network-based methods have also achieved promising improvement (Meng et al., 2017;Meng and Rumshisky, 2018;Ning et al., 2018aHan et al., 2019;Cheng et al., 2020). Especially, Goyal and Durrett (2019) use LSTM to encode Timex for temporal relation extraction, Ballesteros et al. (2020) jointly train event time arguments extraction from ACE2005. The most related work is (Leeuwenberg and Moens, 2018), which proposes to predict relative event time and uses the comparison of relative timestamps as temporal relations. Our work focuses on jointly training relative time prediction and temporal relation extraction and utilizes relative timestamps as features via the Stack-Propagation framework.

Conclusions and Future Work
In this paper, we leverage relative event time prediction that can ground events onto a relative timeline to help event-event temporal relation extraction. We use Stack-Propagation framework to further incorporate predicted timestamp explicit for relation classification. Our experiment results demonstrate the effectiveness of our proposed method. In the future, we will focus on extending the relative time prediction to cross-document setting and support cross-document temporal relation extraction.