Extracting or Guessing? Improving Faithfulness of Event Temporal Relation Extraction

In this paper, we seek to improve the faithfulness of TempRel extraction models from two perspectives. The first perspective is to extract genuinely based on contextual description. To achieve this, we propose to conduct counterfactual analysis to attenuate the effects of two significant types of training biases: the event trigger bias and the frequent label bias. We also add tense information into event representations to explicitly place an emphasis on the contextual description. The second perspective is to provide proper uncertainty estimation and abstain from extraction when no relation is described in the text. By parameterization of Dirichlet Prior over the model-predicted categorical distribution, we improve the model estimates of the correctness likelihood and make TempRel predictions more selective. We also employ temperature scaling to recalibrate the model confidence measure after bias mitigation. Through experimental analysis on MATRES, MATRES-DS, and TDDiscourse, we demonstrate that our model extracts TempRel and timelines more faithfully compared to SOTA methods, especially under distribution shifts.


Introduction
Event temporal relation (TEMPREL) extraction is an essential step towards understanding narrative text, such as stories, novels, news, and guideline articles. With a robust temporal relation extractor, one can easily construct a storyline from text and capture the trend of temporally connected event mentions. TEMPREL extraction is also broadly beneficial to various downstream tasks including clinical narrative processing (Jindal and Roth, 2013;Bethard et al., 2016), question answering (Llorens et al., 2015;Meng et al., 2017;Stricker, 2021), and schema induction (Chambers and Jurafsky, 2009;Li et al., 2021).
Most existing TEMPREL extraction models are developed with data-driven machine learning approaches, for which recent studies also incorporate A) I went to e 1 :SEE the doctor. However, I was more seriously e 2 :SICK. =⇒ e 1 AFTER e 2 B) Microsoft said it has e 3 :IDENTIFIED three companies for the China program to run through June. The company also e 4 :GIVES each participating startup in the Seattle program $20,000 to create software. =⇒ e 3 BEFORE e 4 Figure 1: Examples of unfaithful extractions. BEFORE and AFTER that follow the arrows denote the extracted TEMPREL's from the sentences by  advanced learning and inference techniques such as structured prediction (Ning et al., 2017(Ning et al., , 2018bHan et al., 2019;Tan et al., 2021), graph representation (Mathur et al., 2021;Zhang et al., 2022), data augmentation (Ballesteros et al., 2020;Trong et al., 2022), and indirect supervision (Zhao et al., 2021;. These models are prevalently built upon pretrained language models (PLMs) and fine-tuned on a small set of annotated documents, e.g., TimeBank-Dense (Cassidy et al., 2014), MATRES (Ning et al., 2018c), and TDDiscourse (Naik et al., 2019).
Though these recent approaches have achieved promising evaluation results on benchmarks, whether they provide faithful extraction is an unexplored problem. The faithfulness of a relation extraction system is not simply about how much accuracy a system can offer. Instead, a faithful extractor should concern the validity and reliability of its extraction process. Specifically, when there is a TEMPREL to extract, a faithful extractor should genuinely obtain what is described in the context but not give trivial guesses from surface names of events or most frequent labels. Besides, when there is no relation described in the context, the system should selectively abstain from prediction.
We observe that in recent models, biases from prior knowledge in PLMs and statistically skewed training data often lead to unfaithful extractions (see Fig. 1). Example A thereof exhibits a case where the model adheres to the prior knowledge where people usually see the doctor after getting sick, but in this context getting sick is obviously a consequent of seeing the doctor. In Example B, BEFORE is extracted due to statistical biases learned from training data that BEFORE is not only the most frequent TEMPREL between identify and give, but is also the most frequent TEMPREL between the first and second event in narrative order (Gee and Grosjean, 1984). However, with a closer inspection, it can be noticed that the two events in Example B are involved in different programs, one in the China program, the other in the Seattle program. Therefore, the system should abstain from prediction and give VAGUE as output.
In this paper, we seek to improve the faithfulness of TEMPREL extraction models from two perspectives. The first perspective is to guide the model to genuinely extract the described TEMPREL based on a relation-mentioning context. To achieve this goal, we conduct counterfactual analysis (Niu et al., 2021) to capture and attenuate the effects of two typical types of training biases: event bias caused by treating event trigger names as shortcuts for TEMPREL prediction, and label bias that causes the model prediction to lean towards more frequent training labels. We also propose to affix tense information to event mentions to explicitly place an emphasis on the contextual description.
The second perspective is to teach the model to abstain from extraction when no relation is described in the text. To know when to abstain, the models need to have a good estimate of the correctness likelihood. By incorporating Dirichlet Prior (Malinin andGales, 2018, 2019) in the training phase of current TEMPREL extraction models, we improve the predictive uncertainty estimation of the models and make the TEMPREL predictions more selective. Furthermore, since the counterfactual analysis component (from the first perspective) may shift the model-predicted categorical distribution, we also employ temperature scaling (Guo et al., 2017) in inference to allow for recalibrated confidence measure of the model.
The technical contributions of our work are twofolds. First, to the best of our knowledge, this is the first study on the faithfulness issue of eventcentric information extraction. Evidently, the development of a faithful TEMPREL extraction system contributes to more robust and reliable machine comprehension of events and narratives. Second, we propose training and inference techniques that can be easily plugged into existing neural TEM-PREL extractors and effectively improve model faithfulness by mitigating prediction shortcuts and enhancing the capability of selective prediction.
Our contributions are verified with TEMPREL extraction experiments conducted on MATRES (Ning et al., 2018c), TDDiscourse (Naik et al., 2019) and distribution-shifted version of MATRES (MATRES-DS). Particularly, we evaluate on how precise and selective our TEMPREL extraction method is on in-distribution data, and how well it generalizes under distribution shift. Experimental results demonstrate that the techniques explored within the two aforementioned perspectives bring about promising results in improving faithfulness of current models. In addition, we also apply our method to the task of timeline construction (Do et al., 2012), showing that faithful TEMPREL extraction greatly benefits the accurate construction of timelines.  Ballesteros et al. (2020), and enrich the models with auxiliary training tasks to provide complementary supervision signals, while Ning et al. (2018b), Zhao et al. (2021) and  bring into play distant supervision from heuristic cues and patterns. Nevertheless, recent data-driven models risk amplifying bias by exacerbating biases present in the pretraining and task training data when making predictions (Zhao et al., 2017). To rectify the models' biases towards prior knowledge in PLMs and shortcuts learned from biased training examples, our work proposes several training and inference techniques, seeking to improve the faithfulness of neural TEMPREL extractors as described in §1.
Bias Mitigation in NLP. Methods for mitigating prediction biases can be categorized as retraining and inference (Sun et al., 2019). Retraining methods address the bias in early stages or at its source. For instance, Zhang et al. (2017) masks the entities with special tokens to prevent relation extraction models from learning shortcuts from entity names, whereas several works conduct data augmentation (Park et al., 2018;Alzantot et al., 2018;Jin et al., 2020;Wu et al., 2022) or sample reweighting techniques (Lin et al., 2017;Liu et al., 2021a) to reduce biases in training. However, masking would result in the loss of semantic information and performance degradation, and it is costly to manipulate data or find proper unbiased data in temporal reasoning. Directly debiasing the training process may also hinder the model generalization on out-of-distribution (OOD) data (Wang et al., 2022). Therefore, inspired by several recent studies on debiasing text classification or entity-centric information extraction (Qian et al., 2021;Nan et al., 2021), our work adopts counterfactual inference to measure and control prediction biases based on automatically generated counterfactual examples.
Selective Prediction. Neural models have become increasingly accurate with the advances of deep learning. In the meantime, however, they should also indicate when their predictions are likely to be inaccurate in real-world scenarios. A series of recent studies have focused on resolving model miscalibration by measuring how closely the model confidences match empirical likelihoods. Among them, computationally expensive Bayesian (Gal and Ghahramani, 2016;Küppers et al., 2021) and non-Bayesian ensemble (Lakshminarayanan et al., 2017;Beluch et al., 2018) methods have been adopted to yield high quality predictive uncertainty estimates. Other methods have been proposed to use uncertainty reflected from model parameters to assess the confidence, including sharpness (Kuleshov et al., 2018) and softmax response (Hendrycks and Gimpel, 2017;Xin et al., 2021). Another class of methods adjust the models' output probability distribution by altering loss function in training via label smoothing (Szegedy et al., 2016) and Dirichlet Prior (Malinin andGales, 2018, 2019). Besides, temperature scaling (Guo et al., 2017) also serves as a simple yet effective posthoc calibration technique. In this paper, we model TEMPREL's with Dirichlet Prior in learning, and during inference we employ temperature scaling to recalibrate confidence measure of the model after bias mitigation.

Preliminaries
A document D is represented as a sequence of tokens D = [w 1 , · · · , e 1 , · · · , e 2 , · · · , w m ], where some tokens belong to the set of annotated event triggers, i.e., E D = {e 1 , e 2 , · · · , e n }, and the rest are other lexemes. For a pair of events (e i , e j ), the task of TEMPREL extraction is to predict a relation r from R ∪ {VAGUE}, where R denotes the set of TEMPREL's. An event pair is labeled VAGUE if the text does not express any determinable relation that belongs to R. Let y (i,j) denote the model-predicted categorical distribution over R.
In order to provide a confidence estimate y that is as close as possible to the true probability, we first describe three separate factors (Malinin and Gales, 2018) that attribute to the predictive uncertainty for an AI system, namely epistemic uncertainty, aleatoric uncertainty, and distributional uncertainty. Epistemic uncertainty refers to the degree of uncertainty in estimating model parameters based on training data, whereas aleatoric uncertainty results from data's innate complexity. Distributional uncertainty arises when the model cannot make accurate predictions due to the lack of familiarity with the test data.
We argue that the way of handling VAGUE relations in existing TEMPREL extractors is problematic since they typically merge VAGUE into R. In fact, VAGUE relations are complicated exception cases in the IE task, yet the annotation of such exceptions are never close to exhaustive in benchmarks, or even not given (Naik et al., 2019). In this work, we consider VAGUE relations as a source of distributional uncertainty and separately model them. Details are introduced in §4.2.

Methods
In this section, we first present how we obtain event representations and categorical distribution y in a local classifier for TEMPREL ( §4.1). Then we introduce proposed learning and inference techniques to improve model faithfulness from the perspectives of selective prediction ( §4.2) and prediction bias mitigation ( §4.3), before we combine these two techniques with temperature scaling and introduce the OOD detection method in §4.4.

Local Classifier
Given that the context around an event pair (e i , e j ) 1 has linguistic signals and temporal cues that are beneficial to TEMPREL prediction, the context of (e i , e j ) considered in our model starts from the sentence before e i and ends at the sentence after e j . Inspired by  for improving entity representation by prepending entity type information to entity mention spans, we add tense information of events into event trigger representations in this work. Accordingly, we enclose e i and e j with "@" and "#" respectively 2 and prepend their tense information to their spans with " * " and "∧". We provide a detailed example for affixing tense information in Appx. §A.1.
To characterize event pair (e i , e j ), we obtain the two events' contextual representations and attention heads from PLMs. The classifer is trained to uncover the context that is critical to both events by multiplying their attentions before we send the concatenation of token embeddings and attention multiplication to a multi-layer perceptron (MLP) with |R| outputs. In this fashion, we obtain the |R|-dimensional logits vector z (i,j) and categorical distribution y (i,j) , where the probability of a label r ∈ R is given by the softmax function σ(·): (1)

Parameterization of Dirichlet Prior
As discussed in preliminaries ( §3), VAGUE corresponds to complicated exception cases in inference.
We model them as out-of-distribution (OOD) cases which are different from in-distribution (ID) data describing the relations in R. The goal of providing high-quality confidence estimate y requires the model to yield a sharp predicted distribution centered on one of the labels in R when it is confident and yield a flat distribution over R for OOD inputs, as is shown in Fig. 2. To achieve this goal, we explicitly parameterize a prior distribution over categorical distributions. Because of the tractable analytic properties 3 of Dirichlet distribution (Eq. 2), we choose to parameterize a sharp and a flat Dirichlet prior over the model-predicted categorical distribution for ID and OOD inputs, respectively. The Dirichlet distribution is parameterized by its concentration parameters α, where α 0 is the precision of Dirichlet distribution. Higher values of α 0 lead to sharper, more confident predicted distributions.
To attain the aforementioned behaviors, on ID data the model is trained to minimize the KL divergence between a sharp Dirichlet distribution and the model-predicted categorical distribution: (3) where p ID (x) denotes ID data and α s denotes concentration parameters of the sharp Dirichlet distribution. On OOD data, the model minimizes the KL divergence between a flat Dirichlet distribution and the model-predicted categorical distribution: (4) where p OOD (x) denotes OOD data and α f denotes the concentration parameters of the flat Dirichlet distribution. And the total loss of the model is where the λ's are hyperparameters to balance the influence of each loss. With the parameterization of Dirichlet prior, the learning process seeks to partly enhance the model's faithfulness by outputting confident estimates when it encounters ID inputs, and outputting equivocal estimates when the context does not express any TEMPREL in the meantime.

Counterfactual Analysis
After looking into the selective prediction perspective of faithfulness, we now address the other perspective: to mitigate biases from pre-trained knowledge and the task training data during the inference  Figure 3: An overview of our approach to improving model faithfulness. In the training phase we obtain the model-predicted categorical distribution y with a neural encoder and parameterize a Dirichlet Prior over y. And then we conduct counterfactual analysis to distill and mitigate biases during inference before leveraging temperature scaling to obtain recalibrated and debiased y.
stage. Given that we have observed two types of biases in existing models, namely the event trigger bias and the frequent label bias, we ask the following questions: • What will the model prediction be if seeing the full context? • What will the model prediction be if seeing only the event tiggers? • Will the model predict anything even if it sees nothing? Inasmuch as we have described the learning process of the model, we know how to obtain model prediction given full context and can easily answer the first question. The second and third one, however, are hypothetical questions whose answers reflect the confounding biases that we would like to mitigate. With attention masks in recent PLMs (Devlin et al., 2019;Liu et al., 2019;Joshi et al., 2020;Lan et al., 2019), our model can be endowed with imagination ability effortlessly. By inputting a counterfactual instance with the context masked while maintaining the spans of event triggers to the model, we can obtain the model prediction given event trigger names only, which we denote byy. And by sending an empty (counterfactual) instance, we obtain the model prediction where no textual information is given, which we denote byȳ. Intuitively, the two termsy andȳ thereof provide measurements for the trigger bias and label bias.
Our goal is to use the biases assessed from model prediction (on counterfactual instances) to generate debiased categorical distribution. We remove the event trigger bias and the frequent label bias via element-wise subtraction, which is proved to be simple yet empirically effective (Qian et al., 2021): where y ′ denotes the debiased categorical distribution and the β's are independent parameters for balancing the terms that represent biases. We find the optimal values for β 1 and β 2 on different datasets 4 via grid beam search (Hokamp and Liu, 2017): where ψ is a metric function (e.g., F 1 scores) for evaluation, and a, b are the search boundaries.
In a nutshell, we obtain debiased categorical distribution by removing biases distilled via counterfactual inputs, thus encouraging the model to extract genuinely based on the contextual content. Nevertheless, the debiased model is not yet perfect. A minor drawback lies in that its confidence estimates might have been shifted by element-wise subtraction in Eq. 6 though it provides predictions with good evaluation results. Therefore we employ temperature scaling as our last step to allow for recalibrated confidence measure of the model.

Temperature Scaling and OOD Detection
The subtraction operation in Eq. 6 might result in negative values in y ′ . To provide a proper estimate of the correctness likelihood, we first normalize the probabilities in y ′ , where we replace the negative values with a small positive value and clips the values that are greater than 1: 4 Using the development splits of datasets.

545
where ϵ denotes a small positive number, r ∈ R. And then we use the inverse function of softmax to obtain the debiased logits vector z ′ : In this way we are able to apply temperature scaling (Guo et al., 2017) over z ′ and get the recalibrated and debiased categorical distributionŷ: where T > 0 denotes the temperature 5 .
To detect OOD inputs, we need to measure the uncertainty of the model predictions. We use the entropy (Eq. 11) of the final categorical distribution y, which captures the uncertainty encapsulated in the entire distribution. On the dev set with VAGUE examples, we find the optimal threshold of H[ŷ] below which the model predictions are considered equivoques and the inputs are OOD.
To sum up, we improve the model faithfulness in both training and inference phase with robust event presentations, Dirichlet Prior parameterization, counterfactual analysis and temperature scaling. The entire workflow is shown in Fig. 3.

Experiments
In this section, we describe the experiments 6 on two tasks: TEMPREL extraction and timeline construction. We first introduce the datasets that we adopt or create for evaluation ( §5.1), followed by the evaluation protocols ( §5.2). Evaluation results are discussed in §5.3 before we provide a detailed ablation study and case study in §5.4 and §5.5.

Datasets
We evaluate using the following datasets, for which statistics are given in Appx. §A.4.
MATRES (Ning et al., 2018c) is a TEMPREL benchmark annotated with the multi-axis scheme that helps achieve higher inter-annotator agreements (IAA) than previous benchmark datasets (Cassidy et al., 2014;Styler et al., 2014;O'Gorman et al., 2016). Four relations are annotated for the start time comparison of event pairs in 275 documents, namely BEFORE, AFTER, SIMUTANEOUS, and VAGUE. We train our model on the training set of MATRES, and evaluate our model on the dev and test sets of MATRES, MATRES-DS and TDDiscourse, which we introduce next.
MATRES-DS is an evaluation dataset that we created with distribution shifts (DS) compared to MA-TRES. Since one of our goals is to mitigate the bias of event triggers in the training data, we examine whether our proposed model stays uninfluenced when the distribution of event triggers is altered. We replace frequent triggers in the MA-TRES dev and test sets that appear within the top 5K frequent lemmas 7 with their uncommon synonyms, and replace infrequent triggers with their frequent synonyms from the list of frequent lemmas. MATRES-DS also presents a mismatch between the training and test distributions, or dataset shift (Quinonero-Candela et al., 2008), where distributional uncertainty often arises.
TDDiscourse (Naik et al., 2019) is a dataset for discourse-level event temporal ordering, in which TEMPREL's between global long-distance event pairs are annotated. As another data source with distribution shifts compared to MATRES, we adopt the manually annotated subset of TDDiscourse, namely TDD-man, in our experiments. The TEM-PREL set R T 8 annotated in TDDiscourse is a superset of the TEMPREL set R M defined in MA-TRES. Given that TDD-man serves as evaluation data on which we do not train our model, a relation in R M ∪ {VAGUE} is predicted for each pair of events in the test set of TDD-man.

Evaluation Protocols
For event TEMPREL extraction, we compare our model with the current and previous SOTA models (Trong et al., 2022;Mathur et al., 2021) trained on MATRES. The models are evaluated on not only how precise and selective their extraction is on ID data (MATRES), but are also examined for their generalizability under distribution shifts (MATRES-DS and TDD-man). We report micro-F 1 score as an evaluation metric following previous papers. We also report macro-F 1 , which reflects the fairness of model prediction, and expected calibra-  tion error (ECE) that approximates the difference in expectation between confidence and accuracy. The definition of ECE is provided in Appx. §A.2.
We also apply our model to the timeline construction task, where the goal is to sort a list of events in a document in chronological order. To construct the timeline, the model first constructs a directed graph G with predicted non-VAGUE TEM-PREL's between every event pairs. Then, edges in G with lowest confidence scores are removed until G becomes a directed acyclic graph (DAG). Finally, the timeline is generated as the linear ordering of the vertices in the DAG by topological sorting. In this way, we circumvent the possible conflicts in model predictions for timeline construction and the faithful removal of least confident edges serves as an examination on the quality of model-predicted confidence. On the three datasets, we report the accuracy of exact match and the average minimum edit distance between predicted and ground truth timelines as evaluation metrics.

Results
In Tab. 1, we report the TEMPREL extraction results. On MATRES, the SOTA model (Trong et al., 2022) still offers the best performance in terms of micro-F 1 whereas our model achieves comparable macro-F 1 score and lower calibration error. In contrast, our proposed faithful TEMPREL extractor outperforms baseline methods in terms of all evaluation metrics under the dataset shifts caused by replacement of event triggers in MATRES-DS and longer context distances between global event pairs in TDD-man. Specifically, our model shows a significant gain of 2.0% macro-F 1 and 0.8% micro-F 1 over the SOTA model on MATRES-DS and surpasses the previous SOTA model on TDD-man by 1.0% micro F 1 , not to mention the improvements on confidence calibration. We attribute this superior performance under dataset shifts to the mitigation of biases from prior knowledge and training set statistics as well as the techniques we employ to improve predictive uncertainty estimation. For a visual illustration of model calibration, we present the reliability diagram that plots the expected sample accuracy as a function of confidence in Fig. 4.
Tab. 2 exhibits similar observations: our model outperforms both baselines on timeline construction by a large margin in terms of both metrics. Specifically, under dataset shifts within MATRES-DS and TDD-man, our model surpasses the best baseline by 11.4% and 14.4% in accuracy, while drastically reducing the minimum edit distance by relatively 11.4% and 28.9%. Evidently, the capabilities of selective prediction and bias mitigation make our model stand out in complex scenarios like timeline construction, whereas the bias and inferior calibration of existing models exacerbate unfaithful extractions when multiple decisions have to be made simultaneously.

Ablation Study
To analyze the effect of each model component, we conduct an ablation study of our model where we remove one component at a time (see Tab. 1) 9 . We observe on MATRES that, without the counterfactual analysis component, the model performance becomes worse by 2.4% in micro-F 1 and 1.6% in macro-F 1 . Under dataset shifts, the model performance is reduced by 3.8% in micro-F 1 and 2.4% in macro-F 1 on TDD-man without the parameterization of Dirichlet Prior. The model performance in terms of F 1 scores is slightly influenced by taking away the temperature scaling component while the model calibration severely degrades.
From the ablation results in Tab. 2, we notice that temperature scaling has modest effects on the model performances, while Dirichlet Prior plays the most important role towards faithful timeline construction. It is also noteworthy that tense information considerably benefits the model to generalize well under distribution shifts in that it provides a useful feature applicable to all domains.

Case Study
As shown in Fig. 5, we provide a case study on timeline construction for three events. The reason why our model predicts AFTER for the third pair is probably due to the misleading temporal cues in text, 9 We leverage cross-entropy as the training loss when we remove Dirichlet Prior in the training phase. A new Essex County task force began delving Thursday into the e 1 :SLAYINGS of 14 people ... officials have been e 2 :CAREFUL not to draw any firm conclusions, leaving open the possibility of a serial killer ... "I haven't e 3 :SEEN a pattern yet," said Patricia Hurt, the Essex County prosecutor, who created the task force on Tuesday. Thursday and Tuesday, while the long distance between events undermines the confidence for this prediction. When our model builds a directed graph with three relations, a cycle is identified and the edge with lowest confidence is removed from the graph, and thus our model constructs the correct timeline. Without tense information, the model makes wrong prediction concerning the second event whose trigger is an adjective. And without Dirichlet Prior or temperature scaling, the model calibration becomes noticeably worse.

Conclusion
In this paper, we investigate on improving faithfulness for event TEMPREL extraction from two perspectives. To enhance the selectiveness of model predictions, we parameterize a Dirichlet Prior over the model-predicted categorical distribution to regularize the model to behave differently with ID and OOD data. To mitigate two types of biases from PLMs and training data, we add tense information to obtain robust event representations and conduct a counterfactual analysis to reduce the risk of carrying prediction shortcuts into inference. We also employ temperature scaling to combine the two faithful perspectives, which recalibrates the confidence measure of the model after bias mitigation. Through experimental analysis on MATRES, MATRES-DS, and TDDiscourse, we demonstrate that our model faithfully extracts event temporal relations and timelines from text, so as to generalize well under distribution shifts.

Acknowledgments
The authors would like to thank the anonymous ACL ARR reviewers for their insightful feedback on our work. This work was supported by Contract FA8750-19-2-1004 with the US Defense Advanced Research Projects Agency (DARPA). Approved for Public Release, Distribution Unlimited. This research is also based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the BETTER Program. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, the Department of Defense, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. In addition, this work is supported in part by the NSF Grant IIS 2105329, an Amazon Research Award and a Cisco Research Award.

Limitations
As the event representation introduced in our method is augmented with tense information, it potentially leads to limitations when applying to languages other than English, especially tenseless languages and languages having fewer tenses. The training of our models also requires considerable GPU resources which might produce environmental impacts, though the inference stage does not take up much computational resources.

Ethics Statement
There are no direct societal implications of this work. The proposed method attempts to provide high-quality and faithful event TEMPREL extraction and timeline construction. We believe that the intellectual merits of developing robust eventcentric information extraction methods are demonstrated by this work. For any information extraction methods, real-world open source articles used to extract information may contain societal biases. Extracting event-event relations from articles with such biases may spread the bias into the acquired knowledge. Yet we believe that the proposed method can benefit various downstream NLP/NLU tasks like event prediction, task-oriented dialogue systems and risk detection.
where n is the number of samples. The difference between acc and conf for a given bin represents the calibration gap (red bars in reliability diagramse.g. Fig. 4). We use ECE as the primary empirical metric to measure model calibration.

A.3 Negative Log Likelihood
Negative log likelihood is a standard measure of a probabilistic model's quality. It is also referred to as the cross entropy loss in the context of deep learning. Given a probabilistic modelπ(Y |X) and n samples, NLL is defined as: log(π(y i |X i )) It is a standard result that, in expectation, NLL is minimized if and only ifπ(Y |X) recovers the ground truth conditional distribution π(Y |X). The temperature T in temperature scaling is optimized with respect to NLL on the dev sets.

A.4 Dataset Statistics
MATRES is composed of 275 news documents and the train/dev/test split is 183/72/20 documents where 6336/6404/818 event pairs are annotated respectively. The same statistics hold for MATRES-DS since we only change the event triggers in the inputs instead of the labels. In TD-Discourse, 4,000/650/1,500 and 32609/1435/4258 TEMPREL's are annotated in the train/dev/test sets of TDD-man and TDD-Auto, respectively.

A.5 Experimental Setup and Hyperparameter Setting
In the training phase, we fine-tune the pre-trained 1024-dimensional Big Bird (Zaheer et al., 2020) to encode the context of event triggers. We obtain the tense information of event triggers with an offthe-shelf tense identifier 10 . The parameters of the model are optimized using AMSGrad (Reddi et al., 2018) with the learning rate set to 5 × 10 −6 , batch size set to 20, and the training process is limited to 40 epochs on a server with Nvidia A6000 GPU. All experiments are repeated with five different random seeds and the results reported are their average. To obtain α s in Eq. 3, we smooth the target means to redistribute a small amount of probability density to the other corners of the Dirichlet. In our experiments, we set λ 1 = λ 2 = 1 in Eq. 5. On the dev set of TDD-man the optimal β's of the model in Eq. 6 are β 1 = −0.4, β 2 = 0.6, where the search bounds, a and b equal to -1 and 1.