ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning

While pre-trained language models (PTLMs) have achieved noticeable success on many NLP tasks, they still struggle for tasks that require event temporal reasoning, which is essential for event-centric applications. We present a continual pre-training approach that equips PTLMs with targeted knowledge about event temporal relations. We design self-supervised learning objectives to recover masked-out event and temporal indicators and to discriminate sentences from their corrupted counterparts (where event or temporal indicators got replaced). By further pre-training a PTLM with these objectives jointly, we reinforce its attention to event and temporal information, yielding enhanced capability on event temporal reasoning. This **E**ffective **CON**tinual pre-training framework for **E**vent **T**emporal reasoning (ECONET) improves the PTLMs’ fine-tuning performances across five relation extraction and question answering tasks and achieves new or on-par state-of-the-art performances in most of our downstream tasks.


Introduction
Reasoning event temporal relations is crucial for natural language understanding, and facilitates many real-world applications, such as tracking biomedical histories (Sun et al., 2013;Bethard et al., 2015Bethard et al., , 2016Bethard et al., , 2017, generating stories (Yao et al., 2019;Goldfarb-Tarrant et al., 2020), and forecasting social events (Li et al., 2020;Jin et al., 2020). In this work, we study two prominent event temporal reasoning tasks as shown in Figure 1: event relation extraction (ERE) (Chambers et al., 2014;Ning et al., 2018;O'Gorman et al., 2016;Mostafazadeh et al., 2016) that predicts temporal relations between a pair of events, and machine reading comprehension (MRC) (Ning et al., 2020;Zhou et al., 2019) where a passage and a question about event temporal relations is presented, and models need to provide correct answers using the information in a given passage.
Recent approaches leveraging large pre-trained language models (PTLMs) achieved state-of-theart results on a range of event temporal reasoning tasks (Ning et al., 2020;Pereira et al., 2020;Wang et al., 2020;Zhou et al., 2020c;Han et al., 2019b). Despite the progress, vanilla PTLMs do not focus on capturing event temporal knowledge that can be used to infer event relations. For example, in Figure 1, an annotator of the QA sample can easily infer from the temporal indicator "following" that "transfer" happens BEFORE "preparing the paperwork", but a fine-tuned RoBERTa model predicts that "transfer" has no such relation with the event "preparing the paperwork." Plenty of such cases exist in our error analysis on PTLMs for event temporal relation-related tasks. We hypothesize that such deficiency is caused by original PTLMs' random masks in the pre-training where temporal indicators and event triggers are under-weighted and hence not attended well enough for our downstream tasks. Table 1: The full list of the temporal lexicon. Categories are created based on authors' domain knowledge and best judgment. * * 'once' can be also placed into [past] category due to its second meaning of 'previously', which we exclude to keep words unique.
TacoLM (Zhou et al., 2020a) explored the idea of targeted masking and predicting textual cues of event frequency, duration and typical time, which showed improvements over vanilla PTLMs on related tasks. However, event frequency, duration and time do not directly help machines understand pairwise event temporal relations. Moreover, the mask prediction loss of TacoLM leverages a soft crossentropy objective, which is manually calibrated with external knowledge and could inadvertently introduce noise in the continual pre-training.
We propose ECONET, a continual pre-training framework combining mask prediction and contrastive loss using our masked samples. Our targeted masking strategy focuses only on event triggers and temporal indicators as shown in Figure 1. This design assists models to concentrate on events and temporal cues, and potentially strengthen models' ability to understand event temporal relations better in the downstream tasks. We further pre-train PTLMs with the following objectives jointly: the mask prediction objective trains a generator that recovers the masked temporal indicators or events, and the contrastive loss trains a discriminator that shares the representations with the generator and determines whether a predicted masked token is corrupted or original (Clark et al., 2020). Our experiments demonstrate that ECONET is effective at improving the original PTLMs' performances on event temporal reasoning.
We briefly summarize our contributions. 1) We propose ECONET, a novel continual pre-training framework that integrates targeted masking and contrastive loss for event temporal reasoning. 2) Our training objectives effectively learn from the targeted masked samples and inject richer event temporal knowledge in PTLMs, which leads to stronger fine-tuning performances over five widely used event temporal commonsense tasks. In most target tasks, ECONET achieves SOTA results in comparison with existing methods. 3) Compared with full-scale pre-training, ECONET requires a much smaller amount of training data and can cope with various PTLMs such as BERT and RoBERTa. 4) In-depth analysis shows that ECONET successfully transfers knowledge in terms of textual cues of event triggers and relations into the target tasks, particularly under low-resource settings.

Method
Our proposed method aims at addressing the issue in vanilla PTLMs that event triggers and temporal indicators are not adequately attended for our downstream event reasoning tasks. To achieve this goal, we propose to replace the random masking in PTLMs with a targeted masking strategy designed specifically for event triggers and temporal indicators. We also propose a continual pre-training method with mask prediction and contrastive loss that allows models to effectively learn from the targeted masked samples. The benefits of our method are manifested by stronger fine-tuning performances over downstream ERE and MRC tasks.
Our overall approach ECONET consists of three components. 1) Creating targeted self-supervised training data by masking out temporal indicators and event triggers in the input texts; 2) leveraging mask prediction and contrastive loss to continually train PTLMs, which produces an event temporal knowledge aware language model; 3) fine-tuning the enhanced language model on downstream ERE and MRC datasets. We will discuss each of these components in the following subsections.

Pre-trained Masked Language Models
The current PTLMs such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) follow a random masking strategy. Figure 1 shows such an example where random tokens / words are masked from the input sentences. More formally, let x = [x 1 , ..., x n ] be a sequence of input tokens and x m t ∈ x m represents random masked tokens. The per-sample pre-training objective is to predict the identity (x t ) of x m t with a cross-entropy loss, (1) Figure 2: The proposed generator-discriminator (ECONET) architecture for event temporal reasoning. The upper block is the mask prediction task for temporal indicators and the bottom block is the mask prediction task for events. Both generators and the discriminator share the same representations.
Next, we will discuss the design and creation of targeted masks, training objectives and fine-tuning approaches for different tasks.

Targeted Masks Creation
Temporal Masks. We first compile a lexicon of 40 common temporal indicators listed in the Ta With the temporal lexicon, we conduct string matches over the 20-year's New York Times news articles 2 and obtain over 10 million 1-2 sentence passages that contain at least 1 temporal indicators. Finally, we replace each of the matched temporal indicators with a mask token. The upper block in Figure 2 shows two examples where "following" and "after" are masked from the original texts. Event Masks. We build highly accurate event detection models (Han et al., 2019c;Zhang et al., 2021) to automatically label event trigger words in the 10 million passages mentioned above. Similarly, we replace these events with mask tokens. The bottom block in Figure 2 shows two examples where events "transfer" and "resumed" are masked from the original texts.
2 NYT news articles are public from 1987-2007.

Generator for Mask Predictions
To learn effectively from the targeted samples, we train two generators with shared representations to recover temporal and event masks. Temporal Generator. The per-sample temporal mask prediction objective is computed using crossentropy loss, ) and x T t ∈ x T is a masked temporal indicator. h G (x) is x's encoded representation using a transformer and f T is a linear layer module that maps the masked token representation into label space T consisting of the 40 temporal indicators. Event Generator. The per-sample event mask prediction objective is also computed using crossentropy loss, is the shared transformer encoder as in the temporal generator and f E is a linear layer module that maps the masked token representation into label space E which is a set of all event triggers in the data.

Discriminator for Contrastive Learning
We incorporate a discriminator that provides additional feedback on mask predictions, which helps correct errors made by the generators. Contrastive Loss. For a masked token x t , we design a discriminator to predict whether the recovered token by the mask prediction is original or corrupted. As shown in Figure 2, "following" and "resumed" are predicted correctly, so they are labeled as original whereas "during" and "run" are incorrectly predicted and labeled as corrupted. We train the discriminator with a contrastive loss, ) and y is a binary indicator of whether a mask prediction is correct or not. h D shares the same transformer encoder with h G . Perturbed Samples. Our mask predictions focus on temporal and event tokens, which are easier tasks than the original mask predictions in PTLMs. This could make the contrastive loss not so powerful as training a good discriminator requires relatively balanced original and corrupted samples. To deal with this issue, for r% of the generator's output, instead of using the recovered tokens, we replace them with a token randomly sampled from either the temporal lexicon or the event vocabulary. We fix r = 50 to make original and corrupted samples nearly balanced.

Joint Training
To optimize the combining impact of all components in our model, the final training loss calculates the weighted sum of each individual loss, L = L T + αL E + βL D , where α and β are hyperparameters that balance different training objectives. The temporal and event masked samples are assigned a unique identifier (1 for temporal, 0 for event) so that the model knows which linear layers to feed the output of transformer into. Our overall generator-discriminator architecture resembles ELECTRA (Clark et al., 2020). However, our proposed method differs from this work in 1) we use targeted masking strategy as opposed to random masks; 2) both temporal and event generators and the discriminator, i.e. h G and h D share the hidden representations, but we allow task-specific final linear layers f T , f E and f D ; 3) we do not train from scratch and instead continuing to train transformer parameters provided by PTLMs.

Fine-tuning on Target Tasks
After training with ECONET, we fine-tune the updated MLM on the downstream tasks. ERE samples can be denoted as [P, e i , e j , r i,j ], where P is the passage and (e i , e j ) is a pair of event trigger tokens in P. As Figure 3a shows, we feed (P, e i , e j ) into an MLM (trained with ECONET). Following the setup of Han et al. (2019a) and Zhang et al. (2021), we concatenate the final event representations v i , v j associated with (e i , e j ) to predict temporal relation r i,j . The relation classifier is implemented by a multi-layer perceptron (MLP).
MRC/QA samples can be denoted as [P, Q, A], where Q represents a question and A denotes answers. Figure 3b illustrates an extractive QA task where we feed the concatenated [P, Q] into an MLM. Each token x i ∈ P has a label with 1 indicating x i ∈ A and 0 otherwise. The token classifier implemented by MLP predicts labels for all x i . Figure 3c illustrates another QA task where A is a candidate answer for the question. We feed the concatenated [P, Q, A] into an MLM and the binary classifier predicts a 0/1 label of whether A is a true statement for a given question.

Experimental Setup
In this section, we describe details of implementing ECONET, datasets and evaluation metrics, and discuss compared methods reported in Section 4.

Implementation Details
Event Detection Model. As mentioned briefly in Section 2, we train a highly accurate event prediction model to mask event (triggers). We experimented with two models using event annotations in TORQUE (Ning et al., 2020) and TB-Dense (Chambers et al., 2014). These two event annotations both follow previous event-centric reasoning research by using a trigger word (often a verb or an noun that most clearly describes the event's occurrence) to represent an event (UzZaman et al., 2013;Glavaš et al., 2014;O'Gorman et al., 2016). In both cases, we fine-tune RoBERTa LARGE on the train set and select models based on the performance on the dev set. The primary results shown in Table 2 uses TORQUE's annotations, but we conduct additional analysis in Section 4 to show both models produce comparable results.
Continual Pretraining. We randomly selected only 200K out of 10 million samples to speed up our experiments and found the results can be as good as using a lot more data. We used half of these 200K samples for temporal masked samples and the other half for the event masked samples. We ensure none of these sample passages overlap with the target test data. To keep the mask tokens balanced in the two training samples, we masked only 1 temporal indicator or 1 event (closest to the temporal indicator). We continued to train BERT and RoBERTa up to 250K steps with a batch size of 8. The training process takes 25 hours on a single GeForce RTX 2080 GPU with 11G memory. Note that our method requires much fewer samples and is more computation efficient than the full-scale pre-training of language models, which typically requires multiple days of training on multiple large GPUs / TPUs.
For the generator only models reported in Table 2, we excluded the contrastive loss, trained models with a batch size of 16 to fully utilize GPU memories. We leveraged the dev set of TORQUE to find the best hyper-parameters. Fine-tuning. Dev set performances were used for early-stop and average dev performances over three randoms seeds were used to pick the best hyper-parameters. Note that test set for the target tasks were never observed in any of the training process and their performances are reported in Table 2. All hyper-parameter search ranges can be found in Appendix C.

Datasets
We evaluate our approach on five datasets concerning temporal ERE and MRC/QA. We briefly describe these data below and list detailed statistics in Appendix A.
ERE Datasets. TB-Dense (Chambers et al., 2014), MATRES (Ning et al., 2018) and RED (O'Gorman et al., 2016) are all ERE datasets. Their samples follow the input format described in Section 2.6 where a pair of event (triggers) together with their context are provided. The task is to predict pairwise event temporal relations. The differences are how temporal relation labels are defined. Both TB-Dense and MATRES leverage a VAGUE label to capture relations that are hard to determine even by humans, which results in denser annotations than RED. RED contains the most fine-grained temporal relations and thus the lowest sample/relation ratio. MATRES only considers start time of events to determine their temporal order, whereas TB-Dense and RED consider start and end time, resulting in lower inter-annotator agreement.
TORQUE (Ning et al., 2020) is an MRC/QA dataset where annotators first identify event triggers in given passages and then ask questions regarding event temporal relations (ordering). Correct answers are event trigger words in passages. TORQUE can be considered as reformulating temporal ERE tasks as an MRC/QA task. Therefore, both ERE datasets and TORQUE are highly correlated with our continual pre-training objectives where targeted masks of both events and temporal relation indicators are incorporated.
MCTACO (Zhou et al., 2019) is another MR-C/QA dataset, but it differs from TORQUE in 1) events are not explicitly identified; 2) answers are statements with true or false labels; 3) questions contain broader temporal commonsense regarding not only temporal ordering, but also event frequency, during and typical time that may not be directly helpful for reasoning temporal relations. For example, knowing how often a pair of events happen doesn't help us figure out which event happens earlier. Since our continual pre-training focuses on temporal relations, MCTACO could the least compatible dataset in our experiments.

Evaluation Metrics
Three metrics are used to evaluate the fine-tuning performances.
F 1 : for TORQUE and MCTACO, we follow the data papers (Ning et al., 2020) and (Zhou et al., 2019) to report macro average of each question's F 1 score. For TB-Dense, MATRES and RED, we report standard micro-average F 1 scores to be consistent with the baselines. Exact-match (EM): for both MRC datasets, EM = 1 if answer predictions match perfectly with gold annotations; otherwise, EM = 0.   Ning et al. (2020) and the numbers are average over 3 random seeds. The SOTA performances for MCTACO † are provided by Pereira et al. (2020); TB-Dense † † and MATRES ‡ by Zhang et al. (2021) and RED ‡ ‡ by Han et al. (2019b). † , † † , ‡ and ‡ ‡ only report the best single model results, and to make fair comparisons with these baselines, we report both average and best single model performances. TacoLM baseline uses the provided and recommended checkpoint for extrinsic evaluations.
EM-consistency (C): in TORQUE, some questions can be clustered into the same group due to the data collection process. This metric reports the average EM score for a group as opposed to a question in the original EM metrics.

Compared Methods
We compare several pre-training methods with ECONET: 1) RoBERTa LARGE is the original PTLM and we fine-tune it directly on target tasks; 2) RoBERTa LARGE + ECONET is our proposed continual pre-training method; 3) RoBERTa LARGE + Generator only uses the generator component in continual pre-training; 4) RoBERTa LARGE + random mask keeps the original PTLMs' objectives and replaces the targeted masks in ECONET with randomly masked tokens. The methods' names for continual pretraining BERT LARGE can be derived by replacing RoBERTa LARGE with BERT LARGE . We also fine-tune pre-trained TacoLM on target datasets. The current SOTA systems we compare with are provided by Ning et al. (2020), Pereira et al. (2020), Zhang et al. (2021) and Han et al. (2019b). More details are presented in Section 4.1.

Results and Analysis
As shown in Table 2, we report two baselines. The first one, TacoLM is a related work that focuses on event duration, frequency and typical time. The second one is the current SOTA results reported to the best of the authors' knowledge. We also report our own implementations of fine-tuning BERT LARGE and RoBERTa LARGE to compare fairly with ECONET. Unless pointing out specifically, all gains mentioned in the following sections are in the unit of absolute percentage.

Comparisons with Existing Systems
TORQUE. The current SOTA system reported in Ning et al. (2020) fine-tunes RoBERTa LARGE and our own fine-tuned RoBERTa LARGE achieves on-par F 1 , EM and C scores. The gains of RoBERTa LARGE + ECONET against the current SOTA performances are 0.9%, 0.5% and 2.3% per F 1 , EM and C metrics. MCTACO. The current SOTA system ALICE (Pereira et al., 2020) also uses RoBERTa LARGE as the text encoder, but leverages adversarial attacks on input samples. ALICE achieves 79.5% and 56.5% per F 1 and EM metrics on the test set for the best single model, and the best performances for RoBERTa LARGE + ECONET are 76.8% and 54.7% per F 1 and EM scores, which do not outperform ALICE. This gap can be caused by the fact that the majority of samples in MCTACO reason about event frequency, duration and time, which are not directly related to event temporal relations. TB-Dense + MATRES. The most recent SOTA system reported in Zhang et al. (2021) uses both BERT LARGE and RoBERTa LARGE as text encoders, but leverages syntactic parsers to build large graphical attention networks on top of PTLMs. RoBERTa LARGE + ECONET's fine-tuning performances are essentially on-par with this work without additional parameters. For TB-Dense, our best model outperforms Zhang et al. (2021) by 0.1% while for MATRES, our best model underperforms by 1.0% per F 1 scores. RED. The current SOTA system reported in Han et al. (2019b) uses BERT BASE as word representations (no finetuning) and BiLSTM as feature extractor. The single best model achieves 34.0% F 1 score and RoBERTa LARGE + ECONET is 9.8% higher than the baseline.

The Impact of ECONET
Overall Impact. ECONET in general works better than the original RoBERTa LARGE across 5 different datasets, and the improvements are more salient in TORQUE with 1.0%, 2.0% and 1.5% gains per F 1 , EM and C scores, in MCTACO with 2.4% lift over the EM score, and in TB-Dense and RED with 2.0% and 3.4% improvements respectively over F 1 scores. We observe that the improvements of ECONET over BERT LARGE is smaller and sometimes hurts the fine-tuning performances. We speculate this could be related to the property that BERT is less capable of handling temporal reasoning tasks, but we leave more rigorous investigations to future research. Impact of Contrastive Loss. Comparing the average performances of continual pre-training with generator only and with ECONET (generator + discriminator), we observe that generator alone can improve performances of RoBERTa LARGE in 3 out of 5 datasets. However, except for TB-Dense, ECONET is able to improve fine-tuning performances further, which shows the effectiveness of using the contrastive loss. Significance Tests. As current SOTA models are either not publicly available or under-perform RoBERTa LARGE , we resort to testing the statistical significance of the best single model between ECONET and RoBERTa LARGE . Table 8 in the appendix lists all improvements' p-values per Mc-Nemar's test (McNemar, 1947). MATRES appears to be the only one that is not statistically significant.

Impact of Event Models
Event trigger definitions have been consistent in previous event temporal datasets (O'Gorman et al., 2016;Chambers et al., 2014;Ning et al., 2020). Trigger detection models built on TORQUE and TB-Dense both achieve > 92% F 1 scores and > 95% precision scores. For the 100K pre-training data selected for event masks, we found an 84.5% overlap of triggers identified by both models. We further apply ECONET trained on both event mask data to the target tasks and achieve comparable performances shown in Table 10 of the appendix. These results suggest that the impact of different event annotations is minimal and triggers detected in either model can generalize to different tasks.

Additional Ablation Studies
To better understand our proposed model, we experiment with additional continual training methods and compare their fine-tuning performances.  Random Masks. As most target datasets we use are in the news domain, to study the impact of potential domain-adaption, we continue to train PTLMs with the original objective on the same data using random masks. To compare fairly with the generator and ECONET, we only mask 1 token per training sample. The search range of hyperparameters is the same as in Section 3. As Table 3 and 11 (appendix) show, continual pre-training with random masks, in general, does not improve and sometimes hurt fine-tuning performances compared with fine-tuning with original PTLMs. We hypothesize that this is caused by masking a smaller fraction (1 out of ≈50 average) tokens than the original 15%. RoBERTa LARGE + ECONET achieves the best fine-tuning results across the board.

Fine-tuning under Low-resource Settings
In Table 4, we compare the improvements of fine-tuning RoBERTa LARGE + ECONET over RoBERTa LARGE using full and 10% of the training data. Measured by both absolute and relative percentage gains, the majority of the improvements are much more significant under low-resource settings. This suggests that the transfer of event temporal knowledge is more salient when data is scarce. We further show fine-tuning performance comparisons using different ratios of the training data in Figure 6a-6b in the appendix. The results demonstrate that ECONET can outperform RoBERTa LARGE consistently when fine-tuning TORQUE and RED.

Attention Scores on Temporal Indicators
In this section, we attempt to show explicitly how ECONET enhances MLMs' attentions on temporal indicators for downstream tasks. As mentioned in  Figure 4: Cumulative attention score comparisons between RoBERTa LARGE and ECONET on TB-Dense test data. All numbers are multiplied by 100 and averaged over 3 random seeds for illustration clarity.
Sec. 2.6, for a particular ERE task (e.g. TB-Dense), we need to predict the temporal relations between a pair of event triggers e i , e j ∈ P i,j with associated vector representations v l,h i , v l,h j , l ∈ L, h ∈ H in an MLM. L and H are the number of layers and attention heads respectively. We further use T m ∈ T to denote a temporal indicator category listed in Table 1, and t m,n ∈ T m denote a particular temporal indicator. If we let attn(v l,h i , v l,h x ) represents the attention score between an event vector and any other hidden vectors, we can aggregate the per-layer attention score between e i and t m,n as, . Similarly, we can compute a l j,tm,n . The final per-layer attention score for (e i , e j ) is a l tm,n = 1 2 a l i,tm,n + a l j,tm,n . To compute the attention score for the T m category, we take the average of {a l tm,n | ∀t m,n ∈ T m and ∀t m,n ∈ P i,j }. Note we assume a temporal indicator is a single token to simplify notations above; for multiple-token indicators, we take the average of attn(v l,h i , v l,h x∈tm,n ). Figure 4 shows the cumulative attention scores for temporal indicator categories, [before], [after] and [during] in ascending order of model layers. We observe that the attention scores for RoBERTa LARGE and ECONET align well on the bottom layers, but ECONET outweighs RoBERTa LARGE in middle to top layers. Previous research report that upper layers of pre-trained language models focus more on complex semantics as opposed to shallow surface forms or syntax on the lower layers (Tenney et al., 2019;Jawahar et al., 2019). Thus, our findings here show another piece of evidence that targeted masking is effective at capturing temporal indicators, which could facilitate semantics tasks including temporal reasoning.

Temporal Knowledge Injection
We hypothesize in the introduction that vanilla PTLMs lack special attention to temporal indica-  tors and events, and our proposed method addresses this issue by a particular design of mask prediction strategy and a discriminator that is able to distinguish reasonable events and temporal indicators from noises. In this section, we show more details of how such a mechanism works. The heat maps in Figure 5 calculate the fine-tuning performance differences between 1) RoBERTa LARGE and continual pre-training with random masks (Figure 5a); and 2) between RoBERTa LARGE and ECONET (Figure 5b). Each cell shows the difference for each label class in TB-Dense conditional on samples' input passage containing a temporal indicator in the categories specified in Table 1. Categories with less than 50 sample matches are excluded from the analysis.
In Figure 5a, the only gains come from VAGUE, which is an undetermined class in TB-Dense to handle unclear pairwise event relations. This shows that continual pre-training with random masks works no better than original PTLMs to leverage existing temporal indicators in the input passage to distinguish positive temporal relations from unclear ones. On the other hand, in Figure 5b, having temporal indicators in general benefits much more for BEFORE, AFTER, IS_INCLUDED labels. The only exception is INCLUDES, but it is a small class with only 4% of the data.
More interestingly, notice the diagonal cells, i.e. Combining these two sets of results, we provide additional evidence that ECONET helps PTLMs better capture temporal indicators and thus results in stronger fine-tuning performances.
Our final analysis attempts to show why discriminator helps. We feed 1K unused masked samples into the generator of the best ECONET in Table 2 to predict either the masked temporal indicators or masked events. We then examine the accuracy of the discriminator for correctly and incorrectly predicted masked tokens. As shown in Table 12 of the appendix, the discriminator aligns well with the event generator's predictions. For the temporal generator, the discriminator disagrees substantially (82.2%) with the "incorrect" predictions, i.e. the generator predicts a supposedly wrong indicator, but the discriminator thinks it looks original.
To understand why, we randomly selected 50 disagreed samples and found that 12 of these "incorrect" predictions fall into the same temporal indicator group of the original ones and 8 of them belong to the related groups in Table 1. More details and examples can be found in Table 13 in the appendix. This suggests that despite being nearly perfect replacements of the original masked indicators, these 40% samples are penalized as wrong predictions when training the generator. The discriminator, by disagreeing with the generator, provides opposing feedback that trains the overall model to better capture indicators with similar temporal signals.

Related Work
Language Model Pretraining. Since the breakthrough of BERT (Devlin et al., 2018), PTLMs have become SOTA models for a variety of NLP applications. There have also been several modifications/improvements built on the original BERT model. RoBERTa (Liu et al., 2019) removes the next sentence prediction in BERT and trains with longer text inputs and more steps. ELECTRA (Clark et al., 2020) proposes a generator-discriminator architecture, and addresses the sample-inefficiency issue in previous PTLMs.
Recent research explored methods to continue to train PTLMs so that they can adapt better to downstream tasks. For example, TANDA (Garg et al., 2019) adopts an intermediate training on modified Natural Questions dataset (Kwiatkowski et al., 2019) so that it performs better for the Answer Sentence Selection task. Zhou et al. (2020b) proposed continual training objectives that require a model to distinguish natural sentences from those with concepts randomly shuffled or generated by models, which enables language models to capture large-scale commonsense knowledge. Event Temporal Reasoning. There has been a surge of attention to event temporal reasoning research recently. Some noticeable datasets include ERE samples: TB-Dense (Chambers et al., 2014), MATRES (Ning et al., 2018) and RED (O'Gorman et al., 2016). Previous SOTA systems on these data leveraged PTLMs and structured learning (Han et al., 2019c;Wang et al., 2020;Zhou et al., 2020c; and have substantially improved model performances, though none of them tackled the issue of lacking event temporal knowledge in PTLMs. TORQUE (Ning et al., 2020) and MCTACO (Zhou et al., 2019) are recent MRC datasets that attempt to reason about event temporal relations using natural language rather than ERE formalism. Zhou et al. (2020a) and Zhao et al. (2020) are two recent works that attempt to incorporate event temporal knowledge in PTLMs. The formal one focuses on injecting temporal commonsense with targeted event time, frequency and duration masks while the latter one leverages distantly labeled pairwise event temporal relations, masks before/after indicators, and focuses on ERE application only. Our work differs from them by designing a targeted masking strategy for event triggers and comprehensive temporal indicators, proposing a continual training method with mask prediction and contrastive loss, and applying our framework on a broader range of event temporal reasoning tasks.

Conclusion and Future Work
In summary, we propose a continual training framework with targeted mask prediction and contrastive loss to enable PTLMs to capture event temporal knowledge. Extensive experimental results show that both the generator and discriminator components can be helpful to improve fine-tuning performances over 5 commonly used data in event temporal reasoning. The improvements of our methods are much more pronounced in low-resource settings, which points out a promising direction for few-shot learning in this research area. Table 5 describes basic statistics for target datasets used in this work. The numbers of train/dev/test samples for TORQUE and MCTACO are question based. There is no training set provided in MCTACO. So we train on the dev set and report the evaluation results on the test set following Pereira et al. (2020). The numbers of train/dev/test samples for TB-Dense, MATRES and RED refer to (event, event, relation) triplets. The standard dev set is not provide by MATRES and RED, so we follow the split used in Zhang et al. (2021) and Han et al. (2019b).  Downloading link for the (processed) continual pretraining data is provided in the README file of the code package.

B Event Detection Model
As mentioned briefly in Secion 2, we train an event prediction model using event annotations in TORQUE. We finetune RoBERTa LARGE on the training set and select models based on the performance on the dev set. The best model achieves > 92% event prediction F 1 score with > 95% precision score after just 1 epoch of training, which indicates that this is a highly accurate model.

C Reproduction Checklist
Number of parameters. We continue to train BERT LARGE and RoBERTa LARGE and so the number of parameters are the same as the original PTLMs, which is 336M.
Hyper-parameter Search Due to computation constraints, we had to limit the search range of hyper-parameters for ECONET. For learning rates, we tried (1e −6 , 2e −6 ); for weights on the contrastive loss (β), we tried (1.0, 2.0).
Best Hyper-parameters. In Table 6 and Table 7, we provide hyper-parameters for our best performing language model using RoBERTa LARGE + ECONET and BERT LARGE + ECONET and best hyper-parameters for fine-tuning them on downstream tasks. For fine-tuning on the target datasets. We conducted grid search for learning rates in the range of (5e −6 , 1e −5 ) and for batch size in the range of (2, 4, 6, 12). We fine-tuned all models for 10 epochs with three random seeds (5, 7, 23).

Method
learning rate batch size β  Dev Set Performances We show average dev set performances in Table 9 corresponding to our main results in Table 2.

D Significance Tests.
We

F Variants of ECONET
We also experimented with a variant of ECONET by first pretraining RoBERTa LARGE + Generator for a few thousands steps and then continue to pretrain with ECONET. However, this method leads worse finetuning results, which seems to contradict the suggestions in Zhou et al. (2020b) and Clark et al. (2020) that the generator needs to be first trained to obtain a good prediction distribution for the discriminator. We speculate that this is due to our temporal and event mask predictions being easier tasks than those in the previous work, which makes the "warm-up steps" for the generator not necessary.  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Table 12 shows the alignment of between the generator and the discriminator, and Table 13 shows the examples of "disagreed" samples between the     .. An investigation revealed that rock gauges were picking up swifter rates of salt movement in the ceiling of the room, but at Wipp no one had read the computer printouts for at least one month mask the collapse.

G Impact of Event Models
Type II. Related Group: 8/50 (16%) Ex 3. original: in the past; predicted: before text: Mr. Douglen confessed that Lautenberg, which had won mask , was "a seasoned roach and was ready for this race...
Ex 4. original: previously; predicted: once text: Under the new legislation enacted by Parliament, divers who mask had access to only 620 miles of the 10,000 miles of Greek coast line will be able to explore ships and "archaeological parks" freely... Table 13: Categories and examples of highly related "incorrect" temporal indicator predictions by the generator, but labeled as "correct" by the discriminator.