Conditional Generation of Temporally-ordered Event Sequences

Models of narrative schema knowledge have proven useful for a range of event-related tasks, but they typically do not capture the temporal relationships between events. We propose a single model that addresses both temporal ordering, sorting given events into the order they occurred, and event infilling, predicting new events which fit into an existing temporally-ordered sequence. We use a BART-based conditional generation model that can capture both temporality and common event co-occurrence, meaning it can be flexibly applied to different tasks in this space. Our model is trained as a denoising autoencoder: we take temporally-ordered event sequences, shuffle them, delete some events, and then attempt to recover the original event sequence. This task teaches the model to make inferences given incomplete knowledge about the events in an underlying scenario. On the temporal ordering task, we show that our model is able to unscramble event sequences from existing datasets without access to explicitly labeled temporal training data, outperforming both a BERT-based pairwise model and a BERT-based pointer network. On event infilling, human evaluation shows that our model is able to generate events that fit better temporally into the input events when compared to GPT-2 story completion models.


Introduction
This paper proposes a single model of events to support inferences in two seemingly different tasks: (1) temporal event ordering and (2) event infilling, or inferring unseen or unmentioned events occurring as part of a larger scenario. Figure 1 shows an example illustrating these two goals. Unlike prior approaches, we aim to address both with the same model architecture, rather than having to annotate data and build ad-hoc models for each task separately; our goal is to work towards models that cap-  ture temporal event knowledge broadly and support a wide range of inferences. We thus need a suitably general modeling framework to capture temporal knowledge about events, which in our case will be a BART-based (Lewis et al., 2020) model we call TemporalBART. Note that classic temporal relation extraction models, which model temporal ordering in context for a particular document, may chiefly learn how to use local discourse cues rather than generalizable event knowledge (Chambers et al., 2014;Ning et al., 2018b).
The goals in this work relate to past work on learning narrative schemas (Mooney and DeJong, 1985;Chambers, 2013;Peng and Roth, 2016;Peng et al., 2017). Our approach particularly follows a recent line of work using distributed representations of schemas (Pichotta and Mooney, 2016;Weber et al., 2018b), which support inferences about events without explicitly materializing a discrete schema library. The target tasks in this work are directly motivated by downstream applications of schema learning. Text generation tasks like story completion rely on understanding what makes narratives plausible and what events might be likely to happen before, after, and between other events (Jain et al., 2017;Yao et al., 2019), motivating our event infilling task. Answering questions about causes, effects, or what might happen next in a scenario requires knowing typical temporal orders of event sequences (Zhou et al., 2019Ning et al., 2020), motivating our temporal ordering task.
Prior work has not combined traditional event cooccurrence with event temporality as we do.
We propose a conditional generation model to tackle temporal event ordering and event infilling, and train it as a denoising autoencoder over outof-context temporal event sequences. As shown in Figure 1, the encoder of our TemporalBART model reads a temporally scrambled sequence of a subset of input events, obtained by corrupting a temporally-ordered sequence of events from a corpus. The decoder, which can be viewed as a conditional event language model (Kiyomaru et al., 2019;Bosselut et al., 2019;Madaan et al., 2020), then reconstructs the complete, temporally-ordered event sequence. Such denoising training has been successful exploited in many applications (Vincent et al., 2010;Lu et al., 2013;Lample et al., 2018;Lewis et al., 2020), and using seq2seq models to reorder and smooth inputs has been explored before (Goyal and Durrett, 2020), but to our knowledge we are the first to apply this in this temporal modeling setting. The conditional generation architecture of our model is flexible enough to address a variety of tasks, including our temporal ordering and event infilling tasks, by either sampling from the model or using it to score sequences. Capitalizing on the success of recent pre-trained encoder-decoder transformers (Lewis et al., 2020;Raffel et al., 2020), our model itself is based on BART, consuming and producing predicate-argument structures rendered in surface order.
Gathering large-scale high-quality labeled data with temporal annotations is often expensive and requires specially designed annotation schemes (Pustejovsky et al., 2003a;Cassidy et al., 2014;Ning et al., 2018b;Zhao et al., 2021). Here, we instead turn to a narrative documents corpus, EventsNarratives (Yao and Huang, 2018) and design an automatic method to extract the training data we need. In these documents, discourse order is loosely assumed to reflect temporal order, so events extracted from this text can directly provide training data for our models. This use of automatic annotation allows us to use broad-domain data, giving us a strong domain-independent temporal model (Zhao et al., 2021).
To evaluate how well our proposed models capture temporal knowledge and solve the two targeted tasks, we apply them on out-of domain test sets in a zero-shot manner. Specifically, for event ordering, we first extract test temporal event sequences from the CaTeRS (Mostafazadeh et al., 2016b) and MC-Taco (Zhou et al., 2019) datasets, which include the annotations on temporal relations between events. We then compare the performance of our models with two baselines: a BERT-based pairwise model and a BERT-based pointer network. For event infilling, we use the test event sequences from CaTeRS and examine the ability of our models to order unseen events and generate infilled events in comparison with GPT-2 baselines from story generation. Our BART-based models significantly outperform the baseline models on the ordering settings we consider, and human evaluation verifies that our models can generate infilled events that are better temporally-ordered with respect to the input.

Background and Related Work
Learning temporal knowledge to order events and generate new events as part of schemas or stories are two problems that have received significant attention, but in contrast to our work, previous work typically focuses on each in isolation.

Temporal Event Ordering
Closely related to the temporal ordering aspect of this paper is temporal relation extraction, which orders pairs of events in text in document context (Pustejovsky et al., 2003b;Cassidy et al., 2014;Ning et al., 2018b). This problem has been addressed as pairwise classification (Mani et al., 2006;Verhagen et al., 2007;Chambers et al., 2007;Verhagen and Pustejovsky, 2008;Cheng and Miyao, 2017;Tourille et al., 2017;Goyal and Durrett, 2019) or as a structured learning problem to enforce constraints on the output (Do et al., 2012;Ning et al., 2017Ning et al., , 2018aLeeuwenberg and Moens, 2017;Han et al., 2019a,b). However, even in these latter works, the models focus on pairwise relations. In contrast, our work here views temporal event ordering as a sequence generation problem, which provides models a stronger inductive bias to capture global temporal relations between events. One recent effort (Madaan and Yang, 2020) treats this task as a graph generation problem, and so is able to predict more complex structures, but it focuses solely on ordering and is not suitable for our event infilling goals.

Schema Induction
Schema learning systems are often evaluated on their ability to predict unseen events. Initial work e1 e2 e3 e4 e5 attempted to use statistical methods to derive a library of schematic information (Mooney and De -Jong, 1985;Chambers and Jurafsky, 2008;Jans et al., 2012). Another thread exploits event language modeling to learn the distributions over events (Pichotta and Mooney, 2016;Peng and Roth, 2016;Weber et al., 2018b), or focuses on learning event representations (Modi, 2016;Weber et al., 2018a) rather than writing down discrete schemas. However, most of this work only models the cooccurrence between events instead of directly considering temporal information, and only represent events as a small tuple of S-V-O headwords. Another line of work instead directly focuses on extracting coherent narratives from "story salads" (Wang et al., 2018) or more broadly generating narratives given predefined scenarios (Wang et al., 2019;Qin et al., 2020). However, without considering temporal ordering, these systems are prone to learn discourse ordering of events instead of a strong representation of temporal knowledge.

Task Formulation and Model
Our framework involves modeling a conditional distribution P (y | x) over temporal event sequences y = {e 1 , · · · , e l }, which are sequences of events taken out of context (i.e., not represented as spans in a document) which are part of the same scenario, involve shared actors, and are temporally ordered. The input of the model is a (not necessarily temporal) sequence of events x = {e 1 , · · · , e m } that represents incomplete information abut the scenario y: a partial set of unordered events. Our model should learn distribu-tions over a true underlying order of events, without obvious gaps in the event sequence, given this incomplete information. By taking events out of context rather than in the context of a document, we are encouraging the model to encode temporal knowledge between events rather than superficial cues like surface textual order or discourse connectives that might determine their order.
For the definition of events, we follow Chambers and Jurafsky (2008) where an event e is a predicate v e along with its arguments (Palmer et al., 2005).
Our model can be formulated as a denoising autoencoder if x is created as a noised version of y. Specifically, given a temporal event sequence y as defined above, we first corrupt it to get the required input x by performing two transformation functions consecutively (see Figure 2): Event Shuffling We first perform a random shuffling of the events in y to produce x. To perfectly reconstruct the original sequence y, the model must capture the temporal relations between events.

Event Deletion
We randomly delete each event in y with probability p to produce x. This denoising scheme is similar to the token deletion transformation in Lewis et al. (2020). To perfectly reconstruct the original event sequence, the model needs to encode schema-like event knowledge so as to generate events not included in the input x and insert them at correct positions. As a result, this denoising can help the model learn event infilling. We train our model to maximize log P (y | x) on this automatically-constructed data.

Model Architecture
To leverage the power of pretrained transformers, we adopt BART (Lewis et al., 2020) as the underlying architecture for our model, and initialize our model with its pretrained weights.
The overall model, shown in Figure 3, takes a corrupted event sequence x = {e i } as input, and outputs the true event sequence y = {e j }. To feed the event-based inputs and outputs to BART, we need to represent each event e in a textual format Repr(e). We represent e with the concatenation of its predicate and all arguments. Unlike previous work which only uses the syntactic heads of the predicate and certain arguments (Pichotta and Mooney, 2016;Weber et al., 2018a,b), our approach preserves complex noun phrase arguments and exposes to the model arguments like temporal modifiers. We strike a balance between using

BART Encoder BART Decoder
New event generated Event copied from input Event copied from input  enough information to have meaningful event representations and not consuming entire documents (Han et al., 2019a,b), which would result in a model that overly relies on discourse clues. We then consider two variants for input and output: TemporalBART This model first encodes each event e i in x as Repr(e i ), and concatenates them with a special token [E] prepended in front of each event. This special token can help the model identify the boundary between the input events; such placeholder tokens have been used in related tasks like entity tracking in procedural text (Gupta and Durrett, 2019). For the output, we instead This setup not only provides an extra supervision signal that encourages the model to predict ordering on the basis of predicates, but also allows us to post-hoc recover an event sequence by checking the predicate part of the generation.
TemporalBART-indexed This model, depicted in Figure 3, uses the same input and output format as TemporalBART, except the prepended special token [E] is instead [Ei] before each event e i . For the output, if e j is one of the input events and e j = e i , then we also change the prepended tokens as the special event token. Note that the model is not able to "cheat" using the [Ei] tokens to do the prediction since the input events are scrambled by the shuffling denoising training scheme described in §3.1. Compared to TemporalBART, the use of [Ei] here provides an extra clue for the model to associate input events to output events, which can benefit the event ordering. It also provides a potential way to focus only on modeling the ordering of the target sequence, rather than also mixing in generation decisions, many of which are copying event arguments and often affect the prediction. 1 1 We experiment with this method, which is denoted as "TemporalBART-indexed (tags only)", in Appendix A Training details of these BART-based models are described in the Appendix.

Training Data Collection
For our framework, the training data we need is event sequences in temporal order. Note that most text data occurs in discourse order, which is not the same thing: human annotations of temporal relation datasets like TimeBank (Pustejovsky et al., 2003b) show that many events mentioned earlier in the text occur later in time. Existing datasets of temporal relations (Cassidy et al., 2014;Vashishtha et al., 2019) are small-scale, and annotating more data is expensive and prone to low agreement (Ning et al., 2018b). To combat this issue, we instead try to automatically gather the training data we need.
Corpus We use the English-language EventsNarratives corpus (Yao and Huang, 2018), which contains more than 200,000 narrative-structured documents identified from three different source domains including news articles, novel books, and blogs. Yao and Huang (2018) use a weakly supervised method to identify narrative texts, describing a sequence of events in such a way that the discourse order is very likely to reflect the temporal order. This gives us an entry point to collect temporal event sequences automatically from each document. Here we focus on documents in the novel domain as our source for temporal event sequences.
Extracting Temporal Event Sequences To obtain the training event sequences, we first use an SRL model from AllenNLP (Gardner et al., 2017) to extract verbs (events) and their arguments. Then, temporal event sequences are constructed by connecting only the events in different sentences, since the relations between events within the same sentence are unclear even in narrative documents. Here, to ensure all the events in a sequence have a strong relation with each other, we only include chains of events that are associated with a com- mon entity (Chambers and Jurafsky, 2008), as determined by checking whether the arguments of two event have some shared non-stopword tokens.
With this procedure, we are able to collect nearly 2 million temporal event sequences to train on, with nearly 70% of the sequences consisting of three or more events.

Target Task Formulation
Here we describe the two target tasks of our model and how they can be handled as event-based conditional generation problems. A visual of the task formulations is shown in Figure 4.
Temporal Event Ordering Given an unordered set of events {e i }, this task's goal is to produce the temporal ordering of {e i }, as shown in Figure 4(a). We ask the model to generate an ordered sequence of events {e f (i) } given the set {e i }, where f (·) is a mapping function to determine the event to put at position i. This is a conditional generation problem that is directly solved by our proposed models.

Event Infilling
The goal of event infilling is to generate inserted events at some pre-selected insertion positions in a seed event sequence (Wang et al., 2020). To simplify the evaluation, here we assume that given an event sequence x = {e i }, models will only be required to generate one inserted event at one insertion position i * , as shown in Figure 4(b). We first feed {e i } as the input to our model, then generate one event e * using x prefix = {e i | i < i * } as the decoding prefix. To force our models to produce e * / ∈ x, we prevent our model from generating {v e i } during the decoding process.

Baselines: Temporal Event Ordering
We compare against two baselines: a state-of-theart pairwise model used for the in-context temporal ordering task and a pointer network model that directly models event sequence permutations discriminatively.
BERT-based Pairwise Model + SSVM We follow the architecture of the Deep SSVM model used in Han et al. (2019a) as our first baseline, which tackles event ordering as a pairwise classification problem. This network first exploits a BERT-based model (Devlin et al., 2019) to compute pairwise scores for e i preceding e j in the output y. The final output is then obtained by solving an ILP over all the pairwise scores. The overall network is trained with the structured SVM loss so that it can learn to make joint predictions with transitivity constraint. To make this baseline more comparable to our models, we take Repr(e i ) prepended with [E] as the event representation instead of using the sentence containing v e i as in Han et al. (2019a). Detailed formulas are in Appendix B. We denote this baseline as "Pairwise+SSVM" in the evaluations.

BERT-based Pointer Network
This network first follows the BERT-based Pairwise Model + SSVM to extract the the vectorized representation U p i for each e i , where U is the final BERT encoded matrix, and p i is the position of the first token of e i in the input sequence. These event representations are then instead fed into a LSTM-based pointer network to model the ordering probability by decomposing it in a sequential fashion: h t is the decoder hidden state in the pointer network. Compared to the above pairwise baseline, this model has a stronger inductive bias for exploiting global event relations. We train the sequential model with teacher forcing to maximize the probability of the gold ordering. We denote this baseline as "BERT-based PN" in the evaluation section.

Baselines: Event Infilling
HAQAE HAQAE (Weber et al., 2018b) is a vector quantized variational autoencoder which encodes schema knowledge with hierarchical latent variables. Since HAQAE is also an event-level seq2seq autoencoder, we can easily apply it to our setting. During training we follow Weber et al. (2018b) except that we use our narrative event sequences for training and represent each event with the predicate-argument format described in §3.2 so it is more comparable to our BART-based models.
GPT-2 GPT-2 (Radford et al., 2019) is a transformer-based pretrained language model that has been exploited in various generation tasks like story generation (Dathathri et al., 2020;Rashkin et al., 2020). However, one issue with the GPT-2 model is that it can only perform uni-directional generation. To apply GPT-2 to generate an inserted event e * , we first concatenate {Repr(e i ) | e i ∈ x prefix } with periods in between, and treat it as the decoding prefix only. We then decode until another period is generated, and take the model's output as the text representation of e * . Except where otherwise specified, we use the GPT2-medium pretrained model from HuggingFace's Transformer (Wolf et al., 2020), whose model size is comparable to BART-large.
Infilling GPT-2 To build a stronger GPT-2 baseline that doesn't only condition on the prefix events, we follow the baselines from Qin et al. (2020) to adapt GPT-2 to infilling tasks. Infilling GPT-2 generates the infilling events by "wrapping" the events after the insertion position to the front. That is, the decoding prefix fed to the infilling GPT-2 becomes the concatenation of {Repr(e i ) | i >= i * }, <SEP> and {Repr(e i ) | i < i * }, again with a period appended after each event. The special token <SEP> is used to help the model to differentiate the events before and after the insertion position.

Experimental Setup
All the models used in the evaluation are trained with the temporal event sequences automatically collected on EventsNarratives except GPT-2, since we want to compare the learned knowledge in GPT-2 with our proposed models. Although we are able to gather millions of sequences, for efficiency, we train on 100,000 sequences unless specified otherwise. For each sequence, we extract 2 distinct permutations from the corruption process. This results in 200,000 training examples in total.
During evaluation, all the models are evaluated on out-of-domain datasets in a zero-shot way, i.e., no fine-tuning is performed on the evaluation sets.

Datasets
We use two out-of-domain English datasets to extract the test temporal event sequences: CaTeRS and MCTaco. As during training, two different Candidate Answer: they would destroy the democracy

Context:
In Colombia, the drug-financed guerrillas trying to seize the country and destroy democracy include M-19, which Castro has clearly backed.

Question:
What would the guerrillas do if able to seize the country ? Extracted Event Sequence: e1: drug financed guerrillas e2: the drug -financed guerrillas trying to seize the country and destroy democracy e3: the drug -financed guerrillas seize the country e4: the drug -financed guerrillas destroy democracy e5: In Colombia the drug -financed guerrillas trying to seize the country and destroy democracy include M-19 , which Castro has clearly backed e6: M-19 which Castro clearly backed e7: they would destroy the democracy Gold Label: e a AFTER e q To extract the evaluation data from CaTeRS, we first apply the SRL model used in §3.3 on each story. Then, a directed acyclic graph is constructed with a node being an event e whose predicate v e can be captured by the SRL model, and an edge (e i , e j ) indicating e i happens temporally before e j . Note that here we treat all types of annotated relations except "IDEN-TITY", "DURING" and "CAUSE_TO_END" as "BEFORE", as suggested in Mostafazadeh et al. monsense. To extract suitable test data, we focus on questions with the reasoning type of "event ordering" and their positive candidates. Each data point here consists of a sentence describing multiple events {e c i }, a question asking what event could happen temporally before/after a particular event e q ∈ {e c i }, and a candidate event e a . Critically, the question itself tells us whether e a should happen before/after e q in the temporal event sequence formed by {e c i } ∪ {e a }. With this annotation, we evaluate our models by first feeding the randomly shuffled {e c i }∪{e a } into a model, then checking the ordering between e a and e q in the output sequence. Here, we were able to extract 585 test sequences from MCTaco. For each sequence, {e c i } and e a are extracted with the SRL model used in §3.3. For the question, we first use a set of pre-defined regex templates to extract an event e q and a temporal relation ("before" / "after"). We then match e q to one of e c i by ROUGE-L scores. See Figure 5 for an example of the extracted data.
Compared to CaTeRS, since the sentences here are from 9 different domains in MultiRC (Khashabi et al., 2018), the types of events are more diverse. The event arguments are also more complex.

Results on CaTeRS
We first examine the temporal ordering results on CaTeRS, shown in Table 1. We compute the pairwise accuracy of the predicted event sequences, or how many pairs of events in the output are ordered correctly by a model. Note that the BART-based models can deviate from generating permutations of the input; however, we found that the most probable generated sequences were almost exact permutations of the input or easily aligned to the input using a heuristic.
Our BART-based models outperform the BERTbased pointer network by more than 20 points, a  huge margin. One possible reason is that the decoder of BART can condition on the token-level embeddings of the events when generating the output events, whereas in the pointer network, the decoder is only aware of the condensed event embeddings U p i . Our two BART-based models also outperform the BERT-based pairwise model on both all sequences and long sequences.

Results on MCTaco
Results on MCTaco are shown in Table 2. Here since we only know the gold temporal relation of one pair of events in the input, i.e e q and e a , the averaged accuracy on predicting the order of e q and e a is computed. In addition, since the ratio of before/after questions is significantly unbalanced in MCTaco, with 90% asking about the "after" relationship, we also compute the macro F1 score as our metric (averaging F1 across these two classes). Our two baselines perform worse than just picking the majority label. This is possibly due to the high diversity of events in MCTaco, which makes it much harder to apply a zero-shot model. In contrast, TemporalBART achieves an F1 score about 3 points higher than the Pairwise+SSVM baseline, and TemporalBART-indexed further performs best among all.
In Appendix E, we also show that our models are able to learn temporal phenomenon not explicitly annotated in our training data, which is another demonstration of our model's ability to generalize.

Ordering Unseen Events
We evaluate our BART-based models on an additional variant of this ordering problem that better tests their capability as generative models. Recall that previously, BART conditions on the complete (but possibly scrambled) sequence of events. We now consider ordering an event in the decoder that the model does not condition on in the encoder.  Table 3: Comparison of the ability to tackle unseen events between our BART-based models and baselines on CaTeRS. The right columns are computed on test sequences of 3 or more events.
Concretely, for each temporal event sequence in CaTeRS, we randomly select one event e * , and treat the rest of the sequence as the seed input event sequence {e 1 , · · · , e N }. Then we check if a model can correctly determine where to insert e * into the input sequence. Specifically, for both the BARTbased models and the GPT-2 baselines, we use the generation probability to rank event sequences {e 1 , · · · , e i * −1 , e * , e i * , · · · , e N } for i * between 1 and N + 1 (all possible locations). If a model correctly ranks the gold candidate higher, it indicates that it can model temporal relations between seen events and new unseen events it may generate. The results are shown in Table 3, where we compute the top-1 and top-2 exact match (EM): did the model rank the gold sequence 1st or 2nd highest? Our GPT-2 variants are only slightly better than random. HAQAE, also using an autoencoder framework, performs worse than infilling GPT-2, likely due to the lack of large-scale pretraining and the loss of information when compressing input into latent variables. Our BART-based models are significantly better, with TemporalBART-indexed showing the benefit of using indexed event markers to help the model capture order. We also perform an ablation of deletion during training (Figure 2). Unsurprisingly for this unseen event evaluation, not deleting events in training (setting p to 0) causes a major drop by 14 EM points. Deletion denoising is evidently critical to model new events.

Event Infilling
Now we turn to temporal event infilling: given a CaTeRS sequence, remove a random event at index i * , and denote the resulting sequence {e 1 , · · · , e N }. We then ask a model to generate one event e * at position i * so {e 1 , · · · , e i * −1 , e * , e i * , · · · , e N } is temporally ordered with the new event.  Table 4: Human evaluation of event infilling (0-2 scale). Data are event sequences from CaTeRS. All models fill in coherent events, but our BART-based output is more temporally ordered with respect to the input events.
We evaluate the quality of the generated (inserted) events by human evaluation on Amazon Mechanical Turk. Specifically, we randomly sample 30 examples from CaTeRS and have 5 raters judge the coherence and temporality (on a scale from 0 to 2) of the inserted event from each model. See Figure 8 in Appendix for our exact prompt. The final scores for each model on coherence and temporality are computed by taking the average of the majority rating on each prediction. Here we only include GPT-2 models as baselines since HAQAE is also using the autoencoder framework, and already performs significantly worse in §5.3.
The results of this evaluation are shown in Table  4. All models achieve reasonable coherence scores. However in terms of temporality, GPT-2 performs worst, as expected, since it can only condition on partial input event sequences while the other three consider the whole event sequence as input. Both of the BART-based models achieve better performance than infilling GPT-2. The improvements on the temporal score are significant with p < 0.05 according to bootstrap resampling for both Tempo-ralBART models with respect to infilling GPT-2. Figure 6 gives examples of infilled events generated by GPT-2, infilling GPT-2, and Temporal-BART. On this specific test example, GPT-2 generates an event generally about the Apple watch, which is less relevant to the input scenario about Mike making a tree. The event generated by infilling GPT-2 is coherent with the scenario, but doesn't occur in the correct order with respect to the input events. The event generated by TemporalBART is the best in terms of coherence and temporality. More examples are in Table 7 of the Appendix. Figure 7 shows that the performance of both our models on the CaTeRS ordering task improves when increasing the amount of narrative training data. This demonstrates that the automatically ex-  tracted temporal event sequences are useful and diverse enough to help the models to learn temporalrelated knowledge. The TemporalBART-indexed model is effective on surprisingly small amounts of data, but also scales well with data size; however, we observe a plateau in both models which motivated our decision to use 100k training sequences.

The Effectiveness of Narrative Data
For comparison, we train our TemporalBARTindexed model on 1266 event sequences gathered from the MATRES dataset, a human-labeled dataset for temporal relation extraction, using the same procedure we applied to CaTeRS. However, Figure 7 shows that the resulting performance, 65.6 on MATRES, is significantly lower than the best number we get on narrative data. Even with the same size training set, using narrative data achieves over 7 points of improvement over using MATRES. This suggests that the small-scale human-labeled dataset is not enough for models to learn generalized temporal knowledge, and even with the same amount of data, narrative data may be a better source for general temporal knowledge.

Conclusion
This work presents a BART-based conditional generation model and a denoising autoencoder framework to learn temporal event knowledge, and addresses both temporal ordering and event infilling tasks by pretraining on automatically collected data.
Our experiments demonstrate that our model is able to perform temporal ordering and infilling in a zeroshot manner, not fine-tuned on our target datasets, which suggests that it can also be applied to other settings requiring event schematic and temporal knowledge.

Acknowledgments
Thanks to Mahnaz Koupaee from Stony Brook University for providing directions on our HAQAE baseline and to the members of the UT TAUR lab for helpful discussion, particularly Yasumasa Onoe and Jiacheng Xu for suggestions on the human evaluation. Thanks as well to the anonymous reviewers for their comments. This work is based on research that is in part supported by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory (AFRL), DARPA, or the U.S. Government.

A Scoring Orderings with TemporalBART-indexed (tags only)
TemporalBART-indexed (tags only) scores whether an output sequence y is temporally ordered by gathering the generation scores on the special tokens [Ei] only as its final ordering score: (2) where {w y t } is the text representation of y and I is the set of the positions of the special tokens [Ei] in {w y t }. This allows us to make a judgment only depending on the predicted temporal order of the events rather than mixing in general token order. In contrast, TemporalBART scores a sequence of events y with the generation probability on the entire text representation of y: (3) Since many of the generation decisions here are copying event arguments, the prediction could be largely affected by the correlation of tokens within each argument.

Architecture
Acc. Macro F1 TemporalBART-indexed 74.9 55.1 TemporalBART-indexed (tags only) 76.6 56.4 We evaluate "TemporalBART-indexed (tags only)" on the temporal event ordering with the procedure used for the models in Table 2. Table 5 shows that this (tags only) variant further boosts the performance of TemporalBART-indexed by 1.3 points on the macro F1. This result verifies that this setting can help prevent the ordering scores from being overly affected by the text generation probabilities, which is particularly important for MCTaco, where the arguments of events are more complex.

B Architecture of BERT-based Pairwise
Model + SSVM This network uses a BERT-based model (Devlin et al., 2019) to obtain a vectorized representation for each input event e i in x. As with the BARTbased models, the input to the BERT model is the concatenation of Repr(e i ) with [E] being prepended in front of each event. The vectorized representation for e i is then extracted by U p i , where U is the final BERT encoded matrix, and p i is the position of the first token of e i in the input sequence. Each pair of event representations, U p i and U p j are then fed into a feed-forward function g to compute a score B for e i preceding e j in the output y: Finally, the final output y is computed by finding the best permutation over all of the pairwise scores by solving an ILP.

C Training Details of BART-based Models
We train our BART-based conditional generation models to minimize negative log likelihood of reconstructing the original event sequence. We set the learning rate to 1e-5, and use a polynomial decay scheduling with 500 steps of warm-up.  (Wolf et al., 2020). During the evaluation for temporal event ordering, we decode the output event sequences using beam search with the beam size being 4. For event infilling task, we use nucleus sampling with p set to 0.8. Figure 8 shows the prompt for the human evaluation described in §5.4, where we ask the MTurk raters to evaluate the coherence and temporality of the generation outputs. To help the raters ignore grammatical issues when making decisions, we first ask them to check the grammaticality, then separately judge the coherence and the temporality.

E Learning Timex Knowledge
The temporal ordering and event infilling tasks correspond to information that we might expect to be  Table 6: The results of temporal event ordering on the events anchored with various types of timex. The test data used here are length-3 sequences artificially made up with "die" events for the "Year" timex, and 3 typical daily events as shown in Figure 9 for other types of timex. The timexes of type "Year" are randomly sampled from 1000 to 2100. Our BART-based models significantly outperform the GPT-2 and random baselines, showing that they can capture useful timex-related knowledge. Figure 8: A screenshot of prompt for the human evaluation described in §5.4, which includes the 3 questions the raters are asked to judge the event infilling outputs from each model. The input events are highlighted with the color green, and blue for the inserted events. encoded by our model pre-training. To test whether our models generalize to slightly more distant temporal phenomena, we examine whether they are able to capture the temporal relationships between timexes. This knowledge has been shown to be hard to learn in temporal relation extraction models (Goyal and Durrett, 2019).

E.1 Evaluation Setup
The timexes we examine here include years, months, weekdays, 24-hour clock time in "hour:minute" format and 12-hour clock time in "hour:minute am/pm" format. We evaluate the ability of our models to order events that are anchored with a timex in their arguments. To prepare the test input event sequences of a given type of timex, we first artificially make up a template event sequence with 3 typical daily events that have no temporal order relations. We then randomly sample 3 different timexes, e.g "June", "May", "July" for "Month", and append each of them to the events in the template sequence respectively with proper prepositions. At the end, 100 examples are created with this process for each type of timex.
More concrete examples are shown in Figure 9. For the baselines, here we use GPT-2 models to do the ordering by using the generation probability to rank all permutations of the input events.

E.2 Results
The results are shown in Table 6. First, we examine the results of the GPT-2 models. In general both the unsupervised GPT-2 (the medium model) and GPT-2 large perform worse than the random baseline, indicating that they have a limited ability to order timexes. Our BART-based models achieve stronger results. The results are strongest on years. For 12-hour clock time, even though the model has to make a challenging link between the temporal knowledge on "am' and "pm" and numerical comparisons, both of the BART-based models still performs significantly better the random baseline.

F Examples for Event Infilling
In Table 7, we demonstrate more examples of the infilled events generated by GPT-2, infilling GPT-2 and TemporalBART given the seed event sequences from CaTeRS. In general, while the events output GPT-2: A federal judge dismissed a lawsuit Wednesday from the family of an elderly man who died after inhaling a toxic gas during an air show at the Lincoln Memorial .
Infilling GPT-2: He put the card in his wallet .
TemporalBART: He pulled out a small gift card from his wallet .

#3
Scenario: Kim hunkered down at home as a hurricane blew through town . [INSERTED EVENT] She went to sleep hoping all would be well . She hoping all would be well .
GPT-2: It wasn't until his brother and mother returned from vacation that he found out that a hurricane had struck, bringing the death toll from Hurricane Andrew in 2012 up to 24 and leaving many people without electricity for weeks .
TemporalBART: Kim turned the TV to catch the latest news .

#4
Scenario: Tony needed to buy his grandma a birthday present . He went to her favorite bakery . The owner told Tony how to make it himself . [INSERTED EVENT] His elated grandma couldn't tell the difference .
GPT-2: Tony got mad and left the bakery .
Infilling GPT-2: Tony went to the grocery store .
TemporalBART: make a loaf that looked like grandma's .

#5
Scenario: [INSERTED EVENT] He decided to use the batteries in his fire detector . He use the batteries in his fire detector .
GPT-2: I'm an independent developer who's worked for both big and small companies .
Infilling GPT-2: He find a place to charge the fire detector batteries .
TemporalBART: He see the batteries in his alarm clock were dead . Table 7: More examples of the infilled events generated by GPT-2, infilling GPT-2 and TemporalBART respectively. Scenarios are the temporally-ordered input events fed into the models, with the events separated by periods, and the insertion position specified by [INSERTED EVENT] in this figure. The second row in each example shows the infilled event generated by each model. by TemporalBART are coherent and temporallysensible, those from the GPT-2 models has a worse quality in terms of the temporality. Note that the nature of the event representation does not necessarily guarantee a grammatical sentence when the event is rendered in surface order.