ClarET: Pre-training a Correlation-Aware Context-To-Event Transformer for Event-Centric Generation and Classification

Generating new events given context with correlated ones plays a crucial role in many event-centric reasoning tasks. Existing works either limit their scope to specific scenarios or overlook event-level correlations. In this paper, we propose to pre-train a general Correlation-aware context-to-Event Transformer (ClarET) for event-centric reasoning. To achieve this, we propose three novel event-centric objectives, i.e., whole event recovering, contrastive event-correlation encoding and prompt-based event locating, which highlight event-level correlations with effective training. The proposed ClarET is applicable to a wide range of event-centric reasoning scenarios, considering its versatility of (i) event-correlation types (e.g., causal, temporal, contrast), (ii) application formulations (i.e., generation and classification), and (iii) reasoning types (e.g., abductive, counterfactual and ending reasoning). Empirical fine-tuning results, as well as zero- and few-shot learning, on 9 benchmarks (5 generation and 4 classification tasks covering 4 reasoning types with diverse event correlations), verify its effectiveness and generalization ability.


Introduction
An 'event', usually a text span composed of a predicate and its arguments (Zhang et al., 2020b), is a fine-grained semantic unit to describe the state of entities/things (e.g., He looks very worried) and how they act (e.g., I grab his arms). Understanding events and modeling their correlations are fundamental to many reasoning tasks , e.g., abductive reasoning, story ending classification and generation, counterfactual reasoning, script reasoning. For instance, in the left example of Figure 1, to generate the missing event [E] in the given context, it is essential to understand that there are four events ('it tries the knob', [E], 'the creature starts pounding on the door', and '(the creature) to break it down'), and then predict [E] based on the other three events and its correlations to them (i.e., the contrast relation indicated by 'but' and the causal relation by 'so').
Event-aware reasoning has gained much attention and achieved promising success in recent years (Lv et al., 2020;Ding et al., 2019). However, many algorithms are designed to solve only some specific tasks. For example,  propose to improve unsupervised decoding for counterfactual and abductive reasoning; Huang et al. (2021) and Guan et al. (2019) advance story ending generation via incremental encoding and multi-level graph convolutional networks. Although these works show effectiveness in corresponding applications, they are limited to specific scenarios, and cannot generalize well to a broad scope of reasoning.
Meanwhile, some pioneering works follow a recently arising paradigm to conduct event-based pretraining for those downstream reasoning tasks (Yu et al., 2020;Han et al., 2020a;Lin et al., 2020;. However, these solutions have their own limitations: COMeT (Hwang et al., 2021) learns event correlations from a human-curated knowledge graph and thus limits its scalability. Han et al. (2020a) and Lin et al. (2020) only model temporal relations and cannot be expanded to other relations (e.g., causal, contrast). EventBERT  is proposed for event-based classifi-cations and is thus inapplicable to generation tasks.
In this work, we propose a general pre-training framework for event-centric reasoning by learning a Correlation-aware context-to-Event Transformer (ClarET) from an event-rich text corpus. We propose three novel self-supervised objectives, dubbed as whole event recovering (WER), contrastive event-correlation encoding and prompt-based event locating, respectively. The first one aims to capture event correlation by recovering a whole event from its masked context. The second one enhances the representation of the masked event in WER by contrasting it with the gold event against the negative ones. The last one is a simplified WER task by providing hints in its prompt and thus facilitates effective learning for WER.
ClarET explicitly models event correlations and contributes to various scenarios. From one aspect, it covers a variety of correlation types (e.g., causal, temporal, contrast) attributed to correlation typeagnostic objectives. From another aspect, it is applicable to both generation and classification task formulations by its unified structure. Lastly, it highlights event-level correlations and thus is more effective for diverse event-centric tasks, e.g., abductive, counterfactual and ending reasoning.
To evaluate ClarET, we compare it with strong baselines on 9 diverse benchmarks. While ClarET is continually pre-trained from BART  with very limited extra resources, i.e., training on a small subset of BART-used corpus (i.e., 200M out of 2.2T tokens) within 90 GPU hours (only 0.13% of 70,000h BART pre-training), it achieves state-of-the-art (SoTA) performance on all 5 generation benchmarks. It also outperforms all unified models on 4 classification benchmarks and achieves competitive, or even better, accuracy to strong discriminative baselines. We further exhibit that the ClarET provides a good initialization for downstream tasks by zero-and few-shot learning.

Related Work
Unified Pre-trained Model. A recent trend is to pre-train unified (a.k.a. universal or general) models to boost downstream generation and classification tasks, rather than masked language modeling (MLM) only. GPT  is based on auto-regressive language modeling but incompetent in classifications due to unidirectional contextualizing. To remedy this, BART  trains seq2seq models as a text denois-ing autoencoder with mask-infilling, etc; UniLM (Dong et al., 2019) designs advanced self-attention masks in Transformer, leading to a partially autoregressive MLM;  proposes an auto-regressive blank-filling objective based on Transformer, achieved by bi-/uni-directional attention and 2D positional encoding. T5  pre-trains a text-to-text Transformer to recover the masked part of input by decoding. All these general-purpose pre-trained models focus on relatively short-span masking in random, whereas we focus on masking a whole semantic unit (i.e., event) and propose novel training objectives to circumvent problems in long-span event decoding. Besides, they are also vulnerable to pretrainfinetune inconsistency, leading to inferior eventcentric performance.
Event-centric Pre-training. With similar scopes, many works focus on event-centric pre-training to promote event-related tasks as 'event' is a self-contained semantic unit and also an entry of commonsense reasoning. One paradigm is to pre-train on corpora without human-labeling. Some methods focus on more specific aspects of events and their correlations. DEER (Han et al., 2020b) performs temporal and event masking predictions for temporal relations. Lin et al. (2021) propose to recover a temporally-disordered or event-missing sequence for temporal and causal relations. Wang et al. (2021) use AMR structure to design contrastive objectives for the event detection task. However, they are not general enough to various event reasoning tasks. In contrast, CoCoLM (Yu et al., 2020) learns an event-level MLM to generalize more. EventBERT  states the ineffectiveness of event-level MLM and exploits hard negatives via contrasting, contributing much to downstream multi-choice tasks. However, these methods are only competent in discriminative tasks. The other paradigm is based on supervised pre-training on similar tasks and then performs knowledge transfer, e.g., COMeT (Hwang et al., 2021), UnifiedQA (Khashabi et al., 2020 and UNICORN (Lourie et al., 2021), but they require human-curated data.
Event-rich Corpus. Although raw corpora are viewed as off-the-shelf pre-training resources, a key question is how to mine event-rich examples. Here, 'event-rich' denotes that each example contains various events and entails adequate contexts to support event reasoning via either explicit or implicit event-correlation. This is crucial to learning event-correlations and reducing unnecessary overheads. Except for human-curated resources (e.g., ATOMIC (Sap et al., 2019) and ConceptNet (Speer et al., 2017)), event-rich corpora are also mined via automatic schemes. ASER (Zhang et al., 2020b) builds an event-based graph, where each node is an event extracted from a text and the relation of an event pair is predicted by a PDTB model. In contrast, EventBERT  operates on pure text so filters out correlation-scarce contexts and extracts verb-rooted events. Besides, it offers event sampling methods for hard negatives. We adopt this data processing method as both pure-text examples and hard negatives are prerequisites of generic and robust pre-training.

Prerequisite: Event-rich Corpus
In this work, we directly adopt event-rich data mining and negative sampling methods from  but focus our contributions on enlarging application scope of event-centric tasks and overcoming challenges raised in the new scope.
Event-rich Data Mining. To mine event-rich data from raw corpus, we employ a story corpus, BOOKCORPUS (Zhu et al., 2015), and take a twostep procedural (i.e., 'filter' and 'extraction'). It filters out correlation-scarce paragraphs according to existence of connectives (i.e., discourse relation keywords, e.g., however, while). Then, it highlights the event spans in the filtered paragraphs by extract-ing verb-rooted sub-trees in dependency trees of the paragraphs. With a filtered paragraph x, we build each example as (x, e) where e is an event mention in x. We obtain 200M tokens (out of 1B in BOOKCORPUS) in 3.9M filtered paragraphs. For clear notations, we denote a text piece as a lower case letter (e.g., e). It is tokenized into a sequence as a bold (e.g., e = [e 1 , e 2 , . . . ]), where a letter w/ subscript t is the t-th token in the sequence.
Negative Event Sampling. Following , we build a pool of events from the whole corpus and then retrieve negative events by three heuristic schemes. Given an event e in (x, e), we sample its negative event,ē, in light of lexiconbased (20% time), PoS-based (60% time) or indomain (20% time) retrieval. Consequently, given an event e, we sample M negative events, i.e., {ē} M i=1 . Figure 1 (right) shows an integrated instance (x, e, {ē} M i=1 ) of the event-rich corpus 1 .

Pre-training Objectives
We first present whole event recovering as a backbone pre-training objective in §3.2.1. After identifying incompetence of the simple backbone, we propose two other objectives in §3.2.2 and §3.2.3. An overview of the objectives is shown in Figure 2.

Whole Event Recovering
For the objective of whole event recovering (WER), it is straightforward to leverage an encoder-decoder structure, where a masked context is passed into the encoder to generate the missing part by decoding. Specifically, given an event e in a paragraph x, we mask out e from x at the encoder side and then generate e at the decoder side, i.e., where θ denotes parameters and x /{e} denotes replacing e in x with one special token [M]. We estimate Eq. (1) by the Transformer sequence-tosequence (seq2seq) structure . First, we apply the Transformer encoder to x /{m} for contextual embeddings for all tokens in x /{m} : where n is the number of tokens in x /{e} . Then, the Transformer decoder is employed to predict all tokens e of the event e in a recurrent manner, i.e., where V denotes token vocabulary andỹ t is the predicted categorical distribution over V. Lastly, the training objective is defined as a maximum likelihood estimation. Its loss function is written as where 'y t [y = e t ]' denotes fetching the probability of the t-step gold token e t ∈ e fromỹ t . This objective is similar to span recovering schema Joshi et al., 2020) but differs in that (i) each masked span is an event, i.e., an integrated semantic unit, so much longer (up to 22 tokens and see Figure 4 for length distribution), and (ii) only one event is masked out from the context to facilitate event-correlation modeling between the event and its contexts.
Intuitively, the success of Eq. (1) requires to capture correlations between the masked event and remaining contexts but two major problems arise due to WER with long event-level masking spans: (1) Implicit Event-correlation: The model recovers an event based solely on token-level concurrence as in a conditional language model (e.g., T5 and BART), regardless of the rich event-level correlations between the events in context x /{e} and the masked event e. Such a correlation-implicit model would achieve inferior performance on downstream event-centric correlation reasoning tasks.
(2) Learning Difficulty: As the masked event is an integrated, self-contained, semantic unit, it is difficult for the conditional generation model to recover the whole event due to a lack of local contexts. As a result, the model cannot effectively learn from the long masked spans, which has been empirically proved in autoencoding MLM models.
To alleviate the two problems above, we propose two other novel self-supervised objectives in the following. Briefly, we present contrastive eventcorrelation encoding to enhance correlations between contexts and events, and prompt-based event locating to reduce generation difficulty.

Contrastive Event-correlation Encoding
For the implicit event-correlation problem, an intuitive solution is to explicitly highlight the correlation from the masked context to the missing event at the encoder side. To achieve this, we resort to contrastive learning to enhance the encoder-side representation of the masked event by contrasting it with the embedding of the gold event mention e against those of negative onesē. Particularly, we first derive the embedding of e andē independently via the Transformer encoder in Eq.
(2), i.e., c = Pool(Trans-Enc([CLS] + e; θ (enc) )), (5) c = Pool(Trans-Enc([CLS] +ē; θ (enc) )), (6) where [CLS] is a special token prefixed to each event mention, and Pool(·) denotes using the contextual embedding of [CLS] to represent the whole event. Then, we enhance h [m] , the contextual representation of (2), by contrasting it with c againstc, i.e., where d(·, ·) denotes a distance metric of two vectors, which is Euclidean distance in this work. As a result, the encoder-side correlation-aware representation h [m] also offers a straightforward pathway to transmit event-level information to decoding so mitigates the learning difficulty to some extent.

Prompt-based Event Locating
As for learning difficulty problem, we also propose a prompt-based event locating objective to reduce generative difficulty by providing hints in the prompt. The basic idea is to simplify WER objective as an extractive generation task to locate and copy a candidate/hint from the prompt, which aims at improving learning effectiveness. To this end, we present two prompt-based generation schemas in the following.
Correct Event Selection. Inspired by advances of prompt-based multi-choice question answering, we present correct event selection schema to select the gold event e against negative ones {ē} M i=1 based on the contexts x /{e} . Given an event-masked paragraph x /{e} suffixed with several candidate events {ē} M i=1 containing the gold masked one e, it aims to generate the masked event e back, i.e., in case of position bias. We use a random permutation as all candidates are assigned with distinct position embeddings during contextualizing, and a fixed permutation of gold events will result in a learning shortcut (position bias) to degrade the model. Thus, similar to Eq.(1), we can define its formula as p(e|x (ces) ; θ).
Wrong Event Tagging. The other schema is wrong event tagging to find the wrong event in a corrupted paragraph, similar to incoherence reasoning. Thus, we re-write the encoder input aŝ Thus, we can define the formula of this objective as p(ē|x (wet) ; θ).
Based on the two formulas above, we define the prompt-based event locating objective as

Model Pre-training and Fine-tuning
Self-supervised Pre-training. The final loss to pre-train our ClarET is a linear combination of the three losses above from Eq.(4, 7, 8), i.e., We set the margin λ in Eq.(7) to 0.5 w/o tuning.
Supervised Downstream Fine-tuning. For generation tasks, we simply leverage the formula in Eq.
(1) to establish fine-tuning objectives. For discriminative (e.g., multi-choice) tasks, we can either formulate all tasks into generation as in GPT/T5 or fine-tune with classifying heads as in BART.
With pilot experiments, we found the latter one can achieve better performance and adopted it.

Comparing to Similar Works
While we adopt the same data processing in Event-BERT  and share a similar motivation to learn an event-centric pre-trained model, we expand the scope from 'discriminative-only' in EventBERT into 'unified' by our context-to-event Transformer for a broad spectrum of scenarios. Such an expansion is non-trivial since new challenges arise in the unified formulation. Compared to the inefficient 'event-backfilling and contextualizing' paradigm in EventBERT, our model can explicitly and effectively learn event-level correlations between contexts and events by our novel contrastive and prompt-based objectives. Moreover, COMeT (Bosselut et al., 2019;Hwang et al., 2021) is also a conditional generation model but focuses on triple-level commonsense reasoninggiven (head event, relation) to generate tail events, whose motivation, however, is orthogonal to ours. Therefore, we focus on a different motivation or scope, not to mention evaluation formulations.

Experiments
This section begins with descriptions of downstream datasets and experimental setups.
Downstream Datasets. We conduct extensive evaluations on 9 datasets for 9 downstream tasks, i.e., 5 generation and 4 classification tasks. Generation tasks include abductive commonsense reasoning on ART (αNLG)     Table 2: Fine-tuning results on four classification benchmark datasets. We split pre-trained models into discriminative and unified groups since discriminative models usually outperforms unified ones in classification and our ClarET falls into the latter. Previous SoTA discriminative and unified results are ::::: waved and underlined, respectively. See Appendix D.2 for full results.
Pre-training Setups. Instead of learning from scratch, we perform continual pre-training from BART-large    (Papineni et al., 2002), ROUGE-L (R-L) (Lin, 2004) and BERTScore (BERT) (Zhang et al., 2020c) as the evaluation metrics, while the accuracy (ACC) is taken for classification tasks. Each fine-tuning runs with seeds 2, 10 and 1234, and we evaluate the best dev model on the test set.

Main Evaluation
Fine-tuning for Generation. As shown in Table 1, our proposed ClarET achieves SoTA performance across all generation tasks. For instance, ClarET increases the ROUGE-L score by 2.3 absolute value for abductive reasoning. The superior performance of ClarET on the benchmarks demonstrates that it can model event-level correlation more effectively via few steps of continual pre-training and provide a general solution for a variety of event-centric correlation reasoning tasks.
Fine-tuning for Classification. Table 2 lists results on 4 classification tasks. We find ClarET performs better than all task-specific models and unified pre-trained models with 2%-4% improvement. It achieves competitive accuracy to strong discriminative models, e.g., the gap between ClarET and EventBERT is ∼0.15 for narrative incoherence detection and story cloze test. However, EventBERT is a RoBERTa-based competitor using the identical pre-training corpus. Its pre-training follows "event-backfilling and contextualizing" (similar to multi-choice QA), which has a small gap to downstream classification tasks for strong performance but brings two drawbacks. Firstly, its pre-training is slow due to repeat contextualizing over paragraphs, leading to 5.6× longer GPU hours than ours. In addition, its discriminative paradigm limits it specifically to classifications, regardless of wide generation tasks. The results show ClarET is on par with the discriminative-only EventBERT on classifications. This is non-trivial given the large formulation gap between our generative pretraining objectives and downstream multi-choicestyle classification tasks, and attributed to our effective event-correlation learning. In summary, these results show ClarET serves as a unified pre-trained model for event-centric generation and classification tasks.
Note that MLM-style models are evaluated by autoregressionlike operation .

Quantitative Analysis
Zero-shot Learning. It is essential to verify if the targeted information was learned and retained by a pre-trained model. Compared to MLM, our generative recovering model is inherently applicable to event-centric multi-choice and generative formulations. For generation tasks, we apply Eq.(1) to generate answers. As shown in Table 3, ClarET achieves the best performance and outperforms DE-LOREAN (which adapts auto-regression for counterfactual reasoning). For classification tasks, we apply Eq.(1) to each option for its perplexity and select the option with minimum. As shown in Table 4, ClarET surpasses previous models and beats the discriminative-only event-centric model, Event-BERT. Besides, the general-purpose pre-trained models perform nearly random guesses due to their incompetence in long-span event discrimination.
Few-shot Learning. Since our model reduces pretrain-finetune inconsistency for event-centric tasks and provides a good initialization for downstream fine-tuning, it is also interesting to see fewshot performance by scaling down training data. As shown in Figure 3, ClarET achieves similar performance to strong baselines with only 10%-30% of training data for fine-tuning.
Ablation study. To measure the contribution of each objective to the final fine-tuning results, we conduct an ablation study on both generation     (2021b)). The 'ePPL', i.e., event perplexity, refers to event-level token perplexity averaged over the dataset. and classification in Table 5. The first two ablations drop the two prompt schemas respectively in prompt-based event locating objective of Eq. (8), which verifies the effectiveness of reducing task difficulty. Then, the third ablation removes contrastive event-correlation encoding and shows a substantial drop, which verifies the significance of explicit event-correlation learning. Next, we keep only the prompt-based event locating objective to make our model a prompt-learning discriminative model (sharing more close methodology with EventBERT), however leading to a dramatic decrease. Lastly, when removing all the objectives, our model degenerates to BART-large.
Comparison with Larger Model. A trend of pre-training models follows the law of 'larger models for better performance' but a crucial research question is 'how to perform competitively with fewer computation resources'. To answer, we show extra fine-tuning results on the five generation datasets in Table 7 to compare our ClarET (400M parameters) with T5-large (770M) and T5base (220M). It is observed (i) with 3× scale, T5large notably outperforms T5-base to support the above law and (ii) with almost half model size, our ClarET performs very competitively to T5-large (even better on 3 out of 5 tasks), verifying the significance of our objectives towards event-related knowledge.  Difficulty of Event Generation. To exhibit the learning difficulty in pre-training (as stated in §3.2.1) and the effectiveness of our novel learning objectives, we conduct another ablation setting in Table 6. It is observed that ClarET achieves better event-level perplexity (ePPL), verifying the two novel objectives promote event generations and reduce difficulty of decoding.

ACR CS SEG CSG EPC
Long-span Event Generation. To further check if ClarET is more competitive on longer-span event generation, we compare it with BART-large and T5-base/-large by '− log' of Eq.(1). Different from recovering paradigm of others, we follow the denoising paradigm to implement BART and calculate its score by considering the masked part in decoding. Figure 4 shows that (1) Line Chart: the gap between ClarET and the others becomes larger with event length increasing as the general-purpose models only consider short-span masking in pretraining, leading to inferior event generation; and (2) Bar Chart: as for data distribution, although a majority of data falls into the 6-8 bin, there are still many examples with event length greater than nine.

Natural Language Understanding (NLU).
Our basic model, BART-large, is presented for general NLU tasks. To exhibit our minor event-centric continual pre-training would not interfere its NLU ability, we conduct fine-tuning experiments on GLUE benchmark  as in Figure 5. It is observed that, although slightly surpassed by the discriminative RoBERTa ClarET retains BART's natural language understanding ability.

Case Study and Error Analysis
Case Study. As the first case in Figure 6, we conduct a case study on generative abductive reasoning task, where the fine-tuned ClarET generates an event semantically close to the gold reference, but the BART does not. BART only generates a part of the answer but ignores the event-correlations from 'They were impressed with my phone', while ClarET completely captures the correlations in the contexts (e.g., 'to buy a phone' and 'They were impressed',) and generate a much better result.
Error Analysis and Limitation. The second case in Figure 6 shows that our ClarET is ineffective when the gold event is very complicated. In detail, the model focus only on 'at the end of the day' to generate '... spent the whole day ...' but ignore very subtle contexts, e.g., 'starting her job ... teacher' and 'they liked her'. To expand, we found a problem in long-event decoding by pilot experiments. As shown in Figure 7, it is observed that the gap of token-level perplexity between ClarET and WER-only gradually diminishes. This is because 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Position ( the subsequent tokens in an event can be generated on the basis of previous generations on the decoder side, rather than context-aware representations from the encoder side. While a long span is masked, the model can see previous tokens in an event (i.e., e <t ) in decoding and incline to perform the t-th prediction based on e <t but not x /{e} , especially with a larger t. As a result, the model would 'cheat' in the generation but learn decoderside language modeling rather than context-aware representations. In the future, we will exploit this problem. Besides, due to computation resources, we choose the model size with 400M and continual pre-training in 90h, limiting the performance.

Conclusion
We present a novel correlation-aware context-toevent Transformer to self-supervisedly learn eventcorrelation knowledge from text corpus and benefit various event-centric reasoning scenarios. Besides SoTA fine-tuning results on 5 generation and 4 classification tasks, we conduct zero-/few-shot learning and extensive ablation studies to exhibit our model's effectiveness. Lastly, we find our model is competitive to a twice larger general-purpose model, reduces learning difficulty for event generation, and retains NLU ability from its basic model.

A Examples from Mined Pre-training Corpus
There are some mined pre-training examples shown in Table 8. As in , an example includes a paragraph, events, a selected positive event, connectives of the positive event, and sampled negative events of the positive event.
B More Details

B.1 BART Pre-training Resources
In this section, we analyze BART pre-training resources in terms of text corpora and computation resources.
As for tokens in BART pre-training corpora, BART paper  claims using the same corpora as in RoBERTa  and T5 paper  states RoBERTa uses a 2.2T-token text corpus. Thus, we adopt '2.2T' as the number in the main paper.
As for BART pre-training computation overheads, the contributor of BART official code repository said 'We trained for around 11-12 days on 256 gpus.' at https://github.com/pytorch/ fairseq/issues/1525, so the BART pretraining takes from 67584 to 73728 GPU hours. Thus, we use '70,000' as the number in the main paper.

Example 1
Paragraph It was only months later, when she saw her friend's thin gaunt face, her swollen belly and her quiet desperation, that she had come to her senses. Then she had been filled with a combination of burning rage and deep shame. This had endured over the years undiminished.

Positive Event
she had been filled with a combination of burning rage

Connectives when; then
Negative Events he had been loaded with a lot of vampire venom she had been trained in the art of gentler speech I had been blessed with some sort of fire ability he had been transformed into a piece of living statuary he had beened from a block of pale marble I had been circumci sculptsed in the age of infantile apathy ...

Example 2 Paragraph
Then, when she turned twenty one at the end of last year, she had decided to act on it. A driver's license was something she needed for her business and the identity papers which went with it were needed for a range of other reasons, such as enrolling Catherine for school at the start of this year.
Positive Event papers which went with it were needed for a range of other reasons

Connectives then; when; and
Negative Events bookcases that stood against it had opened like a pair of French doors it had a little bit of magic in it , just for Lizzie it only gave her a place for a couple of days which occasionally crossed a small ridge sometimes of gravel , sometimes of sand that she was going shopping in the city with a couple of other girls publish that proposal in the paper for three weeks ...  Table 9: Full results of T5-base and T5-large on generation tasks.

B.2 Full Results of T5 Model
The full results of T5-base and T5-large on the five generation tasks are shown in Table 9.

B.3 Connectives in Paragraph
As stated by , connectives (i.e., discourse relations in the contexts) play important roles to express correlations among events. Therefore, we also find every possible connective r to each (x, e) where r is a connective in x, which immediately links to the verb of e on the parsing tree of x. To leverage the connectives, we also apply the correct event selection in prompt-based event locating objective to r its negatives {r} M 1 , as correct connective selection. Here,r is randomly sampled from discourse relations in the PDTB annotation manual (Webber et al., 2019). At 20% times, we use correct connective selection to replace correct event selection in the prompt-based event locating objective.

C Details of Evaluation Datasets
We detail the nine evaluation datasets in the following. The training example in each dataset is shown in Table 10.
• ART (αNLG). Given two observations in natural language, it aims to generate an explicative hypothesis between them. We follow the official data split  with 169,654/1,532/3,059 in training/dev/test.
• TIMETRAVEL. Given an original story and a counterfactual event, it aims to rewrite the subsequent events to complete a story, which is compatible with the counterfactual event. We follow the official data split  with 98,159/5,613/7,484 in training/dev/test.
• Story Ending Generation. We evaluate the story ending generation based on ROCStories, which aims to generate a story ending for a given story context. We follow the data split (Guan et al., 2019) with 90,000/4,081/4,081 in training/dev/test.
• Commonsense Story Generation. It is based on ROCStories. Given a leading context, it aims to generate a reasonable story. We follow the data split (Guan et al., 2020) with 88,344/4,908/4,909 in training/dev/test.
• APSI. We evaluate event process completion on the APSI dataset, where the goal is to generate a subevent for a given event context. We follow the data split (Zhang et al., 2020a) with 13,501/1,316 in training/test.
Given an event chain, it aims to predict the subsequent event from 5 candidates. We follow the data split (Li et al., 2018) with 140,331/10,000/10,000 in training/dev/test.
• ART (αNLI). Given two observations in natural language, it aims to choose the most explicative hypothesis from 2 candidates. We follow the data split  with 169,654/1532 samples in training/dev.
• ROCStories. We follow (Mori et al., 2020) to use ROCStories for narrative incoherence detection. A random sentence is removed for each five-sentence story, and the goal is to predict the missing position. We follow the data split (Mori et al., 2020) with 78,528/9,816/9,817 in training/dev/test.
• Story Cloze Test. Given a 4-sentence context, it aims to select the right ending from two alternative endings. We follow the data split (Mostafazadeh et al., 2016) with 98,161/1,871/1,871 in training/dev/test.

D Detailed Evaluation Results
We detail the full results on nine evaluation datasets as follows.

D.1 Generation Tasks
Generation tasks include abductive commonsense reasoning (αNLG), counterfactual story generation, story ending generation, commonsense story generation, and event process completion. The detailed results of these generation tasks are shown in Table 11, Table 12, Table 13, Table 14, and Table 15, respectively. ClarET achieves state-of-the-art performance on all five generation tasks. In addition, pre-trained language models show their strong generation ability on story generation tasks, i.e., story ending generation and commonsense story generation.

D.2 Classification Tasks
Classification tasks include script reasoning, abductive commonsense reasoning (αNLI), narrative incoherence detection, and story cloze test. The detailed results of these classification tasks are shown in Table 16, Table 17, Table 18, and Table 19, respectively. Compared with unified language models, ClarET achieves state-of-the-art performance. Although strong discriminative models show their great ability on classification tasks, ClarET still achieves competitive performance.  (Srinivasan et al., 2018) 76.50 RoBERTa-large  87.10 EventBERT  91.33 Unified Model Finetuned Transformer LM (Radford et al., 2018) 86.50 BART-large  87.01 ClarET 91.18