Event-Centric Question Answering via Contrastive Learning and Invertible Event Transformation

Human reading comprehension often requires reasoning of event semantic relations in narratives, represented by Event-centric Question-Answering (QA). To address event-centric QA, we propose a novel QA model with contrastive learning and invertible event transformation, call TranCLR. Our proposed model utilizes an invertible transformation matrix to project semantic vectors of events into a common event embedding space, trained with contrastive learning, and thus naturally inject event semantic knowledge into mainstream QA pipelines. The transformation matrix is fine-tuned with the annotated event relation types between events that occurred in questions and those in answers, using event-aware question vectors. Experimental results on the Event Semantic Relation Reasoning (ESTER) dataset show significant improvements in both generative and extractive settings compared to the existing strong baselines, achieving over 8.4% gain in the token-level F1 score and 3.0% gain in Exact Match (EM) score under the multi-answer setting. Qualitative analysis reveals the high quality of the generated answers by TranCLR, demonstrating the feasibility of injecting event knowledge into QA model learning. Our code and models can be found at https://github.com/LuJunru/TranCLR.


Introduction
Since 2019, many larger-scale pre-trained language models (PLMs) (Devlin et al., 2018;Raffel et al., 2019;Lu et al., 2020;Pergola et al., 2021b) have been introduced to address the Question-Answering (QA) tasks, reaching performance on par with humans on entity-centric QA datasets such as SQuAD (Rajpurkar et al., 2016), Trivi-aQA (Joshi et al., 2017), and NewsQA (Trischler et al., 2016), in which answers are often entities extracted from text.A raising challenge is to research and develop new PLM-based frameworks tackling Figure 1: An event-centric QA example from the ES-TER dataset (Han et al., 2021).All event triggers are highlighted in bold and underlined in the paragraph.The question event trigger and answer event triggers are further highlighted in red colors with different shades.
In-text answer is smeared with blue.more difficult QA settings in real-world scenarios.One direction is to go beyond entity-centric QA and explore QA tasks focusing on high cognitive level information such as events.A recently introduced Event Semantic Relation Reasoning (ESTER) dataset (Han et al., 2021) facilitates the development of Machine Reading Comprehension (MRC) models for event-centric QA.The dataset contains event-centric question-answers annotated with event semantic relation type labels.Figure 1 1 shows an example instance from the dataset.The main challenge is to effectively explore event semantic knowledge to answer event-centric questions.In the example illustrated in Figure 1, an MRC or QA model needs to first understand that the question asks for a potential answer event which holds a conditional relation with the main event 'charges' mentioned in the question.It then needs to identify events in the paragraph which have the conditional relation with the question event trigger 'charges', in this case, 'linked', 'testing' and 'found'. Finally, it needs to generate the answer involving the identified events in natural language.It is easy for humans to understand narratives by constructing a situational logic chain capturing how events evolve and relate to each other in text.Yet, existing QA models only learn shallow semantic cues based on word token statistics gathered from large-scale text corpora (Niven and Kao, 2019), but are not able to grasp high-level concepts such as events.Preliminary experimental results using the pre-trained T5 language model on the ESTER dataset show that there remains a large gap over 15% between machine and human performance (Han et al., 2021).
Intuitively, it is possible to inject event information through a multi-task learning framework where event-related tasks such as event relation type detection and event embedding learning could be potentially useful to guide the QA model to generate better answers.For example, event relation type detection aims to detect the desired event semantic relation given a question, while event embedding learning aims to push events holding the desired semantic relation closer in the new event embedding space.However, in a QA model, the answer generator is usually built on a PLM in which the original event representations learned in the PLM should be preserved.That is, we want to map event representations onto a new event embedding space in order to inherently capture their semantic relations specified by an input question, but at the same time, we need to keep the original event representations learned by the PLM in order to generate coherent answers.To deal with this dilemma, we propose an invertible transformation operator, which makes it possible to learn new event embeddings without changing the mutual information of any given event pairs, making it effective in injecting event information for event-centric QA.
More concretely, to leverage the event semantic knowledge into QA models, we propose a novel multi-task learning framework, named TranCLR (Fig. 2), combining a general-purpose QA model, with an event invertible transformation operator to encode event relations across questions and paragraphs.It builds on the UnifiedQA (Khashabi et al., 2020) model for answer generation, and employs an invertible event transformation operator to project the hidden representations from the UnifiedQA encoder onto a new event embedding space.The transformed representations are then used for (i) contrasting learning and for (ii) event relation type classification.The contrastive learning mechanism is adopted to realign the event vectors, strengthening the relations between the events mentioned in questions and those candidate answer events in paragraphs and improving the generalization to out-of-distribution event relations.On the other hand, the event relation type classification is used to further fine-tune the transformation matrix through contextualized question representations.The combination of the transformation operator, along with contrastive learning and event relation type classification, leads the model to focus on the textual and relation features characterizing the event occurrences in text, and results in an overall boost in performance on event-centric QA tasks.
Our contributions can be summarized as follows: (1) We introduce a novel multi-task learning framework for event-centric QA, TranCLR, in which we design an invertible event transformation operator and a contrastive learning mechanism, further combined with event relation type classification, to perform better reasoning on event semantic relations; (2) We conduct an experimental assessment on the ESTER dataset showing that TranCLR boosts the performance of QA models compared to strong existing PLM-based QA baselines, achieving over 8.4% and 3.0% gain in the token-level F1 and EM score respectively under the multi-answer setting; (3) Visualization of event-aware token semantic vectors verifies the effectiveness of event knowledge injection.We further show the advantages of our framework tailored for event-centric learning on both zero-and few-shot learning, and adaptation ability on out-of-domain event-centric questions.

Related work
This work is related to two lines of research: eventcentric QA, and contrastive learning.
Event-centric QA The growing interest into event understanding has recently led to the development of new resources for event-centric QA and event relation extraction.Souza Costa et al. (2020) proposed EventQA, an event-centric QA dataset to access semantic information stores in knowledge graphs.The questions are created via a random walking on the EventKG (Gottschalk and Demidova, 2019), then manually translated into natural language.Ning et al. (2020a) modified and converted an event temporal relation extraction dataset -MATRES (Ning et al., 2018) into a reading comprehension format focused on event temporal ordering questions, named TORQUE.Instead of solely focusing on simple arguments or temporal relations, the ESTER dataset (Han et al., 2021) was developed to highlight how events are semantically related in terms of five most common semantic relations: Causal, Conditional, Counterfactual, Sub-event, and Coreference relations.Aforementioned work built dataset baselines with popular entity-based PLMs, and thus leave significant performance gaps compared with human evaluation.Asai and Hajishirzi (2020), Dua et al. (2021) and Shang et al. (2021) leverage features of closely related questions to capture temporal difference to deal with certain types of event-centric questions.Compared to the existing works, we target to various types of event-centric questions.Therefore, we introduce an invertible event transformation to (i) model the event semantic relations through an auxiliary classification task, and to (ii) realign the event latent representations via contrastive learning in the space of the transformed events.
Contrastive Learning Approaches to contrastive learning for text focus on the generation of positive and negative training pairs from pretrained language models.For example, Clark et al. (2020) proposed a new pretraining framework named ELEC-TRA, which defines a new generative training task, i.e., Replacement Token Detection (RTD), with the aim of determining whether a token was originally replaced by the language model.Based on ELECTRA, Meng et al. (2021) 2020) designed a framework similar to SimCLR (Chen et al., 2020) to generate sentence representations by applying several data augmentation strategies to create contrastive pairs, such as word deleting and swapping, backtranslation and synonym replacement.Yet, Gao et al. (2021) reported that simply using dropout masks twice within a PLM can led to rather reliable positive pairs.Our work adopts standard contrastive learning framework.The positive and negative pairs of events are composed directly from the different text sections: questions, paragraphs, and answers within the paragraph.

Methodology
In this section, we first define the task of eventcentric QA and then present our proposed TranCLR model.We build our model mainly based on the ESTER dataset (Han et al., 2021).

Task Formulation
Event-centric QA can be formulated as question answering centred on the understanding of event semantic relations.The task can be mathematically defined as: given a text passage x p and an answerable event-centric question x q , a model is asked to provide one or more answers where A denotes the total number of answers to the given question.In the ESTER dataset, event triggers in text passages, questions and answers are annotated, Cp , e q , e a 1 , • • • , e a Ca }, where C p and C a denote the total number of events in the text passage and the answer, respectively.Each question only contains a single event e q .Since answers are parts of the paragraph in ES-TER, paragraph event triggers also include answer event triggers.In addition, the relation type of the question event and answer events, t ∈ T , is also annotated.In the ESTER dataset, there are 5 event semantic relations types: Causal, Conditional, Counterfactual, Subevent, and Co-reference.

TranCLR
We propose a novel framework for event-centric question answering, called TranCLR, which is a multitask model via contrastive learning and invertible event transformation.The overall framework is shown in Figure 2. Following settings in the ESTER work (Han et al., 2021), we adopt T5large (Raffel et al., 2019) as an encoder-decoder backbone for the generative setting (i.e., answer generation), and RoBERTa-large (Liu et al., 2019) as an encoder for the extractive setting (i.e., answer extraction).The T5-large model will be finetuned in a universal generative style (Khashabi of question-answer event relation type label t, question x q and passage x p with ":", "\n", "</s>" and "<s>" special tokens.We use x = {t:x q \nx p } and {<s>t:x q </s></s>x p } to denote the whole input sequence for the generative and the extractive settings, respectively.Let N x be the length of x, and d be the dimension of hidden state vectors, H ∈ R Nx×d is the contextual hidden states of the encoder.The target label for the generative setting is the concatenation of all answers Ŷ = {ŷ 1 , • • • , ŷA } with each separated by a ";" special token, while labels for the extractive setting are Ŷ = {ŷ 1 , • • • , ŷx p }, following the "B-I-O" or "I-O" tagging format. After getting the hidden states H of an input sequence x via the UnifiedQA-T5-large or the RoBERTa-large encoder, we simultaneously train the model with contrastive learning for two tasks, the main QA task, and the auxiliary task for event relation type classification.Therefore, our model is designed to maximize the probability p(ŷ|x q , x p , E, t) of the generated answers or the predicted labels given a question x q , the supporting paragraph x p , all event triggers in the materials E, and question-answer event relation type label t.The event relation type t can be considered as a prefix or prompt to the input in prompt-based learning.
It is worth noting that the annotated event triggers are only used in training, but not in inference.
The key to event-centric QA is to perform reasoning on the semantic relation of the event found in the question and those candidate events in the paired text passage.For example, for the question in Figure 2, "What actions from the law enforcemnet could lead to the filling of the charges against Kopp?", we would expect the QA model to generate the answer which contains the event(s) that exhibit the Conditional relation with the event charges mentioned in the question.This is somewhat similar to node prediction in knowledge graph embedding learning, that is, given the head event e q in the question and the relation type t, we aim to locate the tail event e p in the text passage to generate the desired answer.Inspired by knowledge embedding learning methods such as TransE ( 2013), we propose to transform event embeddings using a transformation matrix and introduce an auxiliary task for event relation type classification.Using the transformation matrix has two advantages.First, the token-level hidden states are preserved in the original embedding space which are important for semantic-based QA.Second, the transformed event embeddings allow the identification of common features for more general event relation type classification in the new event embedding space.In what follows, we describe our proposed invertible event transformation operator, contrastive learning, and event relation type classification in more detail.

Invertible Event Transformation
We propose an invertible transformation which aims to map event representations onto a new event embedding space in which the desired event semantic relations are inherently encoded.Let H q|e ∈ R Cq×d , H a|e ∈ R Ca×d and H o|e ∈ R Co×d be part of the hidden state vectors H representing the embeddings of the question event, the answer events and other events in the text passage, in which C q refers to the number of event triggers in the question, and the sum of C a and C o refers to the total number of event triggers in the text passage.Additionally, let M ∈ R d×d be the transformation matrix, H q|e , H a|e and H o|e are mapped onto a new event embedding space by: where b M ∈ R d is the bias term.The singularity of the random matrix can be guaranteed by (Tao and Vu, 2008) with high probability (confirmed by our experimental results as well).Therefore, we do not need any regularisation terms to guarantee the rank of the transformation matrix in the training process.Since the linear transformation is invertible, we have the following properties (the proof can be found in Appendix A), Property 1.For any event representation e obtained from a PLM, and its transformed new embedding e ′ , we have S(e ′ ) = S(e) + log(|M |), where S is the entropy of a given event.
Property 1 guarantees that the projected representation of a given event has a smoother distribution which makes it easier to find a separatrix in a hyperspace in the auxiliary task of event relation type classification, since |M | is usually large than 1.The distribution of outliers, i.e., low frequency words, will be smoothed by this invertible transformation as well.Property 2. For any event representation pair e 1 and e 2 obtained from a PLM, and their transformed representations e ′ 1 and e ′ 2 , we have I(e ′ 1 , e ′ 2 ) = I(e 1 , e 2 ), where I is the mutual information of the given event pair.
Property 2 guarantees that for any event pairs, the projection will not change the mutual information, which represents event relations encoded in the original PLM.Since the projection is a bijection and invertible, the separatrix from the learned auxiliary task will be converted to the hidden states to guide the answer generation directly.

Contrastive Learning
After mapping event representations onto a new event embedding space by the aforementioned invertible event transformation, we can then form positive event pairs (h i , h j ) by selecting the transformed question event h i from H ′ q|e and the transformed answer event h j from H ′ a|e .We can also form negative event pairs (h i , h k ) and (h k , h j ) by randomly sampling h k from the transformed event vectors of other events H ′ o|e .Let L cl denote the loss of contrastive learning: where l cl:(i,j) denotes the loss for positive pair on event vectors h i and h j , l cl:(i,j) = − log[exp(cos(h i , h j )/τ )/s i ], cos(•) denotes the cosine similarity function, s i denotes the sum of cosine similarity of the positive event pair (h i , h j ) and that of negative event pairs (h i , h k ), τ is the temperature hyperparameter to adjust the penalty of negative pairs, and L cl sums over the contrastive loss of all possible event pairs in the training set.
TranCLR takes question event vector and answer event vectors as the source of positive pairs, while takes other event vectors as the source of negative events.The purpose of contrastive learning is to better employ the event information as hint for the QA task.Therefore, a good transformation matrix is essential.We introduce an auxiliary event relation type classification task in order to train a better transformation matrix.

Event Relation Type Classification
As shown in the analysis of the n-gram word and token statistics conducted on the ESTER dataset (Han et al., 2021), the questions already encode sufficient information to detect the type of event relations referred.Therefore, the idea is to apply the same transformation matrix, used on the event vectors, also on the hidden vectors encoding the question, and then use the results for event relation type classification.We first predict the event relation type, t by feeding the tranformed question vector to a feed-forward layer and a softmax layer.We then define the cross entropy loss of event rela-tion type classification, denoted as L tc :

Final Objective Function
For answer generation, the model operates on the hidden state vectors H in the original embedding space encoded by UnifiedQA-T5-large (or RoBERTa-large) to generate (or extract) the answer(s), ŷ.Let L qa denote the loss of the main question answering task: where T = N a + A − 1 is the total token length of A ground truth answers separated by A − 1 ";" special tokens under the generative setting, while T = x p is the total token length of the supporting paragraph x p under the extractive setting.For the latter, we further extract all tokens marked as "BI" or "I" predictions as answers.The final loss is defined as: where λ tc , λ cl are hyperparameters to control the contribution of individual loss terms.

Experiments
In this section, we will first introduce the experimental setup including the dataset used and the hyperparameter setting, followed by the discussion of experimental results and ablation studies.

Experimental Setup
Dataset We use the event-centric QA dataset, ESTER (Han et al., 2021), for our experiments.
The dataset contains 6k human-annotated eventcentric questions with an average length of 10 tokens over 1.9k paragraphs with a maximum of 340 tokens.All event triggers have been marked over the questions, paragraphs and answers.Besides, the dataset provides the event type label for each question from the five common event relation types: Evaluation Metrics We use the same metrics introduced in ESTER (Han et al., 2021): F T 1 , HIT @1 and EM defined for the multi-answer setting.F T 1 calculates unigram-level token overlap between generated answers and the ground truth answers, HIT @1 measures whether the leftmost answer contains a correct event trigger, and Exact Match EM checks if any predict answer matches exactly the corresponding ground truth answer.
Baseline The baselines we use are the seq2seq pipeline built on the UnifiedQA-T5-large and the RoBERTa-large models introduced in ESTER (Han et al., 2021). 3Hyperparameter setting for our models can be found in Appendix B.

Model
F T Table 1: Main results of experiments on ESTER dataset.(Han et al., 2021) takes "B-I-O" labels for extractive QA, while we found "I-O" labels work better.UnifiedQAlarge TranCLR (-*) refer to ablation studies for generative QA, where -prefix, -TC, -CL and -TransM refer to the removal of the event type label prefix, questionanswer event type classification, contrastive learning, and the transformation matrix, respectively.
UnifiedQA-large (our run) Table 2: Results from various models on 5 different event relation types on the development set.UL refer to the abbreviation of UnifiedQA-large.UL TranCLR outperforms UnifiedQA-large baseline significantly in F T 1 across all event relation types.The ablated versions of UL TranCLR show mixed results in HIT @1 and EM .
We show the overall evaluation results in Table 1.Our model achieves impressive results compared with the previous baseline, gaining about 8% improvement in F T 1 under the generative setting, 3.0% improvement in EM and HIT @1 scores under the extractive setting. 4We have the following observations: (1) event-based contrastive learning brings a significant gain, enabling a better reasoning of semantic relations between event triggers in questions and candidate answers in text, since the question event and the answer event, although bearing very different semantic meanings, are pushed closer in their projected new event embedding space.This is evident from the drastically improved F T 1 score in the main QA task; (2) both event relation type classification and contrastive learning are indispensable since the combination of them achieves more balanced results across all metrics in answer evaluation, showing that the auxiliary event relation type classification task leads to a better learned transformation matrix; (3) prompt-based learning using the event relation type as prefix of input 5 is effective as the additional information can better guide the model to answer questions which are much more difficult than traditional factoid questions; and finally (4) the use of the transformation matrix makes it possible to simultaneously learn representations in both the original embedding space and the new event-centric space, leading to better QA results.

Zero-shot and Few-shot Learning
In this section, we assess the ability of TranCLR for zero-shot and few-shot learning, i.e., without the training data or with very few training instances.With only 500 training instances, TranCLR is able to generate more accurate answers, beating UnifiedQA-large by 3% in EM , demonstrating the benefit of making effective use of event information for better reasoning of event semantic relations.The performance gap however gradually diminishes with the increasing size of the training set.Nevertheless, TranCLR is able to generate answers containing more overlapped information with the ground truth with more training instances, evidenced by the increased performance gains compared to UnifiedQA-large, reaching nearly 8% in F T 1 when using the full training set.This shows that our proposed contrastive learning combined with the auxiliary task of event semantic type classification can better capture event semantic knowledge which guides the decoder to generate answers closer to the ground truth.

Results per Event Relation Type
In Table 2, we provide detailed comparison of results under various event relation types.We can Table 3: Example answers generated by different models.TranCLR injected with event knowledge accurately grab the news narrative, and generate answers that cover 100% content of ground truth.In contrast, UnifiedQA-T5 without event-related learning is confused with question-related context information in the paragraph.
observe that the results on the 'Causal' type, being the largest category, are much better compared to other event relation types.Our proposed TranCLR achieves the best F T 1 scores across all event relation types compared to the baseline UnifiedQA-large, with the increment in the range of 5.4-11.6%.The largest performance improvement of 11.6% is observed on the most difficult 'Sub-event' type in which questions have more than 3 answers on average.By analyzing the results, we found that it is sometimes quite difficult to distinguish between 'Conditional' and 'Counterfactual' types.As such, adding the event relation type as prefix may confuse the model.In terms of the HIT @1 results, TranCLR with prefix only (i.e., -TC&CL) improves upon the baseline by over 5% and nearly 4% for the 'Conditional' and the 'Sub-event' types respectively.We also observe that using the event relation type as prefix in prompt-based learning is very effective in boosting the EM scores, especially for the 'Counterfactual' type in which nearly 4% improvement is obtained compared to UnifiedQAlarge.For the 'Sub-event' type where multiple answers are expected, there is no improvement in EM in our models compared to the baseline.

Visualisation of Event Embeddings
In Figure 4 6 , we visualize the learned event embeddings in the semantic space during the training.It can be observed that with the increasing number of training epochs, event triggers are grouped into few clusters from an evenly distributed initial state.Most irrelevant event nodes are pushed aside as they are used as negative samples in contrastive learning, while question events and answer events are pulled together.The visualization reveals reasonable process of event knowledge distillation. 6Better viewing in color.Question events, answer events, other events and the remaining non-event tokens are shown in blue, red, black and gray, respectively.

Qualitative Analysis
We further perform qualitative analysis in Table 3 over the example illustrated in Figure 1 and 2 7 .All models generate more than one answers.More concretely, TranCLR manages to generate text covering 100% of the ground truth answer, while UnifiedQAlarge is unable to generate coherent answers as the model failed to detect the semantic relations between the question event trigger "charges" and the answer event triggers "link", "testing" and "found".

Generalization Evaluation
To further evaluate the generalization capability of our event knowledge distillation paradigm, we apply the TranCLR model trained on ESTER for Model F 1 EM RoBERTa-large 10.0 0.0 RoBERTa-large (Han et al., 2021) 20.0 4.1 RoBERTa-large TranCLR (ESTER) 28.7 15.6 zero-shot inference on unseen QA data focusing on event temporal relations, using the TORQUE dataset (Ning et al., 2020b).TORQUE focuses on questions about event temporal relations such as "what happened before/after [some event]?".For each question, the dataset provides a two-sentence supporting passage and passage event annotations.The answers are simply event mentions in the form of words/phrases, rather than longer text spans as in the ESTER dataset.We perform zero-shot inference on TORQUE without fine-tuning the models on its training set.It can be observed from Table 4 that compared with RoBERTa-large without trained on any event QA data (first result row), fine-tuning RoBERTa-large on ESTER (second result row) improves F 1 by 10%.Nevertheless, our proposed TranCLR exhibits a significantly better event understanding capability, achieving 8.7% and 11.5% further gains in F 1 and EM scores, respectively.It is worth mentioning that the ESTER dataset does not contain any questions about event temporal relations.The results show the strong generalization capabilities of TranCLR and further verify the effectiveness of our proposed framework for event semantic reasoning.

Conclusions
In this paper, we have proposed a novel framework, called TranCLR, to tackle the event-centric QA task on the ESTER dataset (Han et al., 2021).The core idea of TranCLR is to effectively explore the event knowledge in both questions and context through event-centric contrastive learning and the auxiliary task of event type classification.Our experimental results show superior performance of TranCLR on event-centric QA compared to the strong baseline, gaining 8.4% and 3% absolute improvements in F T 1 and EM scores respectively.Further zero-short inference and qualitative analysis verify the promising event semantic understanding and reasoning capability of our model.

Limitations
Although we have verified the promising event semantic understanding and reasoning capability of TranCLR trained on ESTER for both in-domain event semantic relations and the out-of-domain event temporal relation, it is worth further exploring whether the model indeed captures event semantic relations and does not just generate answers by the matching of spurious patterns.Adversarial attacks could be explored in the future to assess the possible backdoor of the model in order to evaluate its robustness (Tan et al., 2021;Pergola et al., 2021a;Bartolo et al., 2021).
Our current work is built on the ESTER dataset where each question is paired with a single paragraph.In reality, event-centric QA may require the gathering of evidence scattered over multiple paragraphs and reasoning over more sophisticated event chains or graphs.Such complex event semantic relations is beyond what our proposed event-centric contrastive learning could capture.To develop new methodologies for dealing with more challenging event-centric QA, efforts need to be devoted to develop a dataset under a more realistic setting.
Table A2: Additional generated samples from the selected models.In the first case, both models generate overlong answers.In the second case, TranCLR manages to generate one completely correct answer while UnifiedQA-T5large produced a wrong answer.For the next two cases, TranCLR controls answer range better in the third one and is able to cover both answers in the fourth one, compared with the UnifiedQA-T5-large baseline.In the last case, only TranCLR generates related answer.
designed two new pretraining tasks: the Correct Language Modeling (CLM), aiming at restoring a corrupted sentence; and the a contrastive learning-based task, in which the positive pairs are made of recovered sentences and corresponding previously corrupted sentences.Similarly, Qin et al. (2020) designed another contrastive learning framework ERICA for documentlevel text understanding, via specific entity discrimination pre-training task and relation discrimination pre-training task.Chen et al. (2022) proposed a two-stage framework, integrating answeraware span-based contrastive learning for crosslingual machine reading comprehension.Wu et al. (2020) and Fang et al. (

Figure 2 :
Figure 2: The overall TranCLR architecture.The input to encoder is the concatenation of the event relation type, the question, and the paragraph.The resulting hidden vectors are used to provide answers.Simultaneously, the hidden representations are projected through a transformation matrix and used for both contrastive learning and event relation type classification.The contrastive learning mechanism realigns the event vectors to strengthen the relations between the event occurred in question and candidate answer events in paragraphs; while event relation type classification predicts the event relation type given a transformed question representation.et al., 2020), therefore named as UnifiedQA-T5large.During training, the input sequence consistsof question-answer event relation type label t, question x q and passage x p with ":", "\n", "</s>" and "<s>" special tokens.We use x = {t:x q \nx p } and {<s>t:x q </s></s>x p } to denote the whole input sequence for the generative and the extractive settings, respectively.Let N x be the length of x, and d be the dimension of hidden state vectors, H ∈ R Nx×d is the contextual hidden states of the encoder.The target label for the generative setting is the concatenation of all answers Ŷ = {ŷ 1 , • • • , ŷA } with each separated by a ";" special token, while labels for the extractive setting are Ŷ = {ŷ 1 , • • • , ŷx p }, following the "B-I-O" or "I-O" tagging format.

Figure 3
shows the F T 1 and EM values of TranCLR and the baseline UnifiedQA-large with the vary-4 Evaluation code: https://github.com/PlusLabNLP/ESTER/tree/master/code 5 During inference, event relation type is detected automatically from a given question.

Figure 4 :
Figure 4: Distributions of events in the semantic space.Question events, answer events, other events and the remaining non-event tokens are shown in blue, red, black and gray, respectively.
Sub-event type questions have more than 3 answers on average.ESTER has been officially split into the training, development and test sets, with 4,547, 301 and 1,170 instances, respectively. 2Table A1 in Appendix B reports the statistics of 5 event types in ESTER.
Causal, Conditional, Counterfactual, Sub-event,and Co-reference, and collects over 10k event relation pairs.Each of the aforementioned event relation types contribute to 43.1%, 21.3%, 7.1%, 15.6% and 12.9% of questions, respectively.Most of the questions have 1-2 in-paragraph answers, while the

Table 4 :
Zero-shot inference results on the TORQUE development set.RoBERTa-large (the first result row) is not trained on any event QA data, while the other models are trained on ESTER only.