Lifelong Knowledge-Enriched Social Event Representation Learning

The ability of humans to symbolically represent social events and situations is crucial for various interactions in everyday life. Several studies in cognitive psychology have established the role of mental state attributions in effectively representing variable aspects of these social events. In the past, NLP research on learning event representations often focuses on construing syntactic and semantic information from language. However, they fail to consider the importance of pragmatic aspects and the need to consistently update new social situational information without forgetting the accumulated experiences. In this work, we propose a representation learning framework to directly address these shortcomings by integrating social commonsense knowledge with recent advancements in the space of lifelong language learning. First, we investigate methods to incorporate pragmatic aspects into our social event embeddings by leveraging social commonsense knowledge. Next, we introduce continual learning strategies that allow for incremental consolidation of new knowledge while retaining and promoting efficient usage of prior knowledge. Experimental results on event similarity, reasoning, and paraphrase detection tasks prove the efficacy of our social event embeddings.


Introduction
Everyday life comprises the ways in which people typically act, think, and feel on a daily basis. Our life experiences unfold naturally into temporally extended daily events. The event descriptions can be packaged in various ways depending on several factors like speaker's perspective or the related domain. Interpretation of event descriptions will be incomplete without understanding multiple entities involved in the events and even more so when the focus is primarily on "social events", S1: Student goes to class S2: Student goes to wedding S3: Student takes course S4: Teacher takes course S5: Professor teaches subject x S1 x S2 x S3 x S4 x S5 x S1 x S2 x S3 x S4 x S5

ConceptNet
ConceptNet ATOMIC ConceptNet ATOMIC SB-SCK x S1 x S2 x S3 x S4 x S5 x S1 x S2 x S3 x S4 x S5 i.e., events explaining social situations and interactions. Therefore, a social event representation model must capture the semantic properties from the event text description and embed salient knowledge that encompasses the implicit pragmatic abilities. Early definitions of pragmatic aspects refer to the use of language in context; comprising the verbal, paralinguistic, and non-verbal elements of language (Adams et al., 2005). Contemporary definitions have expanded beyond just communicative functions to include behavior that includes social, emotional, and communicative aspects of language (Adams et al., 2005;Parsons et al., 2017). Moving away from the extensively studied speech acts, we analyze characteristics that reflect how a person behaves in social situations and how social contextual aspects influence linguistic meaning. In the context of event representations, the pragmatic properties can specifically refer to the human's inferred implicit understanding of event actors' intents, beliefs, and feelings or reactions (Wood, 1976;Hopper and Naremore, 1978).
Understanding the pragmatic implications of social events is non-trivial for machines as they are not explicitly found in the event texts. Prior studies (Ding et al., 2014(Ding et al., , 2015Granroth-Wilding and Clark, 2016;Weber et al., 2018) often extract the syntactic and semantic information from the event descriptions but ignore the pragmatic aspects of language. In this work, we address this shortcoming, and aim to (a) disentangle semantic and pragmatic attributes from social event descriptions and (b) encapsulate these attributes into an embedding that can move beyond simple linguistic structures and dispel apparent ambiguities in the real sense of their context and meaning.
Towards this goal, we propose to train our models with social commonsense knowledge about events focusing specifically on intents and emotional reactions of people. Such commonsense understanding can be obtained from existing knowledge bases like ConceptNet (Speer et al., 2017), Event2Mind/ATOMIC (Sap et al., 2019a;Rashkin et al., 2018) or by collecting more noisy commonsense knowledge using data mining techniques. As new domain sources emerge, each containing different knowledge assertions, it is essential that the representation models for social events keep evolving with this growing knowledge. Since it is generally infeasible to retrain models from scratch for every new knowledge source, we consider the need to employ prominent continual learning practices (Kirkpatrick et al., 2017;Lopez-Paz and Ranzato, 2017;Asghar et al., 2018;d'Autume et al., 2019) and enable semantic and pragmatic enrichment of social event representations. This problem can be addressed from the perspective of incremental domain adaptation (Asghar et al., 2018;Wulfmeier et al., 2018), which quickly adapts to new domain knowledge without interfering with existing ones. Figure 1 presents a sample functioning scenario producing incrementally richer social event embeddings. As the model gains more knowledge from different sources, it learns to discern events based on semantic and pragmatic properties, including social roles. For example, "Student takes course", and "Teacher takes course" has significant lexical and semantic relatedness. However, the social role information changes the meaning as depicted in Figure 1(d) with the introduction of our in-house dataset (SB-SCK).
In this paper, we develop a lifelong representation learning approach for embedding social events from their free-form textual descriptions. Our model augments a growing set of knowledge obtained from various domain sources to allow for positive knowledge transfer across these domains. Our contributions are as follows: • We propose a continual representation learning approach-that integrates both text encoding and lifelong learning techniques to aid better representation of social events.
• We adopt a domain-representative episodic memory replay strategy with text encoding techniques to effectively consolidate the expanding knowledge from several domain sources and generate a semantically & pragmatically enriched social event embedding.
• We evaluate our models primarily on four different tasks: (a) intent-emotion prediction for event texts based on an in-house Lifelong EventRep Corpus, (b) event similarity task using hard similarity dataset (Ding et al., 2019;Weber et al., 2018), (c) paraphrase detection using Twitter URL corpus (Lan et al., 2017), and (d) social commonsense reasoning task using SocialIQA (Sap et al., 2019b) dataset.
2 Related Work

Social Events Representation Learning
Early work in the domain of events can be traced back to modeling narrative chains. Chambers and Jurafsky Jurafsky, 2008, 2009) introduced models for event sequences involving coreference resolution and inferring event schemas. Similar efforts (Balasubramanian et al., 2013;Cheung et al., 2013;Jans et al., 2012) have explored the use of open-domain relations to extract event schemas but suffer from reduced predictive capabilities and increased sparsity. Recent advancements, aimed at addressing the limitations of prior works, compute distributed embeddings of events involving word embeddings, recurrent sequence models, and tensor-based composition models (Modi and Titov, 2013;Granroth-Wilding and Clark, 2016;Pichotta and Mooney, 2016;Hu et al., 2017). Specifically, tensor-based methods have demonstrated improved performance by representing events that predict implicit arguments with event knowledge (Cheng and Erk, 2018), combine (subject, predicate, object) triples information (Weber et al., 2018) and reflect thematic fit (Tilk et al., 2016).

Lifelong Learning
Lifelong learning or continual learning approaches can be grouped into regularization-based, databased, and model-based approaches. Regularization based approaches (Kirkpatrick et al., 2017;Schwarz et al., 2018;Zenke et al., 2017) minimize significant changes to the previously learned representations as we update parameters for the current task. This is usually implemented as an additional constraint to the objective function based on the sensitivity of parameters. Recent studies (Kemker and Kanan, 2017;d'Autume et al., 2019;Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019) under data-based approaches store previous task data either using a replay memory buffer or a generative model. In NLP domain, lifelong language learning approaches have investigated the use of memory replay and local adaptation techniques (d'Autume et al., 2019). Finally, model-based approaches allow models to allocate or grow capacity (layers or features) necessary for the tasks (Rusu et al., 2016;Lee et al., 2016). More recently, (Asghar et al., 2018) augmented RNNs with a progressive memory bank leading to increased model capacity. However, the challenges related to increased architectural complexity are tackled by hybrid models as in (Sodhani et al., 2020). In this paper, we build on ideas from the hybrid models and apply them for our learning task. While previous studies (Ding et al., 2019) have attempted to incorporate commonsense knowledge, this work is one of the first efforts to integrate multi-source knowledge and address it through the lens of incremental domain adaptation.

Problem Formalization
Formally, we assume that our learning framework has access to streams of social commonsense knowledge data obtained from n different domains, denoted by D = {D 1 , D 2 , ..., D n }. At a particular point in time, we extract knowledge from the current domain D i . We produce an embedding of social events by consolidating the accumulated knowledge across the modeled domains D ≤i . Data from each domain source contains source-specific textual descriptions of social situations and their intuitive commonsense information such as intents and emotions. Training samples, drawn from a domain dataset D i , could contain either a significant overlap or an entirely new set of knowledge when compared with the previously processed domains D 1:i . Given such a setup, we aim to generate incrementally richer social event representations using our continual learning framework.

Datasets
For our representation learning task, we aggregate social commonsense knowledge data from various domain sources. This knowledge contains details about pragmatic aspects like intents and emotional reactions. We create a continual learning benchmark based on these commonsense data sources 1 .

Lifelong Social Events Dataset
Different domain sources of social commonsense knowledge used for training our social event representation model are explained as follows.
ATOMIC dataset consists of inferential knowledge based on 24k short events covering a diverse range of everyday events and motivations. Though each event contains nine dimensions per event, the scope of this work will be limited to intent and emotions as our inferential pragmatic dimensions.
CONCEPTNET knowledge base contains several commonsense assertions.
SB-SCK Since social roles (e.g., student, mother, teacher, worker, etc.) provide additional information about the motives and emotions behind actions specified in the events (as shown in Figure 1, 2, we adopt web-based knowledge mining techniques for capturing this aspect. This dataset was collected as a part of our recent work (Vijayaraghavan and Roy, 2021) using the following steps: (a) process texts from Reddit posts containing personal narratives as in (Vijayaraghavan and   Roy, 2021), (b) extract propositions from text using OpenIE tools, (c) perform a web search for plausible intents and emotions by attaching purpose clauses (Palmer et al., 2005) and feelings lexical units from Framenet (Baker et al., 1998) and (d) finally, remove the poorly extracted facts using a simple classifier trained on some seed commonsense knowledge. Figure 2(Left) shows samples from this dataset indicating how the same action could have different social role related motivations. We refer to this as Search-based Social Commonsense Knowledge (SB-SCK) data. Figure 2(Right) presents the data statistics. For data from each of the above domain sources, we sample free-form event text, its paraphrase, intent, emotional reactions, and negative samples of paraphrases, intents, and emotional reactions. Based on the annotated labels for motivation (Maslow's) and emotional reactions (Plutchik) in STORYCOMMONSENSE data, we run a simple K-Means clustering on the open text intent data. We identify five disjoint clusters on each of the three domains and map them to those categories. For the purpose of our lifelong learning problem, we divide each domain data into two sets (3 clusters and 2 clusters) and consider them as different subdomains. Therefore, this results in 6 tasks in our continual learning setup. We refer to this dataset as Lifelong EventRep Corpus.

Paraphrase Datasets
We use random samples of parallel texts from paraphrase datasets like PARANMT-50M corpus (Wieting and Gimpel, 2017) and Quora Question Pair dataset 2 . These paraphrase datasets are primarily used for pretraining our model. We also produce paraphrases of free-form event texts in our 2 https://www.kaggle.com/c/quora-question-pairs/data dataset using a back-translation approach (Iyyer et al., 2018). We used pretrained English↔German translation models for this purpose.

Framework
Our goal is to learn distributed representations of social events by incorporating pragmatic aspects of the language beyond shallow event semantics. Moving away from conventional supervised multi-task classification based lifelong learning approaches, we focus on a lifelong representation learning approach that enables us to adapt and sequentially learn a social event embedding model. The motivation for a lifelong learning framework is that the growing knowledge obtained from various domain sources can effectively guide the modeling of complex social events. This involves systematically updating the model by consolidating this expanding knowledge to produce richer embeddings without forgetting previously accumulated knowledge. In this section, we will explain various components of our modeling framework.

Social Event Representation
Given an input event text description, the core idea is to first encode the free-form event text and decompose the ensuing representation into pragmatic (implied emotions and intents) and non-pragmatic (syntactic and semantic information) components. Eventually, we combine these decomposed representations to obtain an overall event representation and apply it in different downstream tasks.

Encoder
The input to our model is a free-form event text description from i th domain, x (i) j ∈ D i . This free-form event text contains a sequence of tokens, x First, we construct a context-dependent token embedding using a context embedding function G : R L×d X → R L×d H , where d X and d H refer to the embedding and hidden layer dimensions respectively. Following this encoding step, we incorporate pooling or projection function, G pp : R L×d H → R 3×d H , that transform event text from context-dependent embedding space into pragmatic and semantic space. More specifically, we produce latent vectors for intents (h I ), reactions (h R ) and non-pragmatic (h N ) information. Finally, we com-bine the latent vectors h N , h I , h R using a simple feed-forward layer, G C : R 3×d H → R d H , to produce a powerful social event representation, h C , capable of dispelling apparent ambiguities in the true sense of their meaning. Given positive and negative examples of intents, emotional reactions and paraphrases associated with the input event text, we learn to effectively sharpen each of these embeddings h I , h R and using metric learning methods.
For the sake of brevity, we drop the domain index i and the sample index j in this section. These encoding steps are summarized as: We denote this multi-step encoding process resulting in h I , h R , h C as a function G event . Now, we experiment with the following text embedding techniques as our context embedding function (G): BiGRU Using bidirectional GRUs (Chung et al., 2014), we compute the context embedding of the input event text by concatenating the forward ( − → h t ) and backward hidden states ( BERT We employ BERT (Devlin et al., 2018), a multi-layer bidirectional Transformerbased encoder, as our context embedding method G.
We fine-tune a BERT model that takes attribute-augmented event text x = [CLS] m [SEP ] w 1 , ..., w L [SEP ] as input and outputs a powerful context-dependent event representation H e . The attribute m ∈ {xIntent, xReact, xN prag} refers to special tokens for intents, reactions and non-pragmatic aspects.
In our default case, our G pp function is the output embedding of [CLS] token associated with their respective attribute-augmented input. In cases where input event text is not augmented with attribute special tokens, we apply pooling strategies such as attentive pooling (AP) and mean (MEAN) of all context vectors obtained from the previous encoding step G. We obtain h I , h R , h N based on these techniques. Depending on the type of context embedding function, we refer our multi-step event text encoder, G event , as EVENTGRU or EVENTBERT.

Objective Loss
Using positive {u p I , u p R , u p C } and N − 1 negative {u n I , u n R , u n C } examples of intents, emotions and paraphrases associated with the event texts, we calculate N -pair loss, L v (h, z p , {z n k } N −1 k=1 ), to maximize the similarity between the representation of positive examples (z p v ) and the computed embeddings (h v ). Here, z e v is computed using a transformation function f v as: where v ∈ {I, R, C} and e ∈ {p, n}. Thus, our loss function is devised as: where L I , L R are used to learn disentangled pragmatic embeddings (intent and emotion), L C is intended to jointly embed semantic and pragmatic aspects to produce an overall social event representation. β D , β E are loss coefficients that weigh the importance of disentanglement loss and an overall joint embedding loss. These coefficients are non-negative and they sum to 1.

Continual Learning
Given a never-ending list of social events, onceand-for-all training on a fixed dataset limits the utility of such models in real-world applications. Therefore, we draw ideas from lifelong learning literature to adapt our models to new data yet retaining prior knowledge. First, we implement a data-based approach, which is a variant of episodic memory replay (EMR) (Lopez-Paz and Ranzato, 2017;Chaudhry et al., 2019;d'Autume et al., 2019) for mitigating catastrophic forgetting while allowing for beneficial backward knowledge transfer. Next, we combine it with simple vocabulary expansion for incremental domain adaptation.

Domain-Representative Episodic Memory Replay (DR-EMR)
We augment our model explained in Section 5.1 with an episodic memory module to perform sparse experience replay.

Read Operation
The read operation retrieves domain-specific random samples from our episodic memory for experience replay. These samples contain the original event training data and their representations. We choose samples from a previously trained domain at every read step assuming an overall uniform distribution over domains.
Write Operation In this work, we incorporate the following desirable characteristics of an episodic memory module: (a) M stores domainrepresentative samples that best approximate the domain's data distribution and (b) M's capacity is bounded or at most expands sub-linearly. To achieve this, we adopt the following strategies: Domain-Representative Sample Selection: Past studies have explored using random writing strategy (d'Autume et al., 2019) and distribution approximation or K-Means to cluster the samples (Rebuffi et al., 2017). In our work, we perform sample selection using a CURE (Clustering Using REpresentatives) algorithm (Guha et al., 2001) to find C representative points. CURE employs hierarchical clustering that is computationally feasible by adopting random sampling and two-pass clustering. Using event representations obtained after embedding alignment transformation, we identify the domain-representative samples by computing euclidean metric-based nearest neighbors to the C representative points. We set the value of C based on the memory budget. These domainrepresentative samples are stored in our episodic memory with their corresponding representations.
Replacement Policy: When the memory becomes full, we follow a simple memory replacement policy that selects an existing memory entry to delete. Specifically, we replace the q th memory entry which is determined by: q = argmax q (sof tmax(φ(x (t) j ) · M q ) similar to idea proposed in (Gulcehre et al., 2018).
Inspired by Wang et al. (Wang et al., 2019), we propose a variant of the alignment model that helps overcome catastrophic forgetting by ensuring minimal distortion to previously computed representational spaces. To accomplish this, we define simple linear transformations: G IA , G RA , G CA on the top of the multi-step encoder function G event outputs. For simplicity, we drop the subscripts (I, R, C) and denote these transformation functions as G A . Given a new domain i, we initialize our multistep event encoder G i event and alignment function G i A with the last trained parameters as in G i−1 event and G i−1 A respectively. The optimization is implemented in the following two steps: (a) For samples from (B dom , B rep ), we optimize the encoder output representations to be closer to their respective ground-truth examples in this linearly transformed space. Therefore, we modify the N -pair loss function to accommodate the alignment function as: For samples from B rep , we add an extra constraint (L + L EA ) for each embedding component to align old and new domain embedding spaces: 6 Training Since our model involves metric learning, hard negative data mining is an essential step for faster convergence and improved discriminative capabilities. However, selecting too hard examples too often makes the training unstable. Therefore, we choose a hybrid negative mining technique where we choose few semi-hard negatives examples (Hermans et al., 2017) and combine it with random negative samples to effectively train our model. In our work, we define a heuristic objective by weighing samples based on two factors: (i) word overlap or similarity in embedding space of the event text and (ii) intent and emotion free-form text or categories based on STORYCOMMONSENSE data. More specifically, given an event text as anchor and a positive intent text based on a ground Domains % Accuracy Figure 4: Average accuracy scores (%) of data after 6 permuted runs from domains that have been observed at that point during the continual learning process.
truth motivation category, we mine negative instances for intent as follows: (a) choose random text samples associated with a motivation category that is different from that of the positive example but closer in the embedding space or word overlap, (b) choose random text samples within the same motivation category but with different emotion category. We repeat this process for drawing negative instances related to emotions. For paraphrases, we consider few examples with significant word overlap while the rest are randomly chosen examples. N -pair loss helps alleviate the sensitivity of triplet loss function to the choice of hard triplets. Finally, we pre-train our model with paraphrase data and fine-tune it using the examples obtained from hard negative mining for intents, emotions and paraphrases. For our training, the learning rate is set to 0.0001, the number of training epoch is 20. By default, we use EVENTBERT as our multi-step encoder. We conduct a study by assigning different values for loss coefficients, β D , β E , and explain the results of the study in Section 7.1.2.

Experiments
We conduct several experiments to study the power of our learned embeddings. Our experiments are designed to answer the following questions: RQ1: How well does our model perform in comparison to other continual learning approaches for intent-emotion prediction?
RQ2: To what extent do our modeling choices impact the results in predicting intents & emotions?
RQ3: Does our model outperform existing stateof-the-art methods in hard similarity task that evaluates the effectiveness of the learned embeddings?
RQ4: Do the learned embeddings demonstrate transfer capability to downstream tasks -Paraphrase detection & Social IQA reasoning?

Intent-Emotion Prediction (RQ1)
The continual learning methods evaluated using our Lifelong EventRep Corpus are given below.
• Base: We simply fine-tune our model on successive tasks from previously trained checkpoint.
• A-GEM: This method (Chaudhry et al., 2019) uses a constraint that enables the projected gradient to decrease the loss on older tasks. We randomly choose 2-3% samples from all the previous tasks to form a constraint.
• EMR: We use randomly stored examples for sparse experience replay.
• A-EMR: A variant of our model that uses random samples for experience replay with alignment constraints.
• DR-EMR: This is our complete model involving domain representative experience replay and alignment constraints.

Empirical Results
After running six permuted sequence of tasks, we calculate the mean performance on the test set of all observed task domains after time step k, given by AvgAcc = 1 k k i=1 Acc (i) , where Acc (i) is the model accuracy on the test set from the domain D i . Further, we also compute standard deviation to determine the importance of the order of the tasks during training. Table 1a contains the comparison of last-step AvgAcc scores and standard deviation for predicting intents and emotions. Figure 4 plots the AvgAcc at every step where D 1:i indicates that the model has seen data from i domains to evaluate our continual learning process.
In the absence of any lifelong learning strategies, the performance drops significantly for Base model while emphasizing the importance of task order as indicated by the high standard deviation value. Compared to the Base model, our results show some improvement in intent and emotion prediction as we introduce methods like A-GEM and EWC. However, we observe a significant performance gain with EMR-based techniques. Our complete model (DR-EMR) outperforms all the other methods, thereby demonstrating the importance of KGEB (Ding et al., 2016) 50.09 NTN + Int+ (Ding et al., 2019) 58.83 NTN + Int + Senti (Ding et al., 2019) 64.31 EVENTBERT  domain-representative sampling and alignment constraints towards learning representations that help effective prediction of intents and emotions. Moreover, we assess the domain sequences that cause a performance drop. For example, whenever SB-SCK is trained at the end, the model shows reduced accuracy. The reason can be ascribed to the effect of interference of noisy knowledge over the previously trained cleaner domains. Training this noisy source ahead leads to positive knowledge transfer and hence produces sharpened performance outcomes. Despite these odds, our DM-EMR model records the least standard deviation implying reduced sensitivity to training order. Figure 3 (Right) shows the dynamics of our continual learning mode. It contains accuracy scores of a single run and displays the lower-triangular values. We see that our approach allows positive backward transfer, i.e., model performance in previous domains gradually increases with new domain knowledge. Our model achieves the best performance (see the bold-faced accuracy scores in Figure 3(Right)) in most domains after observing all the data (D 1:6 ).

Ablation Study (RQ2)
We analyze different model configurations related to: (a) encoding: EVENTGRU, EVENTBERT, (b) pooling: attribute-augmented input (CLS), Mean Pooling (MP) and Attentive Pooling (AP) and (c) sampling: K-Means, CURE. Results of the study are given in Figure 5(Left). We evaluated different combinations of these strategies but only report the average accuracy scores of configurations having the best strategy at each category combined with the variants in the following category, i.e., for pooling strategies, we only report scores with EVENTBERT encoding strategy and so on. From the results, we ascertain that the best configuration comprises an EVENTBERT encoder supported by attentive pooling and CURE-based sample selection.  Additionally, we measure the effect of β E in the prediction of intents. As shown in Figure 5 (Right), the model performs significantly better for lower values β E as more weight is assigned for the disentanglement of pragmatic aspects. However, we observe that a balanced loss function with β E = 0.5 allows for consistently good performance in both intent-emotion prediction ( Figure 5 (Right)) and hard similarity tasks (see Section 7.2). Despite other hyperparameters, changes to β E determine the importance of incorporating semantic or pragmatic information in the ensuing event embedding.

Hard Similarity Task (RQ3)
By following the work of Ding et al. (Ding et al., 2019), we evaluate our social event representation on an extended dataset of event pairs containing: (a) similar event pair having minimum lexical overlap (e.g., people admired president / citizens loved leader) (b) dissimilar event pair with high lexical overlap (e.g., people admired president / people admired nature). A good performance in this task will ensure that similar events are pulled closer to each other than dissimilar events. Combining hard similarity datasets from (Ding et al., 2019) and (Weber et al., 2018), the total size of this balanced dataset is 2,230 event pairs. Using our joint embedding h C for an event text and triplet loss setup, we compute a similarity score between similar and dissimilar pairs. The baselines include: Knowledgegraph based embedding model (KGEB) (Ding et al., 2016), Neural Tensor Network (NTN) and its variants augmented with ATOMIC dataset based embeddings (Int, Senti) (Ding et al., 2019). We report the model's accuracy in assigning a higher similarity score for similar pairs than dissimilar pairs. Table 1b shows that our model outperforms the state-of-the-art method for this task.

Paraphrase Detection (RQ4)
Given a sentence pair, the objective is to detect whether they are paraphrases or not. For each sentence pair (s 1 , s 2 ), we pass them through our model and obtain their respective h C , given by vectors (u, v). We concatenate these vectors (u, v) with the element-wise difference |u − v| and feed to a feed-forward layer. We optimize binary crossentropy loss. For evaluation purposes, we compare our model against baselines like BERT and ESIM (Chen et al., 2016). Trained on a subset of dataset explained in Section 4.2, we choose an outof-domain test dataset where samples stem from a dissimilar input distribution. To this end, Twitter URL paraphrasing corpus (Lan et al., 2017), referred to as TwitterPPDB, is selected. Table 2b contains results of our evaluation. The results testify the efficacy of our embeddings.

Social IQA Reasoning (RQ4)
We determine the quality of our latent social event representations by evaluating on a social commonsense reasoning benchmark -SocialIQA dataset (Sap et al., 2019b). Given a context, a question and three candidate answers, the goal is to select the right answer among the candidates. Following Sap et al. (Sap et al., 2019b), the context, question, and candidate answer are concatenated using separator tokens and passed to the BERT model. Additionally, we feed the context to our EVENTBERT model to obtain three embeddings h I , h R , h C . While the original work computed a score l using the hidden state of [CLS] token, we introduce a minor modification to this step as: l = W 5 tanh(W 1 h CLS + W 2 h I + W 3 h R + W 4 h C ), where W 1:4 ∈ R d H ×d H and W 5 ∈ R 1×d H are learnable parameters. Similar to (Sap et al., 2019b), triple with the highest normalized score is used as the model's prediction. We fine-tune BERT models  using our new scoring function with social event embedding (denoted as "w/") and compare against baselines (like GPT (Radford et al., 2018)) without our event embeddings (denoted as "w/o"). Results in Table 2a indicate that a simple enhancement procedure at the penultimate step can offer significant performance gains. Our findings also suggest that our enhanced model performed well for question types like 'wants' and 'effects' that weren't explicitly modeled in our embedding model. This confirms that our pragmatics-enriched embeddings lead to improved reasoning capabilities.

Conclusion
Humans rely upon commonsense knowledge about social contexts to ascribe meaning to everyday events. In this paper, we introduce a lifelong learning approach to effectively embed social events with the help of a growing set of social commonsense knowledge assertions acquired from different domains. First, we leverage social commonsense knowledge to sharpen social event embeddings with semantic and pragmatic attributes. Next, we employ domain-representative episodic memory replay (DR-EMR) to overcome catastrophic forgetting and enable positive knowledge transfer with the emergence of new domain knowledge. By evaluating on a corpus of social events aggregated from multiple sources, we establish that our model is able to outperform several baselines. Experimental results on downstream tasks like event similarity, reasoning, and paraphrase detection demonstrate the capabilities of our social event embeddings. We hope that our work will motivate further exploration into lifelong representation learning of social events and advance the research in inferring pragmatic dimensions from texts.