Benchmarking Commonsense Knowledge Base Population with an Effective Evaluation Dataset

Reasoning over commonsense knowledge bases (CSKB) whose elements are in the form of free-text is an important yet hard task in NLP. While CSKB completion only fills the missing links within the domain of the CSKB, CSKB population is alternatively proposed with the goal of reasoning unseen assertions from external resources. In this task, CSKBs are grounded to a large-scale eventuality (activity, state, and event) graph to discriminate whether novel triples from the eventuality graph are plausible or not. However, existing evaluations on the population task are either not accurate (automatic evaluation with randomly sampled negative examples) or of small scale (human annotation). In this paper, we benchmark the CSKB population task with a new large-scale dataset by first aligning four popular CSKBs, and then presenting a high-quality human-annotated evaluation set to probe neural models’ commonsense reasoning ability. We also propose a novel inductive commonsense reasoning model that reasons over graphs. Experimental results show that generalizing commonsense reasoning on unseen assertions is inherently a hard task. Models achieving high accuracy during training perform poorly on the evaluation set, with a large gap between human performance. We will make the data publicly available for future contributions. Codes and data are available at https://github.com/HKUST-KnowComp/CSKB-Population.


Introduction
Commonsense reasoning is one of the core problems in the field of artificial intelligence. Throughout the development in computational commonsense, commonsense knowledge bases (CSKB) (Speer et al., 2017;Sap et al., 2019) are constructed to enhance models' reasoning ability. As humanannotated CSKBs are far from complete due to * Equal Contribution Figure 1: Comparison between CSKB completion and population. An example of aligning eventuality graph as candidate commonsense knowledge triples is also provided.
the scale of crowd-sourcing, reasoning tasks such as CSKB completion (Li et al., 2016;Malaviya et al., 2020;Moghimifar et al., 2021) and population (Fang et al., 2021) are proposed to enrich the missing facts. The CSKB completion task is defined based on the setting of predicting missing links within the CSKB. On the other hand, the population task grounds commonsense knowledge in CSKBs to large-scale automatically extracted candidates, and requires models to determine whether a candidate triple, (head, relation, tail), is plausible or not, based on the information from both the CSKB and the large number of candidates which essentially form a large-scale graph structure. An illustration of the difference between completion and population is shown in Figure 1.
There are two advantages of the population task. First, the population can not only add links but also nodes to an existing CSKB, while completion can only add links. The populated CSKB can also help reduce the selection bias problem (Heckman, 1979) from which most machine learning models would suffer, and will benefit a lot of downstream applications such as commonsense generation (Bosselut et al., 2019). Second, commonsense knowledge is usually implicit knowledge that requires multiple-hop reasoning, while current CSKBs are lacking such complex graph structures. For example, in ATOMIC (Sap et al., 2019), a humanannotated if-then commonsense knowledge base among daily events and (mental) states, the average hops between matched heads and tails in ASER, an automatically extracted knowledge base among activities, states, and events based on discourse relationships, is 2.4 (Zhang et al., 2021). Evidence in Section 4.5 (Table 3) also shows similar results for other CSKBs. However, reasoning solely on existing CSKBs can be viewed as a simple triple classification task without considering complex graph structure (as shown in Table 3, the graphs in CSKBs are much sparser). The population task, which provides a richer graph structure, can explicitly leverage the large-scale corpus to perform commonsense reasoning over multiple hops on the graph.
However, there are two major limitations for the evaluation of the CSKB population task. First, automatic evaluation metrics, which are based on distinguishing ground truth annotations from automatically sampled negative examples (either a random head or a random tail), are not accurate enough. Instead of directly treating the random samples as negative, solid human annotations are needed to provide hard labels for commonsense triples. Second, the human evaluation in the original paper of CSKB population (Fang et al., 2021) cannot be generally used for benchmarking. They first populate the CSKB and then asked human annotators to annotate a small subset to check whether the populated results are accurate or not. A better benchmark should be based on random samples from all candidates and the scale should be large enough to cover diverse events and states.
To effectively and accurately evaluate CSKB population, in this paper, we benchmark CSKB population by firstly proposing a comprehensive dataset aligning four popular CSKBs and a largescale automatically extracted knowledge graph, and then providing a large-scale human-annotated evaluation set. Four event-centered CSKBs that cover daily events, namly ConceptNet (Speer et al., 2017) (the event-related relations are selected), ATOMIC (Sap et al., 2019), ATOMIC 20 20 (Hwang et al., 2020), and GLUCOSE (Mostafazadeh et al., 2020), are used to constitute the commonsense relations. We align the CSKBs together into the same format and ground them to a large-scale eventuality (including activity, state, and event) knowledge graph, ASER (Zhang et al., 2020(Zhang et al., , 2021. Then, instead of annotating every possible node pair in the graph, which takes an infeasible O(|V | 2 ) amount of annotation, we sample a large subset of candidate edges grounded in ASER to annotate. In total, 31.7K high-quality triples are annotated as the development set and test set.
To evaluate the commonsense reasoning ability of machine learning models based on our benchmark data, we first propose some models that learn to perform CSKB population inductively over the knowledge graph. Then we conduct extensive evaluations and analysis of the results to demonstrate that CSKB population is a hard task where models perform poorly on our evaluation set far below human performance.
We summarize the contributions of the paper as follow: (1) We provide a novel benchmark for CSKB population over new assertions that cover four human-annotated CSKBs, with a large-scale human-annotated evaluation set. (2) We propose a novel inductive commonsense reasoning model that incorporates both semantics and graph structure.
(3) We conduct extensive experiments and evaluations on how different models, commonsense resources for training, and graph structures may influence the commonsense reasoning results.

Commonsense Knowledge Bases
Since the proposal of Cyc (Lenat, 1995) and Con-ceptNet (Liu and Singh, 2004;Speer et al., 2017), a growing number of large-scale human-annotated CSKBs are developed (Sap et al., 2019;Bisk et al., 2020;Sakaguchi et al., 2020;Mostafazadeh et al., 2020;Forbes et al., 2020;Lourie et al., 2020;Hwang et al., 2020;Ilievski et al., 2020). While ConceptNet mainly depicts the commonsense relations between entities and only small portion of events, recent important CSKBs have been more devoted to event-centric commonsense knowledge. For example, ATOMIC (Sap et al., 2019) defines 9 social interaction relations and~880K triples are annotated. ATOMIC 20 20 (Hwang et al., 2020) further unifies the relations with ConceptNet, together with several new relations, to form a larger CSKB con-taining 16 event-related relations. Another CSKB is GLUCOSE (Mostafazadeh et al., 2020), which extracts sentences from ROC Stories and defines 10 commonsense dimensions to explores the causes and effects given the base event. In this paper, we select ConceptNet, ATOMIC, ATOMIC 20 20 , and GLUCOSE to align them together because they are all event-centric and relatively more normalized compared to other CSKBs like SocialChem-istry101 (Forbes et al., 2020).

Knowledge Base Completion and Population
Knowledge Base (KB) completion is well studied using knowledge base embedding learned from triples (Bordes et al., 2013;Yang et al., 2015;Sun et al., 2019) and graph neural networks with a scoring function decoder (Shang et al., 2019). Pretrained language models are also applied on such completion task (Yao et al., 2019;Wang et al., 2020b) where information of knowledge triples is translated into the input to BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019). Knowledge base population (Ji and Grishman, 2011) typically includes entity linking (Shen et al., 2014) and slot filling (Surdeanu and Ji, 2014) for conventional KBs, and many relation extraction approaches have been proposed (Roth and Yih, 2002;Chan and Roth, 2010;Mintz et al., 2009;Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012;Lin et al., 2016;Zeng et al., 2017). Universal schema and matrix factorization can also be used to learn latent features of databases and perform population (Riedel et al., 2013;Verga et al., 2016;Toutanova et al., 2015;McCallum et al., 2017). Besides completion tasks on conventional entitycentric KBs like Freebase (Bollacker et al., 2008), completion tasks on CSKBs are also studied on ConceptNet and ATOMIC. Bi-linear models are used to conduct triple classification on Concept-Net (Li et al., 2016;Saito et al., 2018). Besides, knowledge base embedding models plus BERTbased graph densifier (Malaviya et al., 2020;Wang et al., 2020a)   CSKB population problem. However, it requires the known heads and relations to acquire more tails so it does not fit our evaluation. Recently, various prompts are proposed to change the predicate lexicalization (Jiang et al., 2020;Shin et al., 2020;Zhong et al., 2021) but still how to obtain more legitimate heads for probing remains unclear. Our work can benefit them by obtaining more training examples, mining more commonsense prompts, as well as getting more potential heads for the generation.

Task Definition
Denote the source CSKB about events as C = {(h, r, t)|h ∈ H, r ∈ R, t ∈ T }, where H, R, and T are the set of the commonsense heads, relations, and tails. Suppose we have another much larger eventuality (including activity, state, and event) knowledge graph extracted from texts, denoted as G = (V, E), where V is the set of all vertices and E is the set of edges. G c is the graph acquired by aligning C and G into the same format. The goal of CSKB population is to learn a scoring function given a candidate triple (h, r, t), where plausible commonsense triples should be scored higher. The training of CSKB population can inherit the setting of triple classification, where ground truth examples are from the CSKB C and negative triples are randomly sampled. In the evaluation phase, the model is required to score the triples from G that are not included in C and be compared with humanannotated labels.

Selection of CSKBs
As we aim at exploring commonsense relations among general events, we summarize several criteria for selecting CSKBs. First, the CSKB should be well symbolically structured to be generalizable. While the nodes in CSKB can inevitably be free-text to represent more diverse semantics,  Table 2: Overlaps between eventuality graphs and commonsense knowledge graphs. We report the proportion of (h, r, t) triples where both the head and tail can be found in the eventuality graph.
we select the knowledge resources where format normalization is conducted. Second, the commonsense relations are encoded as (head, relation, tail) triples. To this end, among all CSKB resources, we choose the event-related relations in ConceptNet, ATOMIC, ATOMIC 20 20 , and GLU-COSE as the final commonsense resources. For the event-related relations in ConceptNet, the elements are mostly lemmatized predicate-object pairs. In ATOMIC and ATOMIC 20 20 , the subjects of eventualities are normalized to placeholders "PersonX" and "PersonY". The nodes in GLUCOSE are also normalized and syntactically parsed manually, where human-related pronouns are written as "SomeoneA" or "SomeoneB", and object-related pronouns are written as "SomethingA". Other commonsense resources like SocialChemistry101 (Forbes et al., 2020) are not selected as they include over looselystructured events.
For ConceptNet, we select the event-related relations Causes and HasSubEvent, and the triples where nodes are noun phrases are filtered out. For ATOMIC, we restrict the events to be those simple and explicit events that do not contain wildcards and clauses. As ATOMIC 20 20 itself includes the triples in ATOMIC and ConceptNet, to distinguish different relations, we refer to ATOMIC 20 20 as the new event-related relations annotated in Hwang et al. (2020), which are xReason, HinderedBy, isBefore, and isAfter. In the rest of the paper, ATOMIC( 20 20 ) means the combination of ATOMIC and the new relations in ATOMIC 20 20 .

Alignment of CSKBs
To effectively align the four CSKBs, we propose best-effort rules to align the formats for both nodes and edges. First, for the nodes in each CSKB, we normalize the person-centric subjects and objects as "PersonX", "PersonY", and "PersonZ", etc, according to the order of their occurrence, and the object-centric subjects and objects as "SomethingA" and "SomethingB". Second, to reduce the seman-tic overlaps of different relations, we aggregate all commonsense relations to the relations defined in ATOMIC( 20 20 ), as it is comprehensive enough to cover the relations in other resources like GLU-COSE, with some simple alignment in Table 1.

ConceptNet.
We select Causes and HasSubEvent from ConceptNet to constitute the event-related relations. As heads and tails in ConceptNet don't contain subjects, we add a "PersonX" in front of the original heads and tails to make them complete eventualities. ATOMIC( 20 20 ). In ATOMIC and ATOMIC 20 20 , heads are structured events with "PersonX" as subjects, while tails are human-written free-text where subjects tend to be missing. We add "PersonX" for the tails without subjects under agent-driven relations, the relations that aim to investigate causes or effects on "PersonX" himself, and add "PersonY" for the tails missing subjects under theme-driven relations, the relations that investigate commonsense causes or effects on other people like "PersonY" . GLUCOSE. For GLUCOSE, we leverage the parsed and structured version in this study. We replace the personal pronouns "SomeoneA" and "SomeoneB" with "PersonX" and "PersonY" respectively. For other object-centric placeholders like "Something", we keep them unchanged. The relations in GLUCOSE are then converted to ATOMIC relations according to the conversion rule in the original paper (Mostafazadeh et al., 2020). Moreover, gWant, gReact, and gEffect are the new relations for the triples in GLUCOSE where the subjects are object-centric. The prefix "g" stands for general, to be distinguished from "x" (for Per-sonX) and "o" (for PersonY).

Selection of the Eventuality KG
Taking scale and the diversity of relationships in the KG into account, we select two automatically extracted eventuality knowledge graphs as candidates for the population task, Knowlywood (Tandon et al., 2015) and ASER (Zhang et al., 2020).  Table 3: The overall matching statistics for the four CSKBs. The edge column indicates the proportion of edges where their heads and tails can be connected by paths in ASER. Average (in and out)-degree on ASER norm and C for nodes from the CSKBs is also presented. The statistics in C is different from (Malaviya et al., 2020) as we check the degree on the aligned CSKB C instead of each individual CSKB.
They both have complex graph structures that are suitable for multiple-hop reasoning. We first check how much commonsense knowledge is included in those eventuality graphs to see if it's possible to ground a large proportion of commonsense knowledge triples on the graphs. Best-effort alignment rules are designed to align the formats of CSKBs and eventuality KGs. For Knowlywood, as the patterns are mostly simple verb-object pairs, we leverage the v-o pairs directly and add a subject in front of the pairs. For ASER, we aggregate the raw personal pronouns like he and she to normalized "PersonX". As ASER adopts more complicated patterns of defining eventualities, a more detailed pre-process of the alignment between ASER and CSKBs will be illustrated in Section 4.4. We report the proportion of triples in every CSKB whose head and tail can both be matched to the eventuality graph in Table 2. ASER covers a significantly larger proportion of head-tail pairs in the four CSKBs than Knowlywood. The reason behind is that on the one hand ASER is of much larger scale, and on the other hand ASER contains eventualities with more complicated structures like s-v-o-p-o (s for subject, v for verb, o for object, and p for preposition), compared with the fact that Knowlywood mostly covers s-v or s-v-o only. In the end, we select ASER as the eventuality graph for population.

Pre-process of the Eventuality Graph
We introduce the normalization process of ASER, which converts its knowledge among everyday eventualities into normalized form to be aligned with the CSKBs as discussed in Section 4.2. Each eventuality in ASER has a subject. We consider singular personal pronouns, i.e., "I", "you", "he", "she", "someone", "guy", "man", "woman", "somebody", and replace the concrete personal pronouns in ASER with normalized formats such as "Per-  Figure 2: An example of normalizing ASER. The coral nodes and edges are raw data from ASER, and the blue ones are the normalized graph by converting "he" and "she" to placeholders "PersonX" and "PersonY" sonX" and "PersonY". Specifically, for an original ASER edge where both the head and tail share the same person-centric subject, we replace the subject with "PersonX" and the subsequent personal pronouns in the two eventualities with "PersonY" and "PersonZ" according to the order of the occurrence if exists. For the two neighboring eventualities where the subjects are different person-centric pronouns, we replace one with "PersonX" and the other with "PersonY". In addition, to preserve the complex graph structure in ASER, for all the converted edges, we duplicate them by replacing the "PersonX" in it with "PersonY", and "PersonY" with "PersonX", to preserve the sub-structure in ASER as much as possible. An illustration of the converting process is shown in Figure 2. The normalized version of ASER is denoted as ASER norm .

The Aligned Graph G c
With the pre-process in Section 4.2 and 4.4, we can successfully align the CSKBs and ASER together in the same format. To demonstrate ASER's coverage on the knowledge in CSKBs, we present the proportion of heads, tails, and edges that can be found in the ASER norm via exact string match in Table 3. For edges, we report the proportion of edges where the corresponding heads and tails can be connected by a path in ASER. We also report the average shortest path length in ASER for those matched edges from the CSKB in the #hops column, showing that ASER can entail such commonsense knowledge within several hops of path reasoning, which builds the foundation of commonsense reasoning on ASER. In addition, the average degree in G c and C for heads and tails from each CSKB is also presented in the table. The total number of triples for each relation in the CSKBs is presented in Table 4. There are 18 commonsense relations in total for CSKBs and 15 relations in ASER. More detailed descriptions and examples of the unification are presented in the Appendix (Table 11, 12, and 14).

Evaluation Set Preparation
For the ground truth commonsense triples from the CSKBs, we split them into train, development, and test set with the proportion 8:1:1. Negative examples are sampled by selecting a random head and a random tail from the aligned G c such that the ratio of negative and ground truth triples is 1:1.
To form a diverse evaluation set, we sample 20K triples from the original automatically constructed test set (denoted as "Original Test Set"), 20K from the edges in ASER where heads come from CSKBs and tails are from ASER (denoted as "CSKB head + ASER tail"), and 20K triples in ASER where both heads and tails come from ASER (denoted as "ASER edges"). The detailed methods of selecting candidate triples for annotation is listed in the Appendix B.2. The distribution of different relations in this evaluation set is the same as in the original test set. The sampled evaluation set is then annotated to acquire ground labels.

Setups
The human annotation is carried out on Amazon Mechanical Turk. Workers are provided with sentences in the form of natural language translated from knowledge triples (e.g., for xReact, an (h, r, t) triple is translated to "If h, then, PersonX feels t"). Additionally, following Hwang et al. (2020)  point Likert scale: Always/Often, Sometimes/Likely, Farfetched/Never, and Invalid. Triples receiving the former two labels will be treated as Plausible or otherwise Implausible. Each HIT (task) includes 10 triples with the same relation type, and each sentence is labeled by 5 workers. We take the majority vote among 5 votes as the final result for each triple. To avoid ambiguity and control the quality, we finalize the dataset by selecting triples where workers reach an agreement on at least 4 votes.

Quality Control
For strict quality control, we carry out two rounds of qualification tests to select workers and provide a special training round. First, workers satisfying the following requirements are invited to participate in our qualification tests: 1) at least 1K HITs approved, and 2) at least 95% approval rate. Second, a qualification question set including both straightforward and tricky questions is created by experts, who are authors of this paper and have a clear understanding of this task. 760 triples sampled from the original dataset are annotated by the experts. Each worker needs to answer a HIT containing 10 questions from the qualification set and their answers are compared with the expert annotation. Annotators who correctly answer at least 8 out of 10 questions are selected in the second round. 671 workers participated in the qualification test, among which 141 (21.01%) workers are selected as our main round annotators. To further enhance  the quality, we carry out an extra training round for the main round annotators. For each relation, annotators are asked to rate 10 tricky triples carefully selected by experts. A grading report with detailed explanations on every triple is sent to all workers afterward to help them fully understand the annotation task. After filtering, we acquire human-annotated labels for 31,731 triples. The IAA score is 71.51% calculated using pairwise agreement proportion, and the Fleiss's κ (Fleiss, 1971) is 0.43. We further split the proportion of the development set and test set as 2:8. The overall statistics of this evaluation set are presented in Table 5. To acquire human performance, we sample 5% of the triples from the test set, and ask experts as introduced above to provide two additional votes for the triples. The agreement between labels acquired by majority voting and the 5+2 annotation labels is used as the final human performance of this task.

Experiments
In this section, we introduce the baselines and our proposed model KG-BERTSAGE for the CSKB population task, as well as the experimental setups.

Model
The objective of a population model is to determine the plausibility of an (h, r, t) triple, where nodes can frequently be out of the domain of the training set. In this sense, transductive methods based on knowledge base embeddings (Malaviya et al., 2020) are not studied here. We present several ways of encoding triples in an inductive manner.
BERT. The embeddings of h, r, t are encoded as the embeddings of the [CLS] tokens after feeding them separately as sentences to BERT   More details about the models and experimental details are listed in the Appendix Section C.

Setup
We train the population model using a triple classification task, where ground truth triples come   from the original CSKB, and the negative examples are randomly sampled from the aligned graph G c . The model needs to discriminate whether an (h, r, t) triple in the human-annotated evaluation set is plausible or not. For evaluation, we use the AUC score as the evaluation metric, as this commonsense reasoning task is essentially a ranking task that is expected to rank plausible assertions higher than those farfetched assertions. We use BERT base from the Transformer 1 library, and use learning rate 5×10 −5 and batch size 32 for all models. The statistics of each relation is shown in Table 6. We select the best models individually for each relation based on the corresponding development set. Besides AUC scores for each relation, we also report the AUC score for all relations by the weighted sum of the break-down scores, weighted by the proportion of test examples of the relation. This is reasonable as AUC essentially represents the probability that a positive example will be ranked higher than a negative example.

Main Results
The main experimental results are shown in Table 7. KG-BERTSAGE performs the best among all, as it both encodes an (h, r, t) as a whole and takes full advantage of neighboring information in the graph. Moreover, all models are significantly lower than human performance with a relatively large margin. 1 https://transformer.huggingface.co/ ASER can on the one hand provide candidate triples for populating CSKBs, and can on the other hand provide graph structure for learning commonsense reasoning. From the average degree in Table 3, the graph acquired by grounding CSKBs to ASER can provide far more neighbor information than using the CSKBs only. While KG-BERT treats the task directly as a simple triple classification task and takes only the triples as input, it does not explicitly take into consideration the graph structure. KG-BERTSAGE on the other hand leverages an additional GraphSAGE layer to aggregate the graph information from ASER, thus achieving better performance. It demonstrates that it is beneficial to incorporate those un-annotated ASER graph structures where multiple-hop paths are grounded between commonsense heads and tails. Though BERTSAGE also incorporates neighboring information, it only leverages the ASER nodes representation and ignores the complete relational information of triples as KG-BERTSAGE does. As a result, it doesn't outperform BERT by much for the task.

Zero-shot Setting
We also investigate the effects of different training CSKBs as shown in Table 8 Table 9: AUC scores grouped by the types of the evaluation sets defined in 4.6. The latter two groups are harder for neural models to distinguish. on all CSKBs achieve better performance both for each individual relation and on the whole. We can conclude that more high-quality commonsense triples for training from diverse dimensions can benefit the performance of such commonsense reasoning. When trained on each CSKB dataset, there are some relations that are never seen in the training set. As all of the models use BERT to encode relations, the models are inductive and can thus reason triples for unseen relations in a zero-shot setting. For example, the isBefore and isAfter relations are not presented in GLUCOSE, while after training KG-BERTSAGE on GLUCOSE, it can still achieve fair AUC scores. Though not trained explicitly on the isBefore and isAfter relations, the model can transfer the knowledge from other relations and apply them to the unseen ones.

Error Analysis
As defined in Section 4.6, the evaluation set is composed of three parts, edges coming from the original test set (Original Test Set), edges where heads come from CSKBs and tails from ASER (CSKB head + ASER tail), and edges from the whole ASER graph (ASER edges). The break-down AUC scores of different groups given all models are shown in Table 9. The performances under the Original Test Set of all models are remarkably better than the other two groups, as the edges in the original test set are from the same domain as the training examples. The other two groups, where there are more unseen nodes and edges, are harder for the neural models to distinguish. The results show that simple commonsense reasoning models studied in this paper struggle to be generalized to unseen nodes and edges. As a result, in order to improve the performance of this CSKB population task, more attention should be paid to the generalization ability of commonsense reasoning on unseen nodes and edges.  Moreover, by taking a brief inspection of the test set, we found that errors occur when encountering triples that are not logically sound but semantically related. Some examples are presented in Table 10. For the triple (PersonX go to nurse, xEffect, Per-sonX use to get headache), the head event and tail event are highly related. However, the fact that someone gets a headache should be the reason instead of the result of going to the nurse. More similar errors are presented in the rest of the table. These failures may be because when using BERT-based models the training may not be well performed for the logical relations or discourse but still recognizing the semantic relatedness patterns.

Conclusion
In this paper, we benchmark the CSKB population task by proposing a dataset by aligning four popular CSKBs and an eventuality graph ASER, and provide a high-quality human-annotated evaluation set to test models' reasoning ability. We also propose KG-BERTSAGE to both incorporate the semantic of knowledge triples and the subgraph structure to conduct reasoning, which achieves the best performance among other counterparts. Experimental results also show that the task of reasoning unseen triples outside of the domain of CSKB is a hard task where current models are far away from human performance, which brings challenges to the community for future research.

A Additional Details of Commonsense Relations
During human annotation, we translate the symbolic knowledge triples into human language for annotators to better understand the questions. A (h, r, t) triple where h, r, and t are the head, relation, and tail, is translated to if h, then [Description], t. Here, the description placeholder [Description] comes from rules in Table 11, which is modified from Hwang et al. (2020). These descriptions can also be regarded as definitions of those commonsense relations. Moreover, the definitions of the discourse relations in ASER are presented in Table 12. We also present the statistics of relation distribution for ASER norm in Table 13. Table 14 demonstrates several examples for unifying the formats of different resources. In Con-ceptNet and Knowlywood, the nodes are mostly verb or verb-object phrases, and we add a subject "PersonX" in front of each node. For ATOMIC, the main modification part is the tails, where subjects tend to be missing. We treat agent-driven (relations investigating causes and effects on PersonX) and theme-driven (relations investigating causes and effects on PersonY) differently, and add PersonX or PersonY in front of the tails whose subjects are missing. For ASER, rules are used to discriminate PersonX and PersonY in a certain edge. Two examples for ASER and ATOMIC demonstrating the differences between PersonX and PersonY are provided in the table. For GLUCOSE, we simply replace SomeoneA with PersonX and SomeoneB with PersonY accordingly. Moreover, all the words are lemmatized using Stanford CoreNLP parser 2 to normalized forms.

B.2 Selecting Candidate Triples from ASER
The evaluation set comes from three parts:   3. ASER edges: The edges are sampled from the whole ASER graph.
Instead of randomly sampling negative examples which may be easy to distinguish, we sample some candidate edges from ASER with some simple rules to fit the chronological order and syntactical patterns for each commonsense relation, thus providing a harder evaluation set for machines to concentrate more on commonsense. The discourse relations defined in ASER at Table 12 inherently represent some chronological order, which can be matched to each commonsense relation based on some alignment rules.
First, for each commonsense relation, we sample the edges in ASER with the same basic chronologi-number of edges  cal and logical meaning. For example, the Result relation from ASER, which is a discourse relation where the tail is a result of the head, can be served as a candidate for the xEffect commonsense relation, where a tail is the effect or consequence of the head. Alternatively, we can also regard a (tail, Succession −1 , head), which is the inversion of (head, Succession, tail), as a candidate xEffect relation, as in Succession, the head happens after the tail. By providing candidate triples with the same chronological relation, the models will need to focus more on the subtle commonsense connection within the triple. Second, we restrict the dependency patterns of the candidate edges. For the stative commonsense relation such as xAttr, where the tails are defined to be a state, we restrict the tails from ASER to be of patterns such as s-v-o and s-v-a. This also filters out some triples that are obviously false as they are not actually describing a state. Detailed selection rules for each commonsense relation are defined in Table 15. Besides the above selected edges, we also sample some edges from ASER that are reverse to the designated discourse relations. For example, for the commonsense relation xEffect, the above rules will select discourse edges with patterns like (head, Result, tail) to constitute a candidate xEffect relation (head, xEffect, tail). In addition to that, we also sample some edges with reverse relations, like (tail, Result, head), to form a candidate edge (head, xEffect, tail), to make the annotated edges more diverse.

B.3 Examples of Populated Triples
Examples of the annotations of the populated triples are listed in Table 17. The source of the triples is from the three types defined in Section B.2.
In the Original Test Set category, the triples are composed of two parts, one is the ground truth triples from the original CSKBs, and one is triples randomly sampled from G c .

C.1 Model Details
For a (h, r, t) triple, we denote the word tokens of h and t as w h 1 , w h 2 , · · · , w h l and w t 1 , w t 2 , · · · , w t m , where l and m are the lengths of the corresponding sentences. For the BERT model, the model takes "[CLS] w h 1 w h 2 · · · w h l [SEP]" as the input to a BERT base encoder, and the corresponding embedding for the [CLS] token is regarded as the final embedding s h of the head h. The tail t is encoded as s t similarly with the head. For the relation r, we feed the name of the relation directly between [CLS] and [SEP] into BERT, which is "[CLS] r [SEP]", and we use the corresponding embedding for the [CLS] token as the embedding of r as s r . As BERT adopts sub-word encoding, the relations, despite being complicated symbols, can be split into several meaningful components for BERT to encode. For example, xReact will be split into "x" and "react", which can demonstrates both the semantics of "x" (the relation is based on PersonX) and "react" (the reaction of the head event).
For KG-BERT, we encode a (h, r, t) triple by feeding the concatenation of the three elements into BERT. Specifically, "[CLS] w h 1 w h 2 · · · w h l [SEP] r [SEP] w t 1 w t 2 · · · w t l [SEP]" is fed into BERT and we regard the embedding of [CLS] as the final representation of the triple.
Denote the embedding of a (h, r, t) triple acquired by KG-BERT as KG-BERT(h, r, t). The function N (v) is defined as returning the incoming neighbor-relation pairs, which is {(r, u)|(u, r, v) ∈ G} (G is ASER in our case.) N (v) is defined as the function that returns the set {(r, u)|(v, r, u) ∈ G}, which are neighboring edges. The model KG-BERTSAGE then encodes a (h, r, t) triple as: Moreover, as the average number of degrees for nodes in ASER is quite high, we follow the idea    in GraphSAGE (Hamilton et al., 2017) to conduct uniform sampling on the neighbor set. 4 neighbors are randomly sampled during training.

C.2 Neighboring Function N
The edges in ASER are directed. We try two kinds of neighboring functions : N (v) = {(r, u)|(v, r, u) ∈ G or (u, r, v) ∈ G} Equation (1) is the function that returns the outgoing edges of vertex v. Equation (2) is the function that returns the bi-directional edges of vertex v. The overall results using the two mechanisms of KG-BERTSAGE is shown in Table 16. By incorporating bi-directional information of each vertex, the performance of CSKB population can be largely improved. PersonX be diagnose with something Causes PersonX be sad Plau.
Randomly sampled examples PersonX be patient with ignorance HinderedBy PersonY have the right vocabulary Implau.
PersonX be save money HasSubEvent PeopleX can not afford something Plau.
PersonX decide to order a pizza xReact PersonX have just move Implau.
it be almost christmas gReact PersonX be panic Implau.
PersonX go early in morning xEffect PersonX do not have to deal with crowd Plau.

ASER edges
PersonX have take time to think it over PersonX xReact PersonX be glad Plau.
PersonX have a good work-life balance xIntent PersonX be happy Plau.
PersonX be hang out on reddit oReact PersonY can not imagine Implau.