PseudoReasoner: Leveraging Pseudo Labels for Commonsense Knowledge Base Population

Commonsense Knowledge Base (CSKB) Population aims at reasoning over unseen entities and assertions on CSKBs, and is an important yet hard commonsense reasoning task. One challenge is that it requires out-of-domain generalization ability as the source CSKB for training is of a relatively smaller scale (1M) while the whole candidate space for population is way larger (200M). We propose PseudoReasoner, a semi-supervised learning framework for CSKB population that uses a teacher model pre-trained on CSKBs to provide pseudo labels on the unlabeled candidate dataset for a student model to learn from. The teacher can be a generative model rather than restricted to discriminative models as previous works. In addition, we design a new filtering procedure for pseudo labels based on influence function and the student model's prediction to further improve the performance. The framework can improve the backbone model KG-BERT (RoBERTa-large) by 3.3 points on the overall performance and especially, 5.3 points on the out-of-domain performance, and achieves the state-of-the-art. Codes and data are available at https://github.com/HKUST-KnowComp/PseudoReasoner.


Introduction
Commonsense knowledge are the common agreements by most people on daily entities, which are crucial for intelligent systems to act sensibly in the real world (Davis, 1990;Liu and Singh, 2004).Endowing natural language understanding systems with the ability to draw commonsense reasoning remains an important yet challenging task.
Throughout the development of automated commonsense understanding, CommonSense Knowledge Base (CSKB) is an important form of automatic commonsense reasoning system to store knowledge sources for drawing inferences.With expert-curated relations and human annotations, CSKBs such as ConceptNet (Liu and Singh, 2004), ATOMIC (Sap et al., 2019;Hwang et al., 2021), and GLUCOSE (Mostafazadeh et al., 2020) are developed to study commonsense regarding properties of objects, causes and effects of events and activities, motivations and emotional trajectories of humans on certain circumstances, and so on.
As those human-annotated CSKBs are sparse and usually of a small scale and coverage, reasoning tasks on CSKB such as CSKB Completion (Li et al., 2016;Saito et al., 2018;Malaviya et al., 2020) and CSKB Population (Fang et al., 2021b,a) are defined with the goal of either adding new edges/assertions within the training knowledge base (CSKB Completion), or adding new edges/assertions from outside of CSKBs (CSKB Population).A visualized comparison between the two tasks is shown in Figure 1.
Different from CSKB Completion, which adopts a close-world assumption and assume all knowledge are in-domain, the population task deals with unseen entities and requires a more out-ofdistribution reasoning ability.In this paper, we study commonsense reasoning in the context of CSKB Population.In this task, four mainstream CSKBs, ConceptNet, ATOMIC, ATOMIC 20 20 , and GLUCOSE are aligned together as the labeled dataset.ASER (Zhang et al., 2022), a large-scale eventuality (events, activities, and states) knowledge graph is aligned with CSKBs and serves as unlabeled candidates for populating commonsense knowledge.Human annotations on held-out dev/test sets sampled from both CSKBs and ASER are provided as the evaluation set.
There are two major challenges remaining unsolved for CSKB Population.First, the scale of the annotated training set (ConceptNet, ATOMIC, and GLUCOSE) is approximately 1M samples, too small compared with 200M of the actual candidate space to perform population (ASER).Second, as inherently only ground-truth (positive) examples are provided by CSKBs, the randomly sampled negative examples in the task are less informative and may lead the model to overfit artifacts of the dataset.A supervised learning model finetuned on such annotated training set is hard to be generalized to out-of-domain knowledge space, as shown in Fang et al. (2021a) and also Table 3 in our paper, where the AUC for out-of-domain test sets performs over 10 points worse than the in-domain part.
To address the above challenges, we propose PseudoReasoner, a semi-supervised learning framework that uses a pre-trained commonsense teacher model to automatically label the unlabeled candidates to serve as pseudo labels, such that the student model can be further finetuned with pseudo labels to improve out-of-domain commonsense reasoning ability.In fact, pre-trained language models finetuned on commonsense knowledge bases have shown to perform generalizable commonsense reasoning on downstream tasks to some extent.For example, leveraging commonsense knowledge generated by COMET (Bosselut et al., 2019), a language model finetuned on ATOMIC, can improve the performance on commonsense QA (Yang et al., 2020;Shwartz et al., 2020;Bosselut et al., 2021).Different from the text generation paradigm as in previous works, here we leverage the commonsense language model as a teacher model for labeling unlabeled candidates.To further improve the quality of pseudo labels, we use both influence function (Koh and Liang, 2017) and the student model's prediction to select highly confident pseudo exam-ples.
Our contribution is three-fold: 1) We introduce a new way of providing pseudo label for CSKB Population by leveraging generative commonsense language models.
2) We propose a semi-supervised learning framework with pseudo labels and a special filtering mechanism based on influence function and student model's prediction that significantly improve the performance of CSKB Population, especially for out-of-domain knowledge triples.
3) We demonstrate the effectiveness of our framework by extensive experiments on different backbone models and different semi-supervised learning methods.We achieve the state-of-theart performance on this task.
2 Related Works

Commonsense Reasoning over Knowledge Bases
Commonsense Knowledge Bases (CSKBs) provides rich human-curated commonsense knowledge in the form of head-relation-tail triples and are perfect fields to conduct commonsense reasoning.
Reasoning tasks defined over CSKBs can be categorized into two main categories, CSKB Completion and CSKB Population.CSKB Completion adopts a close-world assumption and assume the knowledge not included in the knowledge base are incorrect, thus using evaluation metrics such as accuracy, MRR, and HITS@k (under the link prediction setting (Li et al., 2016;Malaviya et al., 2020;Saito et al., 2018;Jastrzębski et al., 2018)) or BLEU (under the text generation setting (Bosselut et al., 2019;Hwang et al., 2021)) to evaluate the reasoning performance.As vanilla knowledge base completion models such as TransE (Bordes et al., 2013) and ComplEx (Trouillon et al., 2016) do not take the node semantics into account and CSKBs are much sparser than standard factual KBs, Malaviya et al. (2020) and Wang et al. (2021) use BERT (Devlin et al., 2019) to encode nodes and propose graph densifiers to address the sparsity issue.CSKB Population, on the other hand, requires the model to reason on not only existing nodes in the CSKB (Fang et al., 2021b,a), but the nodes from other domains (e.g., ASER (Zhang et al., 2022)).The evaluation set is manually annotated to form a binary classification task, and AUC is used as  the evaluation metric.The population task is inherently a classification task, KG-BERT (Yao et al., 2019) and KG-BERTSage (Fang et al., 2021a), a graph-aware model based on KG-BERT, are used to tackle the task.While there are other ways of populating commonsense knowledge, such as leveraging conceptualization (He et al., 2022) or generative language models (Bosselut et al., 2021;West et al., 2022), they are different with CSKB Population in several aspects.For conceptualization, they provide new triples in depth through hierarchical structures of concepts by lexical manipulation and the core semantics are preserved, while we do population in breadth by exploring more diverse information-extracted triples.For text generation, first, text generation itself doesn't perform reasoning, while discrimination on a knowledge graph can provide structural and contextual information to do reasoning.Second, generative models are trained only on positive/plausible examples by nature while do not handle negative/implausible examples.We will show the importance of negative examples in CSKB Population in the experiments.

Pseudo Labels
Pseudo labels are widely used in the semisupervised learning setting (Iscen et al., 2019;Xie et al., 2020b;Sohn et al., 2020;Pham et al., 2021).In common, pseudo labels alleviates the cost of human annotation, or familiarizes the student model with out-of-domain or same-distributionbut-unseen data (i.e data from the large unlabeled dataset) under the advice from teacher models.It has been used in image classification (Lee et al., 2013;Shi et al., 2018;Yalniz et al., 2019;Xie et al., 2020b;Pham et al., 2021), machine translation (He et al., 2020;Chen et al., 2021), information retrieval (Wang et al., 2022), and word segmentation (Huang et al., 2021).In the simplest form of pseudo labels, the teacher model that provides pseudo labels is static or iteratively updated using the latest student model.Meta pseudo label (Pham et al., 2021), on the other hand, leverages metalearning technique to learn an end-to-end teacherstudent network that updates both networks jointly based on the performance of the student network on the labeled dataset, yielding state-of-the-art per-  formance on ImageNet (Deng et al., 2009).

Other Semi-supervised Learning Methods
Most works on semi-supervised learning leverages the idea of consistency training (Zhou et al., 2003;Xie et al., 2020a;Gururangan et al., 2019), which aims to constrain the model to be robust given noised inputs or hidden states.UDA (Unsupervised Data Augmentation) (Xie et al., 2020a) adopts back translation and TF-IDF word replacement to provide noisy inputs and use them for consistency training.However, in the setting of CSKB Population, the problem of out-of-distribution is more important as we aim to explore novel commonsense knowledge that are not seen in the training domain, which makes directly finetuning on pseudo labels more effective than using consistency training.

Problem Definition
Denote a labeled ground-truth Commonsense Knowledge Base as l and a randomly sampled negative dataset D − l from the CSKB, where y ∈ {0, 1} is the label of the triple.Triples from D + l are labeled 1 while those from D − l are labeled 0. H C and T C are the set of heads and tails in the CSKB.R is the relation set.
D u = {(h, r, t)|h ∈ H u , r ∈ R, t ∈ T u } is an unlabeled candidate knowledge base of the same format and relation set as the CSKB.It is of a much larger scale than the labeled part and is a source for populating commonsense knowledge.H u and T u is the set of heads and tails in the unlabeled KB.The task is defined as given the labeled commonsense knowledge base D l as training source, predict the plausibility of triples from D u .
We use the dataset provided by Fang et al. (2021a).Here, the set of heads H C , tails T C , and relations R come from the alignment of ConceptNet, ATOMIC, ATOMIC 20 20 (newly developed relations), and GLUCOSE, as in the original paper.The unlabeled KB D u is adapted from ASER, where the discourse relations are converted to commonsense relations to serve as candidates for population.The evaluation dataset with 32K triples is sampled from both D l and D u and manually annotated.There are three categories of the evaluation set, Original Test Set , CSKB head + ASER tail, and ASER edges, where the first category is sampled from the heldout test split in D l (both D + l and negative examples D − l ) and is thus an in-domain test set, and the latter two are novel assertions outside of D l and are thus out-of-domain.The statistics and descriptions of the training and evaluation datasets are shown in Table 1 and 2. The list of commonsense relations, alignment methods, and more detailed statistics of the dataset can be found in the Appendix A.

Backbone Models
Considering the nature of the CSKB Population task is triple classification in the form of natural language, we use KG-BERT (Yao et al., 2019) as the backbone model following Fang et al. (2021a).In detail, a triple (h, r, t) is concatenated and serialized as " Here, [CLS] and [SEP] are the special tokens in BERT-based models (Devlin et al., 2019).[CLS] is used to represent the whole sentence, and [SEP] is used to separate different sentences, respectively.h 1 , ..., h |h| are the tokens of the head h, and t 1 , ..., t |t| are the tokenized tokens of the tail t. [r] is registered as a new special token for a certain relation r.After feeding the serialized version of (h, r, t) into a BERT-based masked language model, the representation of the special token [CLS] is regarded as the representation of the whole triple.It is trained to distinguish positive triples with negative triples with cross entropy loss.Here x denotes a triple (h, r, t), P L models the distribution of the labeled dataset D l , and θ is the set of parameters for KG-BERT.P θ (y|x) denotes the probability after feeding the model prediction logits to softmax under parameter set θ. Then the optimization objective is as follows: (1)

Methods
In this section, we present the details of the framework of PseudoReasoner.A sketch illustration of the model is presented in Figure 2. To sum up, the procedure of PseudoReasoner can be summarized into the following steps: 1) Train a teacher model and a student model on the labeled dataset D l (Section 4.1.1).
2) Use the teacher model to predict plausibility scores on triples from the unlabeled D u .
Triples with high/low plausibility scores within pre-defined intervals are given label 1/0 (Section 4.1.2).
3) Filter the pseudo labels with influence function with respect to the student model, and the student model's predictions.(Section 4.1.3).

Teacher Models
We use a teacher model pre-trained on the labeled dataset for labeling the unlabeled triples.We define plausibility scores of an unlabeled triple x as α(x), where the higher the score the more plausible the triple is regarded by the teacher model.We choose two different forms of teacher models as follows: GPT2 (Radford et al., 2019): As negative sampling in the labeled dataset D l is noisy, we aim to use an alternative model that avoids the negative part D − l .We finetune a (COMET) GPT2 language model, as the representative of generative family, on the positive part of the labeled dataset, D + l , with a text generation task.For an (h, r, t) triple from D + l , denote x as the serialized version of the triple, "h 1 , ..., h |h| , [r], t 1 , ..., t |t| ". θ LM denotes the trainable parameters in GPT2 language model (LM).We minimize the negative log likelihood of each triple as indicated in Equation (2): Denote the optimized parameters as θ * LM .Here, the plausibility function α(x) = −L(x, θ * LM ), where the lower the loss, the higher the plausibility score by GPT2.Hence, for the triples from the unlabeled dataset, D u , we score every triple with Equation (2) on θ * LM .KG-BERT: Besides GPT2, KG-BERT itself, a discriminative model, can be used as a teacher model.This teacher model learns θ * from the labeled dataset D l with cross entropy loss in Equation (1).For an instance {(h, r, t), y} ∈ D l , denote x = (h, r, t), we use α(x) = P θ * (y=1|x) as x's plausibility score.
❌: Randomly sampled negative triples from CSKB and ASER.

Acquiring Pseudo Labels
The triples whose plausibility scores α(x) are between [T − min , T − max ] are labeled as negative, and the triples within [T + min , T + max ] are labeled as positive.Here T − min < T − max < T + min < T + max .The reason that we introduce additional T − min and T + max is that we want to filter out the triples that are treated over plausible or implausible by GPT2 to reduce potential selection bias.For example, GPT2 has been shown to provide low loss for repetitive patterns instead of the plausibility of the semantics (Brown et al., 2020).Details of the hyperparameter selection are in Appendix C.

Pseudo Label Filters
To further improve the quality of pseudo labels, we propose two filtering mechanisms on pseudo labels for better finetuning.Influence Function.
The left-hand-side L(Z val , θ * ) − L(Z val , θ * −z ) can be approximated without retraining the model by influence function (Koh and Liang, 2017): ) is the Hessian.We linearly approximate I up,loss with inverse hessian-vector product (HVP) introduced in LiSSA (Agarwal et al., 2017) following (Koh and Liang, 2017).Details about influence function and the numerical approximation to calculate it can be found in the Appendix B. We filter out those examples with negative influence scores, which are harmful to the generalization of the model.

KG-BERT.
As the student model we use is KG-BERT, when the pseudo labels from GPT2 are used, we can use the P θ * (y|x) produced from optimized KG-BERT as an additional filter to select pseudo labels.Specifically, pseudo labels {x = (h, r, t), y} with P θ * (y|x) > 0.5 are selected.This procedure can be viewed as ensembling GPT2 and KG-BERT.

PseudoReasoner Training
The objective function of KG-BERT on the labeled dataset is shown in Equation (1), and the objective function on pseudo labels can be written as:  (Yao et al., 2019) with four backbone encoders and GPT2 (use LM loss to score triples).For semi-supervised learning (SSL) baselines, we study UDA (Xie et al., 2020a), G-DAUG (Yang et al., 2020), and Noisy-student (Xie et al., 2020b).The backbone encoders for SSL baselines are RoBERTa-large, which performs the best in the supervised setting.The number of parameters of backbone language models are presented as subscripts behind model names.∆ all and ∆ OOD are the improvement on the "all" AUC and the Out-of-domain (OOD) AUC.
Here P L and P U are the distribution of the labeled and unlabeled dataset, respectively.q(y|x) is the distribution of pseudo labels, modeled by the teacher model and filters.After finetuning KG-BERT initialized with θ * on the filtered pseudo labels with Equation (6), we acquire θ * ′ .
For KG-BERT, as it's flexible to be adapted using different pre-trained encoders, we use BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), De-BERTa (He et al., 2021), and BART (Lewis et al., 2020) as the backbone language models.For BERT, RoBERTa, and DeBERTa, we use the embedding of the [CLS] token in KG-BERT as the representation of the whole triple.For BART, we follow the ways of doing sequence classification in the original paper (Lewis et al., 2020) to use the embedding of the end-of-sentence token in the decoder as the representation of the whole triple.
For the semi-supervised learning setting, we use the following baseline models: Unsupervised Data Augmentation (UDA).UDA (Xie et al., 2020a) uses consistency training to constrain the model to provide invariant predictions with noise added to the input.We adapt UDA into the framework of CSKB Population and uses TF-IDF word replacement and back-translation to provide noise to the input text to be fed into the consistency loss.More details are in the Appendix D.1.Noisy Student.Noisy student (Xie et al., 2020b) trains a student model with noise added during training iteratively.A teacher model is first trained to provide hard or soft pseudo labels for a student model to finetune together with the labeled dataset.Soft pseudo labels mean using logit scores after softmax as labels.Then the student model is iteratively re-used as the teacher model and a new student model is acquired through each iteration.Details can be found in the Appendix D.2.Generative Data Augmentation (G-DAUG).G-DAUG (Yang et al., 2020)  We also try to replace COMET with COMET-distill (West et al., 2022), the COMET trained with distilled commonsense knowledge from GPT3, which has a better performance and capacity than the vanilla COMET.Details can be found in the Appendix D.3.

Experimental Settings
The learning rate for all models are set as 1e-5, and the batch size is 64.We use the framework of Huggingface Transformers 1 to form our codebase.Early stopping is used where the best checkpoint is selected when the largest validation AUC is achieved.For all experiments, we report the average scores across three different random seeds.For thresholding, we set the thresholds Table 5: The effects of different filters on pseudo labels when KG-BERT (RoBERTa-large) is the backbone and GPT2-XL is the teacher model.

Results
The main results are shown in Table 3.We compare the results of both supervised learning and semisupervised learning approaches with our proposed model PseudoReasoner.The "all" column presents the overall AUC across all testing examples and is the main metric of CSKB Population.We also separately present the AUC of different test set categories, In-domain (Original Test Set) and Outof-domain (CSKB head + ASER tail, and ASER edges).In the last two columns, we report the increase by applying different semi-supervised learning approaches under the same backbone model, where ∆ all means the increase of "all" (AUC) metric and ∆ OOD means the increase of AUC for outof-domain test sets.
For supervised-learning approaches, KG-BERTbased models mostly perform well on the Indomain test set while have poorer generalization ability to Out-of-domain test sets compared to COMET (GPT2).As GPT2 is only finetuned on the positive part of the dataset, it suffers less from the bias of negative sampling in D − l and has a better generalization on new knowledge.However, the drawback in In-domain reasoning hinders the overall performance of GPT2-XL from surpassing KG-BERT (RoBERTa-large), even with 4.5 times more parameters.
For semi-supervised learning baselines, all of them can increase the performance of the backbone KG-BERT (RoBERTa-large), especially for the out-of-domain split.However, the improvement on the In-domain part remains insignificant and the improvement on out-of-domain part is not as competitive as our PseudoReasoner.As we use the same code base and training method to train all semi-supervised learning methods, the main differences between PseudoReasoner and other SSL methods lie in the ways of processing the unlabeled dataset.We leave the detailed discussions in the next section (Section 6.2).

Analysis and Discussions
In this section, we discuss the ablation study on model components, the comparisons with semisupervised learning baselines, diversity analysis, and discussions about why PseudoReasoner works.

Ablation Study
We study the effects of different teacher models and the filters on pseudo labels.

Teacher Models
We compare four representative teacher models, KG-BERT (RoBERTa-large) and GPT2 (small, medium, and XL), on how pseudo labels provided by them can influence the model performance.The ablation results are shown in Table 4. From the comparison between KG-BERT (RoBERTa-large, 340M parameters) and GPT2-small (117M) and -medium (345M), which are of the same scale of model size, we find that GPT2 can perform consistently better than KG-BERT as teacher models.This can be validated by the out-of-distribution performance in Table 3 for the supervised learning baselines, where KG-BERT performs almost 3 points behind GPT2-medium in terms of OOD AUC.This ablation indicates the importance of powerful generalizable teacher models on pseudo labeling.

Thresholding
We study the sensitivity of thresholds in Figure 3.In this ablation, for simplicity, we set different T + max for positive pseudo examples, and sample the same amount of triples as the original training set whose α(x) < T + max in descending order.We do the same ablation study on T − min for negative pseudo examples.When tuning one threshold, other thresholds are fixed as in section 5.2.The pseudo labels under different thresholds are directly used for PseudoReasoner without filtering, and we plot the test set AUC given different thresholds.We see that the resulting AUC is stable within certain ranges of T + max and T − min .While when we set T − min to −∞, which indicates no thresholds for negative examples are set, the performance drops drastically.

Pseudo Label Filter
We conduct experiments with different combinations of filtering mechanisms in Table 5 for KG-BERT (RoBERTa-large).We can see that both filters (influence function and KG-BERT probability) benefits the model performance, while KG-BERT probability contributes to a more substantial improvement.Figure 4 shows an illustration of the KG-BERT plausibility α(x) = P * θ (y=1|x) of positive/negative pseudo examples provided by GPT2.The positive pseudo examples tend to be scored higher by KG-BERT than negative pseudo examples.Adding KG-BERT probability as an additional filter with the labels provided by GPT2 is similar to an ensembling procedure.

Comparisons with Other Semi-supervised
Learning Methods

Computational Cost
The number of pseudo examples used for semisupervised-learning baselines are listed in Table 6.We basically use the same scale of unfiltered pseudo examples for G-DAUG and PseudoReasoner, while use 3 times more unlabeled examples for UDA as it requires more unlabeled data.Specifically, G-DAUG (COMET-distill) leverages bi-gram div.
Figure 5: Diversity analysis with the proportion of unique uni/bi-grams in the labeled dataset, the pseudo labels, and the filtered pseudo labels.With filtering, the diversity can be significantly improved.
the distilled knowledge from GPT3 (West et al., 2022), which further magnify the computational cost to more orders of magnitude.In all, under the same scale of pseudo labels, PseudoReasoner can achieve far better results than UDA and G-DAUG.

Analysis
In UDA, though robustness can be improved with consistency loss on noised inputs, there are no new commonsense knowledge added to the training procedure, making it hard for the model to be equipped with novel knowledge reasoning ability.For G-DAUG, the critial part lies in the generation of negative examples.We finetune two separate GPT2 on D + l and D − l , and the one finetuned on D − l is used to generate negative examples.Compared with the GPT2 finetuned on D + l , the GPT2 finetuned on D − l is of a relatively lower quality as triples in D − l don't follow specific commonsense patterns.We check the text generation quality of GPT2-XL finetuned on D + l and D − l , and find that the BLEU-2 scores on two corresponding held-out test sets are 0.23 and 0.10, indicating the generation of negative examples are of a lower quality.
For Noisy-student, the main differences between our PseudoReasoner is that they use KG-BERT to provide pseudo labels, and they are soft pseudo labels.The main reason behind is that as soft labels are used, the teacher model has to be a discriminative model such as KG-BERT, which has a poor generalization ability than GPT2 used in our Pseu-doReasoner.Moreover, similar to the case in UDA, noisy is not a dominate factor in CSKB Population, while more high-quality novel commonsense knowledge matters more.

Semantic Diversity Analysis
An important contribution of PseudoReasoner is that we extend the knowledge space for training from limited CSKBs to a more broad unlabeled resource.We use the proportion of unique unigrams and bi-grams as an indicator of semantic diversity to measure the scale to which models are exposed to diverse novel knowledge.Figure 5 shows that after filtering with influence function and KG-BERT probability, the diversity can be improved by around 3 to 4 times than the labeled dataset.

Relationship with Knowledge Distillation
While knowledge distillation focuses on distilling knowledge from larger models to smaller ones, our method does not necessarily need the teacher model to be larger.The teacher GPT2-medium, which is of the same size as the student KG-BERT, can work pretty well and is comparable to GPT2-XL.

Conclusion
In this paper, we propose a semi-supervised learning framework for CSKB Population based on pseudo labels.Using a teacher model and a special filtering mechanism on pseudo labels, we achieve the state-of-the-art of CSKB Population in terms of both in-domain and out-of-domain performance.Experiments also show that our CSKB Population benefits more from high-quality novel knowledge than other semi-supervised learning techniques such as noise and consistency training.This work brings a new perspective of improving out-of-domain generalizable commonsense reasoning ability on CSKBs.

Limitations
The main limitation is that for the labeling procedure of the teacher model, we still need a hard threshold T to determine whether the label is 0 or 1.This would limit the method from generalize across different tasks, where different thresholds should be tuned separately.One solution and future work would be automatically learn the thresholding, or improve the teacher-student interaction by training them jointly in an end-to-end manner such as Meta Pseudo Label (Pham et al., 2021).

A Additional Details on the CSKB Population Benchmark
Commonsense Knowledge Base (CSKB) Population aims at automatically populating commonsense knowledge defined in source CSKBs on an unlabeled candidate eventuality knowledge graph, ASER.We provide additional details about the selection of CSKBs and the eventuality knowledge graph, and the details about the evaluation of CSKB Population in this section.

A.1 CSKB
CSKB Population studies commonsense relations among general events.To form a general aligned CSKB, Fang et al. (2021a) selected Concept-Net (Liu and Singh, 2004) (the event-related relations are selected), ATOMIC (Sap et al., 2019), ATOMIC 20 20 (Hwang et al., 2021) (the newly added relations beyond ATOMIC are selected), and GLU-COSE (Mostafazadeh et al., 2020).These four sources of CSKBs are in the form of free-text and are structured as triplet forms ((h, r, t)).Normalization processes are conducted to align the four CSKBs together.The overall structure of the aligned CSKB is based on the formats in ATOMIC, where the events are person-centric sentences with PersonX and PersonY as subjects.In Concept-Net, PersonX is prepended on the predicate-object pairs to make them complete sentences.For example, a triple in ConceptNet (lie, HasSubEvent, make up story) is converted to (PersonX lies, HasSubEvent, PersonX makes up story).The SomeoneA and SomeoneB are converted to PersonX and PersonY accordingly in GLUCOSE.The relationships in GLUCOSE are converted to ATOMIC formats according to the official conversion rules defined in Table 7 in Mostafazadeh et al. (2020).
A summary of the relations studied in the aligned CSKBs is shown in Table 7.

A.2 ASER
The candidate knowledge graph for populating commonsense knowledge, ASER (Zhang et al., 2020(Zhang et al., , 2022)), is a large-scale eventuality-centric knowledge graph that provides explicit discourse relationships between eventualities.The core part of ASER is used, where 10M discourse edges among 27M eventualities are included.An example of an ASER edge is: ("I am hungry," Result, "I have lunch"), and such an discourse edge can be a potential  detailed conversion rules from discourse relations to commonsense relations are provided in Fang et al. (2021a).Such conversion rules make the total number of unlabeled candidate edges around 200M.

A.3 Evaluation Set
For the ground truth commonsense triples from the CSKBs, they are split in to train, development, and test sets using the original split by their own papers.For example, in ATOMIC, the test set split is based on the skeleton words of the heads such that similar heads are split into the same dataset.The test set here is denoted as the original test set.
The final evaluation set comprises of three parts, one sampled from the original automatically constructed test set as above (denoted as "Original Test Set"), one sample from the edges where heads are from CSKBs and tails are from ASER (denoted as "CSKB head + ASER tail"), and one from ASER solely (denoted as "ASER edges").The total number of final evaluation set is around 32K, and they are manually annotated using Amazon Mechanical Turk.

B Influence Function
Influence function gives an efficient approximation for the above quantity, whose idea is to compute the change of the optimized parameters thus the validation loss if z were upweighted by a small ϵ.It introduces new parameters θ * ϵ,z = arg min θ 1 n n i=1 L(z i , θ) + ϵL(z, θ).Cook and Weisberg (1982) gives us the influence of upweighting z on the optimized parameters θ * : where ) is the Hessian matrix.Thus, applying the chain rule, we calculate the influence of upweighting z on the validation loss: Since removing z is equivalent to unweighting it by ϵ = − 1 n , small when n is large, we can linearly approximate L(Z val , θ * ) − L(Z val , θ * −z ) in Equation (4) as −ϵI up,loss (z) = 1 n I up,loss (z) .In practice, we also linearly approximate I up,loss with inverse Hessian-vector product (HVP) introduced in LiSSA (Agarwal et al., 2017) following (Koh and Liang, 2017).

C Additional Details on Hyper-parameters
We present additional details about hyperparameters in the thresholding of the GPT2-XL teacher model and the details of the backbone models.

C.1 Thresholding in Pseudo Labels
We present the GPT2-XL plausibility scores on the unlabaled D u in Figure 6
Here, [CLS] and [SEP] are the special tokens in BERT-based models that are used to represent the whole sentence, and the special token to separate different sentences, respectively.h 1 , ..., h |h| is the tokenized tokens of the head h, and t 1 , ..., t |t| is the tokenized tokens of the tail t. [r] is registered as a new special token for a certain relation r.
ment, are introduced for natural language tasks.We use both augmentation methods and compare the results.For each augmentation methods, original serialized unlabeled candidates x are duplicated and augmented, then both original and augmented candidates are used for consistency training.
For back-translation, we employ OPUS translation models3 between English and French.Specifically, in each candidate, the "PersonX" is replaced with Alice, "PersonY" with Charles, and "Per-sonZ" with Francisco, then candidates are forwardtranslated to French and after then backwardtranslated to English.Selected names Alice, Charles, and Francisco are common first names shared by French and English and remain unchanged after back-translation, thus it ensures that the personal pronouns PersonX, PersonY, and Per-sonZ in unlabeled candidates can be fully recovered after back-translation.Also, we observe that almost newly registered relation tokens survive after the back-translation process, bringing the opportunity for KG-BERT to get familiar to these tokens in consistency training.For TF-IDF replacement, we set the replacement probability as 0.1 to avoid cracking the main semantics of the original sentences.
We present more hyperparameter tuning here for UDA.The objective function of UDA is as follows in Equation ( 7), where p L and p U are the distribution of the labeled and unlabeled dataset, y 1 is the corresponding label of x 1 , p θ is the predicted probability by the backbone model under parameter θ, q(x|x 2 ) is a data augmentation transformation, θ is a fixed copy of the current parameters θ, and CE indicates cross entropy.λ is used to control the weight of unsupervised consistency loss.Besides λ, in the training process, there is another parameter r indicating the ratio of unsupervised examples in a batch.We conduct several experiments with grid search to find the best combinations of λ and r as shown in Table 8 when using TF-IDF replacement as the augmentation method.We report the results of the best combinations in the main text of the paper.(7) The original implementation of UDA is based on Tensorflow4 .We re-implement their code with

Figure 1 :
Figure 1: An example of CSKB Population.The coral part (left) and the blue part (right) respectively represent the labeled CSKBs and the unlabeled candidate pool.The entities in the overlap parts are marked with coral shape and blue outline.The reasoning within the CSKB (coral-outlined boxes) belongs to the CSKB Completion part, and the reasoning that are not limited within the domain of CSKBs belongs to CSKB Population.

Figure 2 :
Figure 2: An end-to-end workflow of PseudoReasoner.Four steps in the figure are elaborated in Section 4.
Filtering out detrimental training examples with influence function (Koh and Liang, 2017) can boost the model performance, as shown in Yang et al. (2020) and Han et al. (2020).A training example z = ((h, r, t), y) will hurt the generalization ability of the model if including z in the training set results in a higher validation loss.Denote L(Z, θ) as the loss function of dataset Z under the parameter set θ. Then the loss under training set Z train is indicated in Equation (3): L(Z train , θ) = 1 |Z train | |Z train | i=1 L(z i , θ). (3) Denote θ * as the optimized parameters after training the model on Z train , and θ * −z as the optimized parameters after training the model on Z train −{z}.Denote Z val as the validation set.The empirical criterion to determine z as a detrimental training example is Equation (4):

Figure 3 :
Figure 3: Ablation study on different T + max and T − min .

Figure 4 :
Figure 4: KG-BERT plausibility distribution for positive/negative pseudo labels provided by GPT2.

Filtering
out detrimental training examples with influence function (Koh and Liang, 2017) can boost the model performance, as shown in Yang et al. (2020) and Han et al. (2020).A training example z = ((h, r, t), y) will hurt the generalization ability of the model if including z in the training set results in a higher validation loss.Denote L(Z, θ) as the loss function of dataset Z under the parameter set θ. Then the loss under the training set Z train of cardinality n is indicated in Equation (3) in the main body (also shown below).L(Z train , θ) Denote θ * as the optimized parameters after training the model on Z train , and θ * −z as the optimized parameters after training the model on Z train − {z}.Denote Z val as the validation set.The empirical criterion to determine z as a detrimental training example is Equation (4) in the main body (also shown below).

Figure 6 :
Figure 6: GPT2-XL plausibility distribution (−α(x)) of unlabeled data with relation Causes, xWant, oEffect, and gReact.The triples within the red dashed lines are considered negative pseudo examples, and the triples within the green dashed lines are considered as positive pseudo examples.
for four representative commonsense relations, Causes, xWant, oEffect, and gReact.We set the thresholds T − min = − 4.0, T − max = − 3.7, T + min = − 2.8, T + max = − 2.0, by roughly observing the data distribution and representative triples in different range of plausibility scores.The filtered triples are further sampled such that the number of pseudo examples per relation equals the number of the original examples for the corresponding relations.The pseudo examples that we use for experiments can be downloaded at our GitHub repository.

Table 1 :
Statistics of labeled D l (CSKB with negative examples) and unlabeled D u (processed ASER).

Table 2 :
The details of evaluation set categorization.

Table 3 :
Results on the test set of the CSKB Population benchmark.For supervised learning baselines, we report the result of KG-BERT

Table 6 :
Number of pseudo examples used in experiments for semi-supervised methods.The "Original" row indicates the number of training examples in the original training set.

Table 7 :
Relation distribution statistics for different CSKBs.The table is the same as in the Table 4 of Fang et al. (2021a).