MICO: A Multi-alternative Contrastive Learning Framework for Commonsense Knowledge Representation

Commonsense reasoning tasks such as commonsense knowledge graph completion and commonsense question answering require powerful representation learning. In this paper, we propose to learn commonsense knowledge representation by MICO, a Multi-alternative contrastve learning framework on COmmonsense knowledge graphs (MICO). MICO generates the commonsense knowledge representation by contextual interaction between entity nodes and relations with multi-alternative contrastive learning. In MICO, the head and tail entities in an $(h,r,t)$ knowledge triple are converted to two relation-aware sequence pairs (a premise and an alternative) in the form of natural language. Semantic representations generated by MICO can benefit the following two tasks by simply comparing the distance score between the representations: 1) zero-shot commonsense question answering task; 2) inductive commonsense knowledge graph completion task. Extensive experiments show the effectiveness of our method.


Introduction
Commonsense reasoning is a fundamental problem in artificial intelligence.Recently in the NLP field, much attention has been paid to commonsense reasoning in the following two aspects.First, more commonsense knowledge graphs (CKGs) (Sap et al., 2019a;Fang et al., 2021) were developed to support new types of reasoning tasks, such as commonsense knowledge graph completion (CKGC) (Malaviya et al., 2020).Another way to evaluate machine learning models' commonsense reasoning capabilities is using commonsense question answering (CQA) tasks (Zellers et al., 2018;Sap et al., 2019b;Bisk et al., 2020).Existing approaches to deal with the above problems commonly involve fine-tuning large pre-trained language models, such as BERT (Kenton and Toutanova, 2019), RoBERTa (Liu et al., 2019), and GPT2 (Radford et al., 2019), by either incorporating the entire knowledge base for CKGC (Yao et al., 2019;Bosselut et al., 2019) or injecting the knowledge base to provide background knowledge for zero-shot CQA (Banerjee and Baral, 2020;Bosselut et al., 2021;Ma et al., 2021).
In fact, both CKGC and zero-shot CQA can be formulated in a unified way, where a question can be constructed based on the head entity and relation in a knowledge graph, and then finding the tail entity, which is regarded as an answer, based on the constructed question.In this way, incorporating the entire knowledge base for CKGC and injecting the KG in pre-trained LMs for zero-shot CQA can be unified as a semantic matching problem, where a powerful representation learning for the matching becomes the most important problem.This also means that, after we unify them for CKGC and CQA in the same way, we can perform zero-shot CQA by simply leveraging the model finetuned on the entire CKGs for CKGC.
Existing commonsense-related representation learning usually leverage a CKG embedding framework (Malaviya et al., 2020;Wang et al., 2021), or fine-tuning a generative language model (Bosselut et al., 2019).However, they were not aware of the challenges that a typical CKG brings.First, in a typical CKG, such as ConceptNet (Liu and Singh, 2004) and ATOMIC (Sap et al., 2019a), nodes are loosely structured free-from texts, which means that previous embedding based on negative sampling cannot substantially support sufficient training because of sparsity.On the other hand, a generative model can only take positive examples for training so the capability of determining the negative answers is limited.
In this paper, we propose a new framework called MICO, a Multi-alternative contrastIve learning framework for COmmonsense knowledge representation.The representations can benefit across tasks by easily calculating the semantic distances  with unified vector representations.In this way, though many distinct nodes may have similar concepts (Wang et al., 2021), they are still close in the semantic space.Figure 1 shows an example of this advantage on CQA.The entity node PersonX ask if PersonY had seen related to the right answer is not directly connected to the nodes PersonX leaves Per-sonX's book or PersonX leaves PersonY's book but these nodes share similar tail entity nodes.Therefore, the right answer can be found by semantic matching as it is close to the given context and question in the semantic space.
To unify the form of CKGC and CQA, we follow the idea in COPA (Roemmele et al., 2011) where commonsense causal reasoning can be evaluated by selecting the most plausible alternatives given the premise.We first converts the knowledge triplets (h, r, t) into sequence pairs (P, A) (P for premise and A for alternative).MICO then encodes the sequence pairs into embeddings and measures their distance by a similarity function as we assume the representations of related knowledge lie close in the embedding space.Furthermore, we enhance the representation learning by a contrastive loss with sufficient sampling over the sparse CKG.The alternative from the same triplet is a positive sample to the premise under the contrastive learning framework.Alternatives from other knowledge triples with different premises are negative samples.MICO also takes consideration of the structure from CKGs, where one head node h may connect to several tail nodes t under the same relation r.MICO dynamically selects a hard alternative from multi-alternatives for a premise during training.
Experiments on two typical commonsense knowledge graphs and two types of tasks, zero-shot CQA and inductive CKGC, demonstrate the effectiveness of our methodology.Our code is openresourced.1 2 Related Work

Commonsense Question Answering
Background knowledge is necessary for commonsense question answering tasks.Many researches resort to knowledge bases for background knowledge.The works towards this direction can be mainly classified into two streams: incorporating the knowledge base for zero-shot CQA (Yang et al., 2019;Banerjee and Baral, 2020;Bosselut et al., 2021;Ma et al., 2021) or retrieving the related knowledge from the knowledge base for taskspecific CQA (Paul and Frank, 2019;Lin et al., 2019;Feng et al., 2020;Lv et al., 2020;Yasunaga et al., 2021;Xu et al., 2021;Zhang et al., 2021).
Among the works in incorporating the knowledge base for zero-shot CQA, COMET-DynaGen (Bosselut et al., 2021) aggregates all paths of generated commonsense knowledge to the answers from commonsense transformer COMET (Bosselut et al., 2019) trained on CKGs.KTL (Banerjee and Baral, 2020) encodes the knowledge triplets from CKGs into pre-trained LMs by learning triplet representation, aiming to complete a knowledge triplet given the other two.Unlike KTL, we target enhancing the relation-aware representation learning in the form of natural language sequence pairs.

CKG Knowledge Representation
Knowledge representation from knowledge graphs (KGs) has significantly progressed and benefited the KG completion task.Typical methods for KG completion tasks are mainly embedding-based, which utilize the structural information observed in the knowledge triplets (Nickel et al., 2011;Bordes et al., 2013;Wang et al., 2014;Trouillon et al., 2016;Toutanova et al., 2015;Sun et al., 2019).Recent researches also show that external information, such as the textual description of nodes or relation descriptions, can help boost the performance on the task, like ConvE (Dettmers et al., 2018) and ConvTransE (Shang et al., 2019)  graph completion, KG-BERT (Yao et al., 2019) further utilize the pre-trained LMs to learn contextaware embeddings.However, unlike previous KGs (Miller, 1995;Bollacker et al., 2008), commonsense knowledge graphs, e.g., ConceptNet and ATOMIC, have unique challenges towards the completion task.The nodes in CKGs are non-canonicalized and freefrom text, resulting in magnitude larger and sparser graphs (Malaviya et al., 2020).To address this problem, previous works extract entity and relation representation by pre-trained LMs and graph structure representation by graph neural networks GCN (Kipf and Welling, 2017) to enhance the generalizability over entity nodes (Malaviya et al., 2020;Wang et al., 2021).Instead of fusing representation from local subgraph structures in this paper, we focus on utilizing the contrast information between knowledge triplet contexts.

Methodology
In this section, we introduce the terminologies and algorithms, and show the framework in Figure 2.

Knowledge Triplets to Sequence Pairs
A commonsense knowledge graph is denoted as G = {V, E, R}, where V is the set of entities, E is the set of edges, and R is the set of relations.Knowledge triplet e ∈ E is composed by (h, r, t) where head entity h and tail entity t are entities from V and r is from R. h and t are connected by r.Each entity comes with a free-form text description.
To convert the knowledge triplet into sequence pairs as inputs to MICO, we substitute the relation with human-readable language templates and connect it to entities.Typically relations are represented as specific words or short phrases in the CKG, for example, xWant from ATOMIC and At-Location from ConceptNet.Following Hwang et al. (2021), we design natural language templates to replace the original relations and connect them to entities, forming context-aware sequences.An example from ATOMIC is shown in Figure 3.The template for xWant would be as a result, PersonX wants.We also design a template for its reverse version r −1 so it can be connected to the tail entity to form an additional sequence pair.Details of substitute templates are listed in Appendix A.1.We denote the newly constructed sequence pair as (P, A).For a premise P , there may be multiple alternatives connected to it, denoted as {A 1 , A 2 , A 3 , ...}.

MICO
MICO is a multi-alternative contrastive learning framework for commonsense knowledge representation with knowledge sequence pairs as inputs.Recent researches have greatly progressed in sequence representation learning by contrastive learning (Carlsson et al., 2020;He et al., 2020;Gao et al., 2021), in which positive sequence pairs are considered semantic related and are close neighbors in the embedding space.MICO follows the idea and minimizes the distance between the premise P and its connected alternative A.
First, a transformer encodes the constructed sequence pairs (P, A) into embeddings to extract initial representations.Specifically, when using BERT as the transformer, BERT-specific start and end tokens are padded to the input sequence: The sequence pairs are then transformed to token pairs as P tok and A tok by a transformer tokenizer.To get the initial representation, a transformer encoder encodes the token pairs as: where E h and E t are hidden states of the last layer.The representation of the hidden state for the [CLS] token is used as the representation of the input sequence.For the head and tail sequences, representations are: (2) As we assume the sequence pairs lie close in the embedding space, we use a similarity function to measure the distance between sequence pairs.The function f can be cosine similarity or dot product: sim(s, g) = f (s, g). (3) For a premise P , its paired alternative A is a positive sample while alternatives paired with other premises are negative samples in the same batch during training.To minimize the semantic distance between the i-th sequence pair in a batch, the contrastive loss is: where N is the batch size and τ is the temperature parameter.
Many research efforts aim to improve representation learning by generating multiple views for the same sample as data augmentation in multiview contrastive learning approaches (Bachman et al., 2019;Tian et al., 2020;Niu et al., 2022).In CKG, a sequence head P may have multiple positive tails {A 1 , A 2 , A 3 , ...}.Inspired by multi-view contrastive learning, we propose a multi-alternative framework to utilize the multiple positive alternatives for improving the learning of commonsense knowledge representation.Specifically, we dynamically sample a hard positive alternative from multiple alternatives during training.
The representations generated for the premise with its alternatives are s and {g + 1 , g + 2 , g + 3 , ...}.Among the multiple positive alternatives, the one with the largest distance to the premise is selected as the hard positive.Because we aim to minimize the semantic similarity between the premise and alternatives, selecting the alternative with the least similarity would increase the training loss: where k is the number of candidate alternatives during training.The new contrastive loss for i-th sequence pairs during training is: where δ ij ∈ {0, 1} is an indicator that equals 1 if i ̸ = j, and g + j,o is the o-th positive tail of j-th sample in the batch.

Experiments
In this section, we first introduce the CKGs used as knowledge sources, and then two kinds of evaluation tasks (zero-shot CQA and inductive CKGC).Finally, we introduce baseline methods for the two tasks separately.
ConceptNet.ConceptNet has been the most fundamental commonsense knowledge graph over the past decade (Liu and Singh, 2004).CN-100K was built on the knowledge triplets in ConceptNet and first introduced in Li et al. (2016).It contains Open Mind Common Sense (OMCS) (Singh et al., 2002) entries in the ConceptNet5 (Speer et al., 2017).CN-82K is a uniformly sampled version of CN-100k dataset which contains more unseen entities in the test split (Wang et al., 2021).

Evaluation Tasks
Based on that CKGC and CQA can be unified into the same form of selecting alternatives given a premise, commonsense knowledge representations generated from MICO are evaluated on these two tasks.

Zero-shot CQA
The knowledge representation is evaluated on three multiple-choice CQA tasks: COPA (Roemmele et al., 2011), SIQA (Sap et al., 2019b), and CSQA (Talmor et al., 2019).Accuracy is used as the evaluation metric.For each task, the query composed by context and question can be converted into the form as a premise.The answers are viewed as possible plausible alternatives.In this way, the multiple-choice question can be solved by selecting the closet representation pairs generated from MICO given the query and candidate answers.We denote the representation for query as q and candidate answers as {a 1 , a 2 , ...a m }.The answer with the highest score is the predicted answer i * , where i * = arg max i=1,...,m sim(q, a i ). (7) COPA.Choice of Plausible Alternatives is a two-way multiple-choice commonsense reasoning task between events.COPA consists of 1,000 questions, 500 for the development set and 500 for the test set.To make the form of relation consistent with the training dataset in natural language, we substitute cause as The cause for it was that and effect as As a result.
SIQA.The queries in Social IQA are collected based on ATOMIC.Each question in the dataset describes social interactions and has three crowdsourced candidate answers.The dataset's development split and test split are used as zero-shot evaluation, containing 1,954 and 2,059 questions, respectively.
CSQA.The questions in CommonsenseQA are general questions about concepts in ConceptNet.
Each question has five candidate answers.The development set is used as evaluation set, containing 1,221 questions.

Inductive CKGC
Inductive CKGC is an important task for CKG because unseen entity nodes are introduced in realworld CKGC from time to time and many distinct nodes may refer to same concept due to their freeform text description (Wang et al., 2021).In the inductive CKGC task, at least one of the nodes in knowledge triplets is not shown in the training dataset.Following Wang et al. (2021), each triplet (h, r, t) is measured in two directions: (h, r, ?) and (t, r −1 , ?). Inverse relations r −1 are added as additional relation types.We use the MRR (mean reciprocal rank) and Hits@10 score as the evaluation metrics.

Implementation Details
Our experiments are run on RTX A6000.Each experiment is run on a single GPU card.The training batch size is 196.Max sequence length for training is 32.The learning rate is set as 1e-5 for Bert-base and RoBERTa-base.For RoBERTa-large (RoBERTa-L), the learning rate is set as 5e-6.We use AdamW (Loshchilov and Hutter, 2018) optimizer.For experiments with the MICO framework, τ is set as 0.07.The valid set is evaluated by constrastive loss metric and used to select a best model for further evaluation.The models are trained for 10 epochs and early stopped when the change of validation loss is within 1%.

Main Results
The main results include MICO on the zero-shot CQA and the inductive CKGC.

Results on Zero-shot CQA
The results on CQA tasks are shown in Table 2. Baseline systems based on pre-trained language models such as RoBERTa-L and GPT2-L provides strong baselines.Simply comparing the language model score from RoBERTa-L or GPT2-L outperforms random guess by a large margin.This shows that pre-trained language models are encoded with useful knowledge which can benefit the CQA tasks.
MICO generates knowledge representation encoded with commonsense knowledge by the finetuned LMs with self-supervision signal from CKGs.Compared with methods such as self-talk (Shwartz et al., 2020) and Dou (Dou and Peng, 2022), our method outperforms all the evaluation datasets.Self-talk and Dou utilize the pre-trained language models as knowledge source and mine relevant knowledge that may benefit the CQA tasks.However, such knowledge is still not sufficient.By fintuning on CKG, MICO can successfully inject the commonsense knowledge into pre-trained LMs and generate meaningful representation benefiting the CQA tasks.
MICO provides an efficient way to inject CKGs into pre-trained LMs.MICO solely trained on one knowledge source can achieve comparable performance or outperforms KRL (Banerjee and Baral, 2020) and COMET-DynaGen (Bosselut et al., 2021).KRL encodes the knowledge triplets into embeddings separately and then fuses two of them to predict the third one.Compared to KRL, MICO generates knowledge representations for sequence pairs, in which the relation interacts better with node entities as they are concatenated on the contextual level.COMET-DynaGen solves CQA tasks by utilizing the clarifications generated from COMET.However COMET is a generative model and always introduces novel entities (Wang et al., 2021), which may not be related to the query.Compared to COMET-DynaGen, MICO solves the CQA task by simply generating CKG related representations and comparing the similarity, also saving the cost of generating multi-step clarifications.
Another finding is that the representation generated from MICO can be easily generalized to out-of-domain datasets.SIQA achieves best results when ATOMIC used as the knowledge source and  CSQA achieves best results when ConceptNet used as the knowledge source.This is because SIQA is built based on ATOMIC and CSQA is built on Con-ceptNet.MICO still benefits the task for COPA, which requires commonsense knowledge but is not closely related to the two knowledge sources.This shows that the knowledge representation generated by MICO can generalize across tasks.

Results on Inductive CKGC
MICO enhances the commonsense representation by the contrast information between knowledge triplets and can generalize to unseen entity nodes.
Results on the inductive CKGC task are shown in Table 3.Previous methods such as ConvE (Dettmers et al., 2018) and RotatE (Sun et al., 2018) rely on relation link between entities to learn entity embedding.These methods perform bad when new entities come with no link to previous nodes existing.Methods such as Malaviya (Malaviya et al., 2020) or InductivE (Wang et al., 2021) apply pre-trained LMs to initialize the node embedding and then focus on utilizing subgraph structure to improve the generalizability of node features by GCN.However, the CKG is sparse and the average degree for each node is roughly around 2 for both CKGs.Thus MICO focuses on learning the context information of node entities and achieves better performance on ATOMIC while comparable on ConceptNet.MICO achieves better performance than Induc-tivE on ATOMIC while otherwise on Concept-Net.The entity nodes contain 3.93 words on average in ConceptNet and 6.12 words on average in ATOMIC.MICO encodes the node textual description by pre-trained LMs and longer word sequences results in better distinguishable node feature.This may explains why MICO performs better on ATOMIC than on ConceptNet compared to In- ductivE.InductivE relies on learning the neighboring graph structure by GCN.However in ATOMIC, the entity nodes are more complex than those in ConceptNet so capturing the graph structure is not enough to learn good commonsense representation.

Ablation Study and Analysis
In this part, we analyze the influence of backbone models, number of candidate positive tails k, and hard positive selection in MICO.For evaluation on CQA tasks, the results are reported on the development set of SIQA and CSQA, and combination of development set and test set of COPA.

Backbone Pre-trained LMs
The results on different backbone models are shown in Table 4. MICO trained on different backbone models show consistent pattern on the three commonsense QA tasks.First, MICO trained with CKGs outperform baseline models without any CKG knowledge.Second, MICO trained with Con-ceptNet achieves better performance on CSQA and trained with ATOMIC achieves better performance on COPA and SIQA.

Hyper-parameter k
In this part, we study how the number of positive tails k influence the effects of MICO.For simplicity, we study the influence of k on two graphs with BERT-base as the backbone model.The performance of CQA tasks and inductive CKGC tasks under the influence of k is shown in Table 5 and Table 6.MICO generally performs better on CSQA when trained on ConceptNet and SIQA when trained on ATOMIC.This is because the questions in each task are more related to the knowledge in the corresponding CKG.
The performances on the inductive CKGC  mostly increase as k increases.This indicates that larger k helps the model generalize better in pairing the in-domain knowledge sequences.However for ConceptNet, the performance drops when k is greater than 3.The limited average degree of nodes in ConceptNet may explain this as larger k does not induce new candidate tails.Therefore, the model tends to fit the seen nodes better.

Sampling Strategy
We analyze the influence of selecting a hard positive compared with randomly sampling a positive from candidate sets.The results are show in Table 7.The experiments are conducted on the backbone model with BERT-base and k = 2. Compared to random sampling, MICO mostly outperforms on the three datasets.This indicates that the hard positive during training can benefit the generalization of the representation.The only exception is training on ATOMIC and testing on SIQA.The possible explanation is that there is possibly a distribution gap between the training dataset and SIQA dataset.However, generally our sampling strategy can improve the generalizability of the representation.

Discussion
This section shows that transformers from MICO can construct commonsense representations for CKG and benefit commonsense knowledge retrieval given queries.One example from SIQA and retrieved possible alternatives from ATOMIC  is shown in Table 6.We first encode all the alternatives in CKG by the transformer finetuned with ATOMIC and original transformer without any finetuning.The top 5 retrieved nodes are listed by the ranks of similarity score in descending order.We can find that the transformer finetuned on CKG can successfully pair the query with reasonable alternatives from CKG compared to original pre-trained transformer.Therefore, our method provides an efficient way to collect the related knowledge from CKG and may benefit the researches which require retrieved implicit background knowledge to reason over.
However, the representation generated from MICO still has some drawbacks as shown in the results."Jordan look for it at library" would be a reasonable node instead of "Jordan look for it at home".This shows that the representation still need future work to distinguish the detailed concepts.

Conclusion
In this paper, we propose a MICO, a multialternative contrastive learning framework over commonsense knowledge graphs to learn com-monsense knowledge representation.The framework converts the knowledge triplets into sequence pairs and learns superior knowledge representation through contrastive learning techniques.The generated representations perform well over zero-shot CQA tasks and inductive CKGC tasks.Furthermore, for CQA tasks, the related knowledge can be provided by simply retrieving the commonsense knowledge representations of CKGs.laboration Fund (BZ2021065).We also thank the support from the UGC Research Matching Grants (RMGS20EG01-D, RMGS20CR11, RMGS20CR12, RMGS20EG19, RMGS20EG21).

Limitations
As shown in the discussion, the commonsense knowledge representation generated from MICO can capture the rough meanings of the whole word sequences.While for detailed concepts, the representation failed to distinguish since the representation is extracted from a specific token to represent the meaning of the whole word sequence.However, concepts are key elements in semantics so future work is still needed to improve the representation.
left their book in the library after studying all day.Question: What will Jordan want to do next?answer A: Ask people if they've seen the book answer B: go home answer C: Go the library and hide the book

Figure 3 :
Figure 3: An example of converting knowledge triplet (h, r, t) to sequence pairs (P, A) from ATOMIC.r −1 is the reverse relation of r.
. To transfer the knowledge from pre-trained LMs into knowledge

Table 1 :
Malaviya et al. (2020) valid, and test sequence pairs from ConceptNet and ATOMIC.Avg Degree is the average number of tail sequence connected to head sequence and Avg Words is the average words number for head sequence and tail sequence.ATOMIC.ATOMIC(Sap et al., 2019a)contains rich social commonsense knowledge about day-today events.The dataset specifies the effects, needs, intents, and attributes of the actors in the events, covering nine relations and 877k knowledge tuples.Dataset built from ATOMIC for CSKG completion is first created inMalaviya et al. (2020).In our experiments, we follow Wang et al. (2021) to use CN-82K and ATOMIC.To better evaluate the generalizability of representation from MICO, we conduct experiments with the inductive splits in which one of entity nodes in a knowledge triplet from the valid and test split does not appear in the training dataset.Statistics of the converted sequence pairs from original datasets are shown in

Table 2 :
Results on Zero-shot CQA tasks.COMET is the commonsense transformer trained on ATOMIC.For MICO, k is set as 2. RoBERTa-L and GPT2-M have comparable parameter size.KRL is the knowledge representation method in KTL.

Table 4 :
Backbone model study on two CKGs and evaluation on CQA tasks.For MICO, k is set as 2 during training.

Table 5 :
Hyper-parameter study of k on two CKGs and evaluation on zero-shot CQA tasks.

Table 6 :
Hyper-parameter study of k on two CKGs and evaluation on inductive CKGC tasks.

Table 8 :
Comparison of retrieved alternatives from representations extracted RoBERTa-L with CKG (ATOMIC) and without CKG on question from SIQA task.Reasonable alternatives are in boldface.

Table 9 :
Relation types and relation substitute templates from ATOMIC.rev mean reverse relation.