Do Language Models Perform Generalizable Commonsense Inference?

Inspired by evidence that pretrained language models (LMs) encode commonsense knowledge, recent work has applied LMs to automatically populate commonsense knowledge graphs (CKGs). However, there is a lack of understanding on their generalization to multiple CKGs, unseen relations, and novel entities. This paper analyzes the ability of LMs to perform generalizable commonsense inference, in terms of knowledge capacity, transferability, and induction. Our experiments with these three aspects show that: (1) LMs can adapt to different schemas defined by multiple CKGs but fail to reuse the knowledge to generalize to new relations. (2) Adapted LMs generalize well to unseen subjects, but less so on novel objects. Future work should investigate how to improve the transferability and induction of commonsense mining from LMs.


Introduction
Large-scale commonsense knowledge graphs (CKGs), like ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019), store structured knowledge that can benefit various knowledge-driven applications. Given the usefulness of CKGs, but also their inability to flexibly provide information, (Paulheim, 2018), recent work has paid much attention to populating CKGs with commonsense knowledge mined from pretrained language models (LMs) (Wang et al., 2020c;Bosselut et al., 2019). Enhancing the knowledge of CKGs is essential to support reasoning on downstream tasks (Talmor et al., 2019;Wang et al., 2020b;Young et al., 2018).
The task of completing CKGs has typically been posed as commonsense knowledge inference, where the goal is to predict the object of a fact triplet, given its subject and a relation (predicate) (Petroni  Bosselut et al., 2019). Commonsense inference techniques, such as COMET (Bosselut et al., 2019), typically fine-tune an LM, like GPT (Radford et al., 2018), over the training set from a single CKG. While such methods are able to dynamically enhance the completeness of CKGs, their application so far has been limited to the relation set of the source (training) CKG (Da et al., 2021). In addition, the generated object concepts are found to be largely biased towards the ones in the training set (Wang et al., 2020a). It remains unclear to which extent LMs can generalize to multiple CKGs, new relations, and novel objects. To this end, we pose the question: do language models perform generalizable commonsense inference?
To answer this question, we study three aspects of the LM generalizability for commonsense inference, namely: knowledge capacity, transferability, and induction. To measure the knowledge capacity ability of LMs, we examine whether LMs can be adapted to multiple CKGs simultaneously, and tested on each of the CKGs. We test their transferability by assessing whether an initial adaptation of a LM on multiple source CKGs can reduce the effort on further adapting it to a new CKG. The inductive power of LMs is measured by varying the overlap between the objects in the training and test splits of a CKG. The overview of our analysis is depicted in Figure 1. Our results show that LMs are able to infer knowledge for multiple CKGs simultaneously without loss of performance on the target inference task, though the transferability of knowledge across tasks is limited. In addition, we observe that the inductive power of LMs for commonsense inference relies heavily on whether an object is observed during training.

Analysis Setup
To shed light on the LM's generalizalibility for commonsense inference, we investigate: whether LMs have the capability to adapt to multiple CKGs (Q1: capacity), whether LMs can reuse the knowledge learned from source CKGs to efficiently adapt to a target CKG (Q2: transferability), and whether LMs can predict unseen objects or mainly repeat the observed ones (Q3: induction). In this Section, we define the task, the CKGs we consider, our experimental settings, and relate to prior studies.

Task Formulation
Following Hwang et al. (2020); Da et al. (2021), we formalize commonsense inference as a task of predicting the object of a triplet, given a pair of (subject, relation) as input. The subject s and the object o are both expressed as free-form phrases, while the relation r is a predefined relation type from the CKG. A training example from ConceptNet could have (go to a concert, MotivatedByGoal) as input, and listen to music as output. Assuming that a CKG is given, the goal is to leverage the commonsense triplets in the CKG as training examples to adapt the LM for commonsense inference.
(3) ATOMIC (Sap et al., 2019) has social commonsense knowledge about causes and effects of everyday events, and mental states (e.g., xIntent) of their participants. It is created by crowdsourcing.
As indicated by Jastrzebski et al. (2018), a large proportion of the subjects in the test set of ConceptNet-100K overlap with its training set, while TupleKB does not provide an official split. Thus, we (re-)split these two datasets to ensure that the subjects of testing triplets do not appear in the training set. This criterion is also consistent with how the ATOMIC dataset is constructed.

Experimental Settings
Multi-task Learning To answer Q1, we adapt an LM with balanced training data from ConceptNet, TupleKB, and ATOMIC. We sample 8 triplets from each dataset to form one training batch. Transfer Learning To provide insight into Q2, we adopt transfer learning under a leave-one-out strategy. In this setting, we adapt an LM on two of the three CKGs, and then we further adapt it on the third target CKG. Moreover, we study the data efficiency of this transfer learning by down-sampling each training set to x = {1, 20, 50}%, in order to see whether the LM can adapt to the target CKG with less training effort. Fine-tuning on data as small as 1% training set may suffer from instability, and results may change dramatically given a new split of training data (Gao et al., 2020). To control the randomness, we re-sample the 1% training data 5 times with a fixed set of random seeds and report the average performance instead. Controlled Low-resource Learning To answer Q3, we design a controlled experiment, where we first split the training set into two disjoint subsets depending on whether the triplets in the original training set contain objects that exist in the test set or not. We denote the subset where the objects of the triplets appear in testing data as Ω. We sample x = {0, 25, 50, 100}% of the training triplets in Ω for adapting the LM. During the evaluation, we also separate the test set into two disjoint subsets, according to whether the objects are seen in the original full training set. The results on these two split test sets are reported separately for each adapted LM. Evaluation Protocol For each (subject, relation) pair in the test set, we treat all their objects as ground truth references for evaluating the model inference. We report scores for commonly used automatic evaluation metrics for text generation: BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005), which are shown to be consistent with human judgements (Hwang et al., 2020). During experiments, we observe a high correlation among these differ-  Table 1: Methods for using LMs to conduct commonsense inference. "+demo" means prepending a demonstration triplet (s , r, o ) before the input tuple. ent metrics and choose to report METEOR in the main text and other metrics in the appendix.

Connections to Prior Studies
Earlier works (Li et al., 2016;Jastrzebski et al., 2018;Davison et al., 2019) poses the CKG completion task as triplet classification, where the goal is to score the plausibility of a complete triplet. COMET (Bosselut et al., 2019) is the first to cast this task as commonsense inference with LMs. Follow-up contributions utilize COMET as a commonsense provider in various downstream tasks Ammanabrolu et al., 2021;Chakrabarty et al., 2020), thus providing evidence for LM's generalization to previously unseen scenarios. Further efforts include Hwang et al. (2020), which show that the quality of the training triplets is a key factor of adapting LMs, and (Da et al., 2021), which investigates how to learn COMET in a few-shot learning setting. Meanwhile, the study by Wang et al. (2020a) indicates the limited generalization of COMET. Ma et al. (2021) also adapt LMs simultaneously on multiple CKGs, albeit their goal is to improve downstream performance rather than CKG inference. In this paper, we aim to provide a more comprehensive study of a LM's generalizability for CKG inference.

Method
While a set of pretrained LMs exists, we adopt a widely used generative model, GPT2 (Radford et al., 2019), as our baseline LM. The investigation of other generative LMs is orthogonal to our analysis. We experiment with its largest version, GPT2-XL, which contains 48 transformer layers (Vaswani et al., 2017), ensuring sufficient capacity for storing knowledge acquired during its pretraining. We introduce our experimental method as follows.
Commonsense Inference with LMs Given a training triplet (s,r,o), we represent s and o as sequences of tokens, x s and x o , which is trivial given that they are already expressed as phrases. As for the rela-tion r, we convert it by using a template taken from the literature (Davison et al., 2019) into a naturallanguage phrase x r , e.g., IsA is converted to "is a". This has been shown to facilitate efficient adaptation of LMs (Da et al., 2021). Note that we do not explicitly provide the LMs with the information about the source CKG of the triplet as input (e.g., prepending a related special token to the triplet).

Adapting LMs with Commonense Knowledge
The training objectives for adapting LMs is to maximize the probability of generating the object phrase x o given the tuple (x s , x r ). During inference, we adopt greedy decoding to obtain the predicted object from the adapted LM.
There have been various techniques developed for adapting pretrained LMs to downstream tasks (Howard and Ruder, 2018;. Moreover, previously only the vanilla Fine-tuning, i.e., updating the whole LM architecture during training, has been employed to adapt LMs for commonsense inference (Bosselut et al., 2019;Hwang et al., 2020;Da et al., 2021). To obtain comprehensive results that are not specific to one particular way of fine-tuning, here we investigate two more alternatives, each of which has their own advantage when considered in different contexts.

Fine-tuning with Demonstration (FT+demo)
Combining the ideas of fine-tuning and in-context learning (Brown et al., 2020), this technique (Gao et al., 2020) adds a demonstration to each input as additional context and fine-tunes the whole LM as usual. Incorporating demonstrations is shown to boost performance when the amount of training data is extremely limited. In our case, a demonstration is a top-1 training triplet (s , r, o ), ranked according to the cosine similarity between the embedding of the input tuple (s, r) and the embeddings of the training tuples with the same relation type r. The tuple embeddings are given by a pretrained Sentence-BERT (Reimers and Gurevych, 2019). For instance, a demonstration (go to restaurant, UsedFor, eat out) would be added before the input (go to pub, UsedFor). With the demonstrated triplets, the LM could learn to understand the schema of the CKG instead of simply learning the knowledge from the training data.
which is more parameter-efficient. Each adapter is a two-layer bottleneck network with a skipconnection internally. Following Houlsby et al. (2019), the parameters of the bottleneck network are initialized close to zero so that the adapter approximates an identity function from the beginning. We compare to two additional baselines, both using GPT2-XL in a zero-shot setting: Zero-shot (ZS) is fed with the same input as Fine-tuning, while zero-shot with demonstrations (ZS+demo) combines the input plus demonstration, as in the FT+demo method. By investigating all these methods, we aim to understand the influence of different adaptation techniques on the models' performance. Table 1 summarizes the set of methods which we consider in this paper.

Results and Discussion
Knowledge Capacity (Q1) The results that quantify the knowledge capacity of LMs for commonsense inference over multiple CKGs with ME-TEOR scores are shown in Figure 2. The complete results including other metrics can be found in the appendix. All adaptation methods perform considerably better than the zero-shot baselines, indicating the benefit of adaptation. There is no clear distinction between the adaptation methods, though FT+demo performs slightly better than the others across CKGs. Most importantly, we find no notable performance drop for any method in the multi-task training setup despite the challenge that there is limited overlap between these CKGs. Only 10.0% of the facts from ATOMIC can be found in ConceptNet (Hwang et al., 2020) while 8.4% of the facts from ConceptNet can be found in Tu-pleKB (Dalvi Mishra et al., 2017) 2 . This indicates the prominent capacity of LMs to simultaneously adapt to different CKGs. Nevertheless, the results reveal that learning different CKGs jointly do not interfere with each other positively (via knowledge sharing) or negatively (due to overfitting). Figure 3 shows the obtained results regarding the transferability of LMs. Across different CKGs and for any training data size, we observe no indications that adapting to the source CKGs enhances the performance on the target CKG. On the contrary, adapting from source CKGs even hurts the performance of the Adapter-tuning method, revealing that this method overfits to the source CKGs. Overall, we conclude that LMs cannot reuse the knowledge learned from the source CKGs to improve the performance on the target CKG or achieve the same performance with less training data. Thus, we call for future study on developing more effective adaptation methods.

Induction (Q3)
The results in Figure 4 show that without down-sampling (x = 100%), all methods perform much better on predicting facts that contain seen objects, and their performance degrades more when less object entities are seen to training. Meanwhile, the performance on facts with unseen objects stays roughly unaffected. This indicates a key limitation of the LMs: they adapt notably better on seen objects. Since the training set and test set do not share subjects, we conclude that the generalizability of the LM is largely dependent on finding the relationship between unseen subjects and observed objects. We thus posit that a novel strategy for adapting LMs while retaining the knowledge acquired during pre-training is necessary for better generalizability. Promising directions here are prefix tuning (Li and Liang, 2021) or including an additional objective during adaptation which would encourage the generation of novel objects.

Conclusion
This work conducted a focused study of three aspects of the generalizability of LMs for commonsense inference: knowledge capacity, transferability, and induction. We experiment with five methods of using a generative LM and three representative CKGs. Despite their capability to accommodate multiple CKGs, we have observed that LMs have limited ability to transfer knowledge across CKGs. Moreover, their adaptation relies heavily on whether the objects to predict are seen during training. These findings help our understanding of LMs' adaptation behavior on commonsense inference, and highlight the need for future work to improve their transferability and induction. BLEU-2 ROUGE-L METEOR single-task multi-task single-task multi-task single-task multi-task

A.2 Implementation Details
The GPT2-XL language model we adopted in this work has 1558M parameters in total. We train all the models on a V100 GPU. As for hyperparameters, we adopt the commonly-used learning rate (1e-5) and batch size (16) for adapting GPT2, except that in the multi-task learning setting, the batch size is 24 (8 samples from each CKG).

A.3 Additional Results
See Table 2 for the full results of all the evaluation metrics considered in this paper.