Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

The common practice for training commonsense models has gone from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in order to train commonsense models. In this work, we investigate an alternative, from-machine-to-corpus-to-machine: general language models author these commonsense knowledge graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation (Hinton et al., 2015), our approach uses larger models to teach smaller models. A key difference is that we distill knowledge symbolically-as text-in addition to the neural model. We also distill only one aspect-the commonsense of a general language model teacher, allowing the student to be a different type, a commonsense model. Altogether, we show that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense from GPT-3, a general language model. Empirical results demonstrate that, for the first time, a human-authored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity. In addition, it results in a neural commonsense model that surpasses the teacher model's commonsense capabilities despite its 100x smaller size. We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.


Introduction
Prior works have suggested that pre-trained language models possess limited understanding of commonsense knowledge (Merrill et al., 2021;Talmor et al., 2021;Davis and Marcus, 2017) despite 1 We will share this following the anonymity period. We have permission from OpenAI to release    Figure 1: Symbolic knowledge distillation extracts the commonsense from the large, general language model GPT-3, into 2 forms: a large commonsense knowledge graph ATOMIC 10x , and a compact commonsense model COMET DIS TIL . The quality of this knowledge can be controlled and improved by adding a critic model, making GPT-3 a stronger teacher. otherwise stellar performance on leaderboards. As a result, symbolic commonsense knowledge graphs (Speer et al., 2017;Sap et al., 2019;Hwang et al., 2021) and corresponding neural representations (Bosselut et al., 2019;Hwang et al., 2021;Zhang et al., 2020b) have supplemented past models with commonsense capabilities. This has enabled diverse downstream applications, including interactive learning through a conversational interface (Arabshahi et al., 2021), persona-and affect-aware conversation models (Kearns et al., 2020), figurative language understanding (Chakrabarty et al., 2020(Chakrabarty et al., , 2021, story telling (Ammanabrolu et al., 2021a) and fantasy games (Ammanabrolu et al., 2021b).
The common practice for commonsense knowledge graph construction sees humans spell out as many pieces of knowledge as possible. This pipeline goes from-human-to-corpus-to-machine, with commonsense models trained from human-authored knowledge graphs. Yet, high-quality, human-authored knowledge is expensive to scale, limiting coverage; this motivates an alternative: from-machine-to-corpus-to-machine. Prior efforts toward automatic commonsense knowledge graphs have resulted in considerably lower quality than human-written data (Hwang et al., 2021;Zhang et al., 2020b), which in turn leads to less reliable neural models (Hwang et al., 2021). Broad literature consistently shows machine-authored knowledge graphs underperform human-authored graphs (Etzioni et al., 2011;Mitchell et al., 2015;Bollacker et al., 2008).
In this work, we propose Symbolic knowledge distillation, a new conceptual framework towards high-quality automatic knowledge graphs for commonsense, leveraging state-of-the-art models and novel methodology. Most prior art for automatic knowledge graph construction extracts knowledge from raw text (Bhakthavatsalam et al., 2020;Zhang et al., 2020a;Zhou et al., 2020;Zhang et al., 2020b;Li et al., 2020). In contrast, our approach is motivated by knowledge distillation (Hinton et al., 2015) wherein a larger teacher model transfers knowledge to a compact student model ( §2.1). Our method differs from prior knowledge distillation in key ways: we distill a symbolic knowledge graph (i.e., generated text) in addition to a neural model, and we distill only a selective aspect of the teacher model. This selectively allows the student model to be of a different type (commonsense model), compared to the teacher (general language model), enriching the scope of distillation. An added benefit is that knowledge distilled as text is human readable: it can be understood and evaluated. A general language model-GPT-3 in our case-is an imperfect commonsense teacher on its own, and the ability to evaluate distilled knowledge is useful in improving it. We empirically demonstrate that, by training a separate critic model to judge symbolic generation quality, a more precise teacher can be defined. Knowledge from this critical teacher is higher quality-even exceeding human-authored knowledge. Yet even before training a critic, our study makes the unexpected finding that the student model surpasses the commonsense of GPT-3, our knowledge source.
To test symbolic knowledge distillation against the human-to-corpus-to-machine paradigm, we compare with ATOMIC 20 20 (Hwang et al., 2021), which is a human-authored commonsense knowl-edge graph. We find that ATOMIC 10x , our machinegenerated corpus, exceeds the human generated corpus in scale, accuracy, and diversity with respect to 7 commonsense inference types that we focus on in this study. The resulting commonsense model, COMET DIS TIL , not only surpasses the humantrained equivalent COMET 20 20 , but is also smaller, more efficient, and produces commonsense at a higher accuracy than its own teacher-GPT-3.
Symbolic knowledge distillation offers a promising new role for general language models, as commonsense knowledge sources, and humans, as small-scale evaluators to train critic models rather than authors of commonsense knowledge. Our work demonstrates that humans and LMs can be effective collaborators for curating commonsense knowledge graphs and training efficient and performant commonsense models.

Overview and Key Findings
Throughout our work, we describe the machineto-corpus-to-machine methodology of symbolic knowledge distillation. We first go machine-tocorpus ( §3), by decoding from GPT-3, then improve our knowledge with a specialized critic model ( §4), and finally distill this knowledge into an efficient commonsense model ( §5), going corpus-to-machine. Throughout this process, we evaluate against a human knowledge source, comparing our automatic knowledge graph ATOMIC 10x and commonsense model COMET DIS TIL to the humanauthored ATOMIC 20 20 and resulting model COMET 20 20 (Hwang et al., 2021).

Symbolic Knowledge Distillation
Our proposed methodology parallels knowledge distillation (Hinton et al., 2015), a method for compressing a large or complicated teacher distribution P t into a smaller/simpler student distribution P s . Key to knowledge distillation 2 is the notion of minimizing the cross-entropy between P t and P s : Knowledge is transferred to the student by encouraging it to match teacher predictions. Hinton et al.  Figure 2: Example automatically generated ATOMIC triples from our ATOMIC 10x commonsense knowledge graph. Each example includes a generated event, relation (with natural language interpretation), and generated inference.
each training input, P t and P s are model predictions over label set Y . Typically Y is a tractable set, over which this sum can reasonably be calculated. For distilling the knowledge of generative models, we can think of an unconditional language model (LM e.g. GPT-3) as P t . This makes Y the set of all strings, over which LMs define probability. Unfortunately Y is an exponential set, intractable to sum over in Eq 1. Kim and Rush (2016) address this problem by simply taking the mode of P t over Y , truncating most of the teacher distribution to the most likely sequence and discarding information.
Instead, we consider a sampling-based interpretation of the same objective: which exactly equals the cross-entropy of Eq 1, at the limit under pure sampling from P t . 3 Yet distilling all knowledge from the teacher may not be desirable-our work is specifically focused on distlling commonsense knowledge from GPT-3. The ideal teacher P t is a commonsense expert, but GPT-3 can approximate such a teacher, off-theshelf, via prompting. This ability to select information is one explicit benefit of the sampling-based interpretation of Eq 2: while Eq 1 uses continuous logits over existing data, sampling gives discrete control over transferred information, by selecting which samples are elicited and used. For the general language model GPT-3, We encourage domain/quality with prompting, and sample truncation (Holtzman et al., 2020). We call this the loose teacher P L t -knowledge is generated and 3 A useful consequence of this framing is that access to the full model distribution is not required. Our experiments ( §3) use GPT-3, for which the distribution is not available, thus our method is applicable while knowledge distillation is not. transferred from GPT-3, but without critical assessment of correctness ( §3).
In fact, sampling knowledge in Eq 2 offers even more control, as generations can be individually interpreted and judged. Given an indicator function A(x) for which knowledge x is correct, we can define a stronger teacher model. Using a Product of Experts (Hinton, 2002) between the loose teacher P L t and and the critic A(x), we define a critical teacher: In practice, A(x) is a textual classifier learned on human judgements, 1 for knowledge predicted to be correct and 0 otherwise. Thus, the critic gives control over the correctness and confidence of the knowledge that is transferred ( §4).

Key Findings
Applying symbolic knowledge distillation in practice results in promising and surprising findings: 1. Learning symbolic knowledge from language models can be framed as a symbolic extension to knowledge distillation. In §2.1, we describe learning commonsense as a symbolic extension to knowledge distillation, with GPT-3 a knowledge source. We elaborate on this process with positive results in §3,4, and 5.
2. Symbolic knowledge distillation constructs a high quality knowledge graph at scale. Our method naturally yields a machine-generated commonsense knowledge graph, which can achieve impressive quality ( §4), beyond that of humanauthored data. An effective critic which filters incorrect generated knowledge is key.

3.
A critical teacher results in a higher quality student. In §4, we show that making the teacher more critical results in higher quality knowledge, even as it reduces the scale of knowledge transferred. This demonstrates that quality matters, not just quantity, as higher quality knowledge results in a higher quality commonsense model in §5 despite smaller scale data.

4.
Critical teacher or not, a student can outperform the knowledge source. In §5, we show the unexpected result that all student models exceed the quality of GPT-3, the knowledge source.
5. Machines can win over humans for automatic knowledge graph construction. In §4 and §5, we show that machine generated knowledge and the resulting commonsense model can outperform their equivalents that use a human knowledge source. Our symbolic knowledge exceeds humans at scale, quality, and diversity. The resulting commonsense model achieves the most accurate commonsense KG completions.

Machine-to-Corpus Verbalization
Symbolic knowledge distillation begins by going machine-to-corpus, i.e. generating many commonsense facts, which results in a commonsense knowledge graph. §2.1 frames this as sampling to estimate the knowledge distillation objective-a student commonsense model learns from the generations of a teacher (GPT-3). We start with a loose teacher, transferring knowledge by prompted generation with truncated sampling alone-this is in contrast to the critical teacher ( §4) which explicitly judges and filters the generated samples. The loose teacher uses few-shot prompting as in Brown et al. (2020). We use a few-shot template: Of the 23 relations from the most recent version-ATOMIC 20 20 -we limit our investigation to 7 relations that correspond to causal commonsense knowledge: xAttr (how X is perceived after event), xReact (how X reacts in response to event), xEffect (what X does after event), xIntent (X's intent in event), xWant (what X wants after event), xNeed (what X needed for event to take place) and Hin-deredBy. We describe how verbalization is applied to ATOMIC data in 2 steps: generating underlying events (heads), then full examples (inference given event).

Event Generation
Events are context-free premises in ATOMIC involving PersonX (and sometimes a second PersonY) in various scenarios. These events form heads in knowledge graph triples. We generate events by filling in the elements of our template: 1. Event: X overcomes evil with good 2. Event: X does not learn from Y . . .

Event: X looks at flowers 11.
The format is simple, as events are generated unconditionally. We use 100 high-quality events from the ATOMIC 20 20 corpus for our prompt, selected to avoid grammatical or logical errors, and minimize semantic overlap. We randomly sample 10 of these seed events for each generation batch, resulting in randomized prompts. We use nucleus sampling (p = 0.9) (Holtzman et al., 2020), and presence/frequency penalties of 0.5 from the GPT-3 interface. We generate 165K unique events using the 175B-parameter Davinci model 4 from Brown et al. (2020) (human-authored ATOMIC 20 20 contains only 6.2K events).

Inference Generation
Generating ATOMIC inferences requires reasoning about events and relations together. We design verbalization templates fo reach relation, with iterative design and small-scale verification by the authors 5 e.g. we prompt the xNeed relation as follows: What needs to be true for this event to take place?
. . . The language of this template implies the relationspecific task, both "Prerequisites:" and beginning with "for this to happen" suggest the xNeed relation. As well, we include an xNeed-specific <TASK-PROMPT>. We use 10 few-shot examples for each prompt. 6 For each event/relation (165K X 7) we generate 10 inferences with the Curie GPT-3 model 7 and earlier hyperparameters. Removing duplicate and degenerate (e.g. fewer than 3 characters) generations yields 6.46M ATOMIC-style data triples (examples in Figure 2). We call this ATOMIC 10x , as it contains an order of magnitude more triples than ATOMIC 20 20 for the 7 relations we study.

Evaluating a Generated Commonsense Knowledge Graph
Machine generation enables a large scale of unique generations at a much lower cost than humanauthored knowledge (Table 1), but what kind of examples are produced by GPT-3, and how does it differ from knowledge produced by humans? In this section, we conduct an in-depth analysis to answer these questions.

Lexical Differences: Diversity and Uniqueness
Recent work finds that machine generations can be repetitive and lack diversity (Welleck et al., 2020;Holtzman et al., 2020); one way generated knowledge may differ from human-authored is less creative word choice, diversity, or more repetition.
To test this, we begin with lexical diversity (i.e. unique words used, Table 2). While there is variation by relation, the diveristy of ATOMIC 10x actually exceeds ATOMIC 20 20 here, 5.2M unique words to 1.5M. In addition, it contains significantly more strictly unique generated inferences ( Table 2, unique tails).
BLEU Soft Uniqueness. Exact match (above) fails to capture the notion of similar text. Following the intuition of self-BLEU (Zhu et al., 2018), we define soft uniqueness to describe diversity of generations in a corpus. An inference x is softlyunique if: where C is the set of inferences for a given input (in our case, event + relation), and 0.5 is an empirical threshold. To find soft-uniqueness of a corpus, we iteratively remove examples until all are softly unique, i.e. low mutual lexical overlap; higher diversity means more such examples (thus a larger softly unique corpus is preferable). Softly-unique corpus sizes are given in Table 4 ("Size (div)"). ATOMIC 10x has a smaller fraction of softly-unique examples than ATOMIC 20 20 , yet it contains many more such examples. ATOMIC 10x contains 4.38M such examples (full size 6.5M) vs. ATOMIC 20 20 , which has 560K (full size 600K). Model-based Diversity Measurement. Lexical notions of diversity reward differences in surface form, which may not always reflect diversity of information, only format. Thus, we next study information-theoretic measures for diversity. Intuitively, diverse information should be less predictable, or higher entropy. With GPT-2 XL models finetuned on ATOMIC 20 20 and ATOMIC 10x ( §5) we estimate entropy-roughly, how difficult it is for a model to capture the corpus information ( Table 3). This is 4 times higher for ATOMIC 10x , suggesting more content from a modeling perspective. We also estimate cross-entropy-how   well a model trained on one corpus describes the other. From ATOMIC 10x to ATOMIC 20 20 , this is 9.31, only 2 points higher than its entropy suggesting ATOMIC 20 20 is describable with information from ATOMIC 10x . In reverse, this is 41.48 suggesting much of ATOMIC 10x is not captured by ATOMIC 20 20 -ATOMIC 10x is surprising given only information from ATOMIC 20 20 .
Human Evaluation of Quality. Perhaps most importantly, we study the quality of knowledge in each corpus. We conduct human evaluation with Amazon Mechanical Turk. 3 annotators rate each triple resulting in "accepted", "rejected" or "no   (Table 4). We find Fleiss' kappa (Fleiss, 1971) of 40.8 indicating moderate agreement (Landis and Koch, 1977), and 90.5% accuracy agreement. We require workers meet an Amazon Mechanical Turk qualification for annotation quality based on past commonsense evaluations. We compensate workers $0.17 per task, which we estimate require 30 seconds. Further details and task template are in appendix §A.
For the loose teacher, consider the top row of ATOMIC 10x in Table 4 (other rows add the critic §4). ATOMIC 10x exceeds ATOMIC 20 20 in scale, but is somewhat less acceptable by human raters-by roughly 8 percentage points. Yet, the larger scale of ATOMIC 10x implies a significantly higher number of accurate examples. Increasing the proportion of these is the main objective of the critic ( §4).
How do Knowledge Sources Compare? To understand the robustness of our approach, we assess other language models as the knowledge source (i.e. loose teacher): GPT-J (Wang and Komatsuzaki, 2021) and T5-11B adapted for language modelling (Lester et al., 2021). We substitute both for GPT-3 as in §3.2,3.3, generating a small-scale corpus to evaluate. We conduct human evaluation on 1000 examples as above (Table 4). Both models attain roughly 72% accuracy, 6 points below . This suggests strong potential, but higher quality from GPT-3. We explore this further in Appendix B.

Making the Teacher More Critical
Symbolic knowledge distillation requires a strong teacher model to maximize the quality of the generated knowledge graph and resulting student model ( §5). While the loose teacher (GPT-3 alone) results in a viable commonsense knowledge graph, evaluation shows this isn't a perfect commonsense teacher. Thus, we multiply in a critic model, to filter lower-quality knowledge, correcting the teacher ( §2.1). With modest supervision (a small-scale human evaluation) we train a classifier to predict and discriminate unacceptable examples. We multiply this with the loose teacher §3, creating a critical teacher product of experts. In practice this means filtering ATOMIC 10x to create new corpora that are higher quality, yet still larger scale than humanauthored ATOMIC 20 20 .
Training a knowledge critic We gather a training set of correct vs. incorrect human judgments on a randomly-sampled set of 10K entries of ATOMIC 10x , as in §3.4 but with one annotation per example. We take a (random) train/dev/test split of 8k/1k/1k. While this step requires human annotation, humans take on the role of high-level supervisors here-critiquing a small number of generations rather than authoring the entire knowledge graph as in previous work. Indeed, the cost/complexity of this step is similar to a typical human evaluation, making it far cheaper/easier than eliciting humanauthored knowledge in past work. We train binary classifiers (critics) for human acceptability using RoBERTa-Large (Liu et al., 2019). We find pretraining on MNLI results in the best model in terms of precision and recall, and we suggest this technique for future studies. We give more detail in Appendix C, including baselines. Our best model vastly improves the accuracy of ATOMIC 10x ( Table 4), demonstrating that a small amount of human supervision can consistently help to correct GPT-3's mistakes.
Size-accuracy trade-off Using our critic to filter knowledge results in a natural trade-off between size and accuracy. We test several cutoffs for ATOMIC 10x , i.e. confidence at which the critic rejects examples. We report humanmeasured accuracy (Accept/Reject column Table 4) following §3.4. We compare the loose 9 Size of ATOMIC 20 20 is given as the number of comparable datapoints, i.e. those with the same relations as ATOMIC 10x .  What gets filtered out? We qualitatively identify two types of filtered triples: 1) logical misalignments, events/inferences joined in an inconsistent manner. Recognizing these requires understanding events-inference interactions, e.g., X cannot find his shirt as a result X is wearing a shirt; 2) awkward phrasings, in which events/inferences are individually incoherent e.g. PersonX has a fire in the bath-resulting triples are invalid as the event is implausible.
To understand what is filtered, we ablate the critic (Table 5): our full model is compared to a random predictor, event-only model, and inferenceonly model. We also compare to an EMAP (Hessel and Lee, 2020) version, i.e. an ensemble of event and inference-only, without interactions between event/inference (needed for logical misalignments).
We find GPT-3 produces both independent awkwardly-phrased events/inferences (filtered by X-only models) and logical misalignments. The classifier, trained on validated knowledge triples, helps in both cases. The EMAP of our full model (identifies only awkward phrasings) achieves 87% AP, and our full model (which additionally identifies logical misalignments) improves to 94% AP.
Does filtering hurt diversity? One concern is that the critic may keep only similar "safe" examples, lacking novelty. We repeat our diversity analysis ( §3.4) for critical corpora (  Table 6: Model performance on knowledge base completion, measured by human judgement. Inferences are generated on held-out events from ATOMIC 20 20 . Models besides GPT-3 use GPT-2 XL architecture. COMET DIS TIL with a strong critic (+critic high ) achieves the highest acceptance rate overall-87.5. ATOMIC 10x has a diverse subset 68% of its size; rising to 80% with the most extreme filtering. One possibility is that GPT-3 gravitates towards common sentence structures for inconsistent knowledge. These would be recognizable to the critic, and removing them would increase both quality and diversity. This surprising result warrants further study.

Corpus-to-Machine: Distillation
The final step of symbolic knowledge distillation trains a compact model on the generated natural language knowledge graph. Our base model is GPT2-XL trained on all of ATOMIC 10x : we denote this model by COMET DIS TIL . We additionally train the model on critical versions of ATOMIC 10x -crit low denotes training on the corpus achieving 91.5% accuracy, and crit high on the 96.4% accuracy corpus. Models are trained for 1 epoch, with default parameters using the Huggingface Transformers library (Wolf et al., 2019). i.e. generating inferences for test events, specifically from the ATOMIC 20 20 test set. We use human evaluation 10 following Section 3.4, on 1000 inputs (event + relation), with results in Table 6. We compare to the GPT2-XL-based COMET 20 20 model trained on human-generated ATOMIC 20 20 , and GPT-3 using the same generation method as §3-in effect, comparing the student COMET DIS TIL to the loose teacher GPT-3. We omit the critical teacher (GPT-3 + critic), which is not assured to produce an inference for each input, as the critic may reject all tails for some inputs. We also compare to zero-shot GPT2-XL (Radford et al., 2019) using the same methodology (Table 6).

How does COMET DIS
TIL compare to GPT-3? In knowledge distillation, the student model often deteriorates in performance (Hinton et al., 2015;Kim and Rush, 2016) compared to its teacher. Comparing our base teacher-GPT-3-to the simplest version of COMET DIS TIL (top-row COMET DIS TIL of Table 6) surprisingly shows the student surpasses GPT-3, the model that generates its training data 11 . We posit that the superior performance of COMET DIS TIL may have to do with mistakes of GPT-3 being filtered by verbalization and training of GPT-2, and possibly the focus of COMET DIS TIL on one commonsense domain while GPT-3 covers a more general domain. We leave further study of this effect for future work.

How does COMET DIS
TIL compare to human knowledge? While COMET DIS TIL without the critic is slightly outperformed by COMET 20 20 in terms of accuracy, this reverses with the critic. For both cutoffs tested, COMET DIS TIL surpasses COMET 20 20 , with more filtering resulting in a wider gap.

Usefulness of COMET DIS TIL
For on-demand inference, where a single high quality inference for some input event/relation is required, COMET DIS TIL is the best available model: the most performant version surpasses COMET 20 20 by 5 points and GPT-3 by over 10. The critical teacher (GPT-3 + critic) yields a more accurate corpus, but may filter all inferences for an input, giving no output.
Limits and Future Work The success of symbolic knowledge distillation is a first stepdemonstrating superior performance to human authoring on the commonsense relations tested here. No aspect of our approach is specific to these relations, yet further work is needed to explore the feasibility of generation for other aspects of commonsense and knowledge, beyond these relations, to concepts like physical or temporal commonsense. 11 The slight difference in acceptability for GPT-3 from Table 4 is likely due to variance in raters between rounds of evaluation, and a different distribution of events- Table 4 uses generated events while Table 6  Extracting Knowledge from LMs Past work uses models for automatic knowledge graph completion (Bosselut et al., 2019;Hwang et al., 2021;Li et al., 2020). Yet, models are trained on existing resources; ATOMIC 10x is generated without these.  (2016) follow a similar formulation to us ( §2.1), but use the mode of the teacher distribution rather than sampling. Our work is unique in distilling specific information (commonsense) from a general language model.

Data Generation
While manual dataset creation is expensive and complex (Schwartz et al., 2017;Agrawal et al., 2018;Tsuchiya, 2018;Bras et al., 2020), crowdsourcing is the most popular method for goal-oriented, high quality/coverage datasets.
Past automatic data mainly use extractive approaches, e.g. syntactic parsing (Zhang et al., 2020a) or pattern matching (Li et al., 2020) from unstructured text (Lehmann et al., 2015;Buck et al., 2014). These scale, but are noisy and limited in format-ATOMIC knowledge will not appear simply in natural text. Some works explore automatic data synthesis/expansion by finetuning LMs on existing labeled data (Anaby-Tavor et al., 2020;Papanikolaou and Pierleoni, 2020;Kumar et al., 2020;Yang et al., 2020), but are limited by data quality.

Conclusions
We introduce symbolic knowledge distillation, a machine-to-corpus-to-machine pipeline for commonsense that does not require human-authored knowledge-instead, using machine generation. Knowledge is transferred from a large, general model to a compact commonsense model, through a commonsense corpus-yielding a commonsense knowledge graph and model. Our resulting symbolic knowledge graph has greater scale, diversity, and quality than human authoring. symbolic knowledge distillation offers an alternative to humanauthored knowledge in commonsense research.

Acknowledgments
This work was funded in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) (funding reference number 401233309), DARPA MCS program through NIWC Pacific (N66001-19-2-4031), and the Allen Institute for AI.

Ethical Considerations
One aspect of our work with the potential for ethical pitfalls is large-scale generation from pretrained language models, in constructing ATOMIC 10x . Recent work (Bender et al., 2021) has highlighted the risks of models trained on massive text resources, as GPT-3 (Brown et al., 2020) is, which we use for generation. Indeed, open generations from pretrained language models can often contain harmful, biased, or offensive aspects. We argue here that this risk is largely mitigated in our work, mainly due to the narrow and constrained nature of our generations. The goal of our work is characterising simple and generic anonymous situations, specifically in terms of commonsense causes and effects. We ensure generations are focused on these topics through careful prompting, which we found to be quite effective at keeping these generations ontopic. As such, the potential for harmful generation is very low; indeed, in a manual inspection of 100 generated examples, we found none that were significant harmful, besides one that contained adult content.
A related concern is the potential for large models and training sets to make automated oppression or exploitation possible, for instance in surveillance or generating fake news. As above, we argue that the generic, commonsense nature of our data and models makes this concern less relevant here. Our data does not contain any information directly related to these harmful domains (e.g. social media or fake news generation). While our data may assist machines in understanding basic situations, this is unlikely to be useful for harmful models given the simplicity of our data and still-flawed commonsense capabilities of even the most advanced models.
Finally, we note that we ensure fair and generous compensation for all human evaluators we hire through Amazon Mechanical Turk. Based on our estimates of time required per task, we ensure that the effective pay rate is at least $15 per hour.

A Human Evaluation Details
We conduct human evaluations on Amazon Mechanical Turk using the template of Figures 4,5. Workers are presented with ATOMIC-style triples, replacing relations with natural language templates (e.g. HinderedBy becomes "can be hindered by"). 3 annotators rate each triple, with options for acceptability: "always/often", "sometimes/likely", "farfetched/never", "invalid", or "too unfamiliar to judge". The first two are considered "accepted", the second two "rejected" and the final is "no judgement". For reporting acceptance rates, and training a critic model, we only distinguish between "accepted" and not "accepted". Workers are compensated $0.17 per task (i.e. completing all questions in the evaluation template Figures 4,5). We estimate an upper bound of 30s to complete a single task, which gives an hourly rate of $20.4. Workers are selected based on an Amazon Mechanical Turk qualification, specifically filtering for workers with high accuracy on past knowledge base triple evaluations. We follow the same setup for all evaluations, besides number of annotators. This setup is shown to result in consistent and reliable annotations, with an inter-annotator agreement given by Fleiss' kappa (Fleiss, 1971) of 40.8 when evaluating with 3 annotators, in §3.4.

B Using Alternate Models as Knowledge Sources
One natural question that arises from the strong performance of symbolic knowledge distillation is whether other sources of knowledge (i.e. language models) would similarly benefit from this method. In this section, we particularly measure the capacity of other language models to serve as the "loose teacher" which generated the base knowledge of the resulting corpus. We expand our study beyond GPT-3 here (the model used in our work), to include 2 contemporary large language models, GPT-J (Wang and Komatsuzaki, 2021) and T5-11B (Lester et al., 2021) finetuned for language modelling. For knowledge generation (verbalization) we follow the same procedure as §3 along with simple adjustments to improve quality. We are investigating the effect of the critic on knowledge precision here, so we also include ATOMIC 20 20 to probe the usefulness of automatic filtering for human-authored knowledge.
For each knowledge source, we follow the human evaluation setup in §3.4 to obtain quality an-notations of 2000 examples, with 1 annotation per example. This follows a similar setup to §4-indeed, we are replicating the earlier critic experiments but at a smaller scale (2000 annotations vs. 10000) to allow for more knowledge sources. For each knowledge source, we randomly split into sizes of 1400/300/300 for train, dev, and test sets. We follow §4 to train a critic model for each knowledge source.
We plot different thresholds (% of corpus filtered) against the resulting precision (proportion of corpus that is judged to be "valid" knowledge) in Figure 3, and give numbers at various sizes in Table 7. One striking aspect is that a critic model can raise the precision of any of these knowledge sources to approximately 90% while retaining 30% of the original corpus size. While this discards a significant portion of the original generated knowledge, it raises the exciting prospect of using more cost-effective models at a large scale to generate strong commonsense corpora like ATOMIC 10x . GPT-J and T5-11B can both be run locally by researchers, unlike GPT-3 which uses a pay-pergeneration API. Thus, one can imagine producing a large and high-quality corpus like ATOMIC 10x at a lower cost by instead generating a larger volume of knowledge from such an accessible model, and simply filtering to a greater extent.
Another interesting aspect is how the various knowledge sources diverge. Under little to no critical filtering (i.e. corpus size = 1.0), the precision of various knowledge sources is widely spread. Before applying a critic, quality of knowledge source is very important. Indeed, precision is ordered by cost of generation: human ATOMIC 20 20 has the highest precision while being the most expensive, followed by GPT-3 (used here) which is pay-pergeneration, and finally the two publicly available models. Another point of divergence is for extreme filtering (at approximately 20% of the original corpus size. All knowledge sources but GPT-3 plateau at approximately 90% accuracy, while GPT-3 rises towards 100%. Indeed, this supports our use of GPT-3 in this work, as a high-quality automatic knowledge source.

C Critic Model
We train binary classifiers (critics) for human acceptability using RoBERTa Table 7: Knowledge precision at various corpus sizes (from 100% to 10%) based on filtering by the critic model. Precision is calculated by human annotation of valid or invalid knowledge. We consider 4 knowledge sources, as described in Appendix B. This corresponds to the data plotted in Figure 3.  a small grid search on the validation set finding batch size 128, dropout .1, and Adam (Kingma and Ba, 2015) learning rate 5e-6 to be effective. We use early stopping and decay learning rate on validation performance plateauing, to maximize R@80% on the validation set. We find RoBERTa pretrained on MNLI (Williams et al., 2018) effective, outperforming other options. As well, we substitute randomly-sampled names in for person designations "X"/"Y". We include as a baseline an unsupervised filtration metric inspired by (Davison et al., 2019): they propose a model estimate of PMI to score mined commonsense triples. In our case, we use Negative Log-Likelihood (NLL) and token-mean-NLL from GPT-3 itself.
The validation precision/recall of our best performing model, the baselines, and the in-optimal hyperparameter configurations are given in Figure 6. Once fixing our model, we applied it to the test set (also in Fig 6), verifying that it generalizes to ATOMIC 10x entries. Overall, our trained critic model is more effective than the baselines in identifying high and low quality teacher generations at all levels of precision and recall. This result demonstrates that a small amount of human supervision can consistently help to correct GPT-3's mistakes.

D ATOMIC 10x Generation Prompts
We include example prompts for all generations we do, from Table 8 to 15. Note that elements of generation prompts are randomized for each batch. For event generation, the few-shot examples and order are randomly sampled from a seed set of 100 high-quality examples from ATOMIC 20 20 in each batch. For inference generation, the natural names used for PersonX and PersonY are randomly sampled from a small predefined set of names.
Instructions (click to expand/collapse) (WARNING: This HIT may contain adult content. Worker discretion is advised.) Thanks for participating in this HIT! If the data is good, it's good. If bad, then bad. Please annotate as you see not worrying about how many of each label you !nd yourself assigning! If you understand the words but the Phrases or the complete assertation makes poor sense, please mark as INVALID. Thank you! You will evaluate how often assertions are true. Each assertion is comprised of 3 parts: Phrase A, Relation, Phrase B

For each assertion, determine how true it is:
If you see "nothing in particular" for Phrase B, assess Phrase B in context: Sometimes certain actions can simply be responded to by doing nothing! Other times, doing nothing in particular is simply a weird or unlikely reaction to something.

See examples under tricky relations tagged with nothing in particular example
Please report any prejudiced or inappropriate language: Profane or o"ensive content (NSFW, R-rated material etc) Prejudiced assumptions or derogatory language that villainizes people. HOWEVER, please note, not all negative content is derogatory especially if Phrase B is intrinsically what Phrase A means. For example: criminals are characterized by committing crime is OK. ↳ This isn't necessarily villianizing people since "criminal" means "a person who has commited a crime". homeless are characterized by being lazy is prejudiced. ↳ There are many reason a person is rendered homeless. This is a gratuitous prejudice about homelessness.
Material that people may !nd disturbing, o"-putting, or improper A couple NOTES: Please be forgiving of spelling or grammatical errors If the terms are too obscure or you don't know the truth of the fact at the top of your head, it is okay to mark is "too unfamiliar to judge". If you can answer (e.g., based on likelihood), please provide a response.

Phrase A, Phrase B
Short phrases. May describe objects, object properties, events, actions, etc.

Relation
How A relates to B.

always/often
Always or quite often true.

sometimes/likely
Sometimes is true or true for some people. -or-Likely true.

farfetched/never
False or farfetched, at best. -or-Unlikely to be true. invalid This assertion makes no sense (i.e., "what does this even mean?!"). too unfamiliar to judge Cannot make a fair evaluation. Unfamiliar with one or both of the phrase.
Tricky Relations (click to expand/collpase) Examples (click to expand/collapse)  Figure 6: Precision vs. recall of our critic model on the human labelled validation set. The best trained models are labelled, and other hyper-parameter settings are shown as faded lines. We also include generation negative log-likelihood (nll) and token-wise mean nll as cutoff measures-these perform much worse than the supervised model.