NovaCOMET: Open Commonsense Foundation Models with Symbolic Knowledge Distillation

We present NovaCOMET, an open commonsense knowledge model, that combines the best aspects of knowledge and general task models. Compared to previous knowledge models, NovaCOMET allows open-format relations enabling direct application to reasoning tasks; compared to general task models like Flan-T5, it explicitly centers knowledge, enabling superior performance for commonsense reasoning. NovaCOMET leverages the knowledge of opaque proprietary models to create an open knowledge pipeline. First, knowledge is symbolically distilled into NovATOMIC, a publicly-released discrete knowledge graph which can be audited, critiqued, and filtered. Next, we train NovaCOMET on NovATOMIC by fine-tuning an open-source pretrained model. NovaCOMET uses an open-format training objective, replacing the fixed relation sets of past knowledge models, enabling arbitrary structures within the data to serve as inputs or outputs. The resulting generation model, optionally augmented with human annotation, matches or exceeds comparable open task models like Flan-T5 on a range of commonsense generation tasks. NovaCOMET serves as a counterexample to the contemporary focus on instruction tuning only, demonstrating a distinct advantage to explicitly modeling commonsense knowledge as well.


Introduction
We present NOVACOMET, an open commonsense knowledge model combining the advantages of both knowledge models and general task models.NOVACOMET models commonsense knowledge with an open format, allowing it to be applied to general reasoning tasks in contrast to previous knowledge models.Compared

Open Data Open Model
Open Format Figure 1: We leverage opaque-but-powerful proprietary LLMs into an open commonsense pipeline by: (i) creating an auditable knowledge base NOVATOMIC that gives fine-grained control over included knowledge, (ii) ensuring the generated knowledge uses a higher-coverage open-format with natural-language queries as relations and flexible mask-filling to allow for more open commonsense use-cases, (iii) demonstrating the effectiveness of (i) and (ii) via NOVACOMET's superior performance on a number of tasks, under both automatic and human evaluations.
models to be open task solvers (e.g.instruction tuning) we find that explicitly modeling knowledge in NOVACOMET also provides a distinct advantage, with NOVACOMET showing similar or superior performance to comparable open task models on a range of commonsense reasoning benchmarks.
For NOVACOMET, we leverage opaque, proprietary models like ChatGPT or GPT-4 (Ouyang et al., 2022;OpenAI, 2023) as the knowledge source in an open commonsense pipeline (Figure 1).Such models have demonstrated remarkable commonsense ability (Bubeck et al., 2023;Bian et al., 2023) yet, closed and opaque, their direct usefulness for studying commonsense is limited.Without information about training or direct access to the model, it is impossible to study where reported gains come from-e.g. the extent of test set contamination with benchmarks.
In our work, we use these models first to gener-ate an open knowledge base (NOVATOMIC, §2.1), which can be analyzed, improved, and verified against test set contamination.Next, we train an open commonsense model (NOVACOMET, §2.3) on this knowledge: the underlying data and code will be released along with the model for the study of commonsense.This allows future testing of NOVACOMET (and of other models based on NO-VATOMIC) to analyze the training set-essentially allowing us to distill information from a base LLM into an auditable format.
In training NOVACOMET, we also use an open format: compared to previous knowledge models which use a fixed relation set and training order (head + relation→ tail) we use natural language queries as relations, and allow masked generation of all aspects of the data.This allows our model to be used in a wide range of general reasoning tasks, thus addressing a significant limitation of prior knowledge models that are limited to downstream applications capable of effectively leveraging their restricted set of relations.Enabling an open format also allows the knowledge generation to focus on pertinent aspects of the context, rather than forcing the generation of inferences for arbitrary, potentially irrelevant relations.
Following past work on symbolic knowledge distillation (West et al., 2022), we also use NO-VATOMIC as the basis for training a plausibility model with human annotations ( §2.2), and study how this can improve NOVACOMET ( §2.3).
We test NOVACOMET on a range of commonsense generation tasks, and find that it consistently outperforms general task models of comparable size, such as Flan-T5 xxl (Chung et al., 2022a) and T0 on commonsense tasks like abductive infilling and explanation generation.Furthermore, we assess the ability of our plausibility model to handle general commonsense QA tasks and observe that it achieves comparable or superior discriminative performance on a range of tasks.NOVACOMET will serve as an open resource for studying commonsense, and an example of the advantage of explicitly modeling commonsense knowledge in contrast to general task modeling alone.

NOVACOMET: open commonsense models
NOVACOMET is a large-scale, open commonsense model that can handle both explicit knowledge generation, and tasks that require common-sense reasoning.
NOVACOMET is trained with symbolic knowledge distillation (West et al., 2021) by combining the commonsense data generated by large language models ( §2.1) with high-quality annotations of knowledge plausibility ( §2.2).We experiment with multiple methods for combining generated data with plausibility information (indicating how likely a given knowledge is) to train the final model, NOVACOMET ( §2.3).

Generating Open Commonsense Data
Following symbolic knowledge distillation (West et al., 2021), we distill large quantities of highquality knowledge from very large, general foundation models ( §2.1.1)-we call the resulting dataset NOVATOMIC.One major difference from previous knowledge graphs is that we allow an open relation set, in the form of queries rather than fixed relation tokens.While commonsense knowledge often takes a head, relation, tail format with a fixed set of discrete relations (e.g.X buys a lottery ticket, xWant, to win.), we propose a context, query, inference (CQI) format with natural language queries serving as open relations.We also analyze unique properties of this distilled knowledge in §2.1.2.

Data Generation
We outline the generation process below, which consists of (1) generating contexts and (2) generating queries/inferences, resulting in our new knowledge base, NOVATOMIC.
Context Generation.First, we have experts generate 21 varied prompts to steer models to generate events or situations that require commonsense knowledge to fully understand (see B.1 for all prompts used).As variations in prompt wording influence the model's output, we use many different prompts to enhance both diversity and coverage of the generated outputs.Half of the time, we generate contexts in a zero-shot manner, while for the other half, we do one-shot generation with one example drawn from ATOMIC10X (West et al., 2022).In order to reduce generation cost, we generate 20 situations per prompt (batched generation).
We generate the contexts using GPT-3 (Brown et al., 2020) variant text-davinci-003 (Ouyang et al., 2022) for a total cost of USD $39.56.We set top_p=0.99 and presence_penalty=0.3,lowering the logit values for tokens that have already occurred to promote diversity within each batch.Finally, to allow NOVACOMET to see some diversity of names, we also randomly swap all entities (names or "PersonX/Y/Z") for a name drawn from the 2021 public US social security application name registry2 with probability 0.5.
Query/Inference Generation.As no other resource currently has examples of high-quality commonsense inferences in our proposed open format, we develop a set of few-shot examples of 10 contexts (either handwritten or selected from ATOMIC10X or from ROCStories (Mostafazadeh et al., 2016)) with 10 handwritten commonsense query/inference pairs for each (see Appendix B.2 for more details).These query/inference pairs cover a broad swathe of knowledge, including consequences, explanations, reactions, attributes, counterfactuals, etc.
For each context in NOVATOMIC generated in the previous step, we randomly select and permute n ∼ Uniform(1, 10) of the few-shot examples to provide in context after the instructions and then task the model with generating 10 query/inference pairs for the new context.The rationale for this random selection and permutation of the few-shot examples is to mitigate the potential overfitting to a specific ordering or selection or ordering of handwritten examples.To try to support the use case where a desired commonsense query is not known in advance, e.g. when a user simply want general knowledge for a given context, we also generated half of the commonsense hypotheses without generating a query first (prompt in B.2).At training time ( §2.3), we input a NULL value for the query field.We generated all query/inference pairs using default decoding arguments with gpt-3.5-turbo-0301for a total cost of USD $337.16.

Analysis
Comparison to Previous CSKGs.Table 1 shows the comparisons of NOVATOMIC to existing CSKGs, ATOMIC 2020 (Hwang et al., 2020) and ATOMIC10X (West et al., 2022) in dataset statistics and lexical diversity measures.NOVATOMIC contains more diverse unique premises (heads) and hypotheses (tails) as indicated by the higher number and/or percentage of these data entries.NO-VATOMIC also has higher lexical variations, as re- flected by the significantly more diverse 3-grams.
In particular, as NOVATOMIC breaks out from fixed relation types with open questions to connect premise and hypothesis, it contains much more diverse and flexible sets of relations denoted by natural language questions.It is also of note that, based on estimates from (West et al., 2022), the total cost of ATOMIC 2020 and ATOMIC10X were approximately USD $40,000 and USD $6,000 respectively, whereas the cost for NOVATOMIC is approximately $400.Though the size of NOVATOMIC is somewhat smaller, the unit cost per example is also significantly lower.The composition of open-ended vs. yes/no questions.  of the most common questions in the dataset.The most common questions are not context-specific (asking about time, weather, or location), although we find that many of the queries do condition specifically on context.

Plausibility Annotation
Next, we collect annotations of CQI data plausibility.Broadly, this follows West et al. (2022); Howard et al. (2023) in combining generated data with an automatic critic to maximize the quality of a final trained model.In this case, however, we explore multiple ways to incorporate annotations of plausibility into the final model NOVACOMET ( §2.3.2).
Our primary means of collecting annotations of plausibility is through Amazon Mechanical Turk.We use a similar annotation template to (Hwang et al., 2020) (see Appendix C), asking annotators to decide if knowledge is always/often, sometimes/likely, farfetched/never true, or invalid (giving these annotations a score of 3, 2, 1, 0 respectively).We consider any knowledge scored 3 or 2 to be plausible.
In practice, we collect 20k annotations, with 1 annotator per example.For underlying data, we use 16k examples from NOVATOMIC, as well as 2k examples each from ATOMIC10X and ATOMIC 2020 to increase diversity of annotated knowledge style.While these knowledge graphs have fixed relation sets, we use sample natural language queries to replace the discrete relations (e.g.xNeed → What did PersonX need?).
We conduct a human agreement study on a segment of 200 examples for which we elicit 3 annotations each, finding Fleiss κ (Fleiss, 1971) of 0.317 indicating fair agreement (Landis and Koch, 1977).

Commonsense field masking
Previous knowledge models tend to use a standard head,relation → tail format in training, generating some inference given the situation/concept, and one of a set of possible commonsense relations to guide generation.
The goal of NOVACOMET is maximum flexibility in handling commonsense knowledge and tasks, meaning we would like to generate any of these fields from any others.For example, we may want to generate a likely query that connects the context and inference; or, a context under which the query and inference are correct.To this end, we propose commonsense field masking, wherein we randomly sample subspans of fields to be masked for prediction, e.g.The process of masking follows two steps.First, the set of which fields (CQI) will be masked is uniformly selected from all options in which at least one field is masked.Second, for each field, we randomly (with p=0.5) decide whether to mask the entire field, or a subspan.Finally, for those fields where a subspan is masked, we uniformly select the mask length, and which subspan of the given length to mask.
In effect, this gives the final model maximal flexibility at inference time.Users can mask any field, either the full span or infill a subspan as needed, allowing for use cases besides simply generating a full inference as in previous commonsense models.
We explore how this can be especially useful in §3.2.

NOVACOMET Versions
We consider a variety of methods to use the generation and critique data described above for training.Quantized Reward Conditioning Inspired by quantized reward conditioning in (Lu et al., 2022), we also consider more closely unifying the critical and generation data.We consider a light-weight, one-step approach (as opposed to full reinforcement learning in Lu et al. 2022) in which we annotate NOVATOMIC with NOVACOMET crit , then train a masked-prediction model that includes plausibility as a conditioning variable for predicting CQI.For annotation with NOVACOMET crit , we greedily decode plausibility, and train a rewardconditional model NOVACOMET rc .When decoding with NOVACOMET rc , we condition on either of the "plausible" labels (2 or 3) from §2.2.

Model Training
We use the T5X codebase (Roberts et al., 2022) to train NOVACOMET, using the base T5 1.1 xxl (∼11B parameters) checkpoint to initialize all of our experiments.We train all models on v3-128 TPU pods, using a batch size of 128 and a learning rate of 1e-5.For generation models, we train for a fixed 100k training steps, ensuring that loss does not converge or begin increasing.For models that include plausibility prediction as an objective, we stop training when evaluation loss for plausibility converges, which is often significantly before 100k training steps.

Evaluating Plausibility
We begin by evaluating the performance of our plausibility model NOVACOMET critic .Particularly, we aim to understand the ability of this model to provide a useful, absolute plausibility score.We compare the accuracy of our plausibility scores on discriminative commonsense benchmarks to judge its effectiveness.

Models and Baselines
As baselines, we primarily consider popular language models in a roughly similar range of parameter sizes.We include basic language model LLaMA (Touvron et al., 2023) and PaLM (Chowdhery et al., 2022) (citing performance directly for both); and language models with general task tuning such as QA for Macaw (Tafjord and Clark, 2021) or instruction tuning for Flan-T5 xxl (Chung et al., 2022b) and T0 (Sanh et al., 2021).We create standard-format prompts that depend on the model.When possible, models are given answer choices as input.This is an advantage over plausibility models like NOVACOMET crit which are designed to judge answers in isolation, but we include this to maximize baseline quality.To score options of baselines, we use negative-log-likelihood, as it was found by us to be best out of a range of options.We cite results for an alternative formatting for prompting FLAN from (Liu et al., 2023) which automatically reformats commonsense questions as statements, then judges plausibility as the likelihood of answering "yes" or "no" to whether the statement is plausible.We note that, while this method performs well, it will not generally apply to Context-Query-Inference (CQI) formatted data, as not all CQI examples can be naturally reformatted into short statements, but we include this baseline for completeness.We also cite results on GPT-3.5, ChatGPT, and GPT-4 from the same paper.
We compare baselines to NOVACOMET crit described in §2.3.For this models, we score options based on the probability of predicting 2 or 3 for plausibility (sometimes/likely or always/often true), renormalized against the probability of predicting 1 or 0 (rarely or never true, or invalid).

Results and Discussion
Model scores on various tasks requiring commonsense knowledge can be seen in Table 3.While various models are better at different tasks, NOVA-COMET crit is tied for most combined 1st + 2nd place results (5).Note that the other tied system, Flan-T5 (statements) requires automatically transforming each problem into a yes or no question; a transformation that is not generally applicable to the kinds of Context-Query-Inference style problems we would like to solve when deriving commonsense information from a model.
Looking at cases where NOVACOMET crit fails to get the highest accuracy, it is perhaps unsurprising that PaLM 540B and 62B outperform all other models on HellaSwag, which requires predicting the most likely continuation of a scene description, a task especially well suited to a raw language model.Furthermore, with Physical IQA (PIQA), the focus is on physical commonsense, a subcategory that our base generator seemed to produce less naturally on inspection.
We also note that many baselines (e.g.Macaw, T0) assume access to all answer choices.For our use case (judging knowledge within NOVATOMIC to improve the overall dataset) we are judging examples in isolation with no clear contrastive examples.The competitive performance of NOVA-COMET crit here, despite such disadvantages, further validates it for this use case.

Evaluating Generation
The central goal of NOVACOMET is in generating commonsense knowledge, and carrying out commonsense reasoning.In this section, we test the ability of various versions of NOVACOMET described in §2.3 to do this.Note that we primarily use human evaluation for model generations, following a similar setup to §2.2 with annotation templates available in Appendix C.

Datasets
First, we test the ability of models to generate commonsense knowledge in the format of previous knowledge models.Particularly, we take a sample of ATOMIC 2020 (Hwang et al., 2020) commonsense prompts (head + relation), testing the ability of models to generate a valid tail.Results are included in Table 4.
Next, we test on various downstream benchmarks requiring generative commonsense reasoning.First, we test abductive natural language generation (αNLG) (Bhagavatula et al., 2019), wherein models must abductively fill in the gap in a story between two observations.We also consider two question-answering datasets that require commonsense reasoning: TellMeWhy (Lal et al., 2021) in which models explain events, and Reflect (Zhou et al., 2022) in which models generate ATOMICstyle inferences for dialogues.We report results for all downstream reasoning benchmarks in Table 3.We use a custom annotation template for αNLG, and otherwise use the base CQI template from our annotation in §2.2.

Baselines and Models
For baselines, we include all of the models described in §3.1 as well as T5 xxl (∼11B parameters) finetuned for language modeling (T5-LM) (Raffel et al., 2019).We use straightforward prompts to describe each task and generate directly.
Different datasets can demonstrate unique ways to use the commonsense masking of NOVA-COMET for generation.For example, for αNLG, we mask between the beginning (o1) and ending (o2) events to form a natural sequence: To predict a hypothesis h that fits between o1 and o2.We found this resulted in much higher quality generations than encoding o1, o2 as context and predicting h as inference.
For other datasets (Reflect, TellMeWhy, ATOMIC 2020 ), we can encode examples simply by giving context and query, then predicting the inference.For all models, we use greedy decoding.

Results and Discussion
All generation results use human evaluation, presented in Table 4.Note that human evaluation templates are included in the Appendix.We evaluate 100 examples for each system and dataset.For Reflect, TellMeWhy, and ATOMIC 2020 , we use the same template as §2.2.For αNLG we use a template measuring coherence between the generated infill and either or both hypotheses, as well as overall quality.All scores in Table 4 are normalized to a range between 0 and 1.
Note that NOVACOMET models win across the board.Particularly effective is the filtered model NOVACOMET f ilter−0.99 , but so are the reward conditional models, and NOVACOMET rc(2) in particular, conditioned on "2" (likely/sometimes true) rather than "3" (always/often true).It is possible that answers that are always true are somewhat less creative or preferable to humans.
In general, the NOVACOMET models that use plausibility information outperform the basic NO-VACOMET base , other than on the TellMeWhy dataset.This demonstrates a particular advantage of distilling discrete data -it can be annotated, and those annotations can improve downstream performance.
Overall, superior performance of NOVACOMET suggests that explicitly modeling knowledge can provide an advantage, at least considering tasks that explicitly require commonsense knowledge and reasoning.

Related Work
Knowledge Generation Pretrained language models demonstrated the ability to carry implicit knowledge (Petroni et al., 2019;Dhingra et al., 2022).These large language models are prompted for generating new knowledge to perform downstream tasks such as text classification (Shin et al., 2020;Puri and Catanzaro, 2019), commonsense reasoning (Liu et al., 2022b;Trinh and Le, 2018;Davison et al., 2019).We take inspiration from commonsense LMs, designed for query commonsense knowledge, such as COMET (Bosselut et al., 2019) and COMET-2020 (Hwang et al., 2021).Domain specific LMs are also used for knowledge graph completion in specialized domains like biomedicine (Nadkarni et al., 2021) Knowledge Distillation As the process of manually creating datasets can be costly and complex, prior studies have explored the realm of automated data generation.These prior works mainly focused on extractive approaches, e.g.syntactic parsing (Zhang et al., 2020a) or pattern matching (Li et al., 2020) from unstructured text (Lehmann et al., 2015;Buck et al., 2014).West et al. (2021) proposed filtering out low quality data using a critic model for symbolic knowl-edge distillation from larger models.Following this, several works effectively improved upon this for iterative distillation (Sclar et al., 2022;Bhagavatula et al., 2023), self-chat with feedback and conversations with ChatGPT (Xu et al., 2023;Geng et al., 2023;Chiang et al., 2023).SODA (Kim et al., 2023) contextualized social commonsense knowledge from a knowledge graph to distill dialogues from InstructGPT.Sclar et al. (2022) established filters based on length, fidelity, and information bottleneck for distilling reference-free summarization determining the effectiveness of designing filters for selecting data for the following iteration.Recently, (Jung et al., 2023) proposed a framework to learn a high-quality model from a low-quality teacher model to distill a good dataset by summarizing and paraphrasing.

Conclusions
Overall, we introduce NOVACOMET, an open commonsense foundation model.NOVACOMET takes advantage of closed proprietary models, resulting in an open pipeline and resources that are publicly available.NOVACOMET is trained on data generated from these closed proprietary models and augmented with human annotations, resulting in both a high-quality plausibility model and improved generative model.NOVACOMET surpasses other general models of similar size at a range of commonsense knowledge-intensive tasks, demonstrating the existing need for explicit knowledge modeling, even as task-focused methods like instruction tuning grow in popularity.

Limitations
First, we recognize that our line of research requires extensive resources and funding, limiting the broad adoption of our methodology as it is presented.Particularly, our work relies on both massive generation from proprietary language models (GPT-3 turbo) and extensive use of TPU resources.Our hope is that these barriers will only be lowered as proprietary LMs become cheaper and LMs become increasingly efficient to tune and do inference on (Dettmers et al., 2023), lowering the barrier for techniques such as ours.
Second of all, we recognize that, while we have attempted to test the query-ability of commonsense knowledge via automatic and human evaluations on a number of different tasks [FIX ME] RL .However, current tasks are largely biased towards both certain topics and tends to implicitly define ground truth from certain, fixed perspectives rather than acknowledging the underlying diversity of human perspectives (Santy et al., 2023).This limits our ability to assess whether our models capture genuine human agreement-or only the agreement of a certain portion of the population-something which we hope future work will investigate.

Ethics Statement
Akin to all other machine learning approaches, our model could inadvertently exhibit biases.We acknowledge that the open format relations gathered from proprietary models may not be representative of all cultures, and thereby these perpetuate the biases that these proprietary large models possess.While generating commonsense knowledge, LLMs may result in unanticipated commonsense inferences, including those that are biased and escape our critic model.Consequently, incorporating these inferences during training can further amplify such biases.We are committed to understanding such biases and improving our critic model.However, our model's central tenet is knowledge, which contrasts with existing public models of similar size and architecture, thereby regulating the toxicity of the model.We ensured that the crowd workers involved in our project were compensated at a rate that met or exceeded the minimum wage requirement, recognizing the value of their contributions to building our model.Comparable to all open models, our model is susceptible to malicious use and it is our collective responsibility to thrust safe open usage.We acutely understand the ethical implications associated with our proposed method and are dedicated to resolving them, aiming to ensure the responsible adaptation of our approach in the community.Table 6: Comparison of baselines and the NOVACOMET base using automatic scores BLEU (Papineni et al., 2002) and BERTScore (Zhang et al., 2020b).Automatic metrics do not seem to agree with human evaluation, and show less clear variation.
1. Situation: ''' Generate 20 situations that happen sometimes.They should be complex and include multiple parts.(One per line) 1. Situation: ''' Generate 20 situations that include a person or people.They should be complex and include multiple parts.(One per line) 1. Situation: ''' Generate 20 everyday situations about PersonX (one per line).It may also involve other entities, such as PersonY.They should be complex and include multiple parts.(One per line) 1. Situation:

B.2 Relation Generation Prompts
Below are the prompts for generating relations.To promote diversity, the number of examples were randomly selected from Uniform(1,10) and were shuffled.Some contexts come from ROCStories (Mostafazadeh et al., 2016) and (West et al., 2022), while others are handwritten.All questions and inferences are hand-written.When prompting 'gpt-3.5-turbo',we provide the instructions "Given a situation... answer" as the system message, the Context as a user message, and the ten generated queries/inferences as the system response.

B.3 With queries
Given a situation, ask and answer ten (10) relevant questions that require commonsense or a world model.Some examples may include potential consequences, explanations, prerequisites or reactions, attributes, or counterfactuals.The commonsense facts may be about actors, actions, events, or ideas in the passage.The examples should be high-quality and things that are true.Please give a plausible answer at all times instead of just saying that it depends.Only ask questions that will have a relevant, commonsense answer.
Alisa and her family lived in Florida.
They heard a hurricane was coming.They decided to evacuate to a relative's house.They arrived and learned from the news that it was a terrible storm.1.What will happen now?They will wait out the storm.The robber is probably armed 3. What does the robber have?The robber probably has a getaway car 4. What does the bank teller feel?The bank teller is probably scared 5. What might happen to the robber?The robber could go to jail 6.What does the bank have?The bank might have a security system 7. Before this, did the robber do anything?The robber probably planned this in advance 8.As a consequence, what will happen?

How does
After, the robber will have the money 9. What happens before this?The robber tells the bank teller to give them the money 10.How much money does the robber get?
A lot of money The woman enters the elevator 1.What did the woman have to do before?
The woman had to press the button for the elevator to come to her floor 2. What is the woman's goal?The woman wants to go to a different floor 3. What will the woman do next?The woman will press the button for the floor she wants to go to 4. What could hinder this situation?The woman wants to take the stairs to be healthy 5.Is she alone?She may or may not be alone, since there could be other people in the elevator.6.What does the woman see in the elevator?Buttons to different floors 7. What does the woman feel?The woman could feel impatient at having to wait for an elevator 8.As a consequence, what will happen?
The woman will arrive at her desired floor 9. What could prevent this from happening?The elevator is out of service 10.Where are elevators located?Multistory buildings Emma has a big exam tomorrow.She got so stressed, she pulled an all-nighter.She went into class the next day, weary as can be.Her teacher stated that the test is postponed for next week.1.How does Emma feel about this?Emma is probably relieved 2. Why might Emma be frustrated?Emma could be frustrated because she stayed up all night studying for nothing 3. What is the consequence of the situation?Emma will have more time to study 4. What is the prerequisite for this situation?Emma needed to have a big exam 5. Tell me what Emma will do next.Emma will probably go home and sleep.6.What did Emma do before this?Emma was studying for her exam 7. Why did the teacher postpone the exam ?The teacher may have postponed the exam because not everyone was ready.
8. What is an attribute of Emma?Emma is a procrastinator.9. What is an attribute of Emma's teacher?flexible A robber steals from a bank.1.The robber is probably armed 2. The robber probably has a getaway car 3.The bank teller is probably scared 4.This is illegal 5.The robber could go to jail 6.The bank might have a security system 7.The robber probably planned this in advance 8. After, the robber will have the money 9. Before this happens, the robber tells the bank teller to give them the money 10.The robber might wear a mask Addilyn and her family lived in Florida.
They heard a hurricane was coming.They decided to evacuate to a relative's house.They arrived and learned from the news that it was a terrible storm.1.They may have left valuables behind 2. They may come back to a destroyed house 3.They were smart to evacuate 4. If they didn't evacuate, they might have died 5.The hurricane was very bad 6.Now, they will wait out the storm 7.They went to their relatives house because they wanted to be in a safe place 8.They wouldn't have fled if they had not heard about the hurricane 9.The relative lives somewhere safe from the hurricane 10.Their relative is kind for letting them stay over Fatima was assigned a roommate her first year of college.Her roommate asked her to go to a nearby city for a concert.Fatima agreed happily.The show was absolutely exhilarating.1. Fatima has a roommate 2. Fatima likes music 3.As a result, Fatima and her roommate will be better friends 4. Fatima enjoyed the concert 5.In the future, Fatima may want to go to more concerts 6. Fatima may be more likely to spend time with her roommate 7. Fatima's roommate is considerate 8. Fatima's roommate is probably also a student 9.The roommate thinks that Fatima is cool 10.They got to the concert using a car or public transportation The woman enters the elevator 1. Before, the woman pushed the button for the elevator 2. The woman is going to a different floor 3. After, the woman will push the button for her floor 4.Then, she will press the button for her desired floor 5. First, the woman will wait for other people to walk out of the elevator 6.The woman might have been impatient if she had to wait for a long time 7.This couldn't happen if the elevator were out of service 8.The woman would not have done this if she wanted to take the stairs to be healthy 9.The woman may have been in a hurry 10.She is in a multi-story building

IMPORTANT:
In this new dataset, some of the guesses may be exact or near copies of one of the observations.This is an automatic bad.Please respond with strongly disagree for all questions.Sometimes certain actions can simply be responded to by doing nothing!Other times, doing nothing in particular is simply a weird or unlikely reaction to something.
New! Please report any prejudiced or inappropriate language: Profane or offensive content (NSFW, R-rated material etc) Prejudiced assumptions or derogatory language that villainizes people.
to simply training 1 Our resources are available at novacomet.dev is distilled from strong LLMs such as gpt-3 turbo, which allows auditable transfer of knowledge that prevents issues like data contamination (CITE) that may plague opaque proprietary models.The final trained model learns to fill in any part of the knowledge -beyond simply answering a query[orange], NovaCOMET can e.g.predict what context[green] would make a query[orange] + inference[purple] likely.

Figure 2 :
Figure 2: (a) The most frequent question prefixes.(b)The composition of open-ended vs. yes/no questions.

Table 2 :
Frequent queries in NOVATOMIC.Note that we take the top 100 surface forms, and cluster them into semantically related/equivalent groups by hand.Queries above represent the top groups by aggregate count, with indicative labels.See Appendix A for more details.
Model First, we consider the simplest option for producing a commonsense generation model: training directly on NOVATOMIC.NOVACOMET base is trained only on generation data from §2.1 with the commonsense masking objective ( §2.3.1).Plausibility is not used in this model.
West et al. (2022)Second, we train a standalone plausibility critic model, NOVACOMET crit .This is trained to generate a plausibility score from a complete CQI (context, query, inference) knowledge triple, on the annotation set from §2.2.In effect, it returns a probability that a given CQI is plausible.Filtered Generation Model FollowingWest et al. (2022), we use a simple filtering-based technique for improving generation with plausibility scores.Using NOVACOMET crit , we calculate the probability of being plausible for all examples in NOVATOMIC, and filter to only those points that meet a minimum probability.We focus on one threshold in particular, 0.99, indicating that NOVA-COMET crit gives at least 0.99 probability to the given CQI being plausible.We call the resulting model NOVACOMET f ilter−0.99 , and the resulting filtered training set retains over 50% of its original size, indicating NOVATOMIC is already high quality.

Table 4 :
Human evaluation of various commonsense generation tasks.Note that the basic version of NOVACOMET outperforms baselines consistently, but is outperformed by versions that use plausibility to improve.We find human agreement with Fleiss κ (Fleiss, 1971) of 0.32, 0.44, 0.43, 0.39 (respective to order in the table) indicating fair to moderate agreement.Note, values in this table are normalized to a [0, 1] range.
. Liu et al. (2022a) use dataset cartography to prime the model with challenging examples and enable it to generate more examples with such patterns.

read the guess carefully.
Tell us, given the observation pair, how good the AI's guess is on several dimensions.Please note that you might get the same observation pairs multiple times.For each, you will see a different AI's guess, so please

Evaluate AI's guess. Observations What happened in between the observations?
Please be forgiving of minor spelling or grammar errors, as that's not what's at test.It leaves no large unexplained information gaps.
(1.1) AI's guess is a sensical and coherent follow-up event to Observation 1.It leaves no large unexplained information gaps.(1.2) AI's guess is a sensical, coherent, and explanatory preceding event to Observation 2. It leaves no large unexplained information gaps.(1.3)AI's guess is sensical and coherent when both Observations are looked at together.

) Say we were to string the sentences up as a short anecdote...
Please let us know if anything was unclear, if you experienced any issues, or if you have any other fedback for us.
HOWEVER, please note, not all negative content is derogatory especially if Phrase B is intrinsically what Phrase A means.For example: criminals how are they characterized?committingcrime is OK.↳This isn't necessarily villianizing people since "criminal" means "a person who has commited a crime".homelesshowaretheycharacterized?beinglazy is prejudiced.↳Therearemanyreason a person is rendered homeless.This is a gratuitous prejudice about homelessness.If the terms are too obscure or you don't know the truth of the fact at the top of your head, it is okay to mark is "too unfamiliar to judge".If you can answer (e.g., based on likelihood), please provide a response.Short phrases.May describe objects, object properties, events, actions, etc.Sometimes is true or true for some people.-or-Likelytrue.farfetched/neverFalseorfarfetched, at best.-or-Unlikely to be true.invalidThisassertion makes no sense (i.e., "what does this even mean?!").too unfamiliar to judge Cannot make a fair evaluation.Unfamiliar with one or both of the phrase.
Material that people may find disturbing, off-putting, or improper A couple NOTES: Please be forgiving of spelling or grammatical errors