BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' Generation

Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a computation-efficient framework that steers a frozen Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., producing a plausible output that incorporates a list of concepts in a meaningful way). Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score by grounding the sentence to a dynamic commonsense knowledge base from four different relational aspects. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head that guides a fixed PTLM to better satisfy the oracle. We test our framework on a series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two constrained concept-to-sentence benchmarks. Human evaluation results demonstrate that our method consistently leads to the most commonsensical outputs.


🤧 🤖 GPT-3
Davinci-003 The employee watched as the customer prepared their food.

Ours
Several employees are preparing food while a customer waits and watches. GPT-2 A woman is opening an oyster and then she puts the oyster in her glove.

🤧
Alpaca She opened her hand to reveal an oyster glove.

🤧 🤖 +⚙
Ours A woman in a yellow raincoat and latex gloves opens an oyster with hand.

😁
Figure 1: LMs such as GPT-2 finetuned, Alpaca-7b fewshot, and GPT-3 Davinci-003 fail to incorporate the concepts in a commonsensical way.We highlight the insensible phrases in purple.(c) illustrates that they are also vulnerable to perturbations of the input prompt as simple as the swap of two concept positions.Our system which uses an auxiliary model to steer a frozen PTLM generates the most commonsensical outputs.
In this paper, we explore the task of generative commonsense reasoning: a constrained text generation task aiming to generate a plausible sentence given a list of concepts as input.As depicted in Figure 1, language models should generate a sentence that incorporates 'open, hand, oyster, glove' in a meaningful way that aligns with our commonsense.We unveil that LLMs are Final Output Token Distribution: q(y|x) y 1 : A horse is being lassoed by a cow.y 2 : The cowboy used his lasso to catch the runaway cow.… y N : A woman is pulling a horse and a man is lassoing a cow.unreliable and fail to generate commonsensical outputs when the input concepts get complicated.In another case depicted in Figure 1(c), when we swap the position of two input concepts 'customer' and 'employee', LLMs such as Davinci-003 are vulnerable to the change and generate 'employee watched a customer prepare food' despite being instructed to not consider the concept appearance order, which is far from plausible.
Various knowledge-augmented systems have been previously proposed to incorporate external knowledge into the model (Liu et al., 2021;He et al., 2022) for more plausible generation outputs.However, they all require updating model weights at the scale of hundreds millions of parameters such as BART (Lewis et al., 2020).As PTLMs continue to evolve and scale up to hundreds of billions of parameters in size, finetuning the entire LM becomes computationally prohibited for many parties in academia and the industry.
In this work, we propose BOOST, a framework to boost the commonsense of PLTMs' generation in a plug-and-play manner (Figure 2), which is inspired by the recent development of controllable generation to use a small auxiliary model to control a PTLM by training on its self-generated samples (Meng et al., 2022).Specifically, to better integrate commonsense knowledge, we first build a scorer that evaluates how commonsensical a sentence is.The commonsense scorer, called O-Scorer, extracts tuples of commonsense-related concepts (e.g., <customers, prepare their food>) from a sentence, and scores the extracted tuples by grounding the tuples to a dynamic commonsense knowledge base (CSKB) (Bosselut et al., 2019;Ghazarian et al., 2023).Next, we use the signal from the O-Scorer to train an auxiliary model that steers the PTLM toward more commonsensical outputs.Note that our training process is generalizable and only requires access to the output probability of the PTLMs, which is also efficient due to the smaller size of the auxiliary model.
Our contributions are two-fold.First, we propose a reference-free evaluator to assess how commonsensical a sentence is, which achieves on-par performance with referenced-based metrics such as BERTScore (Zhang et al., 2019) in terms of correlation with human judgment.Second, we extend a controllable generation approach to improve commonsense for black-box PTLM.Experimental results show that our method consistently results in the most commonsensical outputs.

Overview
Figure 2 provides an overview of our approach, BOOST.During training, BOOST first generate numerous samples (y 1 , ..., y N ) from the PTLM conditioned on the input constraint x (e.g., 'lasso horse cow').We then construct an oracle to give commonsense scores on all of these self-sampled generations.Next, for each y i of length T i , we train the auxiliary model called NADO which essentially learns to predict the expected cs score of the complete sequence y i given x and an incomplete sequence y i <t (t ∈ [1, 2, ..., T i ]).The flow at inference time is illustrated in dashed lines: both the PTLM and NADO take x and the generated sequence (prefix) y <L as input, from which we

Sentences:
Peel an apple with a drill and a peeler.

A girl is blowing out candles on a cake.
A horse riding bikes on a river.obtain the final output distribution q(y|x).The rest of this section is organized as follows.In §2.2, we first introduce details to construct the commonsense scorer.Then, in §2.3, we provide the theory and practices to train the auxiliary model on PTLM's self-generated data towards the oracle.

Constructing Commonsense Scorer
We use commonsense relation tuples as the intermediate representation of a sentence.Specifically, we get rid of human annotation and leverage on the results of few-shot LLMs.We then check whether these extracted tuples are sensible.To this end, we assign each parsed tuple with a compatibility score based on its maximum similarity with the numerous valid accepted answers generated by COMET, a dynamic commonsense knowledge base (CSKB).Scores for all tuples in a target sentence are then aggregated to obtain the sentence-level commonsense score.Figure 3 provides an illustration of our oracle scorer.

Commonsense-Relation Extraction
Tuple Format We leverage the format of Con-ceptNet (Speer et al., 2017), a widely used knowledge graph connecting concepts or events with commonsense relations to represent general world knowledge.Specifically, each tuple T contains a head concept/event h (e.g., driller) and a tail concept/event t (e.g., drill a hole), which are connected through a commonsensical relation r (e.g., is Used For).We consider four crucial relation types that dominantly exist: is UsedFor, is Capable Of, is At Location, and is Part Of.
Tuple Extraction We present a labor and cost efficient way to extract all tuples from a target sentence, including both commonsensical and nonsensical tuples.LLMs such as GPT-3 and ChatGPT (Brown et al., 2020;Ouyang et al., 2022) have demonstrated remarkable ability of few-shot incontext learning for semantic parsing tasks (Dong and Lapata, 2016;Dunn et al., 2022).Motivated by such progress, instead of asking human workers to annotate a training set of sentences, we leverage OpenAI's GPT3.5-Turbo model to parse the relevant tuples.We hand-crafted 9 examples for our few-shot prompt such that the LLM can accurately extract both sensical tuples (e.g., a girl is Capable Of blowing candles) and nonsensical tuples (e.g., horse is Capable Of riding bikes) from the input sentence.The complete instruction and prompt can be found in Appendix A.
However, in practice, using GPT-3.5-Turbo to parse all sentences needed to train our auxiliary model is costly and unreliable when dependent on the unpredictable traffic of OpenAI's API.To obtain an extractor that can parse ∼ a million sentences at a reasonable cost, we finetune a T5 large model (Raffel et al., 2020) on 6,000 GPT-3.5 annotated sentences for the same task.We show the performance of both tuple extractors in §3.2.

Generative Commonsense Scoring
After extracting relation tuples from a sentence, we need to assess how commonsensical they are.To this end, we follow the compatibility test proposed by Ghazarian et al. (2023) and leverage COMET (Bosselut et al., 2019), a pre-trained generative commonsense transformer that can predict sensible tails given the heads and relations as input.Compared to other fixed and predefined knowledge bases, COMET is dynamic and much more flexible when dealing with original and unseen inputs.
Formally, given a tuple T i = (h i , r i , t i ) and a dynamic CSKB denoted by C dy , we query C dy with the head h and relation r to obtain a diverse list of conditionally generated tails with beam decoding: where emb(•) is the vector representation from a sentence embedding model (Reimers and Gurevych, 2019).Finally, we need to aggregate the compatibility scores computed from different triplets extracted from a single sentence.The sentence-level commonsense score is denoted as the O-score.One rationale is that a single nonsensical tuple can result in a nonsensical sentence, while the other is that one mistake will be mitigated by other reasonable tuples.We hence take the 1) minimum and 2) average compatibility scores, and study their correlation with human judgement in §3.3 and Table 2.

Commonsense-Guided Generation
In this subsection, we describe how we use our derived commonsense oracle to steer the PTL) toward more commonsensical outputs through a neurally-decomposed head (NADO).In §2.3.1, we summarize the theoretical solution of Meng et al. (2022) to decompose the sequence-level oracle into token-level guidance with a frozen PTLM, such that when generating the i-th token, the auxiliary neural network modifies the original output logits predicted by the PTLM.Then, in §2.3.2, we leverage this method to generate more commonsensical outputs.Note that our model only trains the additional NADO head which has much smaller size than the PTLM and does not require access to the parameters inside the PTLM.

Token-Level Guidance with NADO
Notation Suppose we have a sub-optimal PTLM p(y t=T ′ |x, y t<T ′ ), our goal is to obtain an optimal auto-regressive model q from p such that q generates outputs better satisfying the oracle scorer O (for example, q's generated outputs achieve higher O-scores than p).We now define a predictive function R O (x, y t<T ′ ) that predicts the expected O-scores of the complete sequence y given input x and the currently generated tokens y t<T ′ . (2) Solution The unique closed formed solution of the optimal q is (namely, generates most commonsensically according to O): Please refer to Meng et al. (2022) for details of the proof.From Eq.4 we see that when both x and y t<T ′ are fixed, the optimal auto-regressive model is factorized into R O and p at step T ′ : Approximation As we cannot enumerate Y that contains infinite number of sequences, the welldefined R O is intractable.A neural model called NADO is hence introduced to approximate R O , by training on numerous samples Y generated by p.

NADO-Guided Generation
Given a pre-trained language model p such as the GPT-2 and Alpaca model, we first ask p to generate numerous samples to obtain an approximation of Y with various inputs concepts x ∈ X .We then use the oracle O to assign each sample a score, which is used to train the NADO model.
Training During training, the NADO model takes x, y as input, and learns to predict from Here, T is the complete sequence length and the sentence-level value O(x, y) is used as the labels for all steps, from t = 0 till t = T .We emphasize that in order for O to learn R O successfully, all (x, y) pairs must be self-sampled by the base model p instead of come from the CommonGen training data.
We use cross entropy loss as the objective function.Given a particular input x, the cross entropy loss is In practice, we also add a regularization term to the loss.
In order to satisfy the definition that , our regularization loss is measured by the KL divergence of the following: Then, the final training loss is where λ is a hyper-parameter to balance these two terms.In practice, we use grid search and choose the best λ from [0.1, 0.5, 1.0].
Inference At inference time, there are two forward passes as shown in Eq.5 and Figure 2. The decoding efficiency roughly remains unchanged because the NADO head has much smaller size than the base PTLM.

Experimental Results for the Oracle
In this section, we show the results of the commonsense scorer described in §2.2.The experiments and results of commonsense-guided generation ( §2.3) can be found in §4 and §5.

Tuple Extraction Data
We use the GPT-3.5-Turbomodel provided by Ope-nAI to extract the tuples of 6,000 sentences (with a total cost of $12.4), based on which we train the T5-large based tuple extractor.Since our goal is to parse all possible commonsense tuples whether they are sensical or not, we need both sensical and less reasonable sentences.To this end, we randomly select 3,000 sentences from the Com-monGen (Lin et al., 2020) train split (we consider them as more sensical) and another 3,000 sampled from a basic gpt-2 model (we consider them as less coherent and sensical).

Tuple Extractor Results
Following the rationale in §3.1, we study the benefit brought by augmenting the training data with tuples extracted from less coherent and sensical sentences.Specifically, we compare the following three settings: 1) base: trained on the 3,000 sensical sentences; 2) aug: trained on 1,500 sensical sentences and 1,500 less sensical sentences; 3) all: trained on all 6,000 sentences.We test the model performance on a held-out set of 350 sentences that is mix of both types.To obtain the gold labels on the test set, we start with the few-shot GPT-3.5'sannotation.After that, two human annotators iteratively checked and fixed any error they see.
For each relation type, we report the average f1score in Table 1.Here, if the lemmatized tokens in a generated triplet has over 50% overlap with those in the ground-truth triplet, we consider it as correct.Otherwise, we consider it as wrong.Comparing T5-Large aug with T5-Large base in Table 1, we see improvements across all four relation types.Besides, increasing the train data size also boosts the extractor's performance.We also notice that our extractors perform worse on UsedFor and Ca-pableOf than on AtLocation and PartOf, which is partially due to the errors of the training signal (i.e., labels are inaccurately annotated by GPT-3.5).

Oracle Commonsense Scorer Results
To compute the machine-generated compatibility score in Eq.1, we set beam size k = 128.Meanwhile, we instruct human annotators to evaluate the target sentences on how commonsensical they are.Each sentence is annotated by 3 workers with a scale of 1 (least) to 4 (best).We also ask every annotator to specify which part of the target sentence is nonsensical.We find out that explicitly asking the workers to pay detailed attention and point out the erroneous parts helps to increase the inter annotator agreement (IAA, measured by Spearman's correlation) from 0.56 to 0.67.The final sentencelevel commonsense score annotated by humans is the average of 3 individual ratings.
Table 2 shows the correlations between human ratings and automatic scores.For our proposed O-Score, we report the correlations of taking the minimum (min) and average (mean) of all tuplelevel compatibility scores.Taking the average consistently result in higher correlation, reflecting that one mistake of a nonsensical tuple can be mitigated by other sensical ones.Therefore, we use the mean score to train the auxiliary model.We also compare with reference-based metrics such as METEOR (Banerjee and Lavie, 2005) and BERTScore (Zhang et al., 2019).Since there are, on average, 4 references per candidate in the Com-monGen dataset, we select the first reference to compute BERTScore-one, and all available references to compute BERTScore-all.We show that our reference-free scorer performs on par with the best reference-based metric, BERTScore-all, and outperforms the same when use gold tuples extracted by human.

Data
Training Data As is illustrated in Figure 2, we train our auxiliary model on the PTLM's selfsampled data.For each set of input concept, we use top-p sampling (p=0.95) with temperature T=0.7 to generate N samples.In theory, the larger the N , the more accurate approximation R O can learn.In practice, due to limitations in computational resources, we set N to 48 when the base model p is gpt-2, and 10 for Alpaca.In total, we have 1.5M training instances self-sampled by gpt-2 and 0.3M training instances self-sampled by Alpaca.
Test Data We test on two different datasets.The first is the CommonGen dev split (Lin et al., 2020) which contains 993 lists of keywords focusing on daily concepts (e.g., <open, hand, oyster, glove>).Each list of keywords is paired with more than one human written references.Our second test data is distilled from CSK-PN (Chen et al., 2023), which sources challenging triples from Concept-Net (Speer et al., 2017) and tags them with positive/negative relation labels.We randomly select 993 triples with negative relations from CSK-PN (e.g.<wear sunglasses, at night>).There is no human reference for the second set.To reduce the effect of data leakage in GPT-3 and Alpaca, we randomly shuffled the keywords within each entry.2

Experimental Setup
Choice of Base Models.Although our framework does not require fine-tune PTLMs, it does require access to the PTLM's output distribution.Hence, we cannot apply our method to some popular but close-sourced LLMs such as ChatGPT.We choose Alpaca, Flan-T5, and gpt2 instead.In addition, because the pre-trained gpt2 has no instruction following abilities, we have to train it to learn the task of 'generating a commonsensical sentence given these input concepts'.Specifically, we finetune it on the CommonGen training data for 1 epoch, well before the finetuning converges.We call this process warm up, as the goal is mainly to get the smaller base model onboard with our task format.For instruction-following models such as Alpaca, we still add this warm up process for a fair comparison.In total, we apply our commonsense-guided generation method to 5 different base models: gpt-2-large with warm up, zero-shot Alpaca-7b, few-shot Alpaca-7b, Alpaca-7b with warm up, and zero-shot Flan-T5-large.Auxiliary Models.The auxiliary R O models are 4-layer transformer decoders with the same dimension and number of heads as the base models. 3They are 1/9, 1/8, and 1/12 the size of gpt-2-large, Alpaca-7b, and Flan-T5-large.We train the auxiliary models for 10 epochs with a learning rate of 1e − 5 on a single NVIDIA A100 80GB GPU.In comparison, it is not possible to finetune Alpaca-7b using only one 80GB GPU without any memory-saving technique such as LoRA (Hu et al., 2021).

Compared Systems
A*esque Decoding (Lu et al., 2022) A Neurologic decoding algorithm that injects constraints into a neurologic process with a look ahead heuristic, which results in more plausible outputs.Gelato (Zhang et al., 2023) A tractable probabilistic model (TPM) to impose constraints in language models such as gpt2-large.It achieves state-ofthe-art (SOTA) performance on constraint satisfaction.Because it is non-trivial to train new TPMs on Alpaca-based models, we use the authors' original TPM which is trained on the gpt2-large model that is finetuned on CommonGen.Lex (Meng et al., 2022) The vanilla NADO method trained only with lexical constraints as the sequence-level Boolean oracle.Namely, the scorer returns 1 if all lexical constraints are satisfied, and 0 otherwise.BOOST (Ours) Our method that uses the commonsense oracle to steer the auxiliary NADO model.We compare two variations: 1) BOOST CS: using only the commonsense oracle introduced in §2.2, 2) BOOST Joint: multiplying the lexical checking Boolean function (the same used in Lex) with the commonsense oracle score.

GPT3/ChatGPT
We instruct OpenAI's 3.5-turbo and text-davinci-003 to generate a plausible sentence given the constraints, stating that the keywords do not necessarily have to remain in the same order.Note that these models are likely to be trained on our test data already.
For all compared systems, we decode with top_k (k = 30) sampling with a temperature T = 0.7.

Evaluation Setup
Evaluation Metrics We use the keyword coverage ratio (after lemmatization) and the O score as automatic metrics to assess the quality of generated texts.For the CommonGen benchmark which contains human written sentences, we also report the n-gram overlap (BLEU-4).Considering that our systems are trained towards higher O score, we also conduct human annotation for unbiased evaluation.Specifically, we instruct the MTurkers to evaluate 1) how commonsensical each sentence is from a 1-4 Likert scale, and 2) how much they like the sentence overall (e.g., being interesting and informative).An example questionnaire with the full instructions can be found in Appendix C. We pay the MTurkers $18 per hour, and the annotation process is the same as mentioned in §3.3.
Inter-Group and Intra-Group Comparison.Our human evaluation is relative, meaning that the human evaluators are asked to compare the quality of different machine-generated outputs given the same input constraint.Since we have five base models and each entails a group of systems to compare with, we first conduct human evaluation within each group.Then, we select representative systems for inter-group comparison.

Intra-Group Results
We compile the results on the CommonGen and CSK-PN benchmark in Table 3.We find out that, BLEU-4 has a high correlation with the keyword coverage ratio (r = 0.914 measured by Pearson Correlation), but has close to zero correlation with human judgment on commonsense (r = −0.08)and overall preference (r = 0.04).We therefore hypothesize that BLEU-4, coverage ratio, and other metrics measuring the superficial lexical overlap with ground truth, cannot identify meaningful and commonsensical outputs at least in our setting.Moreover, in all eight groups of human evaluation, BOOST successfully improves the commonsense level and overall preference.Comparing Flan-T5 with gpt2, we see that our approach is more effective on instruction-tuned models than similarly-sized decoder only models.In addition, although BOOST Joint achieves slightly lower commonsense ratings than BOOST CS, the later is a lot worse in the keyword coverage, indicating that BOOST CS has a higher risk to generate reasonable sentences without satisfying the input constraints.Hence, in the constrained generation setting, we still consider BOOST Joint as the best model.

Inter-Group Results
The inter-group evaluation results are shown in Table 4.Our model BOOST outperforms all baselines, including Davinci-003.We leave the comparison with ChatGPT in §6 as a separate discussion.
Surprisingly, although human written references are still the most commonsensical, they are less preferred by our annotators compared with Alpaca/BOOST generations.Upon further inspection, we find out that the gold references in Com-monGen are relatively short and flat (e.g., "The car drove through the snow."),which may also explain why Alpaca warmed up on CommonGen are Constraint table, dog, game, walk, fireplace (from Com-monGen) Gelato A dog is playing a game on a table next to a fireplace.
A* Decoding A group of people are walking and playing video games at their dining room with fireplaces, tables, and dogs.

Davinci-003
The dog walked around the table playing a game by the fireplace.

BOOST Joint
The dog walked around the table while we played a game by the fireplace.

Reference
The dog plays the game of walking from the table to the fireplace.
Constraint statue, liberty, alive (from CSK-PN) A* Decoding There are still some people who want to see statues of liberty as living creatures.

Alpaca
The Statue of Liberty became alive on a bright and sunny day.

Lex
The statue of Liberty is alive and stands proudly in New York City.Davinci-003 The Statue of Liberty stands alive and proud.

BOOST Joint
The Statue of Liberty is a symbol of freedom and justice that is alive and well in the hearts of all Americans.
Constraint ant, eat, telephone (from CSK-PN) Lex The ant was eating the phone as if it were a delicious snack.Davinci-003 The ant was seen eating a telephone.BOOST Joint An ant eating a dead fly on the telephone.

BOOST CS
A black ant eating on the side of a brown telephone.
Table 5: Example generations by different systems.Full outputs of all compared models can be found in Table 7 in the Appendix.
less preferred than the few-shot setting where highquality in-context examples are carefully selected.

Case Study
We show three example generations by our systems and the baselines in Table 5 to further understand the advantage of BOOST.In the first example, the baselines connect different constraints logically, but in a less plausible way (e.g., all concepts are bonded to the same object).Our system on the other hand describes a scene where people play games while dogs walk around.In the second and third example, we all know that the Statue of Liberty is not alive and a telephone is inedible.Instead of directly adding negations, we observe BOOST tends to provide more contexts to make its output reasonable.In contrast, other baselines wrongly acknowledge that the Statue of Liberty can be alive or the ant can eat a telephone.
6 Has ChatGPT solved this task?ChatGPT in terms of commonsense, but it excels in overall preference.On the CSK-PN eval set where the gap between our model and ChatGPT is larger, we randomly select 100 pairs of outputs and conduct pairwise comparison on both commonsense and overall preference.Results can be found in Table 6.Specifically, each pair is first randomly shuffled and then annotated by at least two annotators.If the two annotators disagree, a third annotator is introduced for the final judge.They can also provide an optionally justification for their choice, which can earn them a small bonus.Analysis of human's feedback reveal that Chat-GPT tends to generate a sentence with highly common scenarios (e.g., "It is not advisable to wear sunglasses at night as it can impede your vision and increase the risk of accidents."),making the raters less interested.On the other hand, our model tends to provide more creative context (e.g., "Someone wears sunglasses at night to avoid the bright lights of the approaching car."), earning human annotators' overall preference without sacrificing the commonsense too much.As one annotator commented, "I am fed up with those sentence with the so-called better commonsense because they are unimpressive".Such tendency of ChatGPT results in a higher commonsense rating yet noticeably lower overall preference.In short, we highlight that ChatGPT has not entirely solved the task.
The (so far) impossible fair comparison.Last, we would like to list two points regarding why evaluating ChatGPT and our model may not be a fair comparison: (1) Test Data Contamination: Chat-GPT, which is trained on data up to 2021, likely have been trained on both datasets we tested on, including the test set.(2) Size and Trick Differences: Different from BOOST, ChatGPT is more than a plain language model and benefits largely from RLHF and many engineering tricks unknown to the public.It is also much larger than our largest PTLM, which is alpaca-7b.Nonetheless, our approach is technically complementary with Chat-GPT's language model, too.Unfortunately, due to API limitations, direct verification remains infeasible as we do not have access to its output logits.

Related Works Controllable Generation with Frozen PTLMs
There are two major lines: modifying the decoding algorithm and guiding PTLMs with auxiliary models.Recently, Lu et al. (2021Lu et al. ( , 2022) ) propose neurologic decoding with a look ahead heuristic, and (Qin et al., 2022) propose energy-based constrained decoding.One drawback of this line that the inference is slow due to the large search space.In the other line, Dathathri et al.;Krause et al. (2021); Yang and Klein (2021) guide the generation process with an auxiliary model in a plug-and-play fashion by leveraging statistical principles such as the Bayesian rule.Meng et al. (2022) propose to solve the distributional discrepancy of training data and PTLM's generated tokens by training with data directly sampled from the base model.However, mistakes in commonsense are neglected when previous works formulate the whole task as a lexical constrained generation game.
Commonsense Metrics Zhou et al. (2022) measures the commonsense of dialogue turns by hard and soft matching the relations across each turn to ConceptNet.ACCENT (Ghazarian et al., 2023) propose an unsupervised metric to measure the event commonsense of dialogue responses via the ATOMIC knowledge graph (Hwang et al., 2020).Our commonsense oracle is inspired by ACCENT but we primarily focused on factoid commonsense in a constrained generation setting.A concurrent work of ours is Vera (Liu et al., 2023), a supervised model that learns to estimate the plausibility of statements.On the other hand, our metric is unsupervised and neuro-symbolic, thus more interpretable.

Conclusion
We present BOOST, a framework to boost the commonsense in PLTMs' generation by training an auxiliary model with a commonsense scorer as the oracle.Our O-Scorer is task-agnostic and referencefree, meaning that it is generalizable to many downstream tasks such as dialogue and open-ended text generation.For such application, one may need to replace the vanilla PTLMs with task-specific models and then train the NADO head.The O-Scorer can also be combined with task-specific guidance.

Limitations
We discuss the limitations of our work.First, our tuple extractor covers only four relation types and can miss many other important relation types such as causal, temporal order, etc.These later relation types are more sophisticated such that LLMs are strong as gpt-3.5-turbowill fail at (Gao et al., 2023;Yuan et al., 2023;Chan et al., 2023;Bang et al., 2023).Second, we find out that the cosine similarities of sentence embeddings used in Eq. 1 to compute the compatibility scores sometimes do not align with human judgement.The errors incurred during the generative scoring process is then propagated into the training process of NADO, which negatively affect the output's quality.Last, although the auxiliary models have much smaller size than the PTLMs, the number of samples needed to train R O is still large in order to guarantee a good approximation of the closed form solution derived in Eq. 4.

Ethics Statement
It is known that the generated results by PTLMs could capture the bias reflected in the training data (Sheng et al., 2019;Wallace et al., 2019).Our model BOOST is build upon PTLMs including T5 (Raffel et al., 2020), GPT-2 (Radford et al., 2019), and Alpaca (Taori et al., 2023), which may potentially generate offensive content for certain groups or individuals.We suggest to carefully examine the potential biases before deploying the models to real-world applications.The statue of Liberty is alive and stands proudly in New York City.

Alpaca
The Statue of Liberty became alive on a bright and sunny day.

BOOST CS
The statue of liberty stands alone as a symbol of liberty and awakening alive.

BOOST Joint
The Statue of Liberty is a symbol of freedom and justice that is alive and well in the hearts of all Americans.

ChatGPT
The Statue of Liberty looked alive in the glowing sunset.
Constraint ant, eat, telephone (from CSK-PN) A* Decoding A man is feeding ants to an antennae on top of his head, so they can be eating from the telephone.Davinci-003 The ant was seen eating a telephone.

Lex
The ant was eating the phone as if it were a delicious snack.

Alpaca
The ant ate the telephone.

BOOST CS
A black ant eating on the side of a brown telephone.BOOST Joint An ant eating a dead fly on the telephone.

BOOST CS
A black ant eating on the side of a brown telephone.

ChatGPT
The ant tried to eat the speaker of a miniature telephone.The small plates add dimension and depth to this dish of baked zucchinis and carrots .
(b) Concepts wear, sunglasses, at night  GPT-2 A young woman wearing a long dress and sunglasses at night.  Alpaca We wore our sunglasses at night and enjoyed the stars.  +⚙ Ours Someone wears sunglasses at night to avoid the bright lights of the approaching car.
food, customer, watch, employee, prepare  GPT-2 Two employees watch as customers prepare food in the store.

Figure 2 :
Figure 2: The process of BOOST to steer a frozen PTLM with an additional neural model and oracle commonsense scorer.The solid lines indicate the training process, while the dashed lines indicate inference.In practice, we combine our commonsense scorer with lexical checking rules, and use the joint signal to train the auxiliary model.

Figure 3 :
Figure 3: An example of our oracle commonsense scorer.We first extract tuples from a target sentence, and assign each extracted tuple with a commonsensical score using COMET(Bosselut et al., 2019), a dynamic commonsense knowledge base.The sentence-level score is then obtained by aggregating tuple-level scores.

Table 1 :
The performance of different tuple extractors, measured by F1-score.The last row indicates the upper bound that our T5 models can achieve.

Table 2 :
Spearman correlation between human commonsense ratings and six automatic metrics: our O-Score with tuples extracted by T5, GPT-3.5-Turbo, and the gold tuples, plus METEOR and BERTScore.

Table 3 :
Intra-Group evaluation results on two benchmarks: CommonGen (with reference) and CSK-PN (without reference).Here, we define a group as multiple systems under the same setting (i.e., base model) and on the same dataset.We use boldface to denote the best scores within each group, and underlines to denote the second best.Our model BOOST consistently achieves the most commonsensical ratings as annotated by humans.The gap between BOOST and the corresponding Base Model is statistically significant (p<0.05)measured by Student's t-test.Note that the human ratings across groups are not directly comparable as they are conducted in separate batches.

Table 4 :
Inter-Group human ratings.The scores of all models are comparable within the same test benchmark.We color human performance in a grey background, and use boldface/underlines to denote the best/second-best scores among all machines.

Table 6 :
Which system has better commonsense (CS) and overall human preference?Pair-wise comparison between BOOST and ChatGPT shows that our model earns more overall pick while the ChatGPT have higher commonsense.
Decoding There are still some people who want to see statues of liberty as living creatures.Davinci-003 The Statue of Liberty stands alive and proud.Lex

Table 7 :
Full results of case study by different systems.5429 <−− Instruction : −−> Extract tuples (A, B) from the sentence for the relations based on the description below.Do not infer anything .Only extract tuples that explicitly mentioned in the sentence .Put None if there are no tuples to extract .IsUsedFor: A (an object ) is used to do B (a goal) .AtLocation: A is at the location or larger area B. CapableOf: A (a living ) is capable of doing B (an event ) PartOf : A is part of B. <−− Examples: −−> The runner ran because he wanted to win the car race .