Template Filling for Controllable Commonsense Reasoning

Large-scale sequence-to-sequence models have shown to be adept at both multiple-choice and open-domain commonsense reasoning tasks. However, the current systems do not provide the ability to control the various attributes of the reasoning chain. To enable better controllability, we propose to study the commonsense reasoning as a template filling task (TemplateCSR) -- where the language models fills reasoning templates with the given constraints as control factors. As an approach to TemplateCSR, we (i) propose a dataset of commonsense reasoning template-expansion pairs and (ii) introduce POTTER, a pretrained sequence-to-sequence model using prompts to perform commonsense reasoning across concepts. Our experiments show that our approach outperforms baselines both in generation metrics and factuality metrics. We also present a detailed error analysis on our approach's ability to reliably perform commonsense reasoning.


Introduction
Commonsense reasoning has been studied across both multiple choice (Tandon et al., 2018;Talmor et al., 2019;Lv et al., 2020), and open-ended knowledge base settings (Lin et al., 2021a).While multiple choice approaches require a list of answer options, open-ended KB approaches assume that the answer exists in an available knowledge base (KB).Such constraints often limit these systems' ability in practical applications where control is required (e.g. a web search query with specific conditions).
To complement the existing commonsense reasoning efforts, our work aims to enhance the commonsense reasoning capabilities of natural language processing (NLP) systems by studying template commonsense reasoning (TemplateCSR) 1 All code and data will be released publicly Input: Are people who smoke at higher risk for lung cancer ?
Output: Yes Input: People who smoke {are/aren't} at risk for {disease} because {reason}?
Output: People who smoke are at risk for lung cancer because cigarette smoke contains carcinogens that affect DNA Input: People who smoke are at {higher/lower} risk of {disease} Output: People who smoke are at higher risk of lung cancer Commonsense QA Sample TemplateCSR (1) Template-Expansion Pair (2) Template-Expansion Pair Figure 1: In this example, we show how a commonsense reasoning question can be formulated as two different template-expansion pairs, each focusing on different aspects of reasoning between the concepts smoking and lung cancer.While formulation (1) focuses on the explanation, (2) aims to understand the qualitative relationship between them.
-where reasoning is achieved by filling templates with restricted template slots, rather than selecting answers from a list of candidates or KB.TemplateCSR task is challenging as there are no available annotations and potentially multiple correct expansions for each template.Moreover, the task of designing templates with slots that satisfy arbitrary constrains and their expansion is still an open challenge.For example, for an example reasoning template People who smoke are at a risk of {disease} , a system needs to first constrain the slot to only diseases, and then use the addi-tional constraint of smoking to arrive at the right answer in the slot.In comparison to Language Model (LM) probing approaches (Ribeiro et al., 2020) that test capabilities of LM that are already trained, we aim to propose a task and train a model for TemplateCSR task.
Figure 1 shows one such example from the healthcare and well-being domain, where we show how an existing commonsense reasoning query can be formulated as different templateexpansion pairs with control over different aspects of the reasoning.In the first expansion, the reasoning chain focuses on the relationship between smoking and cancer with the corresponding explanation (reason), while the second chain solely focuses on the qualitative relationship between the smoking and cancer.
To address the above mentioned challenges, our contributions in this paper for TemplateCSR are two-fold.First, we present a dataset of commonsense reasoning templates for the healthcare and well-being domain and their corresponding expansions that are valid completions of the template, which we define as template-expansion pairs (Fass and Wilks, 1983).The slots in the templates are open-ended and are not restricted to any particular categories and enable controlling the reasoning chain.Given the recent focus on explainable models for reasoning (Wiegreffe and Marasović, 2021), we also augment templates with an optional free-form explanation slot that explains the reasoning connection between various commonsense concepts.Our TemplateCSR dataset comprises of about 3600 unique template-expansion pairs collected from diverse sources, and we hope to enable SEQ-TO-SEQ systems to effectively learn to fill commonsense reasoning templates.
Next, we present ITO, a model that formulates the TemplateCSR challenge as a SEQ-TO-SEQ task where given a template with slots for specific concepts, the goal of the model is to produce meaningful completed sentences for the template.The concept in each slot in the template is provided via an instruction (Wei et al., 2022a), which indicates an abstraction of the nature of the slot.The multiple choice qualifier slot is used to model the relationship between the concepts and the explanation slot generates a free-form text explanation for the reasoning chain.Specifying each slots in free-form text enables control allowing commonsense reasoning questions to specify concepts, the qualitative relationship and the nature of explanation.
In our experiments for the TemplateCSR task, ITO outperforms baseline both in terms of generation metrics such as ROUGE and BERTSCORE, and factual correctness (factuality) metrics such as FACTCC.We also evaluate the factuality of the dataset using expert human judges and present a detailed analysis of model failures.While we still observe factual errors, our approach provides deeper understanding of the mistakes, enabling alternate ways to build commonsense reasoning systems via templates using SEQ-TO-SEQ models.

Dataset
For our use-case, we create dataset samples of commonsense reasoning templates related to healthcare and well-being.Incorporating NLP systems for aiding healthy lifestyle has been an active area of research in the past decade (Liberato et al., 2014;Fadhil and Gabrielli, 2017;Doustmohammadian and Bazhan, 2021;Ahne et al., 2022).Inspired by this line of research, we want to collect templates that describe a relation between lifestyle related commonsense concept and a corresponding health related concept.In comparison to existing datasets like commonsenseqa (Talmor et al., 2019) which relies on fixed set of relationships from a knowledge-base (Speer and Havasi, 2013), we do not restrict the relationship types or number of concepts or hops, making it close to openvocabulary text.We believe that our dataset augments well with the existing commonsense reasoning datasets in the community, contributing to the diversity of the data.
Based on the efficacy assessment for NLP systems in health and lifestyle related settings (Laranjo et al., 2018;Abd-alrazaq et al., 2020;Hoermann et al., 2017), we designed our basic template structure.Our basic units for the TemplateCSR task is as follows: 1. concept slot : contains an abstract category of a concept.The concept's abstraction is provided in a natural language format in openvocabulary, without fixed class constraints.
In the example shown in figure 2, people with habit and disease are concept slots.
2. multiple-choice qualifier slot : a word or phrase that describes the nature of the relationship between the concepts.This slot : An overview of the overall template structure for our approach.Our goal is to reason across concepts for TemplateCSR.In this example template, concept slots are people_with_habit and disease, and the multiple choice qualifier slot -higher/lower describes their relationship and an explanation reason slot aims to get a free-form text explanation for how they are related.
is typically framed as a multiple-choice slot, where the goal is to pick an option from the choices rather than replacing the text in the template slot.Figure 2 shows an example where the slot higher/lower is one such multiple-choice qualifier slot.
3. explanation slot : this optional field consists of a free-form explanation that explains the reasoning between concepts, typically marked as reason slot.
Towards this, we collect a set of template (x) and its corresponding expansions (y) based on this overall schema for commonsense reasoning.In the example shown in figure 2, the template comprises of two concept slots, (people with habit and disease).The qualifier slot (higher/lower) specifies how one concept is connected to another concept in terms of their qualitative relationship.The template also includes an optional explanation slot that specifies in free-form text how smoking is connected to cancer.A valid output for the above-mentioned template is for instance, people who smoke are at a higher risk for lung cancer because carcinogens in smoke causes DNA damage, where people with habit is replaced by people who smoke, and the multiple choice qualifier slot higher/lower is replaced by higher, and disease slot replaced by lung cancer and finally the reason slot replaced by explanation of the qualitative relationship carcinogens in smoke causes DNA damage.In this example, we show how both the template-expansion pairs aim to uncover the relationship between smoke and lung cancer, while also providing the flexibility to additionally constrain the reasoning chain in any way.
Task Setup : To collect our dataset using crowdsourcing, we use amazon mechanical turk platform2 .Each datapoint took ∼120 seconds to annotate, and we paid an average of $15 per hour.Additionally, we used a filtering step to select master annotators with an approval rate of more than 90%.All the turkers were given specific instructions to input only factual information and not opinionated statements.Specifically, the turkers were instructed to use the following sources: CDC3 , WebMD4 , Healthline5 and Mayo Clinic6 .The annotators were also instructed to give a template, and at least two corresponding sentences that matches the template.The statistics of the data are as follows: the average sentence length was about 14.57 words, with mean 2.4 slots per template.Some qualitative examples from the dataset are given in the table 1.Overall, our dataset contains about 7000 template-sentence pairs with about 3600 unique templates.Once the templates are collected, the authors post-process the data to verify each template-expansion pair for correctness and validate that we do not have any identifying information like proper names.We then create a standard 70/10/20 train/val/test split.

Model
Early NLP systems have often relied on rulebased templatic systems (Riloff, 1996;Brin, 1998;Agichtein and Gravano, 1999;Craven et al., 2000) due to their simplistic nature.Compared to machine learning methods, they were often rigid (Yih, 1997) systems are often easy to comprehend, and lend themselves to easily incorporate domain knowledge (Chiticariu et al., 2013).Our goal is to combine the strengths of both template-based systems and recent advances in pretrained SEQ-TO-SEQ models for the task of commonsense reasoning via template expansion.
In this work, we present ITO (Instruction Fine-tuning for Template Based Commonsense Reasoning), an approach that models the TemplateCSR task as a instruction fine-tuning task inspired by the recent advances in instruction based fine-tuning (Wei et al., 2022a).Table 2 shows an example of our task setup for our ITO approach.In comparison to approaches such as Donahue et al. (2020), our approach does not strictly enforce that sentences only fill missing spans of text.Rather, the expanded sentences are allowed to have additional modifications (token addition, deletion and rewrite).For instance, for the following input template -{person_at_location} has a {higher/lower} risk of {disease} because {reason_for_risk}, a valid expansion is person who lives in the city has a higher risk of depression due to noise.

Training
Given a template x ∈ X and its corresponding expansion y ∈ Y, we can train any sequenceto-sequence model that models p θ (y|x).Towards this, we use a pretrained sequence-to-sequence model M to estimate the filled template y for an input x.We model the conditional distribution Input (Template) Output (Expansion) The first blank is person_at_location.
The second blank is higher/lower.The third blank is disease.
The fourth blank is a reason_for_risk.
[MASK] has a [MASK] risk of Person who lives in a city has a higher risk of depression because of higher stress due to noise p θ (y | x) parameterized by θ as where M is the length of y.

Inference to Decode Template Expansions
The auto-regressive factorization of SEQ-TO-SEQ p θ allows us to effectively cast the constrained decoding of filling the template as generating the sequence given the input x.For each expansion, we sample y 1 j ∼ p θ (y | x j ).Consequently, we sample y 2 j ∼ p θ (y | x j , y 1 j ), and the token generation process is repeated until we reach the end-symbol.For each symbol, the model has to decide between generating a token to replace the template slot or generate part of the template, while also ensuring the overall generated output sequence is consistent with the constraints given in the template.

Experiments
In this section, we describe the experimental setup, and baselines for our approach.Since our ITO approach is agnostic to the pretrained encoderdecoder architecture type, we perform experiments on two state-of-the-art SEQ-TO-SEQ models -BART and T5.

Experimental Setup
Metrics : We use the following evaluation metrics for evaluation for the TemplateCSR task: (i) ROUGE (Lin, 2004) and (ii) BERTSCORE (Zhang et al., 2019).N-gram metrics such as ROUGE are known to be limited, specifically for reasoning tasks.To mitigate this, we use BERTSCORE, which uses the similarity score between the reference and generated output using conceptual embeddings from BERT (Devlin et al., 2019) model, which correlates better towards human judgements.
To perform the evaluation, we compare the generated sentence for the template against the gold annotations in the dataset.We remove the template words from the output and only compare the slot filler concepts to avoid score inflation due to copying.All the experiments were performed on a cluster of 8 NVIDIA V100 GPUs for about 32 GPU hours.

Models
We follow the same experimental settings across the baseline and our approach for all the models.We initialize all the models with their pretrained weights.We use commonly used encoder-decoder architectures for our experiments -BART-BASE, BART-LARGE, T5-BASE and T5-LARGE.The model settings are given in below: (i) BART-BASE: This pretrained encoderdecoder transformer architecture is based on Lewis et al. (2020)

SPL TOKEN:
In this approach, we use the special token approach (SPL TOKEN) (Donahue et al., 2020), where we indicate the start and end of each template slot in the input and generate the output sentence.Table 4 shows the baseline setup.

Results
The results across various pretrained encoderdecoder approaches are shown in table 3.In this table, we see that on average, BART models perform better than T5 models on average.We hypothesize this might be an effect of their pretraining task choices and corresponding datasets.We also observe that INSTRUCTION based models outperform the SPL TOKEN based approach.For all of the models and baselines, we used the greedy decoding strategy.Across all the experiments, we found that the ITO approach outperforms SPL TOKEN approach across both ROUGE and BERTSCORE scores for all models.
7 Implementation adapted from Huggingface (Wolf et al., 2020) 5 Factual Correctness Evaluation Expert Evaluation of Dataset : Since our dataset contains health related commonsense knowledge, we additionally verified our data from domain experts.In our case, we selected a subset of 100 data samples from our validation set, and we verified the factual claims by recruiting two annotators with a Doctor of Medicine (M.D.) degree8 .Each expert annotator labeled either correct or incorrect for a template-expansion pair.Overall, they found an average of 70% of the samples to be factually correct with an inter-annotator agreement of 80%.
Model Output Evaluation : To assess the quality of generated output, we perform additional factuality evaluation towards our best performing models -SPL TOKEN and ITO approach using BART-LARGE.Towards this we use the FACTCC factuality metric (Kryscinski et al., 2020), which uses entailment classification to predict a binary factuality label between the source document and generated output.
Computing factuality using FACTCC metric requires an input source document; (i.e.) the gener-

Model
Type FACTCC BART-LARGE SPL TOKEN 65.27 BART-LARGE ITO 79.88 Table 5: Factual consistency results.In this experiment, we show that our ITO approach outperforms the SPL TOKEN approach in terms of factuality metric FACTCC, showing its relative effectiveness ated output is compared against the source document for factual correctness.For this evaluation setup, we augmented each generated output y with a source document.Towards this, we use a large scale retrieval corpus based on Nguyen et al. (2016), and retrieve the top similar document D (Lin et al., 2021b) to a generated template expansion.Using the (D, y) pairs, we compute the factual correctness of our best performing models.From the table 5, we observe that our ITO approach outperforms the SPL TOKEN approach for factual correctness by ∼14 points in accuracy.Additionally, we also perform human evaluation of factual correctness.For this experiment, three human judges annotated 100 unique samples for correctness -that indicates how many samples were correct from a human perspective.We used our best performing BART-BASE-ITO model for this evaluation.In this experiment, a sentence generated by the model for a given template was given to each human judge and they were asked to evaluate whether the sentence was correct, given the template.The inter-annotator agreement on sentence expansion correctness was substantial with a Fleiss' Kappa score (Fleiss and Cohen, 1973) of 0.73.From our evaluation, we found that human judges rated about 69% of the sentences to be correct given a template, comparable to our FACTCC evaluation metric numbers.Both the automated and human evaluation suggests that our ITO approach has better factual consistency.

Error Analysis
In this section, we analyze in detail how well language models perform template-expansion task for multihop reasoning.To understand the errors in depth, we complement our automated evaluation with manual error analysis.For this analysis, we randomly select 100 samples from the validation set predictions where the ROUGE scores were low.We observe the following categories of errors that language models exhibit.Table 6 shows the common type of errors and a corresponding example for each type.
Error Type -Correct but not in gold (17%) : In several cases, we observe that the output produced by the language models are correct despite not matching the gold answer.This phenomenon is evident when the input template contains multiple possible answers.While the gold answer in the example shown in Table 6 (first row) fills the template using smoking, the model generates an answer related to kidney damage.While correct, the automated generation metrics such as ROUGE and BERTSCORE score such answers lower.
Error Type -Wrong commonsense concept (8%) : In this category of error, the model generates the wrong specification for the given slot.For instance (second row in table 6), the model mistakenly assumes person taking less medication as a socioeconomic condition.This error type gives a more nuanced understanding on which concept categories the model makes the most mistakes.
Error Type -Generic Explanation (53%): In several cases, the model resorts to generic explanation that are obvious.A generic explanation repeats the same information as the rest of sentence as an explanation, thereby not providing any new information compared to the rest of the sentence.In the example shown in Table 6 (row 3), the explanation because of the strain of the heart is already clear from the concept chest pain.A generic explanation is often unreliable in explainable NLP systems since it does not provide any insight into the reasoning capability of the model (Ye and Durrett, 2022).
Error Type -Factually Incorrect (22%) : Factual correctness is one of the biggest challenges in NLP applications (Petroni et al., 2020;Pagnoni et al., 2021).The incorrect factual information is also acute for cross-domain reasoning applications as well.As shown in the example (row 4 in table 6), the model incorrectly generates that people with flu diagnosis should do exercise.Factual correctness in generation models is an active area of research and we believe that template-based approaches can provide additional insight into this phenomenon.
Overall, TemplateCSR remains a challenging  task for SEQ-TO-SEQ models, specifically on their factual correctness and we believe it opens several avenues for progress in this research direction.

Related Work
Knowledge Bases : Knowledge Bases (KBs) have been the predominant approach to perform commonsense reasoning in the past (Speer and Havasi, 2013).Some of the prominent knowledge bases for commonsense reasoning include DBPedia (Mendes et al., 2012), YAGO (Suchanek et al., 2007) and NELL (Mitchell et al., 2018) or extending KBs with domain knowledge (Khetan et al., 2021(Khetan et al., , 2022)).In this work, we focus on TemplateCSR using LM, which can be viewed as a complementary using KBs for commonsense.
Language Models for Reasoning: Using pretrained language models to generate knowledge has been studied for commonsense reasoning tasks.(Sap et al., 2019;Bosselut et al., 2019;Shwartz et al., 2020;Bosselut et al., 2021).Our work closely aligns with Bosselut et al. (2019Bosselut et al. ( , 2021)).Compared to Bosselut et al. (2019), where our goal is to extend towards more controllable commonsense reasoning.Our work is also related to recent chain-of-thought prompting approach (Dalvi et al., 2021;Wei et al., 2022b), where a reasoning chain is first generated before the final solution.Compared to chain-of-thought prompting, our approach focuses on controllability of the reasoning process from input, via tem-plate slots using instructions, similar to Wei et al. (2022a).2020) but our application differs from them in that we focus on commonsense reasoning instead of content planning for stories.

Conclusion and Future Work
In this paper, we present the ITO approach that adapts language models to perform the TemplateCSR task via prompting.We collect a dataset for the same, and show that such an approach allows higher control over the reasoning process by enabling practitioners to specify the nature of the template slots.Through both automated and human metrics, we find that our ITO approach outperforms the baselines while also maintaining high factuality.For future work, we hope to extend this line of work towards other controllable generation tasks such as story generation and summarization.

Limitations and Ethics Considerations
Our work only tests single sentence templateexpansion pairs.One of the limitation of our work is that we do not try on longer template sequences, which we want to explore for future work.While we present our dataset and corresponding modeling approaches, we acknowledge the limitations of the system and potential risks if it was used for real-world use-cases due to its factual errors.
While we worked diligently with the annotators, and real domain experts to ascertain the quality of the dataset, we believe that there is immense room to be potentially improved and scaled further.We also did not test our dataset efficacy in large-scale models such as GPT-3 due to budget constraints, which we consider another limitation of this work.
As our results show, common sense reasoning and its health related knowledge reasoning is far from solved and we hope this dataset starts a research direction towards addressing this reasoning challenge.In no way, we support using this system for real-world commonsense related advice.The system, dataset and the accompanying publication is intended only for research purposes and ability to test current NLP systems' capabilities.
Figure2: An overview of the overall template structure for our approach.Our goal is to reason across concepts for TemplateCSR.In this example template, concept slots are people_with_habit and disease, and the multiple choice qualifier slot -higher/lower describes their relationship and an explanation reason slot aims to get a free-form text explanation for how they are related.
Model Infilling : Our work also closely relates to the language model infilling work in the literature such as Fedus et al. (2018) and Donahue et al. (2020).Compared to these works which only look at cloze-test infilling, our work aims to expand templates that cannot be directly modeled as cloze-style.Our work is also related to the story generation efforts such as Yao et al. (2019); Fan et al. (2018); Ippolito et al. (2019); Rashkin et al. ( . Despite their rigidity, template based This is because {reason_for_activity}People often do reading before going to bed in night -to prevent risk of insomnia.This is because -doing some light reading helps lull you to sleep.People often do teeth brushing before going to bed in night -to prevent risk of tooth decay.This is because -brushing removes cavity-causing plaque from teeth.Table1: Examples from our dataset.Each template has two corresponding sentences.[concept] is a concept, and [text] represents the explanation and [text] represents a qualifier.We show two sentences each for a template.Each template slot is given in free-form text without any restriction in vocabulary.

Table 2 :
Overview of ITO approach for TemplateCSR.Each concept category is given as a instruction to the input and the slots are represented via the [MASK] token.The instruction describes each slot's abstraction and the task is to generate the output.

Table 3 :
Overview of the results compared to baselines.The table shows that BART-BASE performs better than T5-BASE model and BART-LARGE outperforms both.Both in terms of ROUGE and BERTSCORE, we also observe that our INSTRUCTION approach outperforms SPL TOKEN approach.All experiments were done with 5 seeds, and reported are the average.

Table 4 :
Baseline Setup: We use Donahue et al. (2020) and use special tokens to indicate the start and end of each slot.The model is trained to predict the output, which is a valid expansion for the template.

Table 6 :
Error Analysis based on the BART-BASE-ITO model.We select 100 samples from the validation set and each row shows an example of each class of error.