Data-Efficient Concept Extraction from Pre-trained Language Models for Commonsense Explanation Generation

,


Introduction
The natural language processing (NLP) field has gigantic advancements since the involvement of deep learning (Young et al., 2018) and large-size pre-training models (Liu et al., 2019;Devlin et al., 2019;Radford and Narasimhan, 2018).However, creating human-like AI systems requires models comprehending commonsense knowledge and reasoning, which is still a challenge for current NLP models (Jia and Liang, 2017).Among all the research about making language models' responses fit common sense knowledge, we would like to focus on generating sentences to explain a statement with commonsense knowledge (Wang et al., 2019).An example of this task is shown in Table 1.Given a statement, the goal is to output an explanation to explain the statement.
There are two ways for this task at present.One

Statement:
The school was open for summer Template: Statement that the school was open for summer is wrong because concepts about [MASK] .Concepts: summer, school, education, the, schools, time, it, vacation, holidays, students, learning, kindergarten, that, teaching, children Explanation: Summer time is typically vacation time for school.
Table 1: Example of the task and our method.The task is to produce an explanation for a given statement.Our method adopts masked language modeling to firstly retrieve concepts from pre-trained language models by transferring statements into templates before generation.way, Figure 1(a), is to let language models to generate an explanation directly given one statement without external assistance (Wang et al., 2019(Wang et al., , 2020)).The benefit is that this is an end-to-end pipeline and would be easy to use, but it would be hard to interpret why models output such sentences, and most times, the performance would be limited.Another method, Figure 1(b), is extracting bridge concepts from external structured knowledge graphs to connect statements and explanations (Ji et al., 2020).With the assistance of knowledge graphs, such as ConceptNet (Liu and Singh, 2004), this method can introduce ground truth knowledge to offer helpful information for generation, and the selected concepts can be used to interpret models' outputs.Our experiments in Figure 3 confirm that informative concepts can assist the generation process.However, this method would need additional training to search concepts, and this method would be limited under a constrained environment where structured data would be rare.This paper introduces a pre-trained language model driven method, Figure 1(c), to combine the advantages of both methods.The key is that we can adopt pre-trained language models to provide bridge concepts through masked language modeling (Devlin et al., 2019) and prompt-tuning (Liu et al., 2021).Our approach is composed of two distinct models, conceptor and generator.The conceptor is used to provide bridge concepts for a given statement.The generator is responsible for generating explanations based on the combining statement and concepts.Following prompt-based tuning (Schick and Schütze, 2021;Gao et al., 2021;Raffel et al., 2020), we design a template to transfer a generation question to a masked prediction task.Then we use data-efficient finetuning to modify pre-trained language models such as BERT or RoBERTa to encode statements, and we pick up candidates with the highest scores in vocabulary as concepts by masked language modeling.An example is in Table 1.In the next step, the selected concepts are attached at the end of the statements as the generator's input to generate explanations.This method arises with two motivations.One is that pre-trained language models could implicitly encode world knowledge (Niven and Kao, 2019;Talmor et al., 2020;Davison et al., 2019).Another is that while pre-trained models find it challenging to produce sentences that call for world knowledge, it would be simpler for them to output key concepts without taking syntax and grammar into account (Petroni et al., 2019).
We discovered that concepts from pre-trained language models enhance the metric scores of generated sentences compared to baseline models.
Compared with structured knowledge graphs, pretrained language models are widely developed and convenient for processing unstructured texts.Moreover, fine-tuning the conceptor with more data points would further improve the efficiency of concept retrievals and metric scores.Furthermore, we design experiments to identify the advantages of obtaining informative concepts from pre-trained language models for explanation generations.
Our contributions can be summarized as follows: • We proposed to replace annotated knowledge graphs with pre-trained language models for concept extraction.Our data-efficient training method demonstrates its advantages in the explanation generation tasks.
• We designed a metric to assess the retrieved concepts and showed the association between the metric scores and created sentence quality.
• We analyzed the challenges of retrieving concepts from current pre-trained language models and how these concepts would help generate better explanations.

Related Work
Knowledge of Pre-trained Language Model.Pretrained language models are now the mainstream of the Natural Language Processing community and demonstrate incredible strength in solving complex tasks, such as text understanding, generation, and domain generalization (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020;Hendrycks et al., 2020).However, these models make inferences in a human-agnostic way and are easy to be attacked by attaching a trigger (Wallace et al., 2019).Therefore, some researchers are trying to interpret what these models have learned in their pre-training.For example, (Hewitt and Manning, 2019) shows it can recover syntactic dependency from token representations from BERT, and pre-trained language models contain a sense of world knowledge (Petroni et al., 2019;Li et al., 2021a).
Moreover, to reduce the gap between pre-training and fine-tuning, the prompt-based method is proven an efficient way to improve pre-trained language models' performance on downstream tasks (Gao et al., 2021;Schick and Schütze, 2021;Le Scao and Rush, 2021).Thus, we adopt prompt-tuning and designed templates to express pre-trained language models' internal knowledge to replace the external knowledge graphs.
Reasonable Explanation Sentence Generation.Some of the current works (Ji et al., 2020;Lin et al., 2020) focus on adjusting models to output sentences that fit the context and commonsense knowledge.Because investigations (Zhou et al., 2020;Richardson and Sabharwal, 2020) find that current pre-trained models have disadvantages in solving tasks needing inferences steps, some of these research (Moon et al., 2019;Zhou et al., 2018;Guan et al., 2019) propose using a knowledge graph to provide additional concepts to ensure controllable generation.
Natural language explanation generation is also extensively explored in explainable recommender system research (Zhang et al., 2020), which aims to generate personalized and reasonable natural language sentences to explain why certain items are recommended to users.Early methods used manual templates to generate explanation sentences (Zhang et al., 2014), and more recent works explored neural-template generation methods (Li et al., 2020a,b), personalized transformers (Li et al., 2021c(Li et al., , 2022b)), visual-enhanced explanation generation (Geng et al., 2022a;Chen et al., 2019), explanation ranking methods (Li et al., 2022a(Li et al., , 2021b)), counterfactual explanations (Tan et al., 2021(Tan et al., , 2022)), logical explanations (Shi et al., 2020;Chen et al., 2021Chen et al., , 2022a,b;,b;Zhu et al., 2021), path-based explanations (Geng et al., 2022b;Xian et al., 2019Xian et al., , 2020)), as well as large language models for explanation generation such as the P5 large recommendation model (Geng et al., 2022c).Following previous work, we use pre-trained language models as a knowledge base to replace annotated knowledge graphs so as to provide concepts for commonsense explanation sentence generation.

Problem Setup
In this task, conditioning on an statement x = {x 1 , x 2 , . . ., x m } with length m, where x i denotes the ith tokens, a model is expected to generate a sentence y = {y 1 , y 2 , . . ., y n } with length n through model f : x → y to explain why the input statement is against common sense.

Structure Overview
In our method, we adopt two separate pre-trained language models, one is for concept extractions at the first stage, and we call this model as conceptor and denote as C. Generator, denoted as G, is used to generate explanations based on extracted concepts c and the original statement x.The whole process can be formulated as

Concepts Extraction
This part is the We take the tokens in the ground truth explanations but not seen in the input statements as the label.The detail about finetuning the model is described in section 5.4.

Explanation Generation
This section will describe the P (y | x, c, θ G ) process in equation 1.The concepts c from previous step are concatenated at the end of the inputs x to construct a new input (x, c) = (x 1 , x 2 , . . ., x m , c 1 , c 2 , . . ., c k ), and feed the model G with the new input.The generator model follows the equation 2 to produce an explanation.
The structure of the G would be a sequence-tosequence model, which contains an encoder and decoder.The Encoder would map inputs (x, c) to a high-dimensional contextualized representations.
The decoder would condition input representations and past-generated sentences to generate the following sentences.
The generator G would optimize its parameters towards minimizing the following loss function: 4 Experiments

Data
Similar to earlier work (Ji et al., 2020), we also evaluate the model on the dataset from Commonsense Validation and Explanation Competition.This challenge tests the commonsense knowledge through three subtasks: (1) selecting the statement with the proper commonsense knowledge from two similar options.
(2) Choose the reason from three options to answer why a statement is against commonsense.
(3) Coming up with explanations for why a specific statement is absurd.In this work, we focus on task 3, the explanation generation task.

Model
We adopt various pre-trained language models as the concept extractor.Models we used are BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), with both base and large versions.Then we employ the T5-small (Raffel et al., 2020) model as the generator that accepts the combination of statements and concepts as inputs and outputs the explanation.The resulting explanations are evaluated using the automatic metrics, BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005).

Implementation Details
At the start, We trained the conceptor with 20% training data, then we used the trained conceptor to process all the data in training, development, and testing sets to produce concepts for the generator.
The default hyperparameter setting is following: number of concepts selected k = 3, template is id 0 in Table 3, batch size is 8 for conceptor and 32 for generator training, training epoch is 5 for conceptor and 3 for generator, learning rate is 2e − 5 for conceptor and 1e−3 for generator.The model with the best BLEU-4 score during evaluation would be saved and evaluated on test data.

Main Results
Table 2 is our main results.Compared with the generator-only method (T5-small) and knowledge graph augmentation method, using a pre-trained language model as the extractor can improve most scores and achieve the best results.Another observation is that large-version models typically outperform their base version.We attribute the reason that larger models tend to be more powerful (Liu et al., 2019;Hendrycks et al., 2020;Desai and Durrett, 2020;Brown et al., 2020).More analysis is located in section 5.4.Moreover, we can see that attaching concepts chosen randomly (Random) would reduce the result, indicating that only attaching appropriate concepts would best aid the generation process.
Additionally, to investigate the profit of concepts, we attached the tokens exclusive in references at the end of the statement and transmitted the combination to the generator.Consequently, we can observe a significant boost by attaching the correct The conceptor is trained with 20% data.UpperBound: Attaching the concepts that appear in references but do not exist in input statements.Therefore, we consider this result from UpperBound to be the upper limit of our method with selected models and trained data size.
concepts.Please note, we define the "correct" of concepts as the tokens are not seen in the statement but are included in corresponding references at this place.More discussion about retrieved concepts is in the section 5.5.This demonstrates how enhancing retrieving concept precision from language models would be beneficial for explanation generation.We have a deeper discussion in the section 5.5.

Effects of Different Templates
Since template engineering is an important part of our method, we are curious to see whether these pre-trained models are sensitive to the design of templates.Table 3 shows the model's performance with different manually created templates for prompt-tuning.From this table, although there are some variations of the BLEU-4 score, the standard deviation is only about 0.1259.Therefore, the template choice has a small impact on the metric score, indicating that pre-trained models are resistant to the difference in templates.
Moreover, we conduct additional ablation experiments.Results are in Table 4.We study what if adopting a traditional fine-tuning method that takes the [CLS] representation for concept extraction rather than adding a template.Even though models tend to promote under few-shot learning with prompt-based tuning (Gao et al., 2021), there is a small-scale decrease in our result.The reason could be a large amount of data would fill the gap between traditional fine-tuning and prompt-tuning.
Next, we study the effects of the ensemble of templates with the formula 4 and randomly selected templates with the formula 5.
M is the number of distinct templates, and i denotes the Id of templates in 3.In formula 5, i is uniform randomly selected from the range [1, 2, . . ., M ] for each training data.Table 4 shows these two methods can give enhancement compared to the default setting, and all single template results are in Table 3.Therefore, one advantage of prompttuning data augmentations with templates.

The Number Of Attached Concepts
To further investigate the best number of retrieved concepts for models, we designed an experiment by gradually changing the hyperparameter k. k denotes the number of tokens selected from masked language modelings and attached with statements.
The results are shown in Figure 2. We use BLEU-3 and BLEU-4 to evaluate the model's outputs, and the pre-trained models selected for retrieving concepts are the base version of BERT.The x-axis denotes the number of concepts k, and the y-axis represents the evaluation scores on test data.
In this figure, the model provides the best scores overall metrics with k = 3.For k small than 3, the performance is proportional to k.However, when k exceeds 3, the results decrease as k increases.
We argue that although our method can bring some concepts that would benefit the explanation generation stage, it also would inevitably include some noisy tokens that would confuse the generation model.As long k is less than 3, the benefit from retrieved concepts outperforms the noise.However, when the k becomes larger than k, the noise would reduce the advantage of these concepts, thus decreasing the scores.Therefore, in our future work, the key to enhancing the method would be raising the accuracy of concept retrieval and reducing the noise.

Evaluating Retrieved Concepts
In this section, we would like to talk about how to evaluate retrieved concepts and introduce an automatic evaluation metric.We also provide human analysis in section 5.6.
Inspired by the concept F1 metric from (Ji et al., 2020) to evaluate sentences, tokens that are in the grounded truth explanations but do not appear in the input sentences are considered the positive class.All other tokens in the vocabulary are regarded as the negative class.
In detail, we assume all unique tokens in tokenized input x as C x , unique tokens in tokenized ground truth explanation y as C y , and extracted concepts ŷ from conceptor as C ŷ. C y − C x represents the desired concepts.Then we use precision to measure the match conditions among references and outputs.
C ŷ ∩ (C y − C x ) are overlapped concepts, and | • | denotes the number, |C ŷ| is usually the same as k.This metric result shows how many retrieved concepts are aligned with the grounded truth explanations.We choose not to use recall, equation 7, for evaluation as we think the recall in our experiment setting can not properly measure the retrieved concepts and ground truth.
C y contains many ambiguous and irrelevant concepts, which are hard to retrieve, such as do, does, not, is, are, a, an.However, |C mathrmŷ ∩ (C y − C x )| are less or equal to k, but |C y | is usually much larger than k.1 All these factors make the recall score always low regardless of the quality of retrieved concepts and thus not a good indicator.20% 38.20 26.73 21.18 14.78 9.84 100% 47.50 34.80 28.42 20.15 13.83 Table 6: The precision metrics scores for concepts from pre-trained language models with different top-k.The number before % means the percent of data used to train the conceptor.0% is zero-shot, and 100% is fully fine-tuning.The higher, the better.

Recall
Then we use the precision in equation 6 to evaluate the concepts collected from BERT-base and RoBERTa-large.The result is shown in Table 6.We can observe these scores are low when zeroshot, but the precision is much improved by finetuning with 20% data.Furthermore, larger models perform better, which make it consistent with the result in section 4.4 and section 5.4.However, even after training with 100% data, the best result that the model can reach is still 47.50 by Robertalarge.This indicates this task is still challenging for current models.Results decrease as k grows, indicating that bigger k tends to bring more noise and interrupt the generation.

Fine-tuning The Conceptor
In the last section, we observe that training with more data and larger models can increase the precision of concept retrieval, but we also wonder whether these more accurate concepts would benefit the final results.The same as the precision described in section 5.3, we set all the concepts in (C y − C x ) as desired concepts for masked language modeling and denote it as label C. Because this task can be regarded as multi-label learning, we adopt the sigmoid function to map the output scores into probability and keep each value independent.Then we optimize conceptors by crossentropy loss.
As the results in Table 5, fine-tuning increases metric scores with the percent of data used.Comparing zero-shot and complete training, there is a 1.31 BLEU-4 increase for BERT and a 3.12 BLEU-4 increase for RoBERTa.Therefore, this evidence underlines the connections between the concept of precision metric and model performance.In addition, RoBERTa is better than BERT overall metrics trained with the same percent of data, the same as observations before.

How Retrieved Concepts Affect Generations
Our proposed method can simplify concept extraction and benefit the generation process more efficiently.However, current methods still face difficulty guiding pre-trained language models to return correct concepts as the scores in Therefore, we design experiments in which m of k concepts are fixed as correct and k − m concepts as noisy tokens, where 1 ≤ m ≤ k.Specifically, we randomly sample m concepts that appear in the ground truth explanations but are not included in the input sentence and pick up m − k noisy concepts from the vocabulary at random.Then shuffling these k concepts followed by attaching these concepts at the tail of statements for generations.we ran experiments with k ∈ 1, 3, 5, 8 and m increases from 1 to k for each k, and evaluate with BLEU-3 and BLEU-4 metrics.
Figure 3 displays the results.We see that the score rises with the number of correct concepts increasing.Additionally, smaller k typically performs better when having the same m.Models only need an expectation of 2 to easily surpass our best results in Table 2.These observations suggest that enhancing models' capacity to identify better bridge concepts could be the focus of future work for generating commonsense sentences.

Human Analysis
As these automatic evaluation metrics have the difficulty of heavily relying on the quality of human references, we also provide a human investigation to see whether our method can help generation compared with the baseline.We collect 50 statements paired with concepts, generated explanations, and grounded truth explanations.Then one human evaluates these data with 4 standards.Correct: whether retrieved concepts occur in the grounded truth explanations, whether these concepts are in generated explanations or not.Rea-  sonable: whether generated explanations explain statements with commonsense knowledge.Dependent: How many concepts are used in generating concepts?Relevant: this critique measures the ratio of concepts related to the statement's topic but unseen in the ground-truth explanation.
The results is in Table 7 and Table 8 are some sample cases.Table 7 indicates that our method can select relevant concepts more effectively and generate more reasonable explanations compared with baselines.We also observe that Correct is always better than Dependent, which means that the generator still mainly relies upon their internal knowledge for generations.
Table 8 provides case examples in the test dataset with produced explanations and concepts from our method and baseline models.Although in these examples, models can create explanations with the aid of concepts, these explanations follow a simple structure and grammar without further expansion.

Conclusion
In this paper, we present a combining prompttuning and pre-trained language models concept extraction method for commonsense explanation generation task.We demonstrate that our method is more data-efficient for concept extractions than methods that select concepts from the knowledge graph.Our method improves evaluation metric scores compared with baseline models.Additionally, we analyze the method of selecting concepts from pre-trained language models and challenges to solve in future work.

Limitation
In our experiment, we find no very efficient methods or pre-trained language models to retrieve concepts sufficiently.We believe this limitation would restrict the generator's capacity to generate explanations, and future work would focus on proposing methods to improve the concept retrieval precision.Additional limitations would be we haven't tested our method with alternative generation models for the generator, while the generation stage is not the focus of this work.Our method can not explain how the generator understands statements and concepts to generate explanations.
Figure 1: This Figure gives illustrations of (a) the standard method for generations without concepts augmentation, (b) the method enhanced by concepts from knowledge graphs and trained models, and (c) the method enhanced by k concepts from pre-trained language models without training for this explanation generation task.KG is a knowledge graph.G is a generator model.PLM means a pre-trained language model.

Figure 2 :
Figure 2: Performance by changing the number of top-k concepts selected.

Figure 3 :
Figure 3: Performance by changing the number of expectations.The extractor is BERT-base, and the generator is T5-small.
are sent into model G for explanation generation.In the default setting, we only train our conceptor with 20% training data in a weak-supervision manner.
equation 1. Firstly we modify inputs x by attaching a template to create a masked sentence M(x).The template we use is the template with Id 0 in Table3and an example is in Table1.Then we use C to encode M(x).At this step, we can obtain the representation of the [MASK] tokens and send it to a prediction head W ∈ R d×N , where d is the representation size and N is the vocabulary size, to produce a probability of a word fits into the[MASK].Subsequently, we select k candidates c = {c 1 , c 2 , . . ., c k } with the highest probability.k is a hyperparameter, and the default setting is 3. Concepts c combined with input sentence x

Table 2 :
(Wang et al., 2019)esults for explanations from different methods.All these methods use T5-small as the generator.The bold number indicates the best results.T5-small: Using T5 to generate explanations without any concept augmentation.Random: Attaching concepts selected randomly.Knowledge Graph: Following the method in(Wang et al., 2019)that looking up concepts from ConceptNet.The model searching through ConceptNet is trained with 100% data.BERT-base/large and Roberta-base/large denote our method and the model we used as conceptor.

Table 3 :
Templates used in our method.[DATA]would be replaced by the original data.The template is created manually and has semantic meanings.

Table 4 :
Template Ablation Study with BERT-base as conceptor

Table 5 :
Consequences after fine-tuning conceptors with k = 3. Models represent different concept extractors.% denotes the percent of data used for training.

Table 7 :
Table 6 are still deficient.Human evaluation of generated explanations and selected concepts.

Table 8 :
Sample cases of extracted concepts and generated explanations.Red color means concepts appeared in generated explanations