Knowledge Graph Compression Enhances Diverse Commonsense Generation

,


Introduction
Commonsense knowledge graphs (CSKGs) have been used to improve the performance of downstream applications such as question answering (Yasunaga et al., 2021) and dialogue (Tu et al., 2022), as well as for enhancing neural models for commonsense reasoning tasks (Lin et al., 2019;Yu et al., 2022).Typically, these methods extract keywords from the input and construct a subgraph around them using the KG knowledge, which is then incorporated into the model.
Recent popular CSKGs such as ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019) represent nodes in natural language, which allows flexibility but also adds redundancy and noise (Wu   1 Code is available at: https://github.com/eujhwang/KG-CompressionFigure 1: An example from ComVE (Wang et al., 2020).The subgraph obtained for the input sentence includes unimportant information (in red) that can lead to noisy outputs.et al., 2023).Moreover, the retrieved subgraphs around a task's concepts potentially include information that is not relevant to the context.For example, in Figure 1, the goal is to generate a reason why the input sentence ("A shark interviews a fish") defies commonsense.The concepts tank and business are semantically irrelevant to either the input or the reference output sentences.Including irrelevant information introduces noise that can deteriorate the model's performance.Recent work has addressed this by pruning noisy paths based on low edge confidence scores in knowledge base embeddings (Lin et al., 2019) or by using language models (LMs) (Yasunaga et al., 2021).Yet, the relevance of paths is not determined in relation to the given task.
In this paper, we propose to use differentiable graph compression that enables the model to learn how to select the crucial concepts that are actually related to the task.Our method contains two main components: using self-attention scores to select relevant concept nodes in the retrieved subgraph, and employing optimal transport loss to ensure the chosen concepts preserve the most crucial information of the original graph.In this way, the irrelevant or redundant concepts can be automatically eliminated in the subgraph.
We demonstrate the usefulness of our method on two commonsense generation tasks: commonsense explanation generation and abductive commonsense reasoning.Our method outperforms a range of baselines that use KGs in terms of both diversity and quality of the generations.We further conduct a comprehensive analysis, exploring a different setup, such as the scenario of incorporating new knowledge into the subgraph.Different from the baselines, our method enables the model to maintain performance, even in the presence of potentially increased noisy data.Finally, we show that our approach demonstrates better quality-diversity tradeoff than the large language model vicuna-13b, which has 100 times more parameters.
2 Background KG-Enhanced Neural Methods.KGs have been used to enhance models for question answering (Lin et al., 2019;Feng et al., 2020;Yasunaga et al., 2021), relation classification (Wang et al., 2021), textual entailment (Kapanipathi et al., 2020), and more.Typically, such methods extract a subgraph of knowledge related to keywords in the input, which is then either embedded or represented in natural language before being incorporated into the model.For example, both Wang et al. (2023) and Wang, Fang, et al. (2023) used CSKGs to enhance a commonsense inference and a QA model by including the abstraction of concepts in the input (e.g.vacation → relaxing event).However, some knowledge may be irrelevant in the context of the particular question.
To reduce such noise, prior methods have proposed to score and prune the paths.Lin et al. (2019) used TransE (Wang et al., 2014) to score each edge in the path, while Yasunaga et al. (2021) scores nodes based on the likelihood of a pre-trained LM to generate it after the input.In both methods, the scores are not trained to represent a node's importance in relation to the task.
Generating Commonsense Explanations.This paper focuses on the task of generating commonsense explanations, in particular focusing on the following datasets.In ComVE (Wang et al., 2020) the goal is to generate explanations for why a given sentence, such as "A shark interviews a fish", does not make sense.α-NLG (Bhagavatula et al., 2020) presents models with a past observation, such as "Mike spends a lot of his time on the internet" and a future observation such as "Now other people love the internet because of Mike's website".The goal is to generate a plausible explanation for what might have happened in-between, such as "Mike created a website that helps people search".In a related line of work, researchers collected or generated commonsense explanations for existing tasks (e.g., Camburu et al., 2018;Rajani et al., 2019;Brahman et al., 2021).
Diverse Sentence Generation.One of the desired aspects of generating commonsense explanations is the diversity of the outputs.Popular LM decoding methods such as top-k (Fan et al., 2018), top-p (Holtzman et al., 2020), and truncated sampling (Hewitt et al., 2022) generate diverse outputs by pruning the probability distribution over the vocabulary for the next token and then sampling a token from the pruned distribution.An alternative approach is to use a mixture of experts (MoE) to produce diverse outputs (Shen et al., 2019;Cho et al., 2019).Our approach extends MoKGE Yu et al. (2022), a model for commonsense explanation generation.MoKGE uses a combination of KGs to diversify the outputs of a MoE model.However, the knowledge that MoKGE retrieves from the KG is not filtered, hence may contain loosely related, redundant and irrelevant information, which can negatively impact the model's performance in generating high-quality diverse outputs.In our approach, we employ knowledge graph compression to prioritize more important information.

Method
Our goal is to generate diverse sentences, {y 1 , y 2 , ..., y k } that explain a given instance x (see Sec 2 for the specific task descriptions).The objective is to maximize the probability of generating each y i : P (y i |x), as well as to diversify them.Previous KG-enhanced approaches usually add an external graph G x to make the generation also conditioned on the graph: P (y i |x, G x ).However, as we discussed in Sec 1, G x often contains redundancy or noise.For example, given a target concept A, there is a semantically similar concept (e.g. a synonym) A ′ and a noisy concept B in the graph G x ).Obviously, A ′ will negatively impact the diversity of generations because the model may select both A and A ′ for generation and the semantics of the generations are similar; concept B will hurt the generation quality since it is irrelevant to the context.So, a natural idea to solve the problem is to eliminate these concepts by compressing the graph.
Our method extends MoKGE (Yu et al., 2022) by compressing the retrieved external knowledge graph.The framework is illustrated in Figure 2 and described in detail subsequently.In a nutshell, it aims to identify the concepts within the KG that provide the most relevant knowledge for a particular instance.We first extract a subgraph from the KG based on the given input sentence, and encode it into a vector representation (Sec 3.1).Then, we learn a compressed graph that maintains only the most relevant concepts for the given instance (Sec 3.2).We train the model with the corresponding losses (Sec 3.3) and finally apply MoE to generate diverse outputs (Sec 3.4).

KG Subgraph Extraction and Encoding
The subgraph extraction and encoding follows MoKGE (Yu et al., 2022).
Subgraph Extraction.We first associate each input sentence with the set of concepts from the KG that match its tokens.For example, given the sentence q ="A shark interviews a fish" (the "query"), we extract the concepts C q = {fish, shark, interview} from ConceptNet.2 Second, we fix a radius h and extract a subgraph G q with node set V q ⊇ C q from the KG such that it contains all KG nodes and edges that are up to h = 2 hops around the concepts in C q (e.g.shark Graph Encoding.To obtain embeddings for the concept nodes, we apply an off-the-shelf graph encoder over the extracted subgraph (Wu et al., 2021).In our implementation, we follow Yu et al. (2022) and use the relational graph convolutional network (R-GCN; Schlichtkrull et al., 2018).R-GCN computes node representations by iteratively aggregating neighboring node representations and thereby taking the relation types into account.In this way, the final embeddings capture the structural patterns of the subgraph.

Differentiable Graph Compression
As we discussed before, the extracted subgraphs often contain redundancy and noise, and we aim to compress the graph and remove the irrelevant information.This introduces two challenges: (1) how to make the graph compression differentiable so that it can be trained in the context of downstream tasks; and (2) how to maintain the most important and relevant information in the compressed graph.
Self-Attention for Concept Scoring.Since we want to select concepts for the generation step (Sec 3.4), we can't apply differentiable pooling methods (Ying et al., 2018;Ma and Chen, 2020) and instead choose to construct a semantically meaningful subgraph containing the relevant nodes and edges.To do so, we apply self-attention and hence essentially use the features computed in the previous step as main criterion to determine the concepts' importance.Specifically, we compute self-attention scores Z ∈ R C×1 as proposed by Lee et al. (2019) using graph convolution (Kipf and Welling, 2017): where σ is the non-linear activation function tanh; C := |V q | is the number of concept nodes in the subgraph; Ã ∈ R C×C is the adjacency matrix extended by self-connections; D is the degree matrix of Ã, which is used for normalization; X ∈ R C×F is the matrix of concept embeddings obtained in the previous step, with embedding dimension F ; and Θ att ∈ R F ×1 is the parameter matrix for the self-attention scores.Given the concept scores Z, we consider a pre-set assignment ratio s ∈ (0, 1], and form the compressed graph, G ′ , by selecting s% of concept nodes.We denote S as the number of concept nodes selected.In the example in Figure 2, the compressed (third) graph contains 80% of the nodes in the original subgraph.
Optimal Transport for Regularization.The self-attention based concept selection make the graph compressed in an differentiable way, however the attention parameters can only be trained from downstream generation tasks which cannot gurantee the compression quality as well as generalizability.Consider the case with concept A and its synonym A ′ in the retrieved graph G q , if A is selected by the attention scores, it is highly possible A ′ also has a high score to be selected, so the redundancy cannot be removed.
For this reason, we additionally apply optimal transport (OT; Peyré and Cuturi, 2019), a method commonly used for measuring the distance between two probability measures.Here, we regard a graph as a discrete distribution, similarly to Ma and Chen (2020), and minimize the OT distance between the original graph and its compressed version.To this end, we define an optimal transport loss between graphs.Given a m-node graph and a n-node graph, we assume they have discrete distributions µ = m i=1 a i σ x i and ν = n j=1 b j σ x j , where x i and x j indicate the nodes, σ is a delta function, a = (a 1 , ..., a m ) and b = (b 1 , ..., b n ) are weights of nodes (generally uniform).If we define a cost matrix M whose element M ij indicates the transport cost from node x i to node x j , then the optimal transport distance is: T ∈ R m * n is called a transportation plan, whose element T ij denotes the transportation probability from x i to x j , and it meets the requirements that T 1 n = a, and Once the optimal transport distance is minimized, the compressed graph is expected to keep as much information of the original graph.Thus redundant concepts will be largely removed, since involving them in the compressed graph will lead to less information kept.Take a simple example, given an original graph with nodes {A, A ′ , C}, the subgraph with node {A, C} should be more informative than the one with {A, A ′ }, and its optimal transport distance between the original graph should be smaller.
Since solving an OT problem is computationally expensive, we add an entropy regularization term E(T ) = ij T ij (log T ij − 1), to allow for solving it approximately using Sinkhorn's algorithm (Cuturi, 2013) in practice, following prior work.With a hyperparameter γ > 0, the entropy-regularized loss becomes:

Loss Functions for Training
Following Yu et al. (2022), we train BART-base (Lewis et al., 2020) in a seq2seq architecture on the commonsense explanation generation task, with a generation loss, and apply a KG concept loss in addition.We also include an optimal transport loss.
Generation Loss.For sentence generation, we maximize the conditional probability of the target sequence y given the input sequence x concatenated with the selected KG concepts c 1 , c 2 , ...c S .We utilize the standard auto-regressive crossentropy loss as follows: where t is the timestep of the actual output.In the generation step, the model auto-regressively generates the output y with input x and S selected concepts.
KG Concept Loss.The effectiveness of the concept selection can be measured in terms of which of the chosen concepts appear in the output sentence a (the reference answer).More specifically, we consider a regular binary cross entropy loss with targets y c = I(c ∈ V q ∩ C a ) for each c ∈ V q .Here, I(•) represents the indicator function.and C a is the set of concepts that are present in the output.To obtain a probability for each of the S concepts in the compressed graph, we apply an MLP.The resulting loss is as follows: Optimal Transport Loss.To make the optimal transport distance differentiable, we solve Eq. 2 using the Sinkhorn's algorithm (Cuturi, 2013): Starting with any positive vector v 0 , we iteratively update u and v as follows: where ⊘ is the element-wise division and K is an intermediate variable derived from the cost matrix M : K = exp(−M/γ).
After k steps, we arrive at the k-step result P k = diag(u k )K diag(v k ) as an approximated optimal transportation plan, hence the optimal transport loss is approximated by Altogether, our model is trained with three loss functions: where α and β are hyperparameters that control the relative importance of the individual loss functions.In our experimental setup, we set both α and β to a value of 0.3.

Diverse Generation based on MoE
To encourage more diverse outputs, we follow previous work (Shen et al., 2019;Cho et al., 2019;Yu et al., 2022) and use mixture of experts (MoE).We use K experts, where each expert is responsible for generating a unique set of KG concepts.The model is trained using hard-EM algorithm (Dempster et al., 1977).Since it is similar to (Yu et al., 2022)), we put the details in Appendix E. In

Baselines
MoE-based Methods.MoE-embed (Cho et al., 2019) and MoE-prompt (Shen et al., 2019) produce diverse sentences by sampling different mixture components.While MoE-embed employs independent latent variables when generating diverse outputs, MoE-prompt shares the latent variable between the experts.MoKGE (Yu et al., 2022) is the approach that we extend by adding graph compression.It generates outputs by incorporating KG concepts on top of MoE-based methods.
Other Methods to Improve Diversity.To show that our method yields a sophisticated concept selection beyond regular filtering, we compare it to a simple synonym filtering on top of MoKGE, applied during the inference step, that yields a set of unique KG concepts for generating outputs.This baseline prevents the model from selecting similar concepts when generating the outputs.Second, we consider the common pruning approach, which removes irrelevant paths from the potentially noisy subgraph, following KagNet (Lin et al., 2019).To measure the quality of the path, the path is decomposed into a set of triples.Each triple is scored based on the scoring function of the knowledge graph embedding technique, TransE (Bordes et al., 2013) and the score for each path is the product of its triple scores.The threshold for pruning is a hyperparameter and set to 0.15 following Lin et al. (2019).

Metrics
Following the same evaluation setting in previous works, we assess the performance of the generated sentences in terms of both diversity and quality.
Pairwise Diversity.Self-BLEU (Zhu et al., 2018) is used to evaluate how each sentence is similar to the other generated sentences based on n-gram overlap.Self-BLEU-3/4 are diversity scores based on 3/4-gram overlap.The metrics compute the average of sentence-level self-BLEU scores between all pairwise combinations of generated outputs.Hence, lower self-BLEU scores indicate a greater variety between the sentences in the set generated for each input sample.
Corpus Diversity.Distinct-k (Li et al., 2016) is calculated by counting the number of unique kgrams in the generated sentence and dividing it by the total number of generated tokens, to prevent preference towards longer sentences.Additionally, we report entropy-k (Zhang et al., 2018), which evaluates the evenness of the empirical n-gram dis-tribution within the generated sentence.
Quality.We use standard metrics for automatic evaluation of generative tasks: BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), which are based on n-gram overlap between the generated sentences and human-written reference outputs.
They assess the highest accuracy by comparing the best generated sentences to the target sentences.

Results and Discussion
Comparison to Baselines, Table 1.We observe similar trends for the two datasets and across the two model series, based on embedding and prompts.Overall, the differences are strongest for self-BLEU and Distinct-2, two aspects that are particularly important for diverse generation.This suggests that our model is able to reason about different possible contexts.On both datasets, our method, MoKGE+SAG+OT, outperforms the mixtures of experts by large margins in terms of diversity and, at the same time, achieves comparable or better performance in terms of quality.Note that, on ComVE, the quality differences between the best and our, second-best model are within standard deviation.The effectiveness of our approach is especially evident from the comparison to the filtering and pruning baselines.Recall that these approaches similarly aim at better exploiting the KG by im-proving diversity and removing noise, respectively.However, we observe a considerable decrease in diversity and nearly always also slightly in quality.This shows that simple solutions, unrelated to the task at hand, are seemingly not able to identify the most relevant knowledge.More specifically, for the filtering baseline, we observed that the model is unable to learn what concepts to choose for unseen data.As a result, its ability to generalize to unseen data is limited, resulting in lower diversity scores on the test data.Altogether, this demonstrates that our approach, based on the compressed graph, is effective in suppressing redundant information present in the KG and promoting other knowledge that is more relevant in the given context.
We additionally confirm that our optimal transport loss helps the model to keep the KG subgraph more coherently; see especially the α-NLG results.
Generation Examples, Figure 4. Observe that MoKGE tends to generate sentences with simpler structure and fewer concepts, whereas our model employs a broader range of KG concepts.This makes the generations effectively more similar to the human-written ones, where each of the three sentences addresses a different context.We show more examples in Appendix B.
Testing Robustness with Potentially more Redundancy and Noise, Table 2.We created a more challenging scenario by extending the extracted subgraphs with additional, related knowledge potentially including more of both relevant and redundant information.This was done by applying COMET (Bosselut et al., 2019), a transformer that was trained to generate KG triples (i.e., entity-relation-entity tuples) based on given entityrelation pairs.The original MoKGE model seems to struggle in this scenario: its performance decreases without exception in terms of all metrics.In contrast, our approach, applied on top of MoKGE, is successful in both retaining the performance of MoKGE alone and even the improvements of MoKGE+SAG+OT.
Comparison with Large Language Model, Table 3 & Figure 4. We compare our method to Vicuna-13b.Most interestingly, our proposal outperforms the LLM on Distinct-2 and Entropy-4.Note that even MoKGE alone is slightly better than the LLM in these aspects, yet our method is effective in extending the gap by better exploiting the external knowledge.Figure 4 gives example outputs and shows that the LLM is still prone to generating sentences with similar structure (e.g."I wore a wig to ..."), as it can be seen with α-NLG.Furthermore, while the generated sentence "I wore a wig to a party and felt great" explains the first observation "I always wondered why I loved wearing wigs", it fails to explain the second observation "I got beat up and reminded of why I shouldn't".In the ComVE task, the generated sentences are diverse in terms of both sentence structure and word usage, but the model sometimes generates sentences that are less logical, such as "Writing in a paper with an eraser is not permanent".In contrast, our approach enables MoKGE to generate a wider range of sentences that incorporate relevant concepts and enhance the interpretability of the generation process.

Analysis
Compression Ratios, Figure 3.This hyperparameter determines the amount of concept nodes to be kept in the compressed subgraph.Maintaining 65% of the nodes in the subgraph yields the optimal performance in terms of both diversity and quality on both datasets (see Appendix C for ComVE dataset).Interestingly, we do not observe a great negative impact on performance, even up to a ratio of 30%.This shows that ConceptNet apparently contains much information that is not necessarily beneficial for diverse generations in the context of a particular task and hence justifies our proposal.
Unique Concepts in the Output, Appendix D. The comparison of MoKGE and MoKGE+SAG+OT shows that MoKGE tends to generate more sentences containing 0  Human Evaluation, Table 4.We conducted human evaluation on the outputs produced by our model MoKGE+SAG+OT and the baseline MoKGE for the α-NLG task.We randomly se- lected 30 generations from each model.The annotation was performed by 3 researchers in the lab.We instructed the annotators to score the diversity and correctness (quality) of each generation on a scale of 0 to 3. Table 4 shows a consistent performance improvement across both diversity and quality when comparing our model to the baseline.

Conclusion
We present a differentiable graph compression algorithm that enables the model to focus on crucial information.Through experiments on two commonsense explanation generation tasks, we show that our approach not only improves the diversity but also maintains the quality of outputs.Moreover, our graph compression helps the model regain performance when new and potentially noisy information is added to graphs.Our work opens up future research in effectively selecting and incorporating symbolic knowledge into NLP models.

Limitations
Use of Single Word Concept.Since ConceptNet contains mostly single words, we limit additional KG concepts to single words only.However, it can easily be extended into phrases and we leave it to future work to investigate how to effectively utilize longer phrases.
Use of Relations.When additional KG concepts are added to the model, we focus more on the concept nodes, not the edges.However, relation edges may provide additional insight.We leave the exploration of this for future work.

Ethics Statement
Data The datasets used in our work, SemEval-2020 Task 4 Commonsense Validation and Explantation (ComVE; Wang et al., 2020) and Abductive Commonsense Reasoning (α-NLG; Bhagavatula et al., 2020), are publicly available.The two datasets aim to produce commonsense explanations and do not include any offensive, hateful, or sexual content.The commonsense knowledge graph, ConceptNet, was collected through crowdsourcing and may also introduce bias to our model (Mehrabi et al., 2021).However, we only use single word nodes from ConceptNet, which limits the impact of such bias.

Models
The generative models presented in the paper are trained on a large-scale publicly available web corpus and may also bring some bias when generating sentences.572

Figure 2 :
Figure2: Overview of our approach.We retrieve a subgraph from ConceptNet for the given input sentence, compress it, and use MoE to generate diverse sentences for containing concepts from the compressed graph.
Figure 2, the nodes in the 4th graph highlighted in green, red, and blue colors indicate the K = 3 respective experts assigned to handle different concepts.The utilization of our compressed graph version helps the model better prioritize the crucial concepts during output generation, as we demonstrate in our experiments.
et al., 2020)  was part of the Se-mEval 2020 commonsense validation task.Given a nonsensical sentence, the task is to generate explanations for why it doesn't make sense.The dataset contains 10k training examples and roughly 1000 examples each for test and validation.Each example comes with 3 reference output sentences.The other dataset, α α α-NLG (Bhagavatula et al., 2020), addresses the abductive commonsense reasoning task.Given a past observation and a future observation, the goal is to generate plausible explanations for what might have happened in-between.The dataset consists of 50k training examples, 1,779 validation and 3,560 test examples.Each example in the dataset includes up to 5 reference outputs.

Figure 3 :
Figure 3: Self BLEU-3, Distinct-2, and ROUGE-l per assignment ratio on α-NLG dataset.MoKGE-prompt with Self Attention and Optimal Transport is used for the experiment.

Table 1 :
Diversity and quality evaluation on ComVE and α-NLG datasets.All experiments are run three times with different random seeds, and the standard deviations are reported in subscript.
large LM was built uponLLaMA-13b (Touvron  et al., 2023), a transformer-based LM trained on trillions of tokens exclusively sourced from publicly available data.Vicuna-13b performs on par with ChatGPT