Diversifying Content Generation for Commonsense Reasoning with Mixture of Knowledge Graph Experts

Generative commonsense reasoning (GCR) in natural language is to reason about the commonsense while generating coherent text. Recent years have seen a surge of interest in improving the generation quality of commonsense reasoning tasks. Nevertheless, these approaches have seldom investigated diversity in the GCR tasks, which aims to generate alternative explanations for a real-world situation or predict all possible outcomes. Diversifying GCR is challenging as it expects to generate multiple outputs that are not only semantically different but also grounded in commonsense knowledge. In this paper, we propose MoKGE, a novel method that diversifies the generative reasoning by a mixture of expert (MoE) strategy on commonsense knowledge graphs (KG). A set of knowledge experts seek diverse reasoning on KG to encourage various generation outputs. Empirical experiments demonstrated that MoKGE can significantly improve the diversity while achieving on par performance on accuracy on two GCR benchmarks, based on both automatic and human evaluations.


Introduction
An important desideratum of natural language generation (NLG) is to produce outputs that are not only correct but also diverse (Tevet and Berant, 2021). The term "diversity" in NLG is defined as the ability of a generative model to create a set of possible outputs that are each valid given the input and vary as widely as possible in terms of content, language style, and word variability (Gupta et al., 2018). This research problem is also referred as one-to-many generation (Shen et al., 2019;Cho et al., 2019;Shen et al., 2022).
Diversity in NLG has been extensively studied for various tasks in the past few years, such as machine translation (Shen et al., 2019) and paraphrase § Codes of our model and baselines are available at https://github.com/DM2-ND/MoKGE.

A sub-KG on ConceptNet
Input: Piano is a kind of sport . (1) You can produce music when pressing keys on the piano, so it is an instrument . (2) Piano is a musical instrument used in songs to produce different musical tones .  generation (Gupta et al., 2018). In these tasks, output spaces are constrained by input context, i.e., the contents of multiple outputs should be similar, and globally, under the same topic. However, many NLG tasks, e.g., generative commonsense reasoning, pose unique challenges for generating multiple reasonable outputs that are semantically different. Figure 1 shows an example in the commonsense explanation generation (ComVE) task. The dataset has collected explanations to counterfactual statements for sense-making from three annotators (Wang et al., 2020). From the annotations, we observed that different annotators gave explanations to the unreasonable statement from different perspectives to make them diverse in terms of content, e.g., wrong effect and inappropriate usage.

Outputs: 3 different explanations
In order to create diversity, existing methods attempted to produce uncertainty by introducing random noise into a latent variable (Gupta et al., 2018) or sampling next token widely from the vo- cabulary . However, these methods were not able to explicitly control varying semantics units and produce outputs of diverse content. Meanwhile, the input text alone contains too limited knowledge to support diverse reasoning and produce multiple reasonable outputs (Yu et al., 2022c). As an example, Table 1 shows the human evaluation results on two GCR tasks. While human annotators were able to produce 2.60 different yet reasonable explanations on the ComVE dataset, one SoTA diversity-promoting method (i.e., nucleus sampling ) could produce only 2.15 reasonable explanations.
To improve the diversity in outputs for GCR tasks, we investigated the ComVE task and found that 75% of the concepts (nouns and verbs) in human annotations were among 2-hop neighbors of the concepts contained in the input sequence on the commonsense KG ConceptNet 1 . Therefore, to produce diverse GCR, our idea is enabling NLG models to reason from different perspectives of knowledge on commonsense KG and use them to generate diverse outputs like the human annotators.
Thus, we present a novel Mixture of Knowledge Graph Expert (MoKGE) method for diverse generative commonsense reasoning on KG. MoKGE contains two major components: (i) a knowledge graph (KG) enhanced generative reasoning module to reasonably associate relevant concepts into the generation process, and (ii) a mixture of expert (MoE) module to produce diverse reasonable outputs. Specifically, the generative reasoning module performs compositional operations on KG to obtain structure-aware representations of concepts and relations. Then, each expert uses these representations to seek different yet relevant sets of concepts and sends them into a standard Transformer model to generate the corresponding output. To encourage 1 ConceptNet: https://conceptnet.io/ different experts to specialize in different reasoning abilities, we employ the stochastic hard-EM algorithm by assigning full responsibility of the largest joint probability to each expert.
We conducted experiments on two generative commonsense reasoning benchmarks, i.e., commonsense explanation generation and abductive reasoning generation. Our empirical experiments showed that MoKGE can outperform existing diversity-promoting generation methods in diversity, while achieving on par performance in quality.
To the best of our knowledge, this is the first work to boost diversity in NLG by diversifying knowledge reasoning on commonsense KG.

Diversity Promoting Text Generation
Generating multiple valid outputs given a source sequence has a wide range of applications, such as machine translation (Shen et al., 2019), paraphrase generation (Gupta et al., 2018), question generation (Cho et al., 2019), dialogue system (Dou et al., 2021), and story generation . For example, in machine translation, there are often many plausible and semantically equivalent translations due to information asymmetry between different languages (Lachaux et al., 2020).
Methods of improving diversity in NLG have been explored from various perspectives. Sampling-based decoding is one of the most effective solutions to improve diversity. For example, nucleus sampling  samples next tokens from the dynamic nucleus of tokens containing the vast majority of the probability mass, instead of decoding text by maximizing the likelihood. Another line of work focused on introducing random noise (Gupta et al., 2018) or changing latent variables (Lachaux et al., 2020) to produce uncertainty. In addition, Shen et al. (2019) adopted a mixture of experts to diversify machine translation, where a minimum-loss predictor is assigned to each source input. Shi et al. (2018) employed an inverse reinforcement learning approach for unconditional diverse text generation.
However, no existing work considered performing diverse knowledge reasoning to generate multiple reasonable outputs of different contents.

Knowledge Graph for Text Generation
Incorporating external knowledge is essential for many NLG tasks to augment the limited textual Piano is a kind of sport . Piano is … sport art form …

Transformer (S4)
Piano is a kind of art form .

Top-ranked concepts
press (S1) Expert 1 Expert 2 music Figure 2: The overall architecture of MoKGE. The MoKGE consists of four steps: (S1) the model constructs a sequence-associated subgraph from the commonsense KG; (S2) a relational-GCN iteratively updates the representation of a concept node by aggregating information from its neighboring nodes and edges; (S3) each knowledge expert selects different salient concepts that should be considered during generation; (S4) the model generates the outputs by integrating the token embeddings of the input sequence and the top-ranked entities.
information (Yu et al., 2022c;Dong et al., 2021;Yu et al., 2022b). Some recent work explored using graph neural networks (GNN) to reason over multihop relational knowledge graph (KG) paths (Zhou et al., 2018;Jiang et al., 2019;Zhang et al., 2020a;Wu et al., 2020;Yu et al., 2022a;Zeng et al., 2021). For example, Zhou et al. (2018) enriched the context representations of the input sequence with neighbouring concepts on ConceptNet using graph attention. Ji et al. (2020) performed dynamic multi-hop reasoning on multi-relational paths extracted from the external commonsense KG. Recently, some work attempted to integrate external commonsense knowledge into generative pretrained language models (Guan et al., 2020;Bhagavatula et al., 2020;Liu et al., 2021). For example, Guan et al. (2020) conducted post-training on sythetic data constructed from commonsense KG by translating triplets into natural language texts using templates. Yu et al. (2022c) wrote a comprehensive survey for more detailed comparisons of different knowledge graph enhanced NLG methods.

Proposed Method
Problem formulation. In this paper, we focus on diversifying the outputs of generative commonsense reasoning (GCR) tasks, e.g. commonsense explanation generation and abductive reasoning generation. These tasks require one-to-many generation, i.e., creating a set of reasonable outputs that vary as widely as possible in terms of contents, lan-guage style and word variability. Formally, given a source input x, our goal is to model a conditional distribution for the target outputs p(y|x) that assigns high values to {p(y 1 |x), · · · , p(y K |x)} for K mappings, i.e., {x → y 1 , · · · , x → y K }. Meanwhile, the outputs {y 1 , · · · , y K } are expected to be diverse with each other in terms of contents.
Existing diversity-promoting methods only varied the language styles and failed to perform different knowledge reasoning to generate diverse contents (Cho et al., 2019;Shen et al., 2019;. Here, incorporating commonsense KG is essential for the generative reasoning (GR) tasks because the KG cannot only augment the limited information in the input text, but also provide a rich searching space for knowledge reasoning. Therefore, we propose to employ commonsense KG to play the central role of performing diverse knowledge reasoning, then use different sets of selected concepts to produce diverse outputs.
Model Outline. Our model has two major components: (i) a knowledge graph (KG) enhanced generative reasoning module to reasonably associate relevant concepts and background into the generation process, and (ii) a mixture of expert (MoE) module to diversify the generation process and produce multiple reasonable outputs.

KG-enhanced Generative Reasoning
The KG-enhanced generative reasoning module is illustrated in Figure 2. It consists of four steps.
First, a sequence-associated subgraph is retrieved from the KG given the input sequence ( §3.1.1). Then, a multi-relational graph encoder iteratively updates the representation of each node by aggregating information from its neighboring nodes and edges ( §3.1.2). Next, the model selects salient concepts that should be considered during generation ( §3.1.3). Finally, the model generates outputs by integrating the token embeddings of both the input sequence and the top-ranked concepts ( §3.1.4).

Sequence-aware subgraph construction
To facilitate the reasoning process, we resort to an external commonsense knowledge graph G = {V, E}, where V denotes the concept set and E denotes the edges with relations. Since direct reasoning on the entire graph is intractable, we extract a sequence-associated subgraph G where V x consists of the concepts extracted from the input sequence (denoted as C x ) and their inter-connected concepts within two hops, i.e., Figure 2, C x = {piano, sport, kind} and V x = {piano, sport, kind, art, music, press, ...}. Next, the generation task is to maximize the conditional probability p(y|x, G x ).

Multi-relational graph encoding
To model the relational information in the commonsen KG, we employ the relational graph convolutional network (R-GCN) (Schlichtkrull et al., 2018) which generalizes GCN with relation specific weight matrices. We follow Vashishth et al. (2020) and Ji et al. (2020) to use a non-parametric compositional operation ϕ(·) to combine the concept node embedding and the relation embedding. Specifically, given the input subgraph G x = {V x , E x } and an R-GCN with L layers, we update the embedding of each node v ∈ V x at the (l+1)-th layer by aggregating information from the embeddings of its neighbours in N (v) at the l-th layer: where h v and h r are node embedding and relation embedding. We define the compositional operation as ϕ(h u , h r ) = h u −h r inspired by the TransE (Bordes et al., 2013). The relation embedding is also updated via another linear transformation: Finally, we obtain concept embedding h L v that encodes the sequence-associated subgraph context.

Concept selection on knowledge graph
Not all concepts in G appear in the outputs. Thus, we design a concept selection module to choose salient concepts that should be considered during generation. For each concept v ∈ V x , we calculate its probability of being selected by taking a multilayer perception (MLP) on the top of graph encoder: . To supervise the concept selection process, we use the overlapping concepts between concepts appearing in the output sequence C y and concepts in input sequence associated subgraph G x , i.e., V x ∩ C y , as a simple proxy for the ground-truth supervision. So, the concept selection loss (here only for one expert, see MoE loss in Eq. (8)) is: Finally, the top-N ranked concepts on the subgraph G x (denoted as v 1 , ..., v N ) are selected as the additional input to the generation process.

Concept-aware sequence generation
We utilize a standard Transformer (Vaswani et al., 2017) as our generation model. It takes the concatenation of the sequence x and all the selected concepts v 1 , ..., v N as input and auto-regressively generates the outputs y. We adopt the cross-entropy loss, which can be written as: Note that since the selected concepts do not have a rigorous order, we only apply positional encodings (used in Transformer) to the input sequence x.

Overall objective
We jointly optimizes the following loss: where λ is a hyperparameter to control the importance of different tasks 2 .

MoE-Promoted Diverse Generation
To empower the generation model to produce multiple reasonable outputs, we employ a mixture of expert (MoE) module to model uncertainty and generate diverse outputs. While the MoE models have primarily been explored as a means of increasing model capacity, they are also being used to boost diverse generation process (Shen et al., 2019;Cho et al., 2019). Formally, the MoE module introduces a multinomial latent variable z ∈ {1, · · · , K}, and decomposes the marginal likelihood as follows: Training. We minimize the loss function (in Eq. (6)) using the MoE decomposition, and train the model with the EM algorithm (Dempster et al., 1977). Ideally, we would like different experts to specialize in different reasoning abilities so that they can generate diverse outputs. The specialization of experts means that given the input, only one element in {p(y, z|x, G x )} K z=1 should dominate in value (Shen et al., 2019). To encourage this, we employ a hard mixture model to maximize max z p(y, z|x, G x ) by assigning full responsibility to the expert with the largest joint probability. Training proceeds via hard-EM can be written as: • E-step: estimate the responsibilities of each expert r z ← 1[z = arg max z p(y, z|x, G x )] using the current parameters θ; • M-step: update the parameters with gradients of the chosen expert (r z = 1) from E-step.
Expert parameterization. Independently parameterizing each expert may exacerbate overfitting since the number of parameters increases linearly with the number of experts (Shen et al., 2019). We follow the parameter sharing schema in Cho et al. (2019); Shen et al. (2019) to avoid this issue. This only requires a negligible increase in parameters over the baseline model that does not uses MoE. In our experiments, we compared adding a unique expert embedding to each input token with adding an expert prefix token before the input text sequence, where they achieved very similar performance.
Producing K outputs during inference. In order to generate K different outputs on test set, we follow Shen et al. (2019) to enumerate all latent variables z and then greedily decoding each token byŷ t = arg max p(y|ŷ 1:t−1 , z, x). In other words, we ask each expert to seek different sets of concepts on the knowledge graph, and use the selected concepts to generate K different outputs. Notably, this decoding procedure is efficient and easily parallelizable. Furthermore, to make fair comparisons with sampling-based methods, we use greedy decoding without any sampling strategy.

Tasks and Datasets
Commonsense explanation generation. It aims to generate an explanation given a counterfactual statement for sense-making (Wang et al., 2019). We use the benchmark dataset ComVE from SemEval-2020 Task 4 (Wang et al., 2020). The dataset contains 10,000 / 997 / 1,000 examples for training / development / test sets, respectively. The average input/output length is 7.7 / 9.0 words. All examples in the dataset have 3 references. Abductive resoning generation. It is also referred as α-NLG. It is the task of generating a valid hypothesis about the likely explanations to partially observable past and future. We use the ART benchmark dataset (Bhagavatula et al., 2020) that consists of 50,481 / 1,779 / 3,560 examples for training / development / test sets. The average input/output length is 17.4 / 10.8 words. Each example in the ART dataset has 1 to 5 references.

Baseline Methods
We note that as we target at the one-to-many generation problem, we exclude those baseline methods mentioned in the related work that cannot produce multiple outputs, e.g., Zhang et al. (2020a); Ji et al. (2020); Liu et al. (2021). Different from aforementioned methods, our MoKGE can seek diverse reasoning on KG to encourage various generation outputs without any additional conditions.
To the best of our knowledge, we are the first work to explore diverse knowledge reasoning on commonsense KG to generate multiple output sequences. Therefore, we only compared our MoKGE with existing diversity-promoting baselines without using knowledge graph. VAE-based method. The variational auto-encoder (VAE) (Kingma and Welling, 2014) is a deep generative latent variable model. VAE-based methods produce diverse outputs by sampling different latent variables from an approximate posterior distribution. CVAE-SVG (SVG is short for sentence variant generation) (Gupta et al., 2018) is a conditional VAE model that can produce multiple outputs based an original sentence as input. MoE-based method. Mixture models provide an alternative approach to generate diverse outputs by sampling different mixture components. We compare against two mixture of experts (MoE) implementations by Shen et al. (2019) and Cho et al. (2019). We refer them as MoE-prompt (Shen et al., 2019) and MoE-embed (Cho et al., 2019). Sampling-based method. Sampling methods create diverse outputs by sampling next token widely from the vocabulary. We compare against two sampling algorithms for decoding, including truncated sampling (Fan et al., 2018) and nucleus sampling . Truncated sampling (Fan et al., 2018) randomly samples words from top-k probability candidates of the predicted distribution at each decoding step. Nucleus sampling  avoids text degeneration by truncating the unreliable tails and sampling from the dynamic nucleus of tokens containing the vast majority of the probability mass.

Implementation Details
All baseline methods were built on the Transformer architecture with 6-layer encoder and decoder, and initialized with pre-trained parameters from BARTbase (Lewis et al., 2020), which is one of the stateof-the-art pre-trained Transformer models for natural language generation (Gehrmann et al., 2021). In our MoKGE, the Transformer parameters were also initialized by BART-base, in order to make fair comparison with all baseline methods. The R-GCN parameters were random initialized.
For model training, we used Adam with batch size of 60, learning rate of 3e-5, L2 weight decay of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of learning rate. Our models were trained by one Tesla V100 GPU card with 32GB memory, and implemented on PyTorch with the Huggingface's Transformer (Wolf et al., 2020). All Transformer-based methods were trained with 30 epochs, taken about 4-5 hours on the ComVE dataset and 7-9 hours on the α-NLG dataset.
In addition to our MoKGE implementation, we also provide the baseline implementation code on GitHub https://github.com/DM2-ND/MoKGE.

Automatic Evaluation
We evaluated the performance of different generation models from two aspects: quality (or say accuracy) and diversity. Quality tests the appropriateness of the generated response with respect to the context, and diversity tests the lexical and semantic diversity of the appropriate sequences generated by the model. These evaluation metrics have been widely used in existing work (Ott et al., 2018;Vijayakumar et al., 2018;Cho et al., 2019;. Quality metrics (⇑). The quality is measured by standard N-gram based metrics, including the BLEU score (Papineni et al., 2002) and the ROUGE score (Lin, 2004). This measures the highest accuracy comparing the best hypothesis among the top-K with the target (Vijayakumar et al., 2018). Concretely, we generate hypotheses {Ŷ (1) , · · ·Ŷ (K) } from each source X and keep the hypothesisŶ best that achieves the best sentencelevel metric with the target Y . Then we calculate a corpus-level metric with the greedily-selected hypotheses . The diversity of evaluated by three aspects: concept, pairwise and corpus diversity.
Concept diversity. The number of unique concepts (short as Uni.C) measures how many unique concepts on the commonsense KG are covered in the generated outputs. A higher value indicates the higher concept diversity. Besides, we also measure the pairwise concept diversity by using Jaccard similarity. It is defined as the size of the intersection divided by the size of the union of two sets. Lower value indicates the higher concept diversity.
Pairwise diversity (⇓). Referred as "self-" (e.g., self-BLEU) , it measures the within-distribution similarity. This metric computes the average of sentence-level metrics between all pairwise combinations of hypotheses {Y (1) , · · · , Y (K) } generated from each source sequence X. Lower pairwise metric indicates high diversity between generated hypotheses.
Corpus diversity (⇑). Distinct-k (Li et al., 2016) measures the total number of unique k-grams normalized by the total number of generated k-gram tokens to avoid favoring long sentences. Entropyk  reflects how evenly the empirical k-gram distribution is for a given sentence when word frequency is considered.

Experimental results
Comparison with baseline methods. We evaluated our proposed MoKGE and baseline methods based on both quality and diversity. As shown in Table 2, MoE-based methods achieved the best performance among all baseline methods. MoKGE can further boost diversity by at least 1.57% and 1.83% on Self-BLEU-3 and Self-BLEU-4, compared with the vanilla MoE methods. At the same time, MoKGE achieved on par performance with other baseline methods based on the quality evaluation. Specifically, on the ComVE dataset, MoKGE achieved the best performance on BLEU-4 and ROUGE-L, and on the α-NLG dataset, the perfor-mance gap between MoKGE and the best baseline method was always less than 0.5% on BLEU-4.
Ablation study. We conducted an ablation study to analyze the two major components in the MoKGE.
The experimental results are shown in Table 3. First, we note that when not using MoE (line -w/o MoE), we used the most basic decoding strategy -beam search -to generate multiple outputs. We observed that the outputs generated by beam search differed only on punctuation and minor morphological variations, and typically only the last few words were different from others. Besides, integrating commonsense knowledge graph into the MoEbased generation model brought both quality and    diversity improvement on the ComVE, but might sacrifice a little quality (less than 0.5% on BLEU-4) on the α-NLG dataset. Overall, our MoKGE benefited from KG and MoE modules, and achieved great performance on both diversity and quality.

Human Evaluation
Automatic diversity evaluation (e.g., Self-BLEU, Distinct-k) cannot reflect the content-level diversity. Therefore, we conducted extensive human evaluations to assess both the quality and diversity of outputs generated from different models. The human evaluation was divided into two parts: independent scoring and pairwise comparisons. All evaluations were conducted on Amazon Mechanical Turk (AMT), and each evaluation form was answered by at least three AMT workers.
Independent scoring. In this part, human annotators were asked to evaluate the generated outputs from a single model. We first presented top-3 generated outputs from a certain model to human annotators. The annotators would first evaluate the diversity by answering "How many different meanings do three outputs express?" Then we presented human-written outputs to the annotators. The annotator would evaluate the quality by comparing machine generated outputs and human-written outputs, and answering "How many machine generated out-puts are correct?" The diversity and quality scores are normalized to the range from 0 to 3. Besides, the annotators need to give a fluency and grammar score from 1 to 4 for each generated output.
Pairwise comparisons. In this part, the annotators were given two sets of top-3 generated explanations from two different methods each time and instructed to pick the more diverse set. The choices are "win," "lose," or "tie." As shown in Table 4-5, our MoKGE can significantly outperform the state-of-the-art samplingbased methods in diversity evaluation (p-value < 0.05 under paired t-test), even slightly better than human performance on the ComVE task. At the same time, we can observe MoKGE is able to obtain on par performance with other methods based on quality evaluation. The p-value is not smaller than 0.05 (i.e., not significant difference) under paired t-test between MoKGE and baseline methods based on the quality evaluation. (1) Billy's parents took him to the zoo as a reward.

Case Study
(2) Billy wanted to go to the zoo. He saw elephants.
(3) Billy went to the store and bought an elephant.
(1) Billy's parents sent him on an African safari for a reward.
(2) He went to the zoo later in the day and saw elephants.
(3) His mother stopped by the store and bought him a stuffed elephant. (1) Billy wanted to go to the zoo and see elephants.
(2) Billy was excited to go on his trip to the zoo.
(3) Billy went to the zoo to see the animals. (1) Cars are not made of fuel.
(2) Cars burn fuel to produce energy and work.
(3) Fuel is a liquid which cannot make cars. (1) Fuel is not a vehicle material.

MoKGE (ours)
(2) Fuel is not used to make cars. They use gasoline.
(3) Cars are not made of fuel. They are made of metal. [3] [3] (1) Cars are made of metal. but not fuel.
(2) Cars are made of aluminum, not made by fuel.
(3) Fuel is used to make cars more efficient, not less so.
(1) Cars are made of rubber. Fuel is not used to make cars.
(2) Cars are made of aluminum, which is not fuel.
(3) Cars are powered by electric motors and not by fuel.
(1) Billy went to the zoo to see the animals.
(2) Billy was excited to go to the zoo with his friends.
(3) Billy's parents took him to the zoo to see elephants.

MoE (Shen et al.,)
Figure 3: Case studies. MoKGE can produce diverse knowledge reasoning on commonsense KG, select different relevant concepts (in shades of different colors), then generate diverse outputs. The outputs diversity of MoKGE is significantly better than that of beam search and nucleus sampling, and close to human performance. meanings, e.g., "go to the zoo and see elephants" and "took him to the zoo and see elephants" in the α-NLG case. On the contrary, MoKGE can generate semantically richer and more diverse contents than the other two methods by incorporating more commonsense concepts on the knowledge graph.

Future Directions
Improving content diversity in NLG. Most of the existing diversity-promoting work has focused on improving syntactic and lexical diversity, such as different language style in machine translation (Shen et al., 2019) and word variability in paraphrase generation (Gupta et al., 2018). Nevertheless, methods for improving content diversity in NLG systems have been rarely studied in the existing literature. We believe that generating diverse content is one of the most promising aspects of machine intelligence, which can be applied to a wide range of real-world applications, not only limited to commonsense reasoning.
Besides, leveraging knowledge graph is not the only way to promote content diversity as it is a highly knowledge-intensive task. Many existing knowledge-enhanced methods (Yu et al., 2022c) can be used to acquire different external knowledge for producing diverse outputs, e.g., taking different retrieved documents as conditions for generator.
Designing neural diversity metrics. In spite of growing interest in NLG models that produce diverse outputs, there is currently no principled neu-ral method for evaluating the diversity of an NLG system. As described in Tevet and Berant (2021), existing automatic diversity metrics (e.g. Self-BLEU) perform worse than humans on the task of estimating content diversity, indicating a low correlation between metrics and human judgments.
Therefore, neural-based diversity metrics are highly demanded. Intuitively, the metrics should include computational comparisons of multiple references and hypotheses by projecting them into the same semantic space, unlike metrics for evaluating the generation quality, e.g., BERTScore (Zhang et al., 2020b) and BLEURT (Sellam et al., 2020), which only measures the correlation between a pair of reference and hypothesis.

Conclusions
In this paper, we proposed a novel method that diversified the generative reasoning by a mixture of expert strategy on commonsense knowledge graph. To the best of our knowledge, this is the first work to boost diversity in NLG by diversifying knowledge reasoning on commonsense knowledge graph. Experiments on two generative commonsense reasoning benchmarks demonstrated that MoKGE outperformed state-of-the-art methods on diversity, while achieving on par performance on quality.