Learning Contextualized Knowledge Structures for Commonsense Reasoning

Recently, knowledge graph (KG) augmented models have achieved noteworthy success on various commonsense reasoning tasks. However, KG edge (fact) sparsity and noisy edge extraction/generation often hinder models from obtaining useful knowledge to reason over. To address these issues, we propose a new KG-augmented model: Hybrid Graph Network (HGN). Unlike prior methods, HGN learns to jointly contextualize extracted and generated knowledge by reasoning over both within a unified graph structure. Given the task input context and an extracted KG subgraph, HGN is trained to generate embeddings for the subgraph's missing edges to form a"hybrid"graph, then reason over the hybrid graph while filtering out context-irrelevant edges. We demonstrate HGN's effectiveness through considerable performance gains across four commonsense reasoning benchmarks, plus a user study on edge validness and helpfulness.


Introduction
Commonsense reasoning (CSR) is essential for natural language understanding (NLU) systems to function effectively in the real world (Apperly, 2010). For example, to answer the question in Figure 1, one must already know that printing requires using paper. Yet, since commonsense knowledge is self-evident to humans, it is rarely stated in natural language (Gunning, 2018). This makes it hard for neural pre-trained language models (PLMs) (Devlin et al., 2019) to learn commonsense knowledge from corpora alone (Marcus, 2018).
allowing such KG-augmented models to make predictions via multi-hop reasoning over the KG (Lin et al., 2019;).
Despite the growing success of KG-augmented models, obtaining helpful KG facts for a given task instance remains challenging. Existing models assume using either KG-extracted edges Ma et al., 2019;Feng et al., 2020;Yasunaga et al., 2021), PLM-generated edges (to address KG edge sparsity) , or a late fusion of both  is sufficient. Both extraction and generation can produce unhelpful edges, so the model must decide which edges to focus on during reasoning. Since extracted and generated edges are derived from the same set of concepts (nodes), modeling the interactions between extracted and generated edges jointly within a shared KG structure could provide stronger signal for identifying contextually relevant edges. However, current models do not leverage this information.
In response, we propose a new KG-augmented model: Hybrid Graph Network (HGN). Unlike prior models, HGN learns to jointly contextualize extracted and generated knowledge by reasoning over both within a unified graph structure. Given the task input (i.e., context) and an extracted KG subgraph, HGN is trained to generate embeddings for the subgraph's missing edges to form a "hybrid" graph, then reason over the graph (to update model parameters) while filtering out context-irrelevant edges. HGN achieves this primarily through edge reweighting, which downweights irrelevant edges, and edge-weighted message passing, which attenuates irrelevant edges' impact on reasoning.
Our extensive experiments demonstrate that HGN improves performance over all baselines across four CSR benchmarks. In particular, among comparable methods, HGN ranks first on the Com-monsenseQA (Talmor et al., 2019) and Open-bookQA (Mihaylov et al., 2018) leaderboards. Plus, our user studies show that humans find HGNfiltered edges to be more valid and helpful than the heuristically extracted edges used in prior work.

Problem Statement
We consider CSR tasks, like question answering (QA), which can benefit from commonsense KGs. To solve CSR tasks, we focus on KG-augmented models, where a PLM is augmented with a commonsense KG. Given a CSR task, let x be the task's text input, f be the model, and f (x) be the model output. We denote a KG as G = (V, R, E). V, R, and E are the sets of nodes (concepts), relations, and edges (facts), respectively, in the KG. An edge is a directed triple of the form e = (h, r, t) ∈ E, where h ∈ V is the head node, t ∈ V is the tail node, and r ∈ R is the relation between h and t. Let [·, ·] denote concatenation of text or vectors.
As illustrated in Figure 2, a KG-augmented model f has three main components: text encoder f text , graph encoder f graph , and scoring function f score . First, s = f text (x; θ text ) is the encoding of x, where f text is usually a Transformer PLM. Second, as supporting evidence, a x-specific graph G = (V , R , E ) is constructed from G ( Figure 1). Typically, this is done via heuristic extraction by selecting V ⊆ V as the concepts mentioned in x, R ⊆ R as the relations between concepts in V , and E ⊆ E as the edges involving V and R . If G does not provide enough knowledge to build a good G , then new edges are sometimes Figure 2: High-level schematic of a typical KGaugmented model for CSR. In KG-augmented models, text encoder f text tends to be a Transformer PLM, and scoring function f score is usually an MLP. Meanwhile, KG-augmented models generally vary more in their graph encoder f graph and graph construction. added to G using a PLM-based generator . We call G the contextualized KG. g = f graph (G , s; θ graph ) is then the joint encoding of G and s. Third, the model output is computed as f (x) = f score ([s, g]; θ score ), where f score is usually a multilayer perceptron (MLP). Existing KG-augmented models mainly differ in their design of f graph , reasoning over the KG through message passing (Schlichtkrull et al., 2018a;Feng et al., 2020;Yasunaga et al., 2021) or edge/path aggregation Ma et al., 2019).
While KG-augmented models can be applied to any CSR task involving KGs (e.g., natural language inference), we consider multi-choice QA in this work. Given a question q and set of candidate answers {a i }, the QA model's goal is to predict a plausibility score ρ(q, a) for each a ∈ {a i }, so that the highest score is predicted for the correct answer. To use KG-augmented models for commonsense QA, we set x = [q, a] and ρ(q, a) = f (x).

Overview
As illustrated in §2 and Figure 2, given questionanswer pair (q, a) for an instance of the multichoice QA task, the KG-augmented QA model first obtains a (q, a)-contextualized KG G via the full KG G. Edges in G can be extracted directly from G or generated using a PLM-based generator . Then, the model transforms (q, a) and G into text encoding s and graph encoding g, respectively. Finally, s and g are used to predict (q, a)'s plausibility.
However, a contextualized KG may have low knowledge recall or precision, hindering the QA model's access to relevant knowledge. Low recall can stem from missing edges in G, low precision can be the result of bad annotations in G, Figure 3: Overview of HGN. After building a hybrid graph of extracted and generated edges ( §3.2), HGN reasons over the hybrid graph by updating the node embeddings V, hybrid edge embeddings E, and adjacency matrix A at each layer ( §3.3). Darker edges indicate higher weights. Red variables are updated in the previous step. and both can be caused by noisy edge extraction or generation when building G . HGN addresses these issues by reasoning over both extracted and generated edges within a unified graph structure. To improve recall, HGN generates new edges via a PLM-based generator, then initializes a hybrid contextualized KG containing both extracted and generated edges. Note that edge generation is generally (q, a)-agnostic and may produce irrelevant edges that hurt knowledge precision. To improve precision, HGN learns to reweight edges in the hybrid graph and reason over the hybrid graph via edge-weighted message passing. This is akin to learning the hybrid graph's structure and reduces the impact of irrelevant edges on reasoning. Additionally, to further encourage downweighting of noisy edges during reasoning, HGN is trained with entropy regularization on the learned edge weights.
The overall learning objective of HGN is defined as L = L task + βL edge , where L task is the loss for the downstream task (in our work, QA), L edge is the entropy regularization term for edge weights, and β ≥ 0 is a loss weight hyperparameter. In the following subsections, we first explain how the contextualized KG G is constructed as a hybrid graph, including its node embeddings V, hybrid edge embeddings E, and adjacency matrix A 0 ( §3.2). Next, we show how HGN uses edge-weighted message passing to update V, E, and A 0 for L layers (Figure 3), yielding a refined adjacency matrix A L of learned edge weights ( §3.3). Finally, we describe how L task is computed using s and g, while L edge is calculated using A L ( §3.4).

Hybrid Graph Construction
Node Embeddings. The first step of retrieving knowledge from G is concept grounding, which involves identifying text spans in (q, a) that match nodes in V. We define V as the set of all concepts mentioned in (q, a), are the question and answer concepts, respectively. Each node v i ∈ V is represented by an embedding v i ∈ V, which can be initialized using BERT (Devlin et al., 2019) or TransE (Bordes et al., 2013).
Hybrid Edge Embeddings. In G , we loosen the definition of an edge to be e We build fully-connected edges between question and answer nodes in G . The set of edges in G is thus defined as After concept grounding, we need an edge embedding e (i,j) ∈ E for each edge e (i,j) . Let R be the relation embeddings for all relations in R, obtained using TransE. Each extracted edge However, due to edge sparsity, many edges do not have labeled relations and cannot be initialized this way. Meanwhile, despite PLMs' limitations in commonsense, they have shown some ability to encode commonsense knowledge (Davison et al., 2019;Petroni et al., 2019) and aid KG completion . Hence, we generate edge embeddings for all unlabeled edges by feeding each unlabeled edge into a GPT-2 (Radford et al., 2019) based generator f gen (·, ·). This is further explained in the "Edge Embedding Generation" paragraph.
In summary, edge embeddings are computed in a hybrid way: Edge Embedding Generation. Inspired by recent work in PLM-based commonsense KG completion , we frame edge generation as text generation. First, for each extracted edge (h, r, t) ∈ E, we first tokenize its node pair (h, t) and relation label r. Leth,r, andt be the respective token sequences of h, r, and t. Also, let $ be the special separator token. Next, for each tokenized extracted edge, we train a GPT-2 model (Radford et al., 2019) to autoregressively generate the concatenated sequence [h, $,t, $,h,r,t].
During inference, we only have unlabeled edges Alternatively, we consider another edge generation approach proposed by . Here, f gen (·, ·) is trained to generate a relational path connecting v i to v j , then pool the path into an edge embedding. The rationale for this approach is that such paths have been shown to contain useful semantic information about the relation between v i and v j (Neelakantan et al., 2015;Das et al., 2017;.

Hybrid Graph Reasoning
The procedure described in §3.2 yields a hybrid graph, containing unweighted edges between all question-answer node pairs. Constructing this hybrid graph may improve edge recall, but does not address precision. Some edges in the initial hybrid graph may be irrelevant to the question-answer pair, either due to noisy edge extraction or generation. HGN is thus designed to downweight irrelevant edges by converting the unweighted graph into a weighted one, then learning to reweight all hybrid edges during reasoning ( Figure 3).
Learnable Adjacency Matrix. Although A 0 is a binary adjacency matrix, HGN populates it with learned edge attention weights and iteratively updates them over L layers of reasoning. We denote the adjacency matrix at layer as A , where 0 ≤ A (i,j) ≤ 1. Updating A can be viewed as softly contextualizing the hybrid graph's structure with respect to (q, a).
Edge-Weighted Message Passing. Following the general Graph Network (GN) formulation pro-posed by Battaglia et al. (2018), HGN's graph reasoning module consists of layer-wise node-toedge (v → e) and edge-to-node (e → v) message passing functions. However, we equip HGN with a modified version of GN's edge-to-node message passing function, in which each edge's weight is used to rescale information flow on that edge. Intuitively, an edge's weight signifies the edge's relevance for reasoning about the given task instance. We also use text encoding s as global context throughout message passing. Formally, HGN's update rule at layer is: (1) In node-to-edge message passing, the embedding of each edge (v i , v j ) ∈ E is updated as h (i,j) , a function of (v i , v j )'s constituent nodes and the given context s. Through s, the hybrid graph is strongly contextualized with respect to (q, a). Then, h (i,j) is used to compute edge score w (i,j) , which measures the edge's relevance to s. Each edge score is globally normalized across all edges in the graph to produce edge attention weight A (i,j) , so that low-scoring edges are softly pruned by receiving close-to-zero weight.
We use global edge attention (i.e., normalizing across E ) instead of local edge attention (i.e., normalizing across N j ) because local edge attention assumes at least one edge in N j is relevant, which may not be true. For example, given an irrelevant or incorrectly grounded concept, none of its edges will be helpful, and so all nodes in its neighborhood should be excluded from influencing the reasoning process. To demonstrate the advantage of global edge attention, we empirically compare our default HGN architecture to an HGN variant based on Graph Attention Network (GAT) (Velickovic et al., 2018), which uses local edge attention, in our experiments.
In edge-to-node message passing, the embedding of each node v j ∈ V is updated as h j , a function of v j 's neighboring edges. For each edge neighbor, edge weight A (i,j) is used to rescale the edge's influence on v j 's embedding update.

Learning Objective
Task Loss. After L layers of message passing, we obtain node embeddings Node embeddings are aggregated into v agg via attentive pooling with s as the query vector. Edge embeddings are aggregated into e agg via edgeweighted sum pooling. The final graph encoding is then given as g = [v agg , e agg ]. The probability of a being the answer to q is calculated asρ(q, a) ∝ exp(ρ(q, a)), where ρ(q, a) = f score ([s, g]; θ score ). We use cross-entropy loss for the QA classification task, so the loss for each (q, a) with label y is: L task (ρ(q, a; θ)), y) = −y logρ(q, a; θ). (2) Entropy Regularization. To encourage the model to be decisive during edge reweighting, we use a regularization term to penalize nondiscriminative edge weights. In an extreme case, a blind model will assign the same weight to all edges, degenerating G into an unweighted graph. This is a failure mode, since G is likely to contain mostly irrelevant edges, and we want the model to focus on the helpful edges. Therefore, via L edge , we train the model to minimize the entropy of the edge weight distribution (i.e., make the distribution more skewed), in order to maximize the informativeness of the predicted edge weights. Lower entropy means the model has higher certainty about edges' relevance to the given task instance, such that the model will discriminatively judge some edges as being much more relevant than others. L edge is computed as: Joint Learning. We jointly optimize L task and L edge , so graph reasoning and structure can be jointly learned. The full learning objective is:

a,y)∼X train
Ltask (ρ(q, a)), y) + β · Ledge(A L (q, a)) , where θ = {θ text , θ graph , θ score } is the set of all learnable parameters, and X train is the training set. We train our model end-to-end by minimizing L(θ) with the RAdam  optimizer.

Experimental Setup
We evaluate our proposed model on four multiplechoice commonsense QA datasets: Common-senseQA ( Therefore, we build our graph reasoning model on top of retrieval-augmented methods on the leaderboard: "AristoRoBERTa" 2 for OpenBookQA and "RoBERTa (2-step IR)" 3 for QASC. In this way, we can study if strong retrieval-augmented methods can still benefit from KG knowledge and our HGN framework.

Compared Methods
We compare our model with a series of KGaugmented methods and different graph encoders: Models Using Extracted Facts. We consider seven models that only use extracted facts. RN (Santoro et al., 2017) builds the graph with the same node set as our method but extracted edges only. The graph vector is calculated as  (Wang et al., 2019b) softly aligns the nodes in question and answer and do pooling over all matching nodes to get g. KagNet (Lin et al., 2019) uses an LSTM to encode relational paths between question and answer concepts and pool over the path embeddings for graph encoding.

Models Using Extracted and Generated Facts.
We consider two models that use both extracted facts and generated facts. RN + Link Prediction differs from RN by only considering the generated relation (predicted using TransE (Bordes et al., 2013)) between question and answer concepts. PathGenerator ) learns a path generator from paths collected through random walks on the KG. The learned generator is used to generate paths connecting question and answer concepts. g is calculated as the concatenation of the pooled vector over the generated paths and the pooled vector over the extracted paths.
Our Model's Variants. As described in §3.2, the edge embedding can be computed either as a relation embedding or a path embedding. We name these two variants as HGN (w/ RelGen edges) and HGN (w/ PathGen edges) respectively.

Results
Performance Comparisons. Tables 1, 3, 4 show performance comparisons between our models and baseline models on CommonsenseQA, CO-DAH, OpenBookQA and QASC. We clearly find that models with stronger text encoders perform better (i.e. RoBERTa > BERT-Large > BERT-Base). For all text encoders, our HGN shows consistent improvement over baseline models on all datasets. The improvement over all baselines are tested to be statistically significant under most settings, demonstrating the effectiveness of HGN both with and without retrieved evidence. We also submit our best model to leaderboards for CommonsenseQA and OpenBookQA. For CommonsenseQA (Table 2), our HGN ranks first among comparable approaches and shows remarkable improvement over PathGenerator  and the LM Finetuning approach (ALBERT (Lan et al., 2020)). Higher-ranking   (Schlichtkrull et al., 2018b) 65.56 82.42 GAT (Velickovic et al., 2018) 65.88 82.78 GN (Battaglia et al., 2018) 65.52 82.06 GconAttn (Wang et al., 2019a) 65.17 82.35 MHGRN (Feng et al., 2020) 65.92 83.07 PathGenerator  64  models either use stronger text encoders or leverage additional data resources. Specifically, Uni-fiedQA (Khashabi et al., 2020) and T5-3B (Raffel et al., 2020) are based on T5. They have 11B and 3B parameters respectively, making them impractical to be finetuned in an academic setting. ALBERT+DESC-KCR  and AL-BERT+KD additionally use concept definitions from dictionaries. ALBERT+DESC-KCR and AL-BERT+KCR leverage "question concept" annotations, which are used during the construction of the CommmonsenseQA dataset and allow the model to learn shortcuts that don't generalize to other datasets. ALBERT+KRD retrieve sentences from OMCS corpus (Liu and Singh, 2004) as input. These methods are therefore not comparable with our model. For OpenBookQA (Table 5), our model ranks first among all models using AristoRoBERTa as the text encoder.

User Study on Learned Structures
To assess HGN's ability to refine graph structure, we compare the graph structure before and after being processed by HGN. Specifically, we sample 30 questions with its answer from CommonsenseQA's development set and ask 5 human annotators to evaluate the graph output by GN (with adjacency matrix A extract and extracted facts only) and by HGN (with adjacency matrix A L ). We manually binarize A L by removing edges with weight lower than 0.01. Given a graph, for each edge (fact), annotators are asked to rate its validness and helpfulness. The validness score is rated as a binary value in a context-agnostic way: 0 (the fact does not make sense), 1 (the fact is generally true). The helpfulness score measures if the fact is helpful for solving the question and is rated on a 0 to 2 scale: 0 (the fact is unrelated to the question and answer), 1 (the fact is related but doesn't directly lead to the answer), 2 (the fact directly leads to the answer). Note that the percentage of valid edges can be understood as the precision of graph edges. For a given instance, the number of valid edges is proportional to the recall of the edges. We also include another metric named "prune rate" calculated as: 1 − # edges in binarized A L # edges in A 0 , which measures the portion of edges assigned very low weights (softly pruned) during training and is only applicable to HGN.
The mean ratings for 30 pairs of (GN, HGN) graphs by 5 annotators are reported in Table 6. The Fleiss' Kappa (Fleiss, 1971) is 0.51 (moderate agreement) for validness and 0.36 (fair agreement) for helpfulness. The graph refined by HGN has both more edges and denser valid edges compared to the extracted one. The refined graph also achieves a higher average helpfulness score. These all indicate that our HGN learns a superior graph structure with more helpful edges and fewer noisy edges, which improves over previous works that rely on extracted and static graphs. Detailed cases can be found in Appendix §C.

Related Work
Commonsense QA. Commonsense QA is challenging because the required commonsense knowledge is seldom given in the question-answer context or encoded in the PLM's parameters. Thus, many works obtain this knowledge from external sources (e.g., KGs, corpora). While Lv et al. (2020) show that KGs and corpora can provide complementary knowledge, our paper focuses on improving the use of KG knowledge. KG knowledge can be acquired in different ways, either from KGextracted edges Ma et al., 2019;Feng et al., 2020;Yasunaga et al., 2021), PLMgenerated edges , or both . KG-augmented models mainly differ in how they encode KG knowledge, using message passing (Schlichtkrull et al., 2018a;Feng et al., 2020) or edge/path aggregation Ma et al., 2019;. The most relevant work to ours is . The main difference is that they coarsely combine extracted and generated knowledge via late fusion, while HGN encodes both types of knowledge within a unified graph. Besides, they use RN to pool over a set of paths for graph encoding, while HGN reasons over the graph via message passing and edge reweighting.
Graph Structure Learning. Instead of assuming a fixed graph structure, a number of graph models learn the graph structure with respect to the downstream task. Some models learn to discretely select edges for the graph (i.e., hard pruning).  and Franceschi et al. (2019) sample the graph structure from a predicted probabilistic distribution with differentiable approximations. Norcliffe-Brown et al. (2018) calculate the relatedness between any pair of nodes and only keep the top-k strongest connections for each node to construct the edge set. Sun et al. (2019) start with a small graph and iteratively expand it with retrieving operations. Others learn to reweight edges in a fully connected graph (i.e., soft pruning).  and Yu et al. (2019) propose heuristics for regularizing edge weights. Hu et al. (2019) use the question embedding to help predict edge weights. Unlike other edge reweighting models, HGN operates over a hybrid graph of both extracted and generated edges, while updating edge weights with respect to node, edge, and text features.

Conclusion
In this paper, we propose HGN, a KG-augmented model for CSR. To address KG edge sparsity and noisy edge extraction/generation, HGN learns to jointly contextualize extracted and generated knowledge by reasoning over both within a unified graph structure. We justify HGN's design by showing that HGN improves performance on various CSR benchmarks and user studies. In future work, we plan to increase the graph's relation expressiveness by incorporating open relations, plus make the edge extraction/generation process more dependent on the reasoning context.