System 1 + System 2 = Better World: Neural-Symbolic Chain of Logic Reasoning

,


Introduction
Current NLP neural network models such as BERT (Devlin et al., 2018), RoBerta (Liu et al., 2019) and many more recent language models have revolutionized how representations and semantic information can be extracted from language, which have led to large improvements on various tasks.One challenge, however, is how to perform logical reasoning, which relies on the informative representations learned from data but also requires reasoning abilities on top of it.
Our paper builds an architecture that explicitly conducts neural logical reasoning inspired by cognitive science theory of the human mind.According to the Dual Process Theory (Sloman, 1996; Figure 1: Overview of the two-system architecture Gilovich et al., 2002), humans' cognition processes involve two systems: an intuitive, unconscious and fast process called System 1, and a logical, conscious and slow process called System 2. System 1 relies mainly on perception, which is employed to get an intuitive judgment of the current situation and reach an experience-based conclusion.Meanwhile, System 2 is used for more complex logical reasoning processes such as solving numerical problems, logical deduction problems, and even logical and coherent everyday communications.System 1 is unconsciously deployed which generates representations and helps make intuitive judgments.When complex reasoning is needed, System 2 is used to work together with System 1 and reason over the representations from System 1 for decision making.Most existing models mainly focus on the System 1 stage, leveraging gigantic deep neural networks to learn high-quality representations such as language models (Devlin et al., 2018;Brown et al., 2020), vision representations (LeCun et al., 1995;He et al., 2016;Radford et al., 2021) and graph embeddings (Bordes et al., 2013;Grover and Leskovec, 2016;Kipf and Welling, 2017).Based on the Dual Process Theory, equipping a reasoning layer that works as the System 2 on top of the representations from System 1 would work better at complex reasoning tasks.
In this paper, we use the Common Sense Knowledge Graph (CSKG) link prediction task to illustrate that models integrating the representation learning ability and the logical reasoning ability perform better than models that only perform representation learning.Technically, our model incorporates logical reasoning with representation learning to enhance the link prediction task based on the observation that a correct proposition (triple) usually has a relevant chain of propositions (triples) that helps validate it.For example, suppose there is a candidate proposition (obesity, CauseDesire, exercise), from these two known propositions (lose weight, HasPrerequisite, exercise) and (obesity, CausesDesire, lose weight), we can infer that (obesity, CauseDesire, exercise) is likely to be valid.If the model can pay attention to the chain of propositions in the knowledge graph to determine, it would be able to improve the graph completion task.A similar idea is introduced by Wei et al. (2022), where the model is trained to generate a chain of thoughts that leads to the final answer in a numerical question answering task.The training of generating explicit reasoning process facilitates the acquisition of reasoning ability in language models.
Our work does not abandon the existing successful representation learning models, instead, we leverage the representations and develop a general architecture that infuses logical reasoning ability on top of the representations for improved performance.The reasoning layer is flexible enough to be plugged into any representation learning model as long as outputs of these models are embeddings.Here are our main contributions: (1) Conceptually, we demonstrate the advantage of applying Dual Process Theory to facilitate the logical reasoning ability and (2) Technically, we develop a neuralsymbolic two-system architecture for chain of logic reasoning based on both rule-driven logical regularizers and the data-driven value regularizers.

Neural Symbolic Reasoning
Neural-symbolic system leverages the representation learning ability of connectionism (the neural system) and the reasoning ability of symbolism (the symbolic system) to effectively integrate learning and reasoning.Various designs of neuralsymbolic reasoning models have been presented by researchers (Garcez et al., 2022;Zhang et al., 2021;Moghimifar et al., 2021;Yang et al., 2017).This work adopts the neural logic reasoning (NLR) paradigm, where logical operators such as AND, OR, NOT are learned as neural modules based on self-supervised logic regularization, while inputs to the operators are representation vectors (Shi et al., 2020).The advantage of NLR paradigm is that it can be easily infused into any established representation learning model.NLR helps various tasks such as solving logical equations (Shi et al., 2020), recommender systems (Chen et al., 2021), graph neural networks (Chen et al., 2022a), compositional reasoning (Chen et al., 2022b), analogy learning (Fan and Zhang, 2022), and visual reasoning (Li et al., 2022).Instead of the neighbor-based neural logic reasoning in previous works, we propose a chain of logic reasoning model which provides a clearer reasoning path.This section introduces the definition of chain used in the model and its coverage in the dataset.
Definition 3.1 (Chain).Let q = (e hq , rel q , e tq ) be a query proposition triple of CSKG G = (E, R), where E is the set of entities and R is the set of relations.Notice that q could be an existing triple in G (for training) or a non-existing one that needs to be predicted in inference.An ordered list of propositions p 1 , p 2 , • • • , p n where p i = (e hi , rel i , e ti ) is a chain for q if and only if: (1) for each p i , e hi , e ti ∈ E and rel i ∈ R, (2) p 1 = (e hq , rel 1 , e t1 ) and p n = (e hn , rel n , e tq ) for some e t1 , e hn ∈ E, and (3) for each p i , e hi = e t(i−1) .
The above is a very strict definition for propositional reasoning chain.To increase coverage, we apply two extensions to the above definition.

Extension 1: Reversed Relations
To cover propositions such as (ferret, AtLocation, pet store) which is supported by a potential list of triples (ferret, IsA, mammal), (cat, IsA, mammal), (cat, AtLocation, pet store), we extend the definition to allow reversed links.We can rewrite the reversed link (e t , rel, e h ) as (e h , rel −1 , e t ).Based on this, the definition of chain remains unchanged except that for each triple (e hi , rel i , e ti ) in the chain,

Extension 2: Graph Densification
CSKG is notorious for the sparsity, for example, 81% of the entities occur only once in ConceptNet-100k.One of the reasons is that free-formed text can differ even though they refer to the same entity or event.For example, "watch movie" and "watch film" are two separate entities in the dataset while they refer to the same event.Such sparsity imposes difficulty on generating chains, which requires densification of the graph.One method is to find similar pairs of entities and generate new triples using  We extend the knowledge graph with these triples by computing the similarity between entity embeddings as in Malaviya et al. (2020).To form these edges, we fine-tune a BERT model on sentences transformed from each triple using simple heuristic rules.Then we extract node representations and use these representations to compute the cosine similarity between all pairs of nodes in the graph: the format of the input to the model is [CLS] + e_phrase + [SEP], where e_phrase is the free-formed text of an entity node.The embedding of each node is the embedding of the [CLS] token.Upon computing the pairwise cosine similarities, we use a threshold τ to filter the pairs of nodes that are most similar.We use τ = 0.955 and create 87,292 more triples on the dataset ConceptNet-100k.This extension is only applied on ConceptNet-100k due to its sparsity.

Coverage of Chains
For each proposition in the training, validation and testing dataset, we exhaustively compute all of its chains with length within four.Table 2 shows the coverage of chains on the ConceptNet-100k (Speer et al., 2017) and WebChild (Tandon et al., 2014(Tandon et al., , 2017) ) datasets, where coverage is the percentage of known propositions in the graph that are accompanied by at least one chain.Each proposition may have multiple chains and the model selects the shortest chain during training and evaluation.Table 1 lists some example chains.We will see in the experiments that chain-based neural logical reasoning significantly helps link prediction.Given a query proposition q and its chain {p 1 , p 2 , • • • , p n }, the model determines whether q is valid by determining whether the following logical expression is True or False: According to the definition of material implication (→)1 , the above expression can be transformed to: where each variable is represented as an embedding in ∈ R d .Since the above expression only involves negation and disjunction, the neural logic component involves only these two logical operators.We use a multi-layer feed-forward neural network with GELU activation function to instantiate both logical operators as neural modules.The negation operator N(•) is a unary operator: it takes in an embedding and outputs an embedding of the same dimension.The disjunction operator D(•, •) is a binary operator: it takes in a concatenation of two embeddings and outputs one embedding in R d .The overall embedding Q of the whole expression is computed by applying the two neural logic operators over the proposition embeddings and the query embedding, as shown in Figure 2. Notice that the model architecture is dynamic which builds different computational graphs for different input logical expressions since the number of prerequisite propositions could be different.To evaluate whether the whole expression is true, the model computes the similarity between the final expression embedding Q and a predefined constant true vector T. T is defined as an all-ones vector, functioning as the anchor vector of the logical space.

Logical and Value Regularizers
For now the disjunction and negation modules are just multi-layer neural networks.To ensure that they are indeed performing the expected logical reasoning operations, and that the known expressions are correctly embedded into the logical reasoning space, we employ two regularizers: (1) The self-supervised logical regularizers for the disjunction and negation modules.They make sure that the modules satisfy a collection of basic logical rules shown in Table 3.The regularizers are applied over all expression embeddings v ∈ V (including the intermediate expressions) appeared during the process of computing the score for a triple, which force all the logical rules to be satisfied.
(2) The supervised value regularizers for expressions of known values, which guarantee that the embeddings of the intermediate expressions-whose true/false value is known-are close to their deserved true (T) or false (F = N(T)) embeddings.
The value regularizer enriches the model training by fusing the ground-truth true/false information into the intermediate expression embeddings.
Here is an example: Given a query triple q (obesity, CausesDesire, exercise) which has a chain of length two: proposition p 1 (obesity, CausesDesire, lose weight) and p 2 (lose weight, HasPrerequisite, exercise), the model evaluates whether p 1 ∧p 2 → q is valid by computing the similarity score between T and Q = D(D(N(p 1 ), N(p 2 )), q), where the latter computes the final embedding for this expression.The logical regularizer requires that all expression embeddings v ∈ V that appeared in the calculation process (i.e., p 1 , p 2 , q, N(p 1 ), N(p 2 ), D(N(p 1 ), N(p 2 )) and D(D(N(p 1 ), N(p 2 )), q)) satisfy the basic logical requirements in Table 3, Table 3: Self-supervised logical regularizers over the logical modules while the value regularizer requires the intermediate expressions to be close to their deserved true or false embeddings, e.g., p 1 should be close to T since it is a known triple in the graph, while N(p 1 ) should be close to F due to the negation, and similar for other intermediate expressions.
If a query triple q has no corresponding chain, then we evaluate the true/false value of the simple expression T → q to decide if q is true or false.

Learning Proposition Embeddings
Following common practice (Bosselut et al., 2019;Malaviya et al., 2020), we use transfer learning to enhance representation learning from language models.Transfer learning from language model to knowledge graph has been shown effective for commonsense knowledge graph completion (Bosselut et al., 2019).Same as (Malaviya et al., 2020), we fine-tune a BERT-large (Devlin et al., 2018) model on phrases of all nodes in the knowledge graph using masked language modeling.The rich semantics from the language model enhances the node representations.We initialize the node embeddings in ConceptNet-100k, and both node embeddings and relation embeddings in WebChild (since there are more than 6k relations) using fine-tuned models.

Training by Negative Sampling
The model is trained using negative sampling where each negative example consists of a negative triple and its chain.For each head entity e h and a relation rel, we sample k tails e ′ t from the set of all entities to construct k triples (e h , rel, e ′ t ) that do not belong to the knowledge graph.However, computation of chains for randomly selected negative triples takes a long time.Thus, instead of randomly sampling tail entities, we sample negative tails from those entities that have a chain from the head entity.Figure 3 is an example.This small graph G has nodes {A, B, C, D, E, F, G, H} and relations r, r ′ .Suppose (A, r, B) is the gold triple.Our goal is to create negative triples (A, r, e ′ t ) that do not exist in the graph.We first create a negative tail entity set E ′ t which is initialized as ∅.Then, randomly sample k entities e ′ t that is connected to the head entity by a chain of length one and that (A, r, e ′ t ) does not belong to the graph.In this example, nodes {C, D, E} are added to the set E ′ t .Then for each sampled entity, we randomly sample another k entities e ′ t that is connected to it by a chain of length one and that (A, r, e ′ t ) does not belong to the graph.Again, nodes {F, G, H} are added to the set E ′ t .We sample iteratively four times which gives us negative tail entities within four hops of the head entity and they are added into the set E ′ t .We then randomly sample another k entities from the rest of the entities without a chain and add them to set E ′ t .Finally, we randomly sample k entities from the set E ′ t and k negative triples (A, r, e ′ t ) are created for the positive triple (A, r, B).We refer to the new sampling method as chain sampling.

Model Optimization
The loss consists of three parts.The first is a contrastive loss: for each triple q in the training dataset, we sample k negative tails to create k negative triples q ′ .The model computes the logical expression embedding Q for q and Q ′ for all q ′ s based on their chains.The contrastive loss maximizes the difference between the score s of Q and Q ′ : where α is the amplifying parameter, σ(•) is the sigmoid function, s is the cosine similarity score between Q (or Q ′ ) and T, and neg(q) is the set of negative triples for the positive triple q.
The second part minimizes the distance between Q and T as well as the distance between Q ′ and F, which encodes the ground-truth supervision signal into the learned proposition embeddings: The third part consists of two regularizers: logical regularizer L l and value regularizer L v : where V is the set of all intermediate expressions during the model calculation (see Section 4.1) and I is an indicator function whose value is 1 if the condition holds and 0 otherwise.The final loss function is: where λ l is the weight for logical regularization and λ v is the weight for value regularization.

Inference
Each model is evaluated in two inference settings: (1) retrieval out of 1,000 randomly sampled candidates and (2) retrieval of 1,000 candidates based on chain sampling.We explore the two inference settings because we noticed that the distribution of negative triples in random sampling (which is used in previous works) and in chain sampling are different: randomly sampled negative triples are in general much easier to be distinguished than negative triples based on chain sampling, since the chainsampled negatives are closer to the corresponding positive triple.As a result, chain-sampling presents a more challenging task that is worth study.
For random-sampling setting, we first compute the chains for each triple q by Breadth First Search; if no chain is found, T is used as the null chain.In chain-sampling setting, each triple already has its chain which is used to inference the final expression score.We rank the triples based on the scores and select the top-ranked triples for evaluation.
ConceptNet-100K contains general commonsense facts of 78,093 entities and 34 relations.The entity has 2.85 words on average.We use the original splits of the dataset and combine the two provided validation sets to create a larger validation set.The validation and test sets has 1,200 triples each.
WebChild is a large collection of commonsense knowledge from the Web.We take the Webchildcomparative dataset which contains 800k comparisons among 576k entities.We randomly select 1,200 triples for validation and another 1,200 for testing and use the remaining for training.

Baselines and Evaluation Metrics
We take fine-tuned BERT-large (Devlin et al., 2018) to initialize the node embeddings (see Section 4.2).Since our method is a general framework that can be applied on existing CSKG completion methods, we apply our Neural Logic (NL) layer (System 2) on top of the following methods (System 1) to see if adding the System 2 reasoning layer can help to improve the System 1 performance: Neural Tensor Network (NTN) (Socher et al., 2013), which uses a bilinear tensor layer to learn how head, relation and tail embeddings interact across multiple dimensions.DistMult (Yang et al., 2014), which is an embedding-based bilinear diagonal model to learn entity and relation embeddings.SimplE (Kazemi and Poole, 2018), which decomposes an entity's embedding by two vectors, each capturing the entity's behaviour as the head or as the tail of a relation.A relation's embedding is decomposed by two vectors: itself and its reverse.ConvE (Dettmers et al., 2018), which uses 2D convolution over entity and relation embeddings and multiple layers of nonlinear features to model knowledge graphs.ConvTransE (Shang et al., 2019), which is a model built upon the ConvE model but additionally models the translational properties of TransE (Bordes et al., 2013).HypER (Balažević et al., 2019), which uses a hypernetwork (one network generates weights for another network) to generate convolutional filter weights based on each relation to process the input entities.
Since the neural logic layer works on the triple embeddings while the outputs of the above baselines are usually scores, we minimally modify the baseline models such that its output is an embedding for each triple.The triple embedding is either directly used to calculate its score by computing

Main Results
Table 4 and Table 5 show the results on random sampling setting and chain sampling setting, respectively.The best number in each base model is bolded, and the overall best value in each column is starred.By comparing Bert+baseline with Bert+baseline*+NL in Table 4 and 5, we can see that adding System 2 reasoning layer on top of System 1 representation learning layer almost always improves the performance on both datasets, especially on WebChild-comparative which has no exception at all.Besides, in most cases, the global best performance on each metric is achieved by the System 2 enhanced model which has the neural logic reasoning layer.
By comparing results in Table 4 and Table 5, we can see that the performance of all models dramatically decreases under chain sampling based evaluation.This indicates that the negative triples with tail entity close to the head entity are more difficult to distinguish than random negative triples.Retrieval on chain sampling poses a much more challenging problem than retrieval on random sampling.However, Bert+baseline*+NL still achieves improvements in both datasets under most scenarios except for ConvtransE and HypER on ConceptNet-100k.

Ablation Study
The model performs neural logical reasoning by learning the logical operators NOT and OR using neural modules.The model uses logical regularizer to help the learning of logical operators, and uses value regularizer to facilitate the learning of whether an expression is true or false.To study the effectiveness of the two regularizers, we present the Bert+ConvTransE*+NL model as an example for the ablation study and other models have similar results.We compare the following four versions of the model: model without either regularizer (no reg.), model with only logical regularizer (logical reg.), model with only value regularizer (value reg.), and model with both regularizers (both reg.).
Table 6 reports the results.The model without any regularizer already performs better than the baseline model, showing that simply providing extra triple information is helpful.The performance is better with either the logical regularizer or the value regularizer, with the model value reg.performing slightly better than the model logical reg., showing that explicitly infusing the logical reasoning ability during the model computation for every step helps the prediction, either by directly enforcing the true/false value of each intermediate expression or enforcing the negation and disjunction operator to function logically.The two regularizers

Parameter Sensitivity for Regularization
The default values for the regularizer weights λ l and λ v in Eq.( 6) are both 0.5.In this section, we try multiple values λ l , λ v ∈ {0.01, 0.05, 0.1, 0.5, 1, 5} for each of the two regularizer weights respectively while holding the other weight as the default value.The MRR of each experiment is shown in Figure 4. We can see that (1) too small λ and too large λ both show negative effect while the latter decreases performance much more, and (2) too large λ of the logical regularizer shows worse performance than that of the value regularizer.

Qualitative Analysis
We analyze some error cases and find three error types: (1) Wrong reasoning conducted on the chain of propositions; (2) Correct reasoning conducted on the chain but the chain contains a mistaken triple which leads to a wrong answer.For example, (fish, AtLocation, at beach) is invalid, but the provided chain (fish, HasPrerequisite, water), (water, At-Location, at beach) validates it.This mistake is made due to the wrong triple (water, AtLocation, (3) Correct reasoning conducted on the chain containing an imprecise triple which leads to a wrong or unnatural triple.For example, it is questionable whether (door, AtLocation, street) is correct, but given the chain (door, PartOf, car), (car, AtLocation, street), it is a correct conclusion.A more natural triple would be considered correct if we replace "door" by "car door" here.This shows that some ground-truth triples are not precise enough to conduct valid logical reasoning.Some examples of the three error types and false negative cases on ConceptNet-100k are presented in Table 7 and Table 8 of the Appendix, respectively.

Conclusions and Future Work
As a simple instantiation of the Dual Process Theory, this paper demonstrates the advantage of incorporating a System 2 reasoning model on top of a System 1 representation learning model on CSKG link prediction tasks.The positive results verify the potential of our method.As a flexible reasoning model that is differentiable and that can be easily infused with any representation learning model, we intend to incorporate our method with larger pretrained language models and extend to more NLP tasks in the future.

Figure 2 :
Figure 2: Overview of the model structure, where System 1 learns representations and System 2 conducts logical reasoning.The final output vector Q is compared with the constant true vector T to decide the final output.4 The Two-System Architecture Our architecture consists of two components: a representation learning component and a neural logic reasoning component on top of it, as shown in Figure 2. The representation learning component can be any model that encodes proposition triples into vector embeddings, and the neural logic reasoning component conducts reasoning in a latent logical space based on neural logical operators.Given a query proposition q and its chain {p 1 , p 2 , • • • , p n }, the model determines whether q is valid by determining whether the following logical expression is True or False:

Figure 3 :
Figure 3: Example graph to illustrate negative sampling

Figure 4 :
Figure 4: Parameter Sensitivity on ConceptNet-100k at beach); (3) Correct reasoning conducted on the chain containing an imprecise triple which leads to a wrong or unnatural triple.For example, it is questionable whether (door, AtLocation, street) is correct, but given the chain (door, PartOf, car), (car, AtLocation, street), it is a correct conclusion.A more natural triple would be considered correct if we replace "door" by "car door" here.This shows that some ground-truth triples are not precise enough to conduct valid logical reasoning.Some examples of the three error types and false negative cases on ConceptNet-100k are presented in Table7and Table8of the Appendix, respectively.

Table 2 :
Statistics of coverage on two datasets a newly created relation "SIM", so that we can add two new triples (watch movie, SIM, watch film) and (watch film, SIM, watch movie) to the dataset.