Deep Cognitive Reasoning Network for Multi-hop Question Answering over Knowledge Graphs

Knowledge Graphs (KGs) provide human knowledge with nodes and edges being entities and relations among them, respectively. Multihop question answering over KGs—which aims to find answer entities of given questions through reasoning paths in KGs—has attracted great attention from both academia and industry recently. However, this task remains challenging, as it requires to accurately identify answers in a large candidate entity set, of which the size grows exponentially with the number of reasoning hops. To tackle this problem, we propose a novel Deep Cognitive Reasoning Network (DCRN), which is inspired by the dual process theory in cognitive science. Specifically, DCRN consists of two phases—the unconscious phase and the conscious phase. The unconscious phase first retrieves informative evidence from candidate entities by leveraging their semantic information. Then, the conscious phase accurately identifies answers by performing sequential reasoning according to the graph structure on the retrieved evidence. Experiments demonstrate that DCRN significantly outperforms state-of-the-art methods on benchmark datasets.


Introduction
Knowledge Graphs (KGs) store structured human knowledge, in which nodes represent entities and edges represent relations between pairs of entities. Multi-hop Question Answering over KGs (KGQA) aims to find answer entities by reasoning over paths in KGs. We illustrate this task with an example in Figure 1. Recently, multi-hop question answering over KGs has attracted great attention from both academia and industry (Li et al., 2017;Fu et al., 2020;Saxena et al., 2020). However, this task remains challenging, because the number of candidate entities grows exponentially with the number of reasoning hops (Sun et al., 2018(Sun et al., , 2019a, making it difficult to accurately identify answers. Previous works mitigate this problem by reducing the size of the candidate entity sets, but they often sacrifice the recall of answers. These methods including GRAFT-Net (Sun et al., 2018) and PullNet (Sun et al., 2019a) first extract questionspecific subgraphs, and then perform multi-hop reasoning on the extracted subgraph via Graph Neural Networks (GNNs) to find answers. However, these approaches often sacrifice the recall of answers in exchange for small candidate entity sets. That is, the extracted subgraph may contain no answer at all. This trade-off between the recall of answer entities and the size of candidate entity sets limits their practical usage. Therefore, it is still desirable to find an approach that is capable of accurately identifying answers without sacrificing their recalls.
To tackle this problem, we take inspiration from the dual process theory (Evans, 1984(Evans, , 2003(Evans, , 2008 in cognitive science and propose a novel Deep Cognitive Reasoning Network (DCRN). In cognitive science, researchers found that humans can reason over a large-capacity memory to find answers (Wang et al., 2003). Specifically, the dual process theory (Evans, 1984(Evans, , 2003(Evans, , 2008 suggests that humans accomplish cognitive tasks by first exploiting fast intuition to retrieve task-relevant evidence via an unconscious process, and then performing sequential reasoning based on the aforementioned evidence to derive answers via a conscious process. Similarly, the proposed DCRN consists of two phases. The first one is the unconscious phase, which can retrieve informative evidence by softly selecting candidate entities that are most likely to be correct answers. The second one is the conscious phase, which can accurately identify answers by performing sequential reasoning with Bayesian networks based the retrieved evidence from the first phase. Experiments demonstrate that DCRN significantly outperforms state-of-the-art methods on benchmark datasets.

Preliminaries
In this section, we first review the background of this paper and then introduce the notations used throughout this paper.

Background
In this part, we review the background of knowledge graph and milti-hop KGQA.
Knowledge Graph Given a set of entities E, a set of relations R, and a set of triplets T = {(e i , r j , e k )} ⊂ E × R × E, we define a knowledge graph G by G = {E, R, T }.
Multi-hop KGQA Given a knowledge graph G = {E, R, T } and a natural language question q with its topic entity e topic ∈ E, the task of KGQA is to predict the answer e * to question q by where f (e i ) is the score function that measures the plausibility of e i being the correct answer. In multihop KGQA, the answers are not guaranteed to be direct neighbours of the topic entity in the given question. Therefore, it often requires multi-hop reasoning over KGs to find answers.
Bayesian Network A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). In a Bayesian network, the nodes represents random variables and the directed edges represent the conditional dependencies between random variables.

Notations
In this paper, we use lower-case letters e and r to represent an entity and a relation, respectively. The corresponding boldface letters e and r denotes the embeddings of e and r, respectively.

Related Work
In this section, we review related work for multihop KGQA and knowledge graph embeddings.

Multi-hop KBQA
Recent work in multi-hop KBQA can be divided into two categories: semantic parsing methods and information retrieval methods. Semantic parsing methods first parse the given question into an executable query, and then execute the query to locate answers. Information retrieval methods embeds questions and the knowledge graph into lowdimensional spaces, and then find answers based on question-answer semantic similarity. Our proposed DCRN belongs to information retrieval methods. Key-Value Memory Network (KV-Mem) KV-Mem (Miller et al., 2016) is a variant of Memory Network (Weston et al., 2015), which performs reasoning based on a memory component, i.e., an array storing triplets in KGs. KV-Mem iteratively reads from the memory to update the question embedding, which is used to match correct answers.
Variational Reasoning Network (VRN) VRN (Zhang et al., 2018) proposes a variational framework for multi-hop KGQA. To identify answers, it computes the compatibility scores between the question type and the reasoning graph of each candidate. However, its performance is limited on the question that requires long reasoning paths to answer, due to the exponentially grown candidates.
GRAFT-Net GRAFT-Net (Sun et al., 2018) first extracts question-specific subgraph based on Personalized Page Rank (PPR), and then encode the subgraph with Graph Neural Networks (GNN) to identify answers. However, as described in Sun et al. (2019a), the extracted subgraphs are often too large and have a low recall for answer entities.
PullNet PullNet (Sun et al., 2019a) mitigates the problem of GraftNet with a trainable subgraph expansion strategy. It constructs question-specific subgraph starting from the list of entities mentioned in the question, and then iteratively "pulls" the relevant entities to expand the subgraph. However, it inevitably sacrifices the recall of answer entities in exchange for small candidate entity sets, which limits its performance in practical usage.
EmbedKGQA EmbedKGQA (Saxena et al., 2020) models multi-hop KBQA as a link prediction task. It first embeds the given question into a latent relation embedding, and then exploits knowledge graph embedding techniques to identify answers.

Knowledge Graph Embedding in KGQA
Knowledge Graph Embedding (KGE) methods (Hitchcock, 1927;Trouillon et al., 2016;Sun et al., 2019b;Zhang et al., 2020a,b) aim to map entities and relations within KGs into distributed representations (vectors, matrices, etc.). These embeddings are often trained by the link prediction task, where the model is required to predict the missing head or tail entity of a triplet. EmbedKGQA (Saxena et al., 2020) use Com-plEx (Trouillon et al., 2016) to train knowledge graph embeddings, which represents entity and relation embeddings as vectors in complex spaces. For fair comparison with previous work including GRAFT-Net (Sun et al., 2018) and PullNet (Sun et al., 2019a), we use Canonical Polyadic (CP) decomposition (Hitchcock, 1927) to train knowledge graph embeddings in our proposed DCRN, which represents entity and relation embeddings as vectors in real spaces.

The Dual Process Theory
The dual process theory (Evans, 1984(Evans, , 2003(Evans, , 2008 is originally proposed in cognitive science. Inspired by this theory, researchers propose to mimic human cognition in various cognitive tasks. For example,  applies the theory to oneshot KG reasoning, and  proposes a cognitive framework for multi-hop reasoning over documents. Different from these work, we focus on the task of multi-hop question answering over knowledge graphs.

Method
In this section, we introduce our proposed Deep Cognitive Reasoning Network (DCRN) for multihop KGQA. In section 4.1, we introduce the motivation and the overall architecture of DCRN. In Section 4.2, 4.3, and 4.4, we introduce the components of the proposed DCRN.

Motivation
For multi-hop questions, it is challenging to accurately identify answers from a large candidate set, of which the size grows exponentially with the number of reasoning steps. Existing approaches (Sun et al., 2018(Sun et al., , 2019a aim to reduce the size of candidate entity set by extracting question-specific subgraphs. However, these approaches often sacrifice the recall of answers in exchange for small candidate sets, which limits their performances in practical usage. We take inspiration from the dual process theory (Evans, 1984(Evans, , 2003(Evans, , 2008 in cognitive science. Specifically, the theory suggests that humans accomplish cognitive tasks by first exploiting fast intuition to retrieve task-relevant evidence via an unconscious process (System 1), and then performing sequential reasoning based on the aforementioned evidence to derive answers via a conscious process (System 2).
Inspired by the dual process theory in cognitive science, we propose Deep Cognition Reasoning Network (DCRN) for multi-hop KGQA. The pro-posed DCRN consists of two phases. The first one is the unconscious phase, which can retrieve informative evidence from candidate entities by leveraging their semantic information. The second one is the conscious phase, which can accurately identify answers by performing sequential reasoning according to the graph structure on the retrieved evidence from the first phase.
The overall architecture of DCRN is shown in Figure 2. In DCRN, the basic module is the Path Decoding Module, based on which are the two phases-unconscious phase and conscious phase.

Path Decoding Module
, The Path Decoding Module is the basic component of DCRN. As multi-hop KGQA requires multi-hop reasoning to arrive at answer entities, we decode the reasoning path information from the question in this module.
Specifically, we adopt an RNN-based encoderdecoder structure, which first encodes the question into a hidden representation, and then decodes this representation to obtain the reasoning path information, i.e., the scores of each relation at each reasoning step. These scores will be used in the unconscious and conscious phase.
First, we encode the given question q with an RNN to obtain its latent representation q ∈ R d . q = RNN-Encoder(q).
Then, we decode this representation q to obtain reasoning path information. We illustrate this process in Figure 3. At each step of decoding, the decoder predicts the scores of each relation. The predictions at step t is the input of the decoder at step t + 1.

Reasoning Path Prediction
Step 2 Step T Figure 3: Illustration of the Path Decoding Module.
At step t, given the hidden state h (t−1) of the previous iteration and the input i (t) , the RNN decoder outputs the updated hidden state h (t) by where the initial hidden state h (0) is initialized as the question embedding q, and the initial input i (0) is a zero vector. Then, we compute the output of step t by .
Note that f (t) rel (r i ) denotes the scores of each relation at step t, which is computed by Then, the output of step t will be the input of step

The Unconscious Phase
The unconscious phase corresponds to the unconscious process (System 1) in the dual process theory from cognitive science. In this phase, we retrieve informative evidence from candidate entities by leveraging their semantic information.
The evidence refers to sketched results that predicts which candidates are most likely to be correct answers. We expect the retrieved evidence to effectively filter out those candidate entities that are irrelevant to the given question.
To achieve this, we perform semantic matching between the given question and each candidate entity. The semantic matching scores f s (e) of candidate entity e is f s (e) = qe , where q ∈ R 1×d is the query embedding obtained based on the given question, and e ∈ R 1×d is the embedding of entity e.
In our model, the entity embedding e ∈ R d is pretrained by the CP (Hitchcock, 1927) model. Therefore, the key of the unconscious phase is to design informative query representation q ∈ R d .
To design informative query representations, we take inspiration from PTransE (Lin et al., 2015), which extends knowledge graph embedding to relation paths. In PTransE, if a relation path e 1 where • is an composition operation, and it can be addition, element-wise multiplication, RNN, etc. This objective can be considered as the semantic matching between the query e 1 • r 1 • ... • r n−1 and its target e n .
Similarly, we encode the query representation q as follows. First, the start entity of the reasoning path is the topic e topic in the given question. Second, recall that we decode reasoning path information in the Path Decoding Module, in which the output at step t (i.e., o (t) ) represents the weighted sum of relation embeddings. Therefore, we represent the query embedding as where • denotes element-wise multiplication in this formula, and T denotes the number of steps in the Path Decoding Module (i.e., the number of reasoning steps).

The Conscious Phase
The conscious phase corresponds to the conscious process (System 2) in the dual process theory from cognitive science. In this phase, we accurately identify answers by performing sequential reasoning according to the graph structure on the retrieved evidence from the unconscious phase.
To model the sequential reasoning, we take inspiration from the consciousness prior (Bengio, 2017). It suggests that the conscious process only refers to a few variables at a time, which can be modeled as factor graphs, a form of knowledge representation which is factored into pieces involving a few variables at a time.
In this work, we perform sequential reasoning based on Bayesian networks, which can be seen as a type of factor graphs. First, we build questionspecific Bayesian networks from the given KG, in which we view the predictions of entities as random variables and relations as the relational dependencies between them. Second, we perform marginal inference on the Bayesian networks to predict the probability of each candidate entity as a correct answer.

Building Bayesian Networks
We build question-specific Bayesian networks from the given KG with the following two steps. First, we perform graph pruning on the KG to obtain a directed acyclic graph (DAG). Second, we transform the DAG into a Bayesian network.
Given a knowledge graph G = {E, R, T } and a question q with a topic entity e topic ∈ E, we prune G to obtain a directed acyclic graph (DAG) by applying the breadth-first search (BFS) algorithm starting from e topic . Specifically, we only keep the visited edges during searching, and remove the unvisited edges. We illustrate this process in Figure  4, in which we perform two-step BFS starting from the topic entity, and prunes the unvisited edges. Note that we add inverse relations r −1 for each relation r in KGs following previous work (Sun et al., 2018(Sun et al., , 2019a. That is, if (e i , r j , e k ) is a valid triplet, then (e k , r −1 j , e i ) is also valid. The reasons to perform the graph pruning is twofold. First, the number of potential reasoning paths from e topic to an arbitrary candidate entity e in the KG can be extremely large. Therefore, we apply graph pruning to reduce the search space, and only keep the shortest paths. Second, Bayesian Network is required to be a directed acyclic graph (DAG). Therefore, the graph pruning procedure only removes redundant edges, and the answer entities are guaranteed to be within the candidate set.

Edge Pruning
Node: entities Edge: relations topic entity topic entity Node: random variables Edge: dependencies = $ not answer is answer Figure 4: Illustration of question-specific Bayesian networks built from KGs. In KGs, we add an inverse relation r −1 for each relation r. That is, if a triplet (e i , r, e j ) exists, then (e j , r −1 , e i ) also exists.
We useĜ(e topic ) to denote the pruned graph given the topic entity e topic . Then, we have the following proposition.

Proposition 1. The pruned graphĜ(e topic ) is a directed acyclic graph (DAG).
For detailed proof, please refer to the appendix. According to the properties of BFS, if G is connected, then there exists a path in the pruned grapĥ G(e topic ) that starts from e topic and ends with e. Furthermore, this path must be the shortest one among all paths in G that connects e topic and e.
We then introduce how this DAG correspondŝ G(e topic ) to a Bayesian network. The transformed Bayesian Network B( topic ) shares the same graph structure withĜ( topic ), but the definitions on nodes and edges are different. In Table 1, we illustrate the Nodes entity e i random variable X ei = {0, 1} Edges relation r j between e i and e k conditional dependencies between X ei and X e k relationship between the DAG and the corresponding Bayesian network.
Each entity e inĜ(e topic ) corresponds to a random variable X e = {0, 1} in B(e topic ), where X e = {0, 1} represents the prediction of a candidate entity e. Given a question q, X e = 0 denotes that e is an incorrect answer and X e = 1 denotes that e is a correct answer. InĜ( topic ), each relation r connecting entity e i and e j corresponds to an directed edge connecting X e i and X e j in B( topic ), which denotes the dependencies between them.

Bayesian Reasoning
Based on the Bayesian network B( topic ), we can make marginal inferences to predict whether an entity e is a correct answer, which can be represented in a probabilistic way: where G is the KG, q is the given question, and e topic is the topic entity. To calculate this marginal probability, we have the following proposition.
Proposition 2. The marginal probability P(X e = 1|G, q, e topic ) that predicts a candidate entity e can be calculated via variable elimination: P(X e = 1|G, q, e topic ) = P(X e = 1) e ∈pa(e) P(X e = 0|G, q, e topic ), where pa(e) denotes the set of parent nodes of e, i.e., the nodes that have edges directing at e. The first component P(X e = 1) is the abbreviation of P(X e = 1|G, q, e topic , X pa(e) = 0).
For detailed proofs, please refer to the appendix. The marginal probability P(X e = 1|G, q, e topic ) is the product of two components. The first component denotes the probability that entity e is an answer given that all e's parent entities pa(e) are incorrect. The second component e ∈pa(Xe) P(X e = 0|G, q, e topic ) denotes the product of the predictions of X e 's parent nodes. Note that we assume P(X e = 0|G, q, e topic ) = 1 when computing P(X e = 1|G, q, e topic in our implementation for convenience of computation.
We model the first component as follows.
P(X e = 1|G, q, e topic , X pa(e) = 0) = sigmoid(g(f s (e), f b (e))), where f s (e) is the evidence provided by the unconscious phase, f b (e) represents the score computed in the Bayesian network. g(·, ·) is a function for combining the two scores, and we choose g(x, y) = x + y in this work. The score f b (e) is defined as follows: rel (r) represents the prediction score of relation r at reasoning step t in the Path Decoding Module, α (t) r is defined in section ??. Note that t also denotes the topological distance between e topic and e, i.e., the required reasoning steps from e topic to e. We initialize the score for the topic entity to zero. That is, f b (e topic ) = 0.
The conscious phase is different from previous multi-hop reasoning approaches (Zhang et al., 2018;Sun et al., 2018Sun et al., , 2019a in the following two aspects. First, we model the reasoning process in a probabilistic perspective with Bayesian networks, while previous works often apply GNN for reasoning. Second, the conscious phase propagates scalar scores along the paths for multi-hop reasoning, while previous works often propagates embeddings with GNNs.

Loss Function
We use the Binary Cross Entropy Loss for training. Specifically, given a question q, the loss L is computed as where E is the set of entities, A is the set of correct answers, and p(e) = P(X e = 1|G, q, e topic ).

Experiments
This section is organized as follows. In Section 5.1, we introduce experimental settings in detail. In Section 5.2, we show the effectiveness of our model on benchmark datasets. In Section 5.3, we conduct ablation studies and analysis.

Experimental Settings
In this part, we introduce the benchmark datasets and the protocols for training and evaluation.

Datasets
We conduct experiments on two public datasets-WebQuestionSP (Yih et al., 2015) and MetaQA (Zhang et al., 2018), which have been divided into training, validation, and testing set by previous works. The statistics are shown in Table 2. WebQuestionSP WebQuestionSP is a small dataset containing 4,737 questions. Those questions are 1-hop or 2-hop questions that can be answered with the Freebase (Bollacker et al., 2008) knowledge graph. Note that WebQuestionSP mainly consists of 1-hop questions, and only 0.5% of the questions are 2-hop.
MetaQA MetaQA is a large dataset containing over 400k questions in the movie domain. It is split into 1-hop, 2-hop, and 3-hop questions. Following previous work (Sun et al., 2018(Sun et al., , 2019aSaxena et al., 2020), we use the "vanilla" version of the dataset. On MetaQA, we evaluate our model under two settings: "full" setting and "half" setting. In the "full" setting, we use the vanilla knowledge graph for training. In the "half" setting, we follow previous work (Saxena et al., 2020) to randomly drop 50% of triplets in the knowledge graph.

Evaluation Protocols
Following previous work (Sun et al., 2018(Sun et al., , 2019a, we use Hits at N (H@N) to evaluate model performance. For each given question, we rank the candidates in descending order according to their scores, and compute the percentage of correct answers that ranks at top N.

Training Protocols
We choose Adam (Kingma and Ba, 2015) as the optimizer, and use grid search to find the best hyperparameters based on the model performance on the validation datasets. For the details of hyperparameter selection, please refer to the appendix. Following previous work (Sun et al., 2018(Sun et al., , 2019a, we use GloVe (Pennington et al., 2014) as word embeddings, and use bidirectional LSTM as the encoder. We also use CP (Hitchcock, 1927) to train entity and relation embeddings.

Candidate Set Generation Protocol
In the "full" setting, the candidate set of a n-hop question q consists of all entities that are within the n-hop of the topic entity of q. In the 'half' setting, the candidate set of any question consists of all entities in the KG. Therefore, the answers are guaranteed to be included in the candidate sets, and the recall of answers is 1.0.

Hyperparameters
We use grid search to find the best hyperparameters. Specifically, we search the learning rate in {0.1, 0.01, 0.001}, and dropout rate in {0.1, 0.2, 0.3}. The optimal configurations of DCRN is that learning rate = 0.01 and dropout rate = 0.2. For fair comparison with previous work (Sun et al., 2018(Sun et al., , 2019aSaxena et al., 2020), we set the embedding size to 300. When training the knowledge graph embeddings with CP (Hitchcock, 1927), we search the learning rate in {0.1, 0.01, 0.001}, and the optimal configuration is learning rate = 0.1. We choose the values hyperparameter T as follows. On WebQuestionSP, we set T = 2. On MetaQA with "full" setting, we set T = t for t-hop questions. On MetaQA with "half" setting, we set T = 4.

Main Results
In Table 3, we show the results of our proposed DCRN on WebQuestionSP and MetaQA datasets. Overall, our model significantly outperforms stateof-the-art models on benchmark datasets.
WebQuestionSP is a small dataset but it uses a large-scale KG, which is a subset of Freebase. This dataset follows an inductive setting-some entities in the test set have not appeared in the training set. Experiments demonstrate that our DCRN achieves 67.8 on H@1, which outperforms GraftNet and KV-Mem, and performs comparatively against previous state-of-the-art PullNet.
MetaQA is a large dataset consisting of 1 to 3hop questions. Overall, our DCRN achieves stateof-the-art on all the three subdatasets. On MetaQA 1-hop and 2-hop, although some previous methods exhibits satisfying performance, they fail to achieve  (Zhang et al., 2018) -97.5 89.9 62.5 GraftNet (Sun et al., 2018) 66.4 97.0 94.8 77.7 PullNet (Sun et al., 2019a) 68.1 97.0 99.9 91.4 EmbedKGQA (Saxena et al., 2020) 66 The questions in the MetaQA 3-hop dataset are more difficult to answer compared to those in MetaQA 1-hop and 2-hop, as they require longer reasoning paths to find answers. However, experiments demonstrate that our model achieves 99.3 on H@1, which significantly outperforms previous state-of-the-arts. Specifically, it gains 7.9 against PullNet and 4.5 against EmbedKGQA. The results on MetaQA 3-hop illustrates the effectiveness of our model on answering questions that require long reasoning paths.
We also conduct experiments in "half" setting. In this setting, 50% of triplets are dropped, making it more challenging to accurately identify answers. The results are shown in Table 4. Experiments demonstrate our model achieves state-of-the art on all subsets of MetaQA.

Analysis
In this part, we conduct analysis on our model. In Section 5.3.1, we conduct ablation studies on the two phases in DCRN. In Section 5.3.2, we conduct a case study to illustrate the two-phase strategy of the proposed DCRN.

Ablation Studies on the Two Phases
In Table 5, we conduct ablations studies to show the performances of the two phases in DCRN. Overall, the experiments show that both two phases are indispensable in our model. The reason is that the unconscious and the conscious phase are designed to better exploit node-level and pathlevel features, respectively. Both levels of features are critical to the accurate answer identification. Therefore, the cooperation of the two phases brings significant improvements to the performance, as shown in Table 5. On MetaQA 1-hop and 2-hop, both two phases achieves satisfying performances, as the number of candidate entities is relatively small. Furthermore, the conscious phase outperforms the unconscious phase. This is because the unconscious phase exploits the coarse-grained semantic of entities, while the conscious phase considers the fine-grained relational dependencies between entities. Therefore, on small candidate entity sets, the conscious phase could make more accurate predictions.
On MetaQA 3-hop, the unconscious phase outperforms conscious phase. This is because 3-hop questions usually have large candidate entity set, and the errors can propagate along reasoning paths. Therefore, to make accurate predictions, DCRN requires the unconscious phase to softly filter out irrelevant candidates. Experiments demonstrate that, by considering both phases, DCRN achieves 99.3 on H@1.
We further compare between the unconscious phase of DCRN and EmbedKGQA (Saxena et al., 2020). EmbedKGQA consists of two partsknowledge graph embedding and relation matching. The former part use the question representation as latent relation embedding. Different from Embed-KGQA, the unconscious phase in DCRN decode a question into relation paths. To illustrate the effectiveness of the unconscious phase, we compare it with EmbedKGQA (w/o relation matching), and the results are shown in Table 6. Note that EmbedKGQA (Saxena et al., 2020) use RoBERTa (Liu et al., 2019) for word embeddings and ComplEx (Trouillon et al., 2016) for entity embeddings. For fair comparison with previous work including GRAFT-Net (Sun et al., 2018) and PullNet (Sun et al., 2019a), we use GloVe (Pennington et al., 2014) for word embeddings and CP (Hitchcock, 1927) for entity embeddings, and we reimplement EmbedKGQA (w/o relation matching) under our settings.
Experiments demonstrate that the unconscious phase outperforms EmbedKGQA (w/o relation matching) on all the three datasets of MetaQA, illustrating the effectiveness of our design on the query representation in the unconscious phase.

Case Study
In this part, we conduct a case study to illustrate the effectiveness of the two-phase strategy in DCRN. In Figure 5, we show the predictions made by DCRN on a 2-hop question who is listed as screenwriter of John Derek acted films?. This question is taken from the test set of MetaQA 2-hop.
The figure on the left shows the predictions of the unconscious phase. It shows that the unconscious phase successfully filters out the candidates that are unlikely to be correct answers. The predictions made by the unconscious phase provide informative evidence for the subsequent conscious phase. The figure on the right shows the predictions of the conscious phase. Based on the retrieved evidence, the conscious phase successfully ranks the correct answers in the first place. Derek acted films?. We exhibit the 2-hop subgraph of the topic entity John Derek. Deeper color for an entity indicates higher prediction score as a correct answer.

Conclusion
Multi-hop question answering over knowledge graphs aims to answer questions by multi-hop reasoning over knowledge graphs to find answers. In this work, we propose a novel Deep Cognitive Reasoning Network (DCRN), which is inspired by the dual process theory in cognitive science. DCRN can accurately identify answers with two phasesunconscious phase and conscious phase. Experiments demonstrate that our model outperforms state-of-the-art methods on benchmark datasets.