Prediction and Calibration: Complex Reasoning over Knowledge Graph with Bi-directional Directed Acyclic Graph Neural Network

Answering complex logical queries is a challenging task for knowledge graph (KG) reasoning. Recently, query embedding (QE) has been proposed to encode queries and entities into the same vector space, and obtain answers based on numerical computation. However, such models obtain the node representations of a query only based on its predecessor nodes, which ignore the information contained in successor nodes. In this paper, we proposed a Bi-directional Directed Acyclic Graph neural network (BiDAG) that splits the reasoning process into prediction and calibration. The joint probability of all nodes is considered by applying a graph neural network (GNN) to the query graph in the calibration process. By the prediction in the first layer and the calibration in deep layers of GNN, BiDAG can outperform previous QE based methods on FB15k, FB15k-237, and NELL995.

Recently, Query Embedding (QE) models (Hamilton et al., 2018;Ren et al., 2020) have been proposed to jointly encode logical queries and entities into the same vector space, and then retrieve answers (entities) based on the similarity scores.
Although QE models can obtain answers in linear time and implicitly reason over incomplete KGs by iteratively predicting the representation of intermediate and target nodes, such models obtain the representation of the current node only based on its predecessor nodes, which causes (1) The joint probability of all nodes in the query graph is ignored.Take the example in Figure 1, the probability distribution of node V 1 will be more concentrated in Japan and China after considering node A1. (2) The information contained in successor nodes is ignored.As shown in Figure 1, the type of node V 1 can only be country after considering the successor relation nationality.) To address the above drawbacks, in this paper, we propose a novel QE based method called Bidirectional Directed Acyclic Graph neural network (BiDAG), which splits the reasoning process into the following two processes: (1) Prediction is used to obtain the initial representation of nodes by aggregating the information of predecessor nodes, which is similar to previous QE models.(2) Calibration.In this process, the original unidirectional query graph is extended to a bidirectional graph, then we apply GNN to the bidirectional graph.In this way, BiDAG can take the joint probability into account, as each node is continuously calibrated by information of its predecessor and successor nodes.
Our contributions can be summarized as follows: (1) We propose a framework that predicts first and then calibrates in CQA, which enables the model to take the joint probability of all nodes into account.
(2) We conducted experiments on three standard benchmarks, and show that calibration can improve model performance significantly.The source codes and data can be found on https://github.com/YaooXu/BiDAG.

Related Work
Modeling entity and query representations and logical operators are critical points of QE models.GQE (Hamilton et al., 2018) answers the conjunctive queries by representing queries and entities as points in Euclidean space.To represent queries with a large set of answer entities, Query2Box (Ren et al., 2020) utilized hyper-rectangles to encode queries.By converting union queries into Disjunctive Normal Form (DNF) (Davey and Priestley, 2002), Query2Box can handle arbitrary existential positive first-order (EPFO) queries (i.e., queries that include any set of ∧, ∨, ∃).To further support the negation operator (¬), BetaE (Ren and Leskovec, 2020) was proposed to support a full set of operations in FOL by encoding entities and queries into Beta distributions.MLPMix (Amayuelas et al., 2022) utilized MLP-mixer (Tolstikhin et al., 2021) to model logical operators.By encoding each query into multiple points in the vector space, Query2Particles (Bai et al., 2022) can retrieve a set of diverse answers from the embedding space.In this paper, we not only predict the intermediate and target node representations but also constantly calibrate them by modeling the joint probability of all nodes in the query graph.

Preliminary
In this section, we formally describe the task of complex query answering over KGs.We denote a KG as G = (V, R), where v ∈ V represents an entity, and each r ∈ R represents a binary function as r : V × V → {0, 1} which indicates whether a directed relationship r exists between a pair of entities.
First-order logic queries The complex queries in KGs are described in logic form with first-order logic (FOL) operators such as existential quantification (∃), conjunction (∧), disjunction (∨), and negation (¬).A complex query q consists of a set of anchor entities V a ⊆ V, some existential quantified variables V 1 , ...V k , and a single target variable V ? .The disjunctive normal form (DNF) of a FOL query q is defined as follows: where each e ij represents a literal containing anchor node or variables, i.e., Computation Graph Each logical query can convert to a corresponding computation graph in the form of directed acyclic graph (DAG, as shown in Figure 1 (B)), where each node corresponds to an entity, and each edge corresponds to a logical operation.The logical operations are defined as follows.
(1) Relation projection: Given a set of entities S ⊆ V and a relation r ∈ R, the relation projection will return entities (2) Intersection/union: Given sets of entities {S 1 , ..., S n }, compute their intersection ∩ n i=1 S i or union ∪ n i=1 S i .It should be noticed that, in QE models, all these operations are executed in the embedding space.So, we can obtain the target node representation by iteratively computing the node representation following the neural logic operators in the DAG.

Bi-directional Directed Acyclic Graph Neural Network
The key idea of BiDAG is utilizing information of predecessor nodes to obtain the current node representation and then calibrating the representation with global information, as shown in Figure 2. Specifically, BiDAG includes two modules: 1) Representation prediction module; 2) Representation calibration module.In the view of GNN, BiDAG can be regarded as the stack of one prediction module (the first layer) and multiple calibration modules (the deep layers).

Representation prediction
In this module, we define neural logic operations.
We can obtain the representation of each node by applying logical operations based on the predecessor node representations.
Projection Given a node embedding h and an edge embedding r, the projection operator P outputs a new node embedding h ′ = P (h, r).Compared with the geometric projection operator and multi-layer perceptron (MLP) used in the previous works (Hamilton et al., 2018;Ren and Leskovec, 2020), we use the gates mechanism to dynamically adjust the transformation of each node embedding under the specific relation, which is implemented by Gated Recurrent Units (GRU) (Cho et al., 2014): , where r, h, and h ′ are treated as the input, past state, and updated state/output of a GRU.
Intersection We model the intersection of a set of query embeddings {q 1 , ..., q n } as the weighted sum of them, which can be regarded as performing sets intersection in the embedding space.We implement it by adopting attention mechanisms: where q inter is the intersection of these query embeddings, α i is the weight of query embedding q i , MLP is a multi-layer perceptron that takes q i as input and outputs a single attention scalar.
Union Following Ren, Hu, and Leskovec (2020), we handle queries with union operators by transforming them into equivalent Disjunctive Normal Form (DNF).By doing so, the original query can be transformed into the union of n conjunctive queries {q 1 , ...., q n } that without union operator.Then, we can apply the existing methods to obtain the embeddings of these conjunctive queries as {q 1 , ...., q n }.
The distance between a query q and the answer entity e is defined as: d(q, e) = min({sim(q 1 , e), ..., sim(q n , e)}) where {q 1 , ..., q n } are the embeddings of these conjunctive queries, e is the embedding of entity e, sim is a similarity function such as cosine function.

Representation calibration
In this module, the representation of each node is calibrated continuously by context information contained in the predecessor and successor nodes, which can address the drawback of ignoring the joint probability of all nodes.Context information aggregating is completed by multi-head attention mechanism (Vaswani et al., 2017) in GNN, which is first introduced by GAT (Velickovic et al., 2018).Compared to the attention mechanism used in GAT which uses a shared linear transformation for all nodes.We make the following improvements: (1) We extend the graph attention mechanisms to handle directed relational graphs like KGs; (2) We introduce three weight matrices Q ∈ R d×d , K, V ∈ R d×2d as query, key, and value matrix to enable the model to capture the higher-level information among neighbor nodes.(3) To enable the model to choose what to remain and update, we use GRU to update node representation in calibration, which is first used by (Li et al., 2017).( 4) To avoid the calibrated representation being too different from the original representation, we adopt residual connection (He et al., 2015) to make adjustments to the original representation at each step.The representation for node j at (t + 1)-th calibration defined formally as follows (for simplicity, we only consider the single-head self-attention): αi,j = exp (LeakyReLU(wi,j)) where ∥ represents the concatenation operation, h t j is the representation for node j after t-th calibration, e i,j is the representation of edge from node i to j, α i,j is the attention coefficients, N (j) is the neighbor nodes of node j.

Model Training
Our objective is to minimize the distance between the query embedding and its answers while maximizing the distance between the query embedding and other random entities via negative sampling (Bordes et al., 2013), which we define as follows: where e j represents a random negative sample, γ represents the margin, d(q, e) represents the distance between query q and entity e.  2020), we consider nine query types for evaluation.For these nine query types, we utilize the same evaluation protocol as Query2Box (Ren et al., 2020).Details about these datasets and query types can be found in Appendix A.
The results are reported in Table 1.More details can be found in Appendix B.
From the table, we can find that: (1) BiDAG demonstrates an average relative improvement in Mean Reciprocal Rank (MRR) of 3.2%, 3.9%, and 5.4% over previous QE based models on the FB15k, FB15k-237, and NELL995 datasets, respectively.
(2) Residual connection can improve model performance consistently on all datasets, which means residual connection is essential in the calibration process.
Even with the naive strategy that represents queries as point vectors like GQE, our BiDAG achieves a significant performance gain compared with all baselines.Furthermore, BiDAG also outperforms well on conjunctive queries (2i/3i).In our opinion, the main reason is that the target node has more processor nodes which will provide more in- formation for calibrating.All these results demonstrate that calibration is helpful in complex query answering.
Ablation Study for BiDAG To better demonstrate the effectiveness of bi-directional calibration (BC), we conduct further ablations studies by adopting different settings on FB15k.The experimental results are demonstrated in Table 2. From the table, we can find that compared to BiDAG-0BC (model without calibration), calibration can improve performance significantly.Besides, the significant improvement on multi-hop queries (2p/3p) demonstrates that calibration can also effectively alleviate the error cascading.
Further study the effect of calibration To further investigate how calibration affects the node representations in each layer, we record the relative change of the calibrated representations to the initial representations (layer-0 representations obtained by the prediction module), which is defined as follows:  Table 2: Ablation studies of the BiDAG on FB15k.BC represents bi-directional calibration.(e.g.BiDAG-2BC indicates utilizing two bi-directional calibration layers).
where h 0 tgt is the initial representation for the target node, h t tgt is the representation for the target node after t-th calibration.The larger the c t value, the greater the difference between the t-th calibrated representation and the initial representation.
As shown in Figure 3, it can be founded that: (1) Throughout the training process, the relative change of final representations (c 3 , the green line) increases initially and then decreases.This observation suggests that at the early stages of training, the initial representation is insufficiently accurate, so calibration mechanism changes representations a lot to get correct answers.However, as training progresses, the initial representations become increasingly precise, resulting in a relatively diminished influence of calibration later on.(2) In the middle and late stages of training, The values of c 1 (the blue line) and c 2 (the orange line) rise slowly, while c 3 remains stable.This observation implies that the first two calibration steps remain crucial even as the initial representations become increasingly accurate.

Conclusion
In this paper, we propose BiDAG, a query embedding method for answering complex queries over incomplete KGs.BiDAG splits the reasoning process into prediction and calibration.In the calibration process, the joint probability of all nodes is considered by applying GNN to the query graph that is extended to bidirectional message passing.The extensive experiments on multiple open datasets demonstrate that BiDAG outperforms previous QE based models and the effect of calibration in CQA.

Limitations
There are three main limitations of our approach: (1) Our model cannot handle negation operation.Enabling BiDAG to support negation operation is a direction for future work.(2) The modeling for query representation and logical operators is too simple.Improving BiDAG by more ingenious modeling for query representation and logical operators is also a direction for future work.(3) The training process cannot be parallelized well, which is a common drawback of QE models, as QE models have to predict node representations one by one.

Ethics Statement
This paper proposes a method for complex query answering in knowledge graph reasoning, and the experiments are conducted on public available datasets.As a result, there is no data privacy concern.Meanwhile, this paper does not involve human annotations, and there are no related ethical concerns.

C Full experimental results
The full results of Comparison with Baselines are shown in Table 4.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?
In the limitation section.
A2. Did you discuss any potential risks of your work?
In the Ethics Stateme.
A3. Do the abstract and introduction summarize the paper's main claims?
In the abstract.
A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
In the Experiment section.
B1. Did you cite the creators of artifacts you used?
In the Experiment section.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
In the Experiment section.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
In the Appendix B.
C Did you run computational experiments?
In the Experiment section.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Left blank.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?
In the Appendix A.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?
In the Experiment section.
In the Appendix A.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: An Example and its corresponding computation graph of CQA.

Figure 3 :
Figure 3: The curve of relative change of representations in each layer with different training steps.

Table 3 :
Number of training, validation, and test queries generated for different query types.

Table 4 :
The MRR results for existential positive first-order (EPFO) queries on different datasets."w/o res" indicates "without the residual connection" while "w/ res" indicates "with the residual connection".Bold and underline indicate top-two results, respectively.