Efﬁcient Hyper-parameter Search for Knowledge Graph Embedding

While hyper-parameters (HPs) are important 001 for knowledge graph (KG) embedding, exist- 002 ing methods fail to search them efﬁciently. To 003 solve this problem, we ﬁrst analyze the proper- 004 ties of different HPs and quantize the transfer- 005 ability from small subgraph to the large graph. 006 Based on the analysis, we propose an efﬁcient 007 two-stage search algorithm, which efﬁciently 008 explores HP conﬁgurations on small subgraph 009 at the ﬁrst stage and transfers the top conﬁgura- 010 tions for ﬁne-tuning on the large whole graph 011 at the second stage. Experiments show that our 012 method can consistently ﬁnd better HPs than 013 the baseline algorithms with the same time 014 budget. We achieve 10.8% average relative im- 015 provement for four embedding models on the 016 large-scale KGs in open graph benchmark. 017

: The ranges of HPs. Conditioned HPs are in parenthesize. "adv." and "reg." is short for "adversarial" and "regularization", respectively. Please refer to the Appendix A for more details.

271
As in Figure 3, the batch size and dimension 272 size show higher consistency than the other HPs.

273
Hence, the evaluation of the configurations can be   ate the HPs that have influence on the evaluation 308 cost. Then, we analyze the evaluation transferabil-309 ity from small subgraph to the whole graph.

310
Cost of different HPs. The cost of each HP value 311 θ ∈ X i is averaged over the different anchor con-312 figurations in X i , different models and datasets. We between small subgraphs and the whole graph.

326
First, we study how to sample subgraphs. There

327
are several approaches to sample small subgraphs 328 from a large graph (Leskovec and Faloutsos, 2006).

329
We compare four representative approaches in Fig-330 ure 6, i.e., Pagerank node sampling (Pagerank), the whole graph is evaluated by the SRCC in (4).

337
We observe that multi-start random walk is the best 338 among the different sampling methods.

339
Apart from directly transferring the evaluation Require: KG embedding model F , dataset D, and budget B; 1: reduce the search space X toX and decoupleX toX ; # state one: efficient evaluation on subgraph 2: sample a subgraph (with 20% entities) G from Dtra by multi-start random walk; 3: repeat 4: sample a configurationx fromX by RF+BORE; 5: evaluatex on the subgraph G to get the performance; 6: update the RF with record x, M(F (P * ,x), Gval) ; 7: until B /2 budget exhausted; 8: save the top10 configurations inX * ; # state two: fine-tune the top configurations 9: increase the batch/dimension size inX * to getX * ; 10: set y * = 0 and initialize the RF surrogate; 11: repeat 12: select a configurationx * fromX * by RF+BORE; 13: evaluate on whole graph G to get the performance; 14: update the RF with record   one day's budget. As in Figure 9(b), the size of

787
Basically, the KG embedding models use a scoring function f and the model parameters P to measure 788 the plausibility of triplets. We learn the embeddings such that the positive and negative triplets can be 789 separated by f and P . In Table 4, we provide the forms f of the embedding model we used to evaluate 790 the search space X in Section 3. Table 4: Definitions of the embedding models. • is a rotation operation in the complex value space; ⊗ is the Hermitian dot product in the complex value space; Re(·) returns the real part of a complex value; W i,j,k is the ijk-th element in a core tensor W ∈ R d×d×d ; and conv is a convolution operator on the head and relation embeddings. For more details, please refer to the corresponding references.
where γ > 0 is the margin value and |a| + = max(a, 0). The MR loss is widely used in early developed 814 models, like TransE (Bordes et al., 2013) and DistMult (Yang et al., 2015). The value of γ, conditioned 815 on MR loss, is another HP to search.

816
• Binary cross entropy (BCE) loss. It is typical to set the positive and negative triplets as a binary 817 classification problem. Let the labels for the positive and negative triplets as +1 and −1 respectively, 818 the BCE loss is defined as where α > 0 is the adversarial weight conditioned on BCE_adv loss.

827
• Cross entropy (CE) loss. Since the number of negative triplets is fixed, we can also regard the (h, r, t) 828 as the true label over the negative ones. The loss can be written as where the left part is the score of positive triplet and the right is the log sum scores of the joint set of 831 positive and negative triplets.  ranges. To better analyze the continuous HPs, we discretize them in Table 5 according to their ranges.

861
Then, for each HP i = 1 . . . n with range X i , we sample a set X i ⊂ X of s anchor configurations through 862 quasi random search (Bergstra and Bengio, 2012) and uniformly dispute them to evaluate the different 863 embedding models and datasets.   Figure 13.

883
In addition, we provide the details of Spearman's ranking correlation coefficient (SRCC). Given a set of 884 anchor configurations X i to analyze the i-th HP, we denote r(x, θ) as the rank of different x ∈ X i with 885 fixed x i = θ. Then, the SRCC between two HP values θ 1 , θ 2 ∈ X i is

887
where |X i | means the number of anchor configurations in X i . We evaluate the consistency of the i-th HP 888 by averaging the SRCC over the different pairs of (θ 1 , θ 2 ) ∈ X i × X i , the different models and datasets.      We show the reduced and decoupled search space compared with the full space in Hence, the reduced and decoupled space is hundreds times smaller than the full space.  {100, 200, 500, 1000, 2000} 100 In addition, we show the details for the search procedure by RF+BORE in Algorithm 2.

917
Algorithm 2 Full procedure of HP search with RF+BORE (in stage one) Require: KG embedding F , dataset G, search spaceX , budget B /2, RF model y = c(x), threshold τ = 0.8. 1: initialize the RF model and H = ∅; 2: split triplets in G with ratio 9 : 1 into G tra and G val ; 3: repeat set label 0 for configuration in H withŷx < τ , and label 1 forŷx ≥ τ ; 10: update RF model y = c(x) to classify the two labels; 11: until B /2 exhausted.
In Algorithm 1, we increase the batch size and dimension size in stage two. We set the searched range 918 for batch size in stage two as [512,1024] and dimension size as [1000,2000]. There are some exceptions 919 due to the memory issues, i.e., dimension size for RESCAL is in [500, 1000]; dimension size for TuckER 920 is in [200,500]. For ogbl-wikikg2, since the used GPU only has 24GB memory, we cannot run models 921 with 500 dimensions which requires much more memory in the OGB board. Instead, we set the dimension 922 as 100 to be consistent with the smaller models in OGB board with 100 dimensions, and increase the 923 batch size in [512,1024] in the second stage.

925
D.1 Implementation details 926 Evaluation metrics. We follow (Bordes et al., 2013;Wang et al., 2017;Ruffinelli et al., 2019) to use 927 the filtered ranking-based metrics for evaluation. For each triplet (h, r, t) in the validation or testing set, 928 we take the head prediction (?, r, t) and tail prediction (h, r, ?) as the link prediction task. The filtered 929 rankings on the head and tail are computed as respectively, where | · | is the number of elements in the set. The the two metrics used are:

934
• Mean reciprocal ranking (MRR): the average of reciprocal of all the obtained rankings.

935
• Hit@k: the ratio of ranks no larger than k.

936
For both the metrics, the large value indicates the better performance.

937
Dataset statistics. We summarize the statistics of different benchmark datasets in Table 8. As shown,