ACTC: Active Threshold Calibration for Cold-Start Knowledge Graph Completion

Self-supervised knowledge-graph completion (KGC) relies on estimating a scoring model over (entity, relation, entity)-tuples, for example, by embedding an initial knowledge graph. Prediction quality can be improved by calibrating the scoring model, typically by adjusting the prediction thresholds using manually annotated examples. In this paper, we attempt for the first time cold-start calibration for KGC, where no annotated examples exist initially for calibration, and only a limited number of tuples can be selected for annotation.Our new method ACTC finds good per-relation thresholds efficiently based on a limited set of annotated tuples. Additionally to a few annotated tuples, ACTC also leverages unlabeled tuples by estimating their correctness with Logistic Regression or Gaussian Process classifiers. We also experiment with different methods for selecting candidate tuples for annotation: density-based and random selection. Experiments with five scoring models and an oracle annotator show an improvement of 7% points when using ACTC in the challenging setting with an annotation budget of only 10 tuples, and an average improvement of 4% points over different budgets.


Introduction
Knowledge graphs (KG) organize knowledge about the world as a graph where entities (nodes) are connected by different relations (edges). The knowledge-graph completion (KGC) task aims at adding new information in the form of (entity, relation, entity) triples to the knowledge graph. The main objective is assigning to each triple a plausibility score, which defines how likely this triple belongs to the underlying knowledge base. These scores are usually predicted by the knowledge graph embedding (KGE) models. However, most KGC approaches do not make any binary decision and provide a ranking, not Figure 1: ACTC method. The manually annotated samples are used to train a Logistic Regression or Gaussian Processes classifier, which labels the additional tuples using their scores predicted by a KGE model. All annotations (manual and automatic) are later used to estimate the per-relation thresholds. classification, which does not allow one to use them as-is to populate the KGs (Speranskaya et al., 2020). To transform the scores into predictions (i.e., how probable is it that this triple should be included in the KG), decision thresholds need to be estimated. Then, all triples with a plausibility score above the threshold are classified as positive and included in the KG; the others are predicted to be negatives and not added to the KG. Since the initial KG includes only positive samples and thus cannot be used for threshold calibration, the calibration is usually performed on a manually annotated set of positive and negative tuples (decision set). However, manual annotation is costly and limited, and, as most knowledge bases include dozens (Ellis et al., 2018), hundreds (Toutanova and Chen, 2015) or even thousands (Auer et al., 2007) of different relation types, obtaining a sufficient amount of labeled samples for each relation may be challenging. This raises a question: How to efficiently solve the cold-start thresholds calibration problem with minimal human input?
We propose a new method for Active Threshold Calibration ACTC 1 , which estimates the relation thresholds by leveraging unlabeled data additionally to human-annotated data. In contrast to already existing methods (Safavi and Koutra, 2020;Speranskaya et al., 2020) that use only the annotated samples, ACTC labels additional samples automatically with a trained predictor (Logistic Regression or Gaussian Process model) estimated on the KGE model scores and available annotations. A graphical illustration of ACTC is provided in Figure 1.
Our main contributions are: • We are the first to study threshold tuning in a budget-constrained environment. This setting is more realistic and challenging in contrast to the previous works where large validation sets have been used for threshold estimation.
• We propose actively selecting examples for manual annotation, which is also a novel approach for the KGC setting.
• We leverage the unlabeled data to have more labels at a low cost without increasing the annotation budget, which is also a novel approach for the KGC setting.
Experiments on several datasets and with different KGE models demonstrate the efficiency of ACTC for different amounts of available annotated samples, even for as little as one.

Related Work
Knowledge graph embedding methods (Dettmers et al., 2017;Trouillon et al., 2016;Bordes et al., 2013;Nickel et al., 2011) have been originally evaluated on ranking metrics, not on the actual task of triple classification, which would be necessary for KGC. More recent works have acknowledged this problem by creating data sets for evaluating KGC (instead of ranking) and proposed simple algorithms for finding prediction thresholds from annotated triples (Speranskaya et al., 2020;Safavi and Koutra, 2020). In our work, we study the setting where only a limited amount of such annotations can be provided, experiment with different selection strategies of samples for annotation, and analyze how to use them best. Ostapuk et al. (2019) have studied active learning for selecting triples for training a scoring model for KG triples, but their method cannot perform the crucial step of calibration. They consequently only evaluate on ranking metrics, not measuring actual link prediction quality. In contrast, our approach focuses on selecting much fewer samples for optimal calibration of a scoring model (using positive, negative, and unlabeled samples).

ACTC: Active Threshold Calibration
ACTC consists of three parts: selection of samples for manual annotation, automatic labeling of additional samples, and estimating the per-relation thresholds based on all available labels (manual and automatic ones).
The first step is selecting unlabeled samples for human annotation. In ACTC this can be done in two ways. One option is a random sampling from the set of all candidate tuples (ACTC rndm ; the pseudocodes can be found in Algorithm 1). However, not all annotations are equally helpful and informative for estimation. To select the representative and informative samples that the system can profit the most from, especially with a small annotation budget, we also introduce density-based selection ACTC dens inspired by the density-based selective sampling method in active learning (Agarwal et al., 2020;Zhu et al., 2008) (the pseudocode can be found in Algorithm 2 in Appendix A). The sample density is measured by summing the squared distances between this sample's score (predicted by the KGE model) and the scores of other samples in the unlabeled dataset. The samples with the highest density are selected for human annotation.
In a constrained-budget setting with a limited amount of manual annotations available, there are sometimes only a few samples annotated for some relations and not even one for others. To mitigate this negative effect and to obtain good thresholds even with limited manual supervision, ACTC labels more samples (in addition to the manual annotations) with a classifier trained on the manually annotated samples to predict the labels based on Algorithm 1 ACT C rndm algorithm Input: unlabeled dataset X , annotation budget size l, minimal decision set size n, KGE model M , classifier C : R → [0, 1] Output: set of per-relation thresholds T # Step 1: samples selection for human annotation 1: T ← a set of per-relational thresholds 2: X gold ← randomly selected l samples from X 3: manually annotate X gold with y gold labels 4: for relation r do 5: X gold r ← samples from X gold with relation r 6: y gold r ← manual labels for X gold r 7: scores gold r ← KGE model scores for X gold r 8: lr ← |X gold r | # Step 2: automatically label additional samples 9: if n > lr then 10: Train a classifier Cr on scores gold r and y gold r 11: Xauto r ← rand. selected n − lr samples from X 12: scoresauto r ← KGE model scores for Xauto r 13: Predict yauto r = Cr(scoresauto r ) 14: X dec = (X gold r , y gold r ) (Xauto r , yauto r ) 15: else 16: X dec = (X gold r , y gold r ) # Step 3: estimate per-relation threshold τr 17: τ ← 0, best_acc ← 0 18: for score in scores gold r do 19: τi ← score 20: accuracyi ← acc(scores gold r , y gold r |τi) 21: if accuracyi > best_acc then 22: τ ← τi 23: best_acc ← accuracyi 24: T.append(τ ) the KGE model scores. We experiment with two classifiers: Logistic Regression (ACTC-LR) and Gaussian Processes (ACTC-GP). The amount of automatically labeled samples depends on hyperparameter n, which reflects the minimal amount of samples needed for estimating each threshold (see ablation study of different n values in Section 5). If the number of samples annotated for a relation r (l r ) is larger or equal to n, only these l r annotated samples are used for threshold estimation. If the amount of manually annotated samples is insufficient (i.e., less than n), the additional n − l r samples are randomly selected from the dataset and labeled by a LR or GP classifier. The automatically labeled and manually annotated samples build a per-relation threshold decision set, which contains at least n samples for a relation r with (manual or predicted) labels. The threshold for relation r is later optimized on this decision set.
The final part of the algorithm is the estimation of the relation-specific thresholds. Each sample score from the decision set is tried out as a potential threshold; the relation-specific thresholds that maximize the local accuracy (calculated for this decision set) are selected.

Experiments
We evaluate our method on two KGC benchmark datasets extracted from Wikidata and augmented with manually verified negative samples: CoDEx-s and CoDEx-m 2 (Safavi and Koutra, 2020). Some details on their organization are provided in Appendix B. The KGE models are trained on the training sets 3 . The ACTC algorithm is applied on the validation sets: the gold validation labels are taken as an oracle (manual annotations; in an interactive setting they would be presented to human annotators on-the-fly); the remaining samples are used unlabeled. The test set is not exploited during ACTC training and serves solely for testing purposes. The dataset statistics are provided in Table 1

Baselines
ACTC is compared to three baselines. The first baseline LocalOpt (Acc) optimizes the per-relation thresholds towards the accuracy: for each relation, the threshold is selected from the embedding scores assigned to the samples with manual annotations that contain this relation, so that the local accuracy (i.e., accuracy, which is calculated only for these samples) is maximized (Safavi and Koutra, 2020). We also modified this approach into LocalOpt (F1) by changing the maximization metric to the local F1 score. The third baseline is GlobalOpt, where the thresholds are selected by iterative search over a manually defined grid (Speranskaya et al., 2020). The best thresholds are selected based on the global F1 score calculated for the whole dataset 4 . In all baselines, the samples for manual annotation are selected randomly.

Results
We ran the experiments for the following number of manually annotated samples: 1, 2, 5, 10, 20, 50, 100, 200, 500, and 1000. Experimental setup details are provided in Appendix E. Table 2 provides the result averaging all experiments (here and further, n = 500 for a fair comparison; see Section 5 for analyze of n value), and our method ACTC outperforms the baselines in every tried setting as well as on average. Figure 2a also demonstrates the improvement of ACT C rndm over the baselines for every tried amount of manually annotated samples on the example of CoDEx-s dataset; the exact numbers of experiments with different budgets are provided in Appendix F. The density-based selection, on the other hand, achieves considerably better results when only few manually annotated samples are available (see Figure 2b). Indeed, choosing representative samples from the highly connected clusters can be especially useful in the case of lacking annotation. LR dense , which selects points from regions of high density, can be helpful for small annotation budgets since it selects samples that are similar to other samples. In contrast, when having a sufficient annotation budget and after selecting a certain number of samples, dense regions are already sufficiently covered, and LR rndm provides a more unbiased sample from the entire distribution.

Ablation Study
A more detailed ablation study of different ACTC settings is provided in Appendix D.
Global Thresholds. All methods described above calibrate the per-relation thresholds. Another option is to define a uniform (uni) threshold, which works as a generic threshold for all tuples regardless the relations involved. We implemented it as ACT C − LR uni method, where the additional samples are automatically labeled and used to build a decision dataset together with the manually annotated ones -in the same way as done for the relation-specific version, but only once for the whole dataset (thus, significantly reducing the computational costs). We also applied the LocalOpt(Acc) and LocalOpt(F1) baselines in the uniform setting. Figure 3 demonstrates the results obtained with the Conve KGE model and random selection mechanism on the CodEX-s dataset.
Although the universal versions generally perform worse than the relation-specific, ACT C uni still outperforms the universal baselines and even relationspecific ones for a small annotation budget.
Different n values. An important parameter in ACLC is n, the minimal sufficient amount of (manually or automatically) labeled samples needed to calibrate the threshold. The ablation study of different n values is provided in Figure 4 on the example of ACT C −LR dens setting, averaged across all annotation budgets. ACTC performs as a quite stable method towards the n values. Even a configuration with a minimum value of n = 5 outperforms baselines with a small annotation budget or even with quite large one (e.g. for RESCAL).

Conclusion
In this work, we explored for the first time the problem of cold-start calibration of scoring models for knowledge graph completion. Our new method for active threshold calibration ACTC provides different strategies of selecting the samples for manual annotation and automatically labels additional tuples with Logistic Regression and Gaussian Processes classifiers trained on the manually annotated data. Experiments on datasets with oracle positive and negative triple annotations, and several KGE models, demonstrate the efficiency of our method and the considerable increase in the classification performance even for tiny annotation budgets.

Limitations
A potential limitation of our experiments is the use of oracle validation labels instead of human manual annotation as in the real-world setting. However, all validation sets we used in our experiments were collected based on the manually defined seed set of entities and relations, carefully cleaned and augmented with manually labeled negative samples. Moreover, we chose this more easy-to-implement setting to make our results easily reproducible and comparable with future work.
Another limitation of experiments that use established data sets and focus on isolated aspects of knowledge-graph construction is their detachment from the real-world scenarios. Indeed, in reality knowledge graph completion is done in a much more complicated environment, that involves a variety of stakeholders and aspects, such as data verification, requirements consideration, user management and so on. Nevertheless, we do believe that our method, even if studied initially in isolation, can be useful as one component in real world knowledge graph construction.

Ethics Statement
Generally, the knowledge graphs used in the experiments are biased towards the North American cultural background, and so are evaluations and predictions made on them. As a consequence, the testing that we conducted in our experiments might not reflect the completion performance for other cultural backgrounds. Due to the high costs of additional oracle annotation, we could not conduct our analysis on more diverse knowledge graphs. However, we have used the most established and benchmark dataset with calibration annotations, CoDEx, which has been collected with significant human supervision. That gives us hope that our results will be as reliable and trustworthy as possible.
While our method can lead to better and more helpful predictions from knowledge graphs, we cannot guarantee that these predictions are perfect and can be trusted as the sole basis for decisionmaking, especially in life-critical applications (e.g. healthcare).

B CoDEx datasets
In our experiments, we use benchmark CoDEx datasets (Safavi and Koutra, 2020). The datasets were collected based on the Wikidata in the following way: a seed set of entities and relations for 13 domains (medicine, science, sport, etc) was defined and used as queries to Wikidata in order to retrieve the entities, relations, and triples. After additional postprocessing (e.g. removal of inverse relations), the retrieved data was used to construct 3 datasets: CoDEx-S, CoDEx-M, and CoDEx-L. For the first two datasets, the authors additionally constructed hard negative samples (by annotating manually the candidate triples which were generated using a pretrained embedding model), which allows us to use them in our experiments.
• An example of positive triple: (Senegal, part of, West Africa).
• An example of negative triple: (Senegal, part of, Middle East).

C Embedding models
We use four knowledge graph embedding models. This section highlights their main properties and provides their scoring functions.
ComplEX (Trouillon et al., 2016) uses complexnumbered embeddings and diagonal relation embedding matrix to score triples; the scoring function is defined as s(h, r, t) = e T h diag (r r ) e t . TransE (Bordes et al., 2013) is an example of translation KGE models, where the relations are tackled as translations between entities; the embeddings are scored with s(h, r, t) = − ∥e h + r r − e t ∥ p . (Nickel et al., 2011) treats the entities as vectors and relation types as matrices and scores entities and relation embeddings with the following scoring function: s(h, r, t) = e T h R r e t . These models were selected, first, following the previous works (Safavi and Koutra, 2020;Speranskaya et al., 2020), and, second, to demonstrate the performance of our method using the different KGE approaches: linear (ComplEX and RESCAL), translational (TransE), and neural (ConvE).

D Ablation Study
Optimization towards F1 score. Just as we converted the LocalOpt (Acc) baseline from Safavi and Koutra (2020) to a LocalOpt(F1) setting, we also converted ACTC into ACTC(F1). The only difference is the metric, which the thresholds maximize: instead of accuracy, the threshold that provides the best F1 scores are looked for. Table 3 is an extended result table, which provides the ACTC(F1) numbers together with the standard ACTC (optimizing towards accuracy) and baselines. As can be seen, there is no dramatic change in ACTC performance; naturally enough, the F1 test score for ACTC(F1) experiments is slightly better than the F1 test score  Table 3: ACTC results in % averaged across different size of annotation budget reported with the standard error of the mean. The ACTC method is provided in two local optimization setting: first, the thresholds maximize accuracy (in the same way as it was presented in Figure 2), second, the thresholds are maximized towards F1 score. The experiment with each annotation budget was repeated 100 times.
for experiments where thresholds were selected based on accuracy value.
Estimate All Samples Apart from the automatic labeling of additional samples discussed in Section 3 (i.e., the additional samples are labeled in case of insufficient manual annotations so that the size of the decision set built from manually annotated and automatically labeled samples equals n), we also experimented with annotating all samples. All samples that were not manually labeled are automatically labeled with a classifier. However, the performance was slightly better only for the middle budgets (i.e., for the settings with 5, 10, and 20 manually annotated samples) and became considerably worse for large budgets (i.e., 100, 200, etc), especially in the denstity_selection setting.
Based on that, we can conclude that a lot of (automatically) labeled additional data is not what the model profits the most; the redundant labels (which are also not gold and potentially contain mistakes) only amplify the errors and lead to worse algorithm performance.
Hard VS Soft Labels. The classifier's predictions can be either directly used as real-valued soft labels or transformed to the hard ones by selecting the class with the maximum probability. In most of our experiments, the performance of soft and hard labels was practically indistinguishable (yet with a slight advantage of the latter). All the results provided in this paper were obtained with hard automatic labels.

E Experimental Setting
As no validation data is available in our setting, the ACTC method does not require any hyperparameter tuning. We did not use a GPU for our experiments; one ACTC run takes, on average, 2 minutes. All results are reproducible with a seed value 12345. ACTC does not imply any restrictions on the classifier architecture. We experimented with two classifiers: Logistic Regression classifier and Gaussian Processes classifier. For both of them, we used a Scikit-learn implementation (Pedregosa et al., 2011). The Logistic Regression classifier was used in the default Scikit-learn setting, with L2 penalty term and inverse of regularization strength equals 100. In the Gaussian Processes classifier, we experimented with the following kernels: • squared exponential RBF kernel with length_scale = 10 • its generalized and smoothed version, Matérn kernel, with length_scale = 0.1 • a mixture of different RBF kernels, Ratio-nalQuadratic kernel, with length_scale = 0.1 All the results for Gaussian Processes classifier provided in this paper are obtained with the Matérn kernel (Minasny and McBratney, 2005) using the following kernel function: where K ν is a Bessel function and Γ is a Gamma function.

F Results for Different Annotation Budgets
Tables 4, 5, and 6 demonstrate the performance of the different ACTC settings for different annotation budgets (1, 10, and 50, respectively). The results are averaged over all settings; each setting was repeated 100 times. Table 4 demonstrates how useful and profitable the density-selection methods are in a lower budget setting. However, the nonbiased random selection works better with more manually annotated samples (e.g., 50).    Table 6: ACTC results for l = 50, n = 500, averaged across 100 tries for each experiment and reported with the standard error of the mean