Robust Knowledge Graph Completion with Stacked Convolutions and a Student Re-Ranking Network

Knowledge Graph (KG) completion research usually focuses on densely connected benchmark datasets that are not representative of real KGs. We curate two KG datasets that include biomedical and encyclopedic knowledge and use an existing commonsense KG dataset to explore KG completion in the more realistic setting where dense connectivity is not guaranteed. We develop a deep convolutional network that utilizes textual entity representations and demonstrate that our model outperforms recent KG completion methods in this challenging setting. We find that our model’s performance improvements stem primarily from its robustness to sparsity. We then distill the knowledge from the convolutional network into a student network that re-ranks promising candidate entities. This re-ranking stage leads to further improvements in performance and demonstrates the effectiveness of entity re-ranking for KG completion.

The increased interest in KGC has led to the curation of a number of benchmark datasets such as FB15K (Bordes et al., 2013), WN18 (Bordes et al., 2013), FB15k-237 (Toutanova and Chen, 2015), and YAGO3-10 (Rebele et al., 2016) that have been the focus of most of the work in this area. However, these benchmark datasets are often curated in such a way as to produce densely connected networks that simplify the task and are not representative of real KGs. For instance, FB15K includes only entities with at least 100 links in Freebase, while YAGO3-10 is limited to only include entities in YAGO3 (Rebele et al., 2016) that have at least 10 relations.
Real KGs are not as uniformly dense as these benchmark datasets and have many sparsely connected entities (Pujara et al., 2017). This can pose a challenge to typical KGC methods that learn entity representations solely from the knowledge that already exists in the graph.
Textual entity identifiers can be used to develop entity embeddings that are more robust to sparsity (Malaviya et al., 2020). It has also been shown that textual triplet representations can be used with BERT for triplet classification (Yao et al., 2019). Such an approach can be extended to the more common ranking paradigm through the exhaustive evaluation of candidate triples, but that does not scale to large KG datasets.
In our work, we found that existing neural KGC models lack the complexity to effectively fit the training data when used with the pre-trained textual embeddings that are necessary for representing sparsely connected entities. We develop an expressive deep convolutional model that utilizes textual entity representations more effectively and improves sparse KGC. We also develop a student reranking model that is trained using knowledge dis-tilled from our original ranking model and demonstrate that the re-ranking procedure is particularly effective for sparsely connected entities. Through these innovations, we develop a KGC pipeline that is more robust to the realities of real KGs. Our contributions can be summarized as follows.
• We develop a deep convolutional architecture that utilizes textual embeddings more effectively than existing neural KGC models and significantly improves performance for sparse KGC. • We develop a re-ranking procedure that distills knowledge from our ranking model into a student network that re-ranks promising candidate entities. • We curate two sparse KG datasets containing biomedical and encyclopedic knowledge to study KGC in the setting where dense connectivity is not guaranteed. We release the encyclopedic dataset and the code to derive the biomedical dataset to encourage future work.

Related Work
Knowledge Graph Completion: KGC models typically learn entity and relation embeddings based on known facts (Nickel et al., 2011;Bordes et al., 2013;Yang et al., 2015) and use the learned embeddings to score potential candidate triples. Recent work includes both non-neural (Nickel et al., 2016;Trouillon et al., 2016;Liu et al., 2017;Sun et al., 2019) and neural (Socher et al., 2013;Dong et al., 2014;Dettmers et al., 2018;Vashishth et al., 2020b) approaches for embedding KGs. However, most of them only demonstrate their efficacy on artificially dense benchmark datasets. Pujara et al. (2017) show that the performance of such methods varies drastically with sparse, unreliable data. We compare our proposed method against the existing approaches in a realistic setting where the KG is not uniformly dense.
Prior work has effectively utilized entity names or descriptions to aid KGC (Socher et al., 2013;Ruobing Xie, 2016;Xiao et al., 2016). In more recent work, Malaviya et al. (2020) explore the problem of KGC using commonsense KGs, which are much sparser than standard benchmark datasets. They adapt an existing KGC model to utilize BERT (Devlin et al., 2019) embeddings. In this paper, we develop a deep convoluational architecture that is more effective than adapting existing shallow models which we find to be underpowerered for large KG datasets. Yao et al. (2019) developed a triplet classification model by directly fine-tuning BERT with textual entity representations and reported strong classification results. They also adapted their triplet classification model to the ranking paradigm by exhaustively evaluating all possible triples for a given query, (e 1 , r, ?). However, the ranking performance was not competitive 2 , and such an approach is not scalable to large KG datasets like those explored in this work. Exhaustively applying BERT to compute all rankings for the test set for our largest dataset would take over two months. In our re-ranking setting, we reduce the number of triples that need to be evaluated by over 7700×, reducing the evaluation time to less than 15 minutes.
BERT as a Knowledge Base: Recent work (Petroni et al., 2019;Jiang et al., 2020;Rogers et al., 2020) has utilized the masked-languagemodeling (MLM) objective to probe the knowledge contained within pre-trained models using fill-in-the-blank prompts (e.g. "Dante was born in [MASK]"). This body of work has found that pre-trained language models such as BERT capture some of the relational knowledge contained within their pre-training corpora. This motivates us to utilize these models to develop entity representations that are well-suited for KGC.
Re-Ranking: Wang et al. (2011) introduced cascade re-ranking for document retrieval. This approach applies inexpensive models to develop an initial ranking and utilizes expensive models to improve the ranking of the top-k candidates. Reranking has since been successfully applied across many retrieval tasks (Matsubara et al., 2020;Pei et al., 2019;Nogueira and Cho, 2019). Despite re-ranking's widespread success, recent KGC work utilizes a single ranking model. We develop an entity re-ranking procedure and demonstrate the effectiveness of the re-ranking paradigm for KGC.
Knowledge Distillation: Knowledge distillation is a popular technique that is often used for model compression where a large, high-capacity teacher is used to train a simpler student network (Hinton et al., 2015). However, knowledge distillation has since been shown to be useful for improving model performance beyond the original setting of model compression. Li et al. (2017) demonstrated that knowledge distillation improved image classification performance in a setting with noisy Dataset # Nodes # Rels # Train # Valid # Test FB15k-237  14,451  237  272,115 17,535  20,466  SNOMED CT Core  77,316  140  502,224 71,778 143,486  CN-100K  78,088  34  100,000  1,200  1,200  FB15k-237-Sparse  14,451  237  18,506 17,535  20,466   Table 1: Dataset statistics labels. The incompleteness of KGs leads to noisy training labels which motivates us to use knowledge distillation to train a student re-ranking model that is more robust to the label noise.

Datasets
We examine KGC in the realistic setting where KGs have many sparsely connected entities. We utilize a commonsense KG dataset that has been used in past work and curate two additional sparse KG datasets containing biomedical and encyclopedic knowledge. We release the encyclopedic dataset and the code to derive the biomedical dataset to encourage future work in this challenging setting. The summary statistics for all datasets are presented in Table 1 and we visualize the connectivity of the datasets in Figure 1.

SNOMED CT Core
For constructing SNOMED CT Core, we use the knowledge graph defined by SNOMED CT (Donnelly, 2006), which is contained within the Unified Medical Language System (UMLS) (Bodenreider, 2004). SNOMED CT is well-maintained and is one of the most comprehensive knowledge bases contained within the UMLS (Jiménez-Ruiz et al., 2011;Jiang and Chute, 2009). We first extract the UMLS 3 concepts found in the CORE Problem List Subset of the SNOMED CT knowledge base. This subset is intended to contain the concepts most useful for documenting clinical information. We 3 We work with the 2020AA release of the UMLS. then expand the graph to include all concepts that are directly linked to those in the CORE Problem List Subset according to the relations defined by the SNOMED CT KG. Our final KG consists of this set of concepts and the SNOMED CT relations connecting them. Importantly, we do not filter out rare entities from the KG, as is commonly done during the curation of benchmark datasets.
To avoid leaking data from inverse, or otherwise informative, relations, we divide the facts into training, validation, and testing sets based on unordered tuples of entities {e 1 , e 2 } so that all relations between any two entities are confined to a single split. Unlike some other KG datasets that filter out inverse relations, we divide our dataset in such a way that this is not necessary; our dataset already includes inverse relations, and they do not need to be manually added for training and evaluation as is standard practice (Dettmers et al., 2018;Malaviya et al., 2020).
Because we represent entities using textual descriptions in this work, we also mine the entities' preferred concept names (e.g. "Traumatic hematoma of left kidney") from the UMLS.
The dense connectivity of FB15k-237 does allow us to to ablate the effect of this density. We utilize the FB15k-237 dataset and also develop a new dataset, denoted FB15k-237-Sparse, by randomly downsampling the facts in the training set of FB15k-237 to match the average in-degree of the ConceptNet-100K dataset. We use this to directly evaluate the effect of increased sparsity.
For the FB15k-237 dataset, we use the textual identifiers released by Ruobing Xie (2016). They released both entity names (e.g. "Jason Frederick Kidd") as well as brief textual descriptions (e.g. "Jason Frederick Kidd is a retired American professional basketball player. . . ") for most entities. We utilize the textual descriptions when available.  Figure 2: We utilize BERT to precompute entity embeddings. We then stack the precomputed entity embedding with a learned relation embedding and project them to a two-dimensional spatial feature map, upon which we apply a sequence of two-dimensional convolutions. The final feature map is then average pooled and projected to a query vector, which is used to rank candidate entities. We extract promising candidates and train a re-ranking model utilizing knowledge distilled from the original ranking model. The final candidate ranking is generated by ensembling the ranking and re-ranking models.

Methods
We provide an overview of our model architecture in Figure 2. We first extract feature representations from BERT (Devlin et al., 2019) to develop textual entity embeddings. Motivated by our observation that existing neural KG architectures are underpowered in our setting, we develop a deep convolutional network utilizing architectural innovations from deep convolutional vision models. Our model's design improves its ability to fit complex relationships in the training data which leads to downstream performance improvements. Finally, we distill our ranking model's knowledge into a student re-ranking network that adjusts the rankings of promising candidates. In doing so, we demonstrate the effectiveness of the re-ranking paradigm for KGC and develop a KGC pipeline with greater robustness to the sparsity of real KGs.

Entity Ranking
We follow the standard formulation for KGC. We represent a KG as a set of entity-relationentity facts (e 1 , r, e 2 ). Given an incomplete fact, (e 1 , r, ?), our model computes a score for all candidate entities e i that exist in the graph. An effective KGC model should assign greater scores to correct entities than incorrect ones. We follow recent work (Dettmers et al., 2018;Malaviya et al., 2020) and consider both forward and inverse relations (e.g. treats and treated by) in this work. For the datasets that do not already include inverse relations, we introduce an inverse fact, (e 2 , r −1 , e 1 ), for every fact, (e 1 , r, e 2 ), in the dataset.

Textual Entity Representations
We utilize BERT (Devlin et al., 2019) to develop entity embeddings that are invariant to the connectivity of the KG. We follow the work of Malaviya et al. (2020) and adapt BERT to each KG's naming style by fine-tuning BERT using the MLM objective with the set of entity identifiers in the KG.
For CN-100K and FB15k-237, we utilize the BERT-base uncased model. For SNOMED CT Core KG, we utilize PubMedBERT (Gu et al., 2020) which is better suited for the biomedical terminology in the UMLS.
We apply BERT to the textual entity identifiers and mean-pool across the token representations from all BERT layers to obtain a summary feature vector for the concept name. We fix these embeddings during training because we must compute scores for a large number of potential candidate entities for each training example. This makes finetuning BERT prohibitively expensive. Given an incomplete triple (e i , r j , ?), we begin by stacking the precomputed entity embedding e ∈ R 1×d with the learned relation embedding of the same dimension r ∈ R 1×d to produce a feature vector of length d with two channels q ∈ R 2×d . We then apply a one-dimensional convolution with a kernel of width 1 along the length of the feature vector to project each position i to a two-dimensional spatial feature map x i ∈ R f ×f where the convolution has f × f filters. Thus the convolution produces a two-dimensional spatial feature map X ∈ R f ×f ×d with d channels, representing the incomplete query triple (e i , r j , ?).

Deep Convolutional Architecture
The spatial feature map, X ∈ R f ×f ×d , is analogous to a square image with a side length of f and d channels, allowing for the straightforward application of deep convolutional models such as ResNet. We apply a sequence of 3N bottleneck blocks to the spatial feature map where N is a hyperparameter that controls the depth of the network. A bottleneck block consists of three consecutive convolutions: a 1 × 1 convolution, a 3 × 3 convolution, and then another 1 × 1 convolution. The first 1 × 1 convolution reduces the feature map dimensionality by a factor of 4 and then the second 1 × 1 convolution restores the feature map dimensionality. This design reduces the dimensionality of the expensive 3 × 3 convolutions and allows us to increase the depth of our model without dramatically increasing its parameterization. We double the feature dimensionality of the bottleneck blocks after N and 2N blocks so the dimensionality of the final feature map produced by the sequence of convolutions is 4d.
We add residual connections to each bottleneck block which improves training for deep networks (He et al., 2016). If we let F(X) represent the application of the bottleneck convolutions, then the output of the bottleneck block is Y = F(X) + X. We apply batch normalization followed by a ReLU nonlinearity (Nair and Hinton, 2010) before each convolutional layer (He et al., 2016) . We utilize circular padding (Wang et al., 2018;Vashishth et al., 2020a) with the 3 × 3 convolutions to maintain the spatial size of the feature map and use a stride of 1 for all convolutions. For the bottleneck blocks that double the dimensionality of the feature map, we utilize a projection shortcut for the residual connection (He et al., 2016).

Entity Scoring
Given an incomplete fact (e i , r j , ?), our convolutional architecture produces a feature mapX ∈ R f ×f ×4d . We average pool this feature representation over the spatial dimension which produces a summary feature vectorx ∈ R 4d . We then apply a fully connected layer followed by a PReLU nonlinearity (He et al., 2015) to project the feature vector back to the original embedding dimensionality d. We denote this final vectorê and compute scores for candidate entities using the dot product with candidate entity embeddings. The scores can be efficiently computed for all entities simultaneously using a matrix-vector product with the embedding matrix y =êE T where E ∈ R m×d stores the embeddings for all m entities in the KG.

Training
Adopting the terminology used by Ruffinelli et al.
(2020), we utilize a 1vsAll training strategy with the binary cross-entropy loss function. We treat every fact in our dataset, (e i , r j , e k ), as a training sample where (e i , r j , ?) is the input to the model. We compute scores for all entities as described previously and apply a sigmoid operator to induce a probability for each entity. We treat all entities other than e k as negative candidates and then compute the binary cross-entropy loss. We train our model using the Adam optimizer (Kingma and Ba, 2015) with decoupled weight decay regularization (Loshchilov and Hutter, 2019) and label smoothing. We train our models for a maximum of 200 epochs and terminate training early if the validation Mean Reciprocal Rank (MRR) has not improved for 20 epochs. We trained all of the models used in this work using a single NVIDIA GeForce GTX 1080 Ti.

Re-Ranking Network
We use our convolutional network to extract the top-k entities for every unique training query and then train a re-ranking network to rank these entities. We design our student re-ranking network as a triplet classification model that utilizes the full candidate fact, (e i , r j , e k ), instead of an incomplete fact, (e i , r j , ?). This allows the network to model interactions between all elements of the triple. The re-ranking setting also enables us to directly finetune BERT which often improves performance (Peters et al., 2019).
We introduce relation tokens 4 for each relation in the knowledge graph and construct the textual input by prepending the head and tail entities with the relation token and then concatenating the two sequences. Thus the triple ("head name", r i , "tail name") would be represented as " We use a learned linear combination of the [CLS] embedding from each layer as the final feature representation for the prediction.

Knowledge Distillation
A sufficiently performant ranking model can provide an informative prior that can be used to smooth the noisy training labels and improve our re-ranking model. For each training query i, we normalize the logits produced by our teacher ranking model, f T (x i ), for the k candidate triples, f T (x i ) 0:k , as where T is the temperature (Hinton et al., 2015).
Our training objective for our student model, f S (x i ), is a weighted average of the binary cross entropy loss, L bce , using the teacher's normalized logits, s, and the noisy training labels, y. 4 We use relation tokens instead of free-text relation representations because the relation identifiers for our datasets are not all well-formed using natural language, and the different styles would introduce a confounding factor that would complicate our evaluation. Utilizing appropriate free-text relation identifiers may improve performance, but we leave that to future work.

Training
For our experiments, we extract the top k = 10 candidates produced by our ranking model for every query in the training set. We train our student network using the Adam optimizer (Kingma and Ba, 2015) with decoupled weight decay regularization (Loshchilov and Hutter, 2019). We fine-tune BERT for a maximum of 10 epochs and terminate training early if the Mean Reciprocal Rank (MRR) on validation data has not improved for 3 epochs.

Student-Teacher Ensemble
For every query, we apply our re-ranking network to the top k = 10 triples and compute the final ranking using an ensemble of the teacher and student networks. The final ranking are computed witĥ where 0 ≤ α ≤ 1 controls the impact of the student re-ranker. The cost of computingŝ ik:(i+1)k is negligible, so we sweep over [0, 1] in increments of .01 and select the α that achieves the best validation MRR.

Baselines
We utilize the same representative selection of KG models from Malaviya et al. (2020) as baselines: DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016) ConvE (Dettmers et al., 2018), andConvTransE (Shang et al., 2018). This is not an exhaustive selection of all recent KG methods, but a recent replication study by Ruffinelli et al. (2020) found that the baselines that we use are competitive with the state-of-the-art and often outperform more recent models when trained appropriately.
We develop additional baselines by adapting the shallow convolutional KGC models to use BERT embeddings to evaluate the benefits of utilizing our proposed convolutional architecture instead of simply repurposing existing KGC models. We refer to these models as BERT-ConvE and BERT-ConvTransE. Malaviya et al. (2020) used BERT embeddings in conjunction with Con-vTransE for commonsense KGC, but their model was prohibitively large to reproduce. We refer to  (1) Improvements of deep convolutional BERT models over both shallow convolutional BERT models with an underline (p < 0.005); (2) Improvements of BERT-ResNet over BERT-DeepConv with a * (p < 0.05); (3) Improvements of the re-ranking configurations over the original rankings with a † (p < 0.005 their model as BERT-Large-ConvTransE and compare directly against their reported results. We also develop a deep convolutional baseline, termed BERT-DeepConv, to evaluate the effect of the architectural innovations used in our model. BERT-DeepConv transforms the input embeddings to a spatial feature map like our proposed model, but it then applies a stack of 3 × 3 convolutions instead of a sequence of bottleneck blocks with residual connections. We select hyperparameters (detailed in the Appendix) for all of our BERT baselines so that they have a comparable number of trainable parameters to our proposed model. We discuss the size of these models in detail in in Section 6.4.
To evaluate the impact of our re-ranking stage, we ablate the use of knowledge distillation and ensembling. Thus we conduct experiments where our re-ranker uses only knowledge distillation, uses only ensembling, and uses neither. This means that in the most naive setting, we train the re-ranker using the hard training labels and re-rank the candidates using only the re-ranker.

Evaluation
We report standard ranking metrics: Mean Rank (MR), Mean Reciprocal Rank (MRR), Hits at 1 (H@1), Hits at 3 (H@3), and Hits at 10 (H@10). We follow past work and use the filtered setting (Bordes et al., 2013), removing all positive entities other than the target entity before calculating the target entity's rank.
We utilize paired bootstrap significance testing (Berg-Kirkpatrick et al., 2012) with the MRR to validate the statistical significance of improvements. To account for the large number of comparisons being performed, we apply the Holm-Bonferroni method (Holm, 1979) to correct for multiple hypothesis testing. We define families for the three primary hypotheses that we tested with our experiments. They are as follows: (1) The deep convolutional BERT models outperform the shallow convolutional BERT models. (2) BERT-ResNet improves upon our BERT-DeepConv baseline. (3) The re-ranking procedure improves the original rankings.
This selection has the benefit of allowing for a more granular analysis of each conclusion while significantly reducing the number of hypotheses. The first family includes all pairwise comparisons between the two deep convolutional models and the two shallow convolutional models. The second family involves all comparisons between BERT-ResNet and BERT-DeepConv. The third family includes comparisons between all re-ranking configurations and the original rankings. We note that the p-value for each family bounds the strict condition that we report any spurious finding within the family.
6 Results and Discussion

Ranking Performance
We report results across all of our datasets in Table  2. Our ranking model, BERT-ResNet, outperforms the previously published models and our baselines across all of the sparse datasets. We find that for all sparse datasets, the models that use free text entity representations outperform the models that learn the entity embeddings during training. Among the models utilizing textual information, the deep convolutional methods generally outperform the adaptations of existing neural KG models. BERT-ResNet outperforms BERT-DeepConv across all datasets, demonstrating that the architectural innovations do improve downstream performance.
On the full FB15k-237 dataset, our proposed model is able to achieve competitive results compared to strong baselines. However, the focus of this work is not to achieve state-of-the-art performance on densely connected benchmark datasets such as FB15k-237. These results do, however, allow us to observe the outsized impact of sparsity on models that do not utilize textual information.

Re-Ranking Performance
Re-ranking entities without knowledge distillation or ensembling leads to poor results, degrading the MRR across most datasets. We note that the performance of our re-ranking model could be limited by our use of a pointwise loss function. Further exploration of pairwise or listwise learning learningto-rank methods is a promising direction for future exploration that could lead to further improvements Guo et al. (2020).
The inclusion of either knowledge distillation or ensembling improves performance. Ensembling is particularly important, achieving a statistically significant improvement over the initial rankings across most datasets. Our final setting using both knowledge distillation and ensembling is the only setting to achieve a statistically significant improvement across all four datasets, although using both does not consistently improve performance over ensembling alone.
A plausible explanation for this is that knowledge distillation improves performance by reducing the divergence between the re-ranker and the teacher, but ensembling can already achieve a similar effect by simply increasing the weight of the teacher in the final prediction. We observe that the weight of the teacher is reduced across all four datasets when knowledge distillation is used which would be consistent with this explanation. Knowledge distillation has also been shown to be useful in situations with noisy labels (Li et al., 2017) which may explain why it was particularly effective for our sparsest dataset, CN-100K, where training with the hard labels led to particularly poor performance.

Effect of Re-Ranking
We bin test examples by the in-degree of the tail nodes and compute the MRR within these bins for our model before and after re-ranking. We report this breakdown for the SNOMED CT Core dataset in Figure 3. Our re-ranking stage improves performance uniformly across all levels of sparsity, but it is particularly useful for entities that are rarely seen during training. This is also consistent with the comparatively smaller topline improvement for the densely connected FB15k-237 dataset.

Model Capacity
We report the number of trainable parameters for the models that use textual representations along with the train and test set MRR for SNOMED CT Core in Table 3. We observe a monotonic relationship between training and testing performance and note that the shallow models fail to achieve

Conclusion
KGs often include many sparsely connected entities where the use of textual entity embeddings is necessary for strong performance. We develop a deep convolutional network that is better-suited for this setting than existing neural models developed on artificially dense benchmark KGs. We also introduce a re-ranking procedure to distill the knowledge from our convolutional model into a student re-ranking network and demonstrate that our procedure is particularly effective at improving the ranking of sparse candidates. We utilize these innovations to develop a KGC pipeline with greater robustness to the realities of KGs and demonstrate the generalizability of our improvements across biomedical, commonsense, and encyclopedic KGs. the hyperparameter that controls the depth of the convolutional network. This means that our BERT-ResNet model consists of 3N = 6 sequential bottleneck blocks. We trained the models using a batch size of 64 with a 1vsAll strategy (Ruffinelli et al., 2020) with the binary cross entropy loss function. We use the Adam optimizer (Kingma and Ba, 2015) with decoupled weight decay regularization (Loshchilov and Hutter, 2019) and train the model with a learning rate of 1e-3. We use label smoothing with a value of 0.1, clip gradients to a max value of 1, and regularize the model using weight decay with a weight of 1e-4. We apply dropout with drop probability 0.2 after the embedding layer and apply 2D dropout (Tompson et al., 2015) with the same drop probability before the 2D convolutions. We apply dropout with probability 0.3 after the pooling and fully connected layer. We manually tuned the hyperparameters for this model based on validation performance.

A.2.3 Baseline Implementations
For our baseline implementations of DistMult, ComplEx, ConvE, and ConvTransE, we adapt the implementations released by Dettmers et al. (2018) and Malaviya et al. (2020). We utilize the hyperparameters reported in the original papers and conduct a grid search to tune the embedding dimension from [100,200,300] and the initial learning rate from [5e-3, 1e-3, 5e-4, 1e-4] for each dataset. We train the models with a batch size of 128 using the 1vsAll strategy with the cross entropy loss function because the replication study by Ruffinelli et al. (2020) found that this training strategy generally led to better performance than other training strategies. For the grid search, we train each model for a maximum of 50 epochs and then select the hyperparameters with the best validation performance and retrain the model with our aforementioned training procedure.
For our implementation of BERT-ConvE and BERT-ConvTransE, we adapt the baseline ConvE and ConvTransE to use BERT embeddings in the same manner as our model. The convolution for BERT-ConvE has 32 channels and the convolution for BERT-ConvTransE has 64 channels. These values were selected to produce models with a comparable number of trainable parameters to our model. We then project the final feature vector down to the embedding dimensionality and rank candidates identically to our model. We trained both models with a batch size of 64 using 1vsAll strategy (Ruffinelli et al., 2020) with the binary cross entropy loss function using the Adam optimizer (Kingma and Ba, 2015) with decoupled weight decay regularization (Loshchilov and Hutter, 2019). We train the models with a learning rate of 1e-4, use label smoothing with value 0.1, clip gradients to a max value of 1, and regularize the model using weight decay with a weight of 0.0001. We apply dropout with drop probability 0.2 after the embedding layer and after the convolution. We apply dropout with probability 0.3 after the fully connected layer.
For our baseline BERT-DeepConv model, we use the same hyperparamters as BERT-ResNet for the initial 1-D convolution and then apply a sequence of three 3 × 3 convolutions with circular padding. The second convolution doubles the number of channels so the dimensionality of the final feature map produced by the sequence of convolutions is 2d. We then mean pool and project the feature map to the embedding dimensionality identically to our proposed model. We selected these hyperparameters so that this baseline has a similar number of trainable parameters to our proposed model. All other implementation details are identical to our BERT-Resnet model (e.g. use of pre-activations, application of dropout, training hyperparameters, etc.).

A.3 Re-Ranking
We fine-tune BERT with a learning rate of 3e−5 using the Adam optimizer (Kingma and Ba, 2015) with decoupled weight decay regularization (Loshchilov and Hutter, 2019). We truncate the textual triple representation to a max length of 32 tokens and fine-tune BERT with a batch size of 128 for a maximum of 10 epochs. Training is terminated early if the validation MRR does not improve for 3 epochs. We set the weight decay parameter to 0.01 and clip gradients to a max value of 1 during training. We apply dropout with probability 0.3 to the final feature representation before the prediction and otherwise use the default parameters provided by the HuggingFace Transformers library (Wolf et al., 2020). We set λ = 0.5 for SNOMED CT Core, λ = 1.0 for CN-100K, and λ = 0.75 for FB15k-237 and FB15k-237-Sparse. We set the temparature as T = 1 for all models.

B Evaluation Metrics
We provide a mathematical formulation for our evaluation metrics. If we denote the set of all facts in the test set as T , then the Mean Rank (MR) is simply computed as The Mean Reciprocal Rank (MRR) is computed as The Hits at k (H@k) is calculated as where I[P ] is 1 if the condition P is true and is 0 otherwise. When computing rank(x i ), we first filter out all positive samples other than the target entity x i . This is commonly referred to as the filtered setting.      Table 8: Validation re-ranking results. We report metrics for the subset of queries where the retrieved entity is already in the top 10 entities because the re-ranking procedure leaves other rankings unchanged.