Knowledge Base Completion Meets Transfer Learning

The aim of knowledge base completion is to predict unseen facts from existing facts in knowledge bases. In this work, we introduce the first approach for transfer of knowledge from one collection of facts to another without the need for entity or relation matching. The method works for both canonicalized knowledge bases and uncanonicalized or open knowledge bases, i.e., knowledge bases where more than one copy of a real-world entity or relation may exist. Such knowledge bases are a natural output of automated information extraction tools that extract structured data from unstructured text. Our main contribution is a method that can make use of a large-scale pretraining on facts, collected from unstructured text, to improve predictions on structured data from a specific domain. The introduced method is the most impactful on small datasets such as ReVerb20K, where we obtained a 6% absolute increase of mean reciprocal rank and 65% relative decrease of mean rank over the previously best method, despite not relying on large pre-trained models like BERT.


Introduction
A knowledge base (KB) is a collection of facts, stored and presented in a structured way that allows a simple use of the collected knowledge for applications. In this paper, a knowledge base is a finite set of triples h, r, t , where h and t are head and tail entities, while r is a binary relation between them. Manually constructing a knowledge base is tedious and requires a large amount of labor. To speed up the process of construction, facts can be extracted from unstructured text automatically, using, e.g., open information extraction (OIE) tools, such as ReVerb (Fader et al., 2011) or more recent neural approaches (Stanovsky et al., 2018;Hohenecker et al., 2020). Alternatively, missing facts can be inferred from existing ones using knowledge base completion (KBC) algorithms, such as If the knowledge base is canonicalized, "the United Kingdom" and "UK" are known to be the same entity.
If the knowledge base is uncanonicalized or open, this information may not be given.
ConvE (Dettmers et al., 2018), TuckER (Balažević et al., 2019), or 5 E (Nayyeri et al., 2021. It is desirable to use both OIE and knowledge base completion approaches to automatically construct KBs. However, automatic extractions from text yield uncanonicalized entities and relations. An entity such as "the United Kingdom" may also appear as "UK", and a relation such as "located at" may also appear as "can be found in". If we fail to connect these occurrences and treat them as distinct entities and relations, the performance of KBC algorithms drops significantly (Gupta et al., 2019). If our target data are canonicalized, collecting additional uncanonicalized data from unstructured text is not guaranteed to improve the performance of said models. An illustration of a knowledge base can be found on Figure 1.
Open knowledge base completion (OKBC) aims to mitigate this problem and to predict unseen facts even when real-world entities and relations may appear under several different names. Existing work in the area overcomes this either by learning a mapping from entity and relation names to their embeddings (Broscheit et al., 2020) or by using external tools for knowledge base canonicalization (Vashishth et al., 2018), and using the obtained predictions to enhance the embeddings of the entities (Gupta et al., 2019;Chandrahas and Talukdar, 2021).
In this work, we follow the first of these two approaches. We pre-train RNN-based encoders that encode entities and relations from their textual representations to their embeddings jointly with a KBC model on a large OKBC benchmark. We use this pre-trained KBC model and encoders to initialize the final model that is later fine-tuned on a smaller dataset. More specifically, KBC parameters that are shared among all inputs are used as an initialization of the same parameters of the fine-tuned model. When initializing the input-specific embeddings, we introduce and compare two approaches: Either the pre-trained entity and relation encoders are also used and trained during the fine-tuning, or they are used in the beginning to compute the initial values of all entity and relation embeddings, and then dropped.
We evaluate our approach with three different KBC models and on five datasets, showing consistent improvements on most of them. We show that pre-training turns out to be particularly helpful on small datasets with scarce data by achieving SOTA performance on the ReVerb20K and ReVerb45K OKBC datasets (Gupta et al., 2019;Vashishth et al., 2018) and consistent results on the larger KBC datasets FB15K237 and WN18RR (Toutanova et al., 2015;Dettmers et al., 2018). Our results imply that even larger improvements can be obtained by pre-training on a larger corpus. The code used for the experiments is available at https:// github.com/vid-koci/KBCtransferlearning. We highlight that Gupta et al. (2019) andBroscheit et al. (2020) use the term "open knowledge graph embeddings" for OKBC. We use the term "open knowledge base completion" to also include KBs with non-binary relations, which cannot be viewed as graphs, and to consider methods that may not produce embeddings.
Our main contributions are briefly as follows: • We introduce a novel approach for the transfer of knowledge between KBC models that works on both open and regular knowledge bases without the need for entity or relation matching. • We show that pre-training on a large OKBC corpus improves the performance of these models on both KBC and OKBC datasets. • We obtain improvements over state-of-the-art approaches on 3 of 5 observed datasets, with the difference being particularly significant on the smallest datasets (e.g., 0.058 absolute increase of MRR and 65% MR decrease on the ReVerb20K dataset).

Model for Transfer Learning
In this section, we introduce the architecture of our model and how a pre-trained model is used to initialize the model for fine-tuning. The model consists of two encoders, one for entities and one for relations, and a KBC model. Given a triple h, r, t , the entity encoder is used to map the head h and the tail t into their vector embeddings v h and v t , while the relation encoder is used to map the relation r into its vector embedding v r . These are then used as the input to the KBC algorithm of choice to predict their score (correctness), using the loss function, as defined by the KBC model. The two parts of the model are architecturally independent of each other and will be described in the following paragraphs. An illustration of our approach is given in Figure 2.

Encoders
We use two types of mappings from an entity to a low-dimensional vector space. The first approach is to assign each entity and relation its own embedding, initialized randomly and trained jointly with the model. This is the default approach used by most KBC models, however, to distinguish it from the RNN-based approach, we denote it NoEncoder.
The second approach that we test is to use an RNN-based mapping from the textual representation of an entity or relation (name) to its embedding. We use the GloVe word embeddings (Pennington et al., 2014) to map each word into a vector, and then use them as the input to the entity encoder, implemented as a GRU (Cho et al., 2014). To separate it from the NoEncoder, we call this the GRU encoder throughout this paper. Broscheit et al. (2020) test alternative encoders as well, but find that RNNbased approaches (LSTMs, in their case) perform the most consistently across the experiments.
The size of the NoEncoder grows linearly with the number of entities, while the size of the GRU encoder grows linearly with the vocabulary. For large knowledge bases, the latter can significantly decrease the memory usage, as the size of the vocabulary is often smaller than the number of entities and relations.  Figure 2: The diagram of our approach. Green and blue blocks represent data, grey blocks represent models, and arrows represent the data flow. Entity and relation encoders are used to map a triple (h, r, t) from their names (textual representations) to their vector embeddings (v h , v r , v t ). These vectors are used as the input to a KBC algorithm of choice to compute the score of the triple.
Transfer between datasets. Here, pre-training is always done using GRU encoders, as the transfer from the NoEncoder to any other encoder requires entity matching, which we want to avoid. When fine-tuning is done with a GRU encoder, its parameters are initialized from the pre-trained GRU parameters. The same applies for the vocabulary, however, if the target vocabulary includes any unknown words, their word embeddings are initialized randomly.
For the initialization of the NoEncoder setup, the pre-trained GRU is used to generate initial values of all vector embeddings by encoding their textual representations. Any unknown words are omitted, and entities with no known words are initialized randomly.
An equivalent process is used for relations. During our preliminary experiments, we have also tried pre-training the encoders on the next-wordprediction task on English Wikipedia, however, that turned out to have a detrimental effect on the overall performance compared to randomly-initialized GRUs (2−3% MRR drop and slower convergence). That line of experiments was not continued.

Knowledge Base Completion Models
We use three models for knowledge base completion, ConvE (Dettmers et al., 2018), TuckER (Balažević et al., 2019), and 5 E (Nayyeri et al., 2021), chosen for their strong performance on various KBC benchmarks. In the following paragraphs, we briefly introduce the models. We assume that vectors v h , v r , and v t , obtained with encoders of any type, are the input to these models.
TuckER assigns a score to each triple by multiplying the vectors with a core tensor W ∈ R de×de×dr , where d e is the dimension of entities, and d r is the dimension of relations. Throughout our work, we make the simplifying assumption that d e = d r to reduce the number of hyperparameters. During transfer, W from the pre-trained model is used to initialize W in the fine-tuned model.
ConvE assigns a score to each triple by concatenating v h and v r , reshaping them into a 2D matrix, and passing them through a convolutional neural network (CNN). The output of this CNN is a d edimensional vector, which is multiplied with v t and summed with a tail-specific bias term b t to obtain the score of the triple. During transfer, the parameters of the pre-trained CNN are used as the initialization of the CNN in the fine-tuned model. Bias terms of the fine-tuned model are initialized at random, since they are entity-specific. 5 E models consider v h and v t to be complex projective lines and v r a vector of 2 × 2 complex matrices. These correspond to a relation-specific Möbius transformation of projective lines. We refer the reader to the work of Nayyeri et al. (2021) for the details. Unlike in ConvE and TuckER, there are no shared parameters between different relations and entities. Pre-training thus only serves as the initialization of the embeddings.
During the time of evaluation, the model is given a triple with a missing head or tail and is used to rank all the possible entities based on how likely they appear in place of the missing entity. Following Dettmers et al. (2018), we transform head-prediction samples into tail-prediction samples by introducing reciprocal relations r −1 for each relation and transforming ?, r, t into t, r −1 , ? . Following Gupta et al. (2019), the name of the reciprocal relation is created by adding the prefix "inverse of".
During our preliminary experiments, we have also experimented with BoxE (Abboud et al., 2020), however, we have decided not to use it for further experiments, since it was much slower to train and evaluate, compared to other models. A single round of training of BoxE with GRU encoders on Olp-Bench takes over 24 days, which we could not afford.

Datasets and Experimental Setup
This section describes the used data, the experimental setup, and the baselines.

Datasets
We use the following five datasets to test our methods (their statistics are given in Table 1): OlpBench (Broscheit et al., 2020) is a large-scale OKBC dataset collected from English Wikipedia that we use for pre-training. The dataset comes with multiple training and validation sets, however, we only use the training set with THOROUGH leakage removal and VALID-LINKED validation set, since their data have the highest quality. They additionally come with gold clusters of entities, and information on which knowledge-base entities refer to the same real-world entity; this information can be used to improve the accuracy of the evaluation. We find that 99.9% of test samples in ReVerb20K and 99.3% of test samples in ReVerb45K contain at least one entity or relation that did not appear in OlpBench, hence requiring out-of-distribution generalization. Two entities or relations were considered different if their textual representations differed after the pre-processing (lowercase and removal of redundant whitespace). If two inputs still differ after these steps, the potential canonicalization has to be performed by the model. FB15K237 (Toutanova et al., 2015) and WN18-RR (Dettmers et al., 2018) are the most commonly used datasets for knowledge base completion, collected from Freebase and WordNet knowledge bases, respectively. Both datasets have undergone a data cleaning to remove test leakage through inverse relations, as reported by Dettmers et al. (2018). We highlight that WN18RR differs from other datasets in its content. While other datasets describe real-world entities and were often collected from Wikipedia or similar sources, WN18RR consists of information on words and linguistic relations between them, creating a major domain shift between pre-training and fine-tuning.

Experimental Setup
In our experiments, we observe the performance of the randomly-initialized and pre-trained GRU and NOENCODER variant of each of the three KBC models. Pre-training was done with GRU encoders on OlpBench. To be able to ensure a completely fair comparison, we make several simplifications to the models, e.g., assuming that entities and relations in TuckER have the same dimension and that all dropout rates within TuckER and ConvE are equal. While these simplifications can result in a performance drop, they allow us to run exactly the same grid search of hyperparameters for all models, excluding human factor or randomness from the search. We describe the details of our experimental setup with all hyperparameter choices in Appendix B.

Baselines
We compare our work to a collection of baselines, re-implementing and re-training them where appropriate.

Models for Knowledge Base Completion.
We evaluate all three of our main KBC models, ConvE, TuckER, and 5 E with and without encoders. We include results of these models, obtained by related work and additionally compare to KBC models from related work, such as BoxE (Abboud et al., 2020), ComplEx (Lacroix et al., 2018), TransH (Wang et al., 2014), and TransE (Bordes et al., 2013). We highlight that numbers from external work were usually obtained with more experiments and broader hyperparameter search compared to experiments from our work, where we tried to guarantee exactly the same environment for a large number of models.

Models for Knowledge Base Canonicalization.
Gupta et al. (2019) use external tools for knowledge base canonicalization to improve the predictions of KBC models, testing multiple methods to incorporate such data into the model. The tested methods include graph convolution neural networks (Bruna et al., 2014), graph attention neural networks (Veličković et al., 2017), and newly introduced local averaging networks (LANs). Since LANs consistently outperform the alternative approaches in all their experiments, we use them as the only baseline of this type.  vantage of both knowledge base canonicalization tools and large pre-trained transformers at the same time. Their approach is more computationally efficient, since only h, r are encoded with BERT, instead of entire triplets, requiring by a magnitude fewer passes through the transformer model. This is the strongest published approach to OKBC.

Experimental Results
This section contains the outcome of pre-training and fine-tuning experiments. In the second part of this section, we additionally investigate an option of zero-shot transfer.
For each model, we report its mean rank (MR), mean reciprocal rank (MRR), and Hits at 10 (H@10) metrics on the test set. We selected N = 10 for comparison, since it was the most consistently Hits@N metric reported in related work. We report the Hits@N performance for other values of N, the validation set performance, the running time, and the best hyperparameters in Appendix C. All evaluations are performed using the filtered setting, as suggested by Bordes et al. (2013). When evaluating on uncanonicalized datasets with known gold clusters, ReVerb20K and ReVerb45K, the bestranked tail from the correct cluster is considered to be the answer, following Gupta et al. (2019). The formal description of all these metrics can be found in Appendix A.
Pre-training results. The performance of the pre-trained models on OlpBench is given in Table 2. Our models obtain better scores than the previously best approach based on ComplEx (Trouillon et al., 2017), however, we mainly attribute the improvement to the use of better KBC models.
OKBC results. Our results of the models on Re-Verb20K and ReVerb45K are given in Table 3. All models strictly improve their performance when pre-trained on OlpBench. This improvement is particularly noticeable for NoEncoder models, which tend to overfit and achieve poor results without pre-training. However, when initialized with a pretrained model, they are able to generalize much better. 5 E seems to be an exception to this, likely because there are no shared parameters between relations and entities, resulting in a weaker regularization. GRU-based models do not seem to suffer as severely from overfitting, but their performance still visibly improves if pre-trained on OlpBench.
Finally, our best model outperforms the stateof-the-art approach by Chandrahas and Talukdar (2021) on ReVerb20K and ReVerb45K. Even when compared to pre-trained GRU_CONVE, which is based on the same KBC model, OKGIT(CONVE) and CARE(CONVE,LAN) lag behind. This is particularly outstanding, because the BERT and ROBERTA language models, used by OKGIT(CONVE), received by several orders of magnitude more pre-training on unstructured text than our models, making the results more significant.
Similarly, the initialization of models with Big-Graph seems to visibly help the performance, however, they are in turn outperformed by a NOEN-CODER model, initialized with pre-trained encoders instead. This indicates that our suggested pretraining is much more efficient, despite the smaller computational cost.
KBC results. To evaluate the impact of pretraining on larger canonicalized knowledge bases, we compare the performance of models on FB-15K237 and WN18RR. For brevity, we treat the choice of an encoder as a hyperparameter and report the better of the two models in Table 4. Detailed results are given in Appendix C.
Pre-trained models outperform their randomly initialized counterparts as well, however, the differences are usually smaller. We believe that there are several reasons that can explain the small dif- Figure 3: Comparing convergence of the overall best randomly initialized (blue) and pre-trained (green) models on ReVerb45K and FB15K237. Pre-trained models converge in fewer training steps despite a smaller learning rate.
ference, primarily the difference in the size. Best models on FB15K237 and WN18RR only made between 3 and 12 times more steps during pretraining than during fine-tuning. For comparison, this ratio was between 250 to 1000 for Re-Verb20K. The smaller improvements on FB15K237 and WN18RR can also be explained by the domain shift, as already described in Section 3.1. Table 4 additionally includes multiple recently published implementations of ConvE, 5 E, and TuckER, as well as other strong models in KBC. We note that the comparison with all these models should be taken with a grain of salt, as other reported models were often trained with a much larger hyperparameter space, as well as additional techniques for regularization (e.g., DURA (Zhang et al., 2020) or label smoothing (Balažević et al., 2019)) and sampling (e.g., self-adversarial sampling (Abboud et al., 2020)). Due to the large number of observed models and baselines, we could not expand the hyperparameter search without compromising the fair evaluation of all compared models.
In Appendix C, we also report the best hyperparameters for each model. We highlight that pretrained models usually obtain their best result with a larger dimension compared to their randomlyinitialized counterparts. Pre-training thus serves as a type of regularization, allowing us to finetune larger models. We believe that training on even larger pre-training datasets, we could obtain pre-trained models with more parameters and even stronger improvements across many KBC datasets.
Not only does pre-training allow us to train larger    Table 5: Comparison of the zero-shot performance of different models on the OKBC benchmarks ReVerb20K and ReVerb45K. The scores of each model are reported with and without pre-training on OlpBench, with the better of the two written in italics. Note that the model that was not pre-trained is equivalent to a random baseline. models, pre-trained models also require fewer training steps to obtain a similar performance, as seen in Figure 3. The experiments in the figure were done with the best hyperparameter setup for both pre-trained and randomly initialized model. Even though the pre-trained models were fine-tuned with a smaller learning rate, they converge in fewer steps, which can be attributed to the pre-training. Finally, we highlight that language modelling as pre-training hardly justifies the enormous computational cost. On the majority of metrics, KG-BERT performs worse than models with fewer parameters and less pre-training, with a notable exception of a remarkable MR on the WN18RR dataset.

Zero-Shot Experiments
To understand better what kind of knowledge is transferred between the datasets, we investigate the zero-shot performance of pre-trained models on ReVerb20K and ReVerb45K, where the impact of pre-training was the strongest. If the improvement of pre-training was mainly due to the direct memorization of facts, the zero-shot performance should already be high without fine-tuning.
The results of the zero-shot evaluation are given in Table 5. We report the performance of all models with and without pre-training, the latter being equivalent to the random baseline. Note that since no fine-tuning takes place, the choice of the encoder for the evaluation does not matter. We thus chose to only evaluate one of them.
Observing the results, we see that pre-training the models results in a lower MR, but not necessarily a much higher MRR. Even when the increase of MRR happens, the difference is much smaller than when comparing fine-tuned models in Table 3. This implies that the improvement induced by pretraining likely does not happen only due to the memorization of facts from OlpBench. On the other hand, the MR of pre-trained models is comparable or even better than the MR of randomly initialized NoEncoder models, fine-tuned on the ReVerb datasets, reported in Table 3. Hence, pretrained models carry a lot of "approximate knowledge", which is consistent with earlier remarks on pre-training serving as a type of regularization.
Knowing that the ReVerb20K and ReVerb45K test sets consist of facts that contain at least one previously unseen entity or relation, this can be seen as out-of-distribution generalization. Comparing the zero-shot MRR results with the OlpBench results implies that while OKBC models are capable of out-of-domain generalization to unseen entities and relations, there is still space for improvement.

Related Work
In recent years, there have been several approaches to improve the performance of KBC algorithms through data augmentation, commonly through various levels of connection with unstructured text. Socher et al. (2013), for example, use pre-trained word embeddings to initialize their entity embeddings. Xie et al. (2016) make use of an encoder that generates an embedding given a description of the entity, and they show that their approach generalizes even to previously unseen entities. Yao et al. (2019) make use of a large-scale pre-trained transformer model to classify whether a fact is true. They rely on costly pre-training and do not generate embeddings for entities that could be used to incorporate background knowledge into natural language understanding systems (Zhang et al., 2019).
Previous attempts at open knowledge base completion are tied to existing work on the canonicalization of knowledge bases. To canonicalize open knowledge bases, automatic canonicaliza-tion tools cluster entities using manually defined features (Galárraga et al., 2014) or by finding additional information from external knowledge sources (Vashishth et al., 2018). Gupta et al. (2019) use clusters obtained with these tools to augment entity embeddings for KBC. We note that Gupta et al. (2019) use RNN-based encoders to encode relations, but not to encode entities. Broscheit et al. (2020), on the other hand, introduce a model with RNN-based encoders for both entities and relations, similarly to our approach, however, they do not transfer beyond the introduced OlpBench dataset. Finally, Chandrahas and Talukdar (2021) use both KB canonicalization tools and large-scale pre-trained model BERT, combining their predictions to make a more informed decision.
The development of methods that improve predictions of KBC algorithms through data augmentation or transfer is tied to the advances in OIE and KBC methods. However, these are beyond the scope of this project; see the works by Niklaus et al. (2018) and Ruffinelli et al. (2020) for an overview.

Summary and Outlook
In this work, we have introduced a novel approach to transfer learning between various knowledge base completion datasets. The main strength of the introduced method is the ability to benefit from pre-training on uncanonicalized knowledge bases, constructed from facts, collected from unstructured text. Scaling the introduced method up would let us train large-scale pre-trained models that have already shown to be incredibly successful in natural language processing. We tested our method on 5 different datasets, showing that pre-training improves the performance of models. Pre-training turned out to be particularly beneficial on smallscale datasets, where we were able to obtain the most significant gains, e.g., 6% absolute increase of MRR and 65% decrease of MR over the previously best method on ReVerb20K, despite not relying on large pre-trained models like BERT. There are several directions of future work, such as scaling of pre-training to larger models and datasets and investigating the impact of the encoder architecture.

A Metrics
This section contains a formal description of the metrics that were used to score the models. Let h, r, t be some test triple. Given h, r, ? , the model is used to rank all entities in the knowledge base from the most suitable to the least suitable tail, so that the entity no. 1 is the most likely tail according to the model. All ranks are reported in the "filtered setting", which means that all other known correct answers, other than t, are removed from this list to reduce noise. Let r t be the position or rank of the correct tail in this list. In ReVerb20K and ReVerb45K, there may be entities from the same cluster as t, equivalent to t. In this case, r t is the lowest of their ranks. The reciprocal rank of an example is then defined as 1 rt . The same process is repeated with the input ?, r, t and the head entity.
The mean rank (MR) of the model is the average of all ranks across all test examples, both heads and tails. The mean reciprocal rank (MRR) is the average of all reciprocal ranks, both heads and tails. The Hits@N metric tells us how often the rank was smaller or equal to N . The related literature most commonly uses 1, 3, and 10 for values of N , however, 5, 30, and 50 are also occasionally reported.

B Detailed Experimental Setup
Following Ruffinelli et al. (2020), we use 1-N scoring for negative sampling and cross-entropy loss for all models. The Adam optimizer (Kingma and Ba, 2015) was used to train the network. We follow Dettmers et al. (2018) and Balažević et al. (2019) with the placement of batch norm and dropout in ConvE and TuckER, respectively, however, we simplify the setup by always setting all dropout rates to the same value to reduce the hyperparameter space. To find the best hyperparameters, grid search is used for all experiments to exhaustively compare all options. Despite numerous simplifications, we find that our re-implementations of the baselines (non-pre-trained NoEncoder models) perform comparable to the original reported values. The experiments were performed on a DGX-1 cluster, using one Nvidia V100 GPU per experiment.
Pre-training setup. Pre-training was only done with GRU encoders, as discussed in Section 2.1. Due to the large number of entities in the pretraining set, 1-N sampling is only performed with negative examples from the same batch, and batches of size 4096 were used, follow-ing Broscheit et al. (2020). The learning rate was selected from {1·10 −4 , 3·10 −4 }, while the dropout rate was selected from {0.2, 0.3} for ConvE and {0.3, 0.4} for TuckER. For 5 E, the dropout rate was not used, but N3 regularization was (Lacroix et al., 2018), its weight selected from {0.1,0.03}. For TuckER, models with embedding dimensions 100, 200, and 300 were trained. We saved the best model of each dimension for fine-tuning. For ConvE, models with embedding dimensions 300 and 500 were trained, and the best model for each dimension was saved for fine-tuning. Following Gupta et al. (2019), we use a single 2d convolution layer with 32 channels and 3 × 3 kernel size. When the dimension of entities and relations is 300, they were reshaped into 15 × 20 inputs, while the 20 × 25 input shapes were used for the 500dimensional embeddings. For 5 E, models with embedding dimensions 200 and 500 were trained, and the best model for each dimension was saved for fine-tuning. Following Broscheit et al. (2020), we trained each model for 100 epochs. Testing on the validation set is performed each 20 epochs, and the model with the best overall mean reciprocal rank (MRR) is selected.
Fine-tuning setup. Fine-tuning is performed in the same way as the pre-training, however, the models were trained for 500 epochs, and a larger hyperparameter space was considered. More specifically, the learning rate was selected from {3 · 10 −5 , 1 · 10 −4 , 3·10 −4 }. The dropout rate was selected from {0.2, 0.3} for ConvE and {0.3, 0.4} for TuckER. The weight of N3 regularization for the 5 E models was selected from {0.3, 0.1, 0.03}. The batch size was selected from {512, 1024, 2048, 4096}. The same embedding dimensions as for pre-training were considered.

C Full Results
This appendix contains detailed information on the best performance of all models. Table 6 contains the detailed information on the performance of the best models both on validation and test sets of the datasets. Table 7 includes information on the best hyperparameter setups and approximate training times. Note that the given times are approximate and are strongly affected by the selection of the hyperparameters as well as external factors.  Table 6: Full results on both validation and test set of all datasets. In addition to the metrics reported in the paper, we also report H@N for N ∈ {1, 3, 5, 10, 30, 50}, which appeared in related work. The best value in each column is written in bold.