Leveraging Capsule Routing to Associate Knowledge with Medical Literature Hierarchically

Integrating knowledge into text is a promising way to enrich text representation, especially in the medical field. However, undifferentiated knowledge not only confuses the text representation but also imports unexpected noises. In this paper, to alleviate this problem, we propose leveraging capsule routing to associate knowledge with medical literature hierarchically (called HiCapsRKL). Firstly, HiCapsRKL extracts two empirically designed text fragments from medical literature and encodes them into fragment representations respectively. Secondly, the capsule routing algorithm is applied to two fragment representations. Through the capsule computing and dynamic routing, each representation is processed into a new representation (denoted as caps-representation), and we integrate the caps-representations as information gain to associate knowledge with medical literature hierarchically. Finally, HiCapsRKL are validated on relevance prediction and medical literature retrieval test sets. The experimental results and analyses show that HiCapsRKLcan more accurately associate knowledge with medical literature than mainstream methods. In summary, HiCapsRKL can efficiently help selecting the most relevant knowledge to the medical literature, which may be an alternative attempt to improve knowledge-based text representation. Source code is released on GitHub.


Introduction
Knowledge is known as a triple to describe the relationship (r) between head entity (e h ) and tail entity (e t ) with the format of <e h , r, e t >. The popular neural models can improve the ability of learning text representation by integrating knowledge, because they usually lack the ability to learn entities and their relationship in the text. However, the medical …… Figure 1: An example to show the undifferentiated knowledge in the medical literature. The literature is from the Chinese Medical Association and is translated from the Chinese version. '***' represents the omitted texts in abstract. The entities are highlighted with italic, and the sentences with underline describe the relationship of two entities. literature usually contains multiple knowledge, and not all knowledge is beneficial to its subject. The integration of undifferentiated knowledge into the neural models may reduce the accuracy of medical literature representation. For example, in figure 1, it lists three knowledge existed in the medical literature. But, the third knowledge is redundant to the subject of the literature. Therefore, it is an essential step to determine the hierarchical association between knowledge and medical literature before integrating knowledge.
The hierarchical association between knowledge and medical literature refers to the definition of the degree of their relevance according to how the subject of the medical literature covers the knowledge. Given a knowledge and a medical literature, the task asks to predict their relevance from four levels, namely "Highly relevance (H r )", "Fairly relevance (F r )", "Marginally relevance (M r )", "Irrelevance (I r )" (Kekäläinen, 2005). The public four-point scale graded relevance assessment (Kekäläinen, 2005) from Text REtrieval Conference (TREC) is commonly used for the hierarchical association (See Table 5 in Appendix A.1). But for an intuitive  definition, two information measures, namely relationship correlation (RCor) and knowledge importance (KImp) , should be considered. RCor means whether the texts surrounding two entities describe their relationship in the medical literature (Labels: positive/negative), and KImp means how important the knowledge is to the subject of the medical literature (Labels: Imp/P-imp/M-imp/U-imp). Table 1 lists the definition of RCor and KImp, and their corresponding relevance labels. For example, in figure 1, there are always sentences with underlines describing the relationships of "indications" and "combination", and these knowledge is also important to the literature because they are discussed as the subject. But for the third knowledge, even though the RCor is positive, it is unimportant to the subject of the literature. So, based on RCor and KImp, one can easily learn that the top-2 knowledge is highly relevant to the literature and the third one is marginally relevant.
However, it is difficult for the mainstream methods to capture these two information, mainly because 1) the subject of medical literature is usually multi-knowledge entangled, and these methods seldom can learn the unique knowledge from it; 2) the expression of the relationship information in medical literature is complex and abstract, which requires methods with strong distinguishing ability. This paper proposes leveraging the capsule routing algorithm to extract RCor and KImp information. The capsule routing algorithm is proposed for the capsule network by Sabour et al. (2017), and is an efficient algorithm for decoupling multiple object feature. For the multiple entangled knowledge and complex relationship, the capsule routing algorithm splits the input feature into multiple capsules, and the capsule in a lower-level hands out its output to higher-level capsules through routing algorithm, completing the extraction and aggregation of information flow. After multi-layer capsule calculation, the final layer capsule represents unique knowledge or relation information to solve the issues of knowledge entanglement and complex relationship in medical literature, and then determine the hierarchical association of knowledge and medical literature.
In summary, in this paper our contributions include: 1) proposing hierarchically associating knowledge with medical literature from relationship correlation and knowledge importance, and recording these information through two empirically designed text fragments in the medical literature; 2) proposing leveraging capsule routing algorithm to model the RCor and KImp text fragments (called HiCapsRKL), and taking them as information gain to judge the hierarchical association of knowledge and medical literature; 3) building a weakly supervised training set, a relevance prediction test set and a medical literature retrieval test set, and using these sets to test and analyze the proposed HiCapsRKL and other comparison methods. The experimental results and analyses prove the efficiency of HiCapsRKL in associating knowledge and medical literature hierarchically.

Related Works
The neural information retrieval (IR) models are available techniques for associating knowledge with medical literature because of their powerful deep neural architectures, like CNN (Hu et al., 2014), RNN (Pang et al., 2017), and pre-trained BERT (Devlin et al., 2019 (2019) followed the heuristics or users' interaction in the result pages to enrich the association features. These methods further improved the neural IR models, but they still did not directly explore the association semantically (Zheng et al., 2018;Zhang et al., 2020).
Relatively, the capsule network is newly proposed neural architecture in recent years and still being explored for its applications in NLP area (Zupon et al., 2020;Nguyen et al., 2019;Zhao et al., 2019). Several researches have explored to apply the capsule network to various NLP tasks, e,g., sentiment classification (Ke et al., 2021;Du et al., 2019b;Chen and Qian, 2019), relation extraction (Liu et al., 2020a), text classification (Chen et al., 2020;Du et al., 2019a;Xiao et al., 2018;Zhao et al., 2018), intent detection (Liu et al., 2019;Zhang et al., 2019;Xia et al., 2018), document translation (Yang et al., 2019, word sense disam-biguation (Liu et al., 2020b), etc. Most of these works followed the convolution and dynamic rooting architecture in capsule network and did not explore the effectiveness of the capsule routing algorithm for NLP tasks alone (Liu et al., 2020b).
In this work, the proposed HiCapsRKL model uses the capsule routing algorithm to learn the relationship correlation and knowledge importance information in texts, and it can semantically explore the hierarchical association of knowledge and medical literature. Figure 2 shows the brief architecture of the proposed HiCapsRKL model. First, the model inputs include the input texts (i.e. the RCor text fragment, the medical literature, and the KImp text fragment) and the knowledge triple. Second, each text and knowledge are pair-wised and passed into the language encoder to learn the contextual representation for each pair, namely R M edL , R RCor , and R KImp . Third, two representations R RCor and R KImp go through two capsule routing algorithms, respectively. The algorithm in each branch outputs the corresponding new caps-representation R c RCor and R c KImp . Finally, the model integrates three representations for the relevance prediction. Besides, to learn accurate RCor and KImp features, the model uses each caps-representation to predict its RCor or KImp label as defined in Table 1 with multi-task training. In this section, the paper will introduce each part of the model in detail.

Input Pre-processing
First, the input texts include the medical literature, the RCor text fragment and the KImp text fragment. The medical literature contains the title, abstract and keywords. The RCor text fragment is composed of the sentences that simultaneously contain two entities of the knowledge. If two entities do not occur in one sentence, then it is composed of the sentences that locate between two nearest entities. The KImp text fragment is also composed of sentences that contain the title, first sentence in abstract and keywords from the medical literature. By concatenating each sentence in the text, all input text is a sequence of words, namely SEQ M edL , SEQ RCor , and SEQ KImp .
Second, the knowledge triple is input as the concatenation of head entity, relationship and tail entity by using a delimiter, then the knowledge triple is also converted into a sequence of words, namely SEQ K Next, the pairwise operation is applied to pair the knowledge triple with the other three input texts, respectively. Then the inputs of the language encoder will be three sequence pairs, namely the medical literature and knowledge pair (<SEQ M edL , SEQ K > ), the RCor text and knowledge pair (<SEQ RCor , SEQ K > ), and the KImp text and knowledge pair (<SEQ KImp , SEQ K > ).

Language Encoder
The language encoder in this paper is the pretrained BERT model initialized with the BERT-Base, Chinese parameters. Three sequence pairs are input into the same BERT model, respectively.
First, in BERT model, each pair is processed with the WordPiece tokenization and sequence concatenation. The first token of every sequence is always a special token ([CLS]), and another special token ([SEP]) is used as the delimiter and end terminator. For example, the input <SEQ M edL , SEQ K > pair will be converted into the following format <[CLS], token M edL Second, the converted sequence goes through the multi-layer Transformer architecture (12 layer in this paper). The model encodes each token with the contextual information and takes the hidden vector in the last layer as the contextual representation for each token. As described in the paper (Devlin et al., 2019), the hidden vector of the special [CLS] token is regarded as the classification representation of the input sequence for down-stream predictions.
Finally, the classification representation of each input sequence for each pair is represented as R M edL , R RCor and R KImp , respectively. The two representations R RCor and R KImp will be pro- Figure 3: The calculation procedure between initial two layers in capsule routing algorithm.
cessed by the capsule routing algorithm.

Capsule Routing Algorithm (CapsR(θ))
In this step, R RCor and R KImp are processed with the same capsule routing function but with different initial parameters, namely CapsR(θ r ) for R RCor and CapsR(θ t ) for R KImp . Here, this paper only takes the branch of CapsR(θ r ) as an example to explain the calculation in the function. The calculation procedure between initial two layers is shown in Figure 3. First, a multi-head splitting operation is applied to the input representation R RCor , and R RCor is split into p sub-vectors with the dimension of D. Multi-head splitting allows cutting the contextual representation into different representation subspaces at different positions (Vaswani et al., 2017). Then R RCor is converted into {v 0 , v 1 , .., v p−1 }, and each sub-vector v corresponds to one capsule in the first layer. So, for a capsule caps i in layer L (1) (abbr. caps L (1) i ), its input u i = v i . Next, a weight matrix W ij with dimensions D × D is used for building connections with the capsule caps j in the layer L (2) (abbr. caps L (2) j ), and a prediction vectorû j|i is produced. In CapsR(θ r ), the parameter θ t actually refers to the weight matrix W ij . The total input x j to the capsule caps L (2) j is a weighted sum over allû j|i from the capsules in the layer L (1) .
where c ij is the coupling coefficient from capsule . The coupling coefficients sum to 1 between caps L (1) i and all capsules in L (2) , namely p−1 j=0 c ij = 1. In capsule caps L (2) j , a non-linear "squashing" function as shown in Equation 2 is applied to keep the length by shrinking short vectors to almost 0 and long vectors to a length slightly below 1.
where v L (2) j is the squashing output of the capsule The coupling coefficient c ij is updated by the iterative dynamic routing, and it is a softmax result based on the logic b ij .
we follow the processing by Sabour et al. (2017). Initially, b ij equals to 0 and is updated as which aims to measure the agreement between the output v L (2) j of caps L (2) j and the predictionû j|i of caps L (1) i . In the following layers, the function repeats the same calculation. The output v L (2) is passed into the capsules in the next layer and goes through the weight matrix, the weighted sum and the nonlinear squashing function. With K layer iterations, we take the outputs of layer K as the relation or topic capsules p−1 }. Finally, the capsules are concatenated into the new caps-representation R c RCor for RCor. Through the CapsR(θ t ) function, we have another capsrepresentation R c KImp for KImp.

Multi-task Training
After obtaining these representations R M edL , R c RCor and R c KImp , the model integrates three representations into the overall representation R M edL for the relevance prediction. The prediction asks the model to predict the relevance label from "H r ", "F r ", "M r " and "I r ", and the training loss is marked as L M edL according to the golden label. To make R c RCor and R c KImp learn accurate RCor and KImp information, the model additionally trains each representation with two specific tasks, namely the RCor prediction and KImp prediction.
In RCor prediction, the model uses R c RCor as the input feature and predicts the RCor label from the binary labels. The binary labels correspond to two cases of the RCor definition in Table 1, namely whether the medical literature describe the relationship of two entities. Therefore, the training loss of this task is marked as L RCor . In KImp prediction, the model uses R c KImp as the input feature and predicts the KImp label. The KImp labels correspond to four cases of the KImp definition in Table 1, namely how important the knowledge is to the subject of the medical literature. Therefore, the training loss of this task is marked as L KImp .
Finally, the total training loss of the model is the sum of three prediction loss, namely L = L M edL +L RCor +L KImp . According to the loss L, the model fine-tunes the parameters in BERT encoder and updates the parameters in capsule routing algorithms during the training.

Experimental Datasets and Metrics
In this work, the medical literature is collected from the Chinese Medical Association in 2019, and each literature is represented with a title, an abstract, and keywords. The knowledge triples are from the Chinese medical knowledge graph (CMeKG (Odmaa et al., 2019)).
In the experiment, the training data are automatically constructed based on the RCor and KImp labels. We first calculated the RCor and KImp labels between knowledge and medical literature respectively, and then mapped to the relevance labels according to Table 1. Two manually-labeled test sets are proposed to evaluate the HiCapsRKL and comparison methods. The knowledge and medical literature in both sets are independent of the training set without any intersections. Both test sets are labeled with professional annotators according to the TREC graded relevance assessment. In the Table 3: Comparison in terms of P@10, NDCG@10, MRR, and MAP scores (%) on the medical literature test set. The methods in each type are ranked according to NDCG@10. In MRR, MAP, and P@10 metrics, the scores before "/" are calculated based on the "Hr" literature, and the scores after "/" are based on the "Hr" and "Fr" literature simultaneously. relevance prediction test set, the set asks the model to predict the relevance label of the pair, and the Macro-F1 and Micro-F1 are used as the evaluation metrics. In the medical literature retrieval test set, given a knowledge, the set asks the model to rank the candidate documents based on their relevance to the knowledge. The evaluation metrics on this test set are normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (P@10), mean reciprocal rank (MRR), and mean average precision (MAP). More details about the data sets are listed in the Appendix A.

Baseline and Comparison Methods
In this work, the baseline and comparison methods are the RCor&KImp baseline, the unsupervised IR methods (unIR, namely TF · IDF and BM25 (Robertson and Zaragoza, 2009)), the neural learning-to-rank models (

Experimental Results
Experimental results of all comparison methods on the relevance prediction test set and medical literature retrieval test set are listed in Table 2  HiCapsRKL gives more correct label predictions than other methods, and the Macro-F1 indicates that HiCapsRKL brings the overall improvements on four categories because it is the average of the F1 score of each category. Since the limitation of generalization of RCor&KImp baseline, it may perform well on the covered cases but perform poor on the un-covered ones. This may be the reason for its poor performance on both metrics. The unsupervised TF · IDF and BM25 methods show similar performance on both metrics, which indicates the upper bound of the unsupervised methods. In NeuL2R methods, CNKM and Conv-CNKM outperform the unsupervised methods with about 5-7% improvement, which mainly benefits from the training set. When the BERT-based models (Line 6-7) are trained, much higher performance is obtained. The knowledge graph embeddings (Line 8-11) contribute a lot to the performance improvement, but they are unstable with about 2% differences compared to BERT. The HiCapsRKL model integrates the RCor and KImp information by capsule routing algorithm, and it reaches the best performance. In Table 3, the HiCapsRKL model also obtains the best performance on all evaluation metrics. The P@10 and NDCG@10 metrics reflect the relevance situation of the top-10 literature among all the retrieved literature, while the MRR and MAP metrics reflect the situation among all the retrieved literature. The NDCG@10 metric not only considers the relevance label of each literature but also considers its position in the top-10 literature. Therefore, it is a comprehensive metric to express the capabilities of the model, and here we use it as the main basis for ranking comparison methods. In the NDCG@10 column, each method can bring a certain improvement. Especially, the NeuL2R and KGemb methods outperform the unsupervised methods, and benefiting from different mechanisms, they show different improvements. Of all these results, the result of HiCapsRKL indicates ranking more literature with higher relevance in the front, and the improvement is a large margin compared to other methods. The P@10 metric counts the literature with a given label in the top-10 literature, and it mainly indicates how much literature that meets the label can be retrieved. Also, from the P@10 column, either on the "H r " label or on both "H r " and "F r " labels, HiCapsRKL retrieves the most literature than others. The MAP and MRR metrics report the ranking performance of each method on all the retrieved literature. They mainly report the position of the relevance literature in all literature. The results on these two metrics are roughly consistent to those on the P@10 and NDCG@10 metrics. The results on this set indicate that HiCapsRKL is effective for retrieving the relevant literature, and it is also proven to be useful for such tasks and is worthy of further research.

Significance Test
The significance test was performed based on the comparison methods that were implemented in this paper. The well-known Wilcoxon signed-rank test was used to measure whether the improvement between the corresponding data distributions in two samples are significant. In the Wilcoxon signedrank test, we first randomly sampled 50% data in each test set for 20 times and used these trained methods to predict the results on the sample data. Second, we scored the sample data with the evaluation script to obtain each metric score. After sampling 20 times, we had a sequence of metric scores with the length of 20 for each method. Finally, the corresponding metric score sequences of any two methods were input into the "wilcox.test()" function in R Tutorial, and the function will output the P-value of two sequences to indicate the significance. If P-value<0.05, the improvement between two methods are significant, otherwise not. Finally, in Table 2 and 3, on both test sets, the improvement between HiCapsRKL and any comparison method on each metric is significant (P-value<0.05).

Cohen's Kappa Coefficient
Cohen's kappa coefficient (Artstein and Poesio, 2008) is a statistic to measure inter-rater reliability for qualitative items between two categorical variables (McHugh, 2012). In this experiment, we used the coefficient to measure the agreement between the weakly supervised training set and the golden standard. First, we randomly sampled 25 pairs for each relevance label from the training set, and obtained a random subset with 100 pairs. Second, we manually annotated these pairs. Finally, on the subset, we calculated the Cohen's kappa coefficient score between the automatic labels and the annotated labels. The calculation is completed by the "co-hen_kappa_score" function in sklearn toolkit.
The final coefficient score for the random subset is 0.707. Based on the interpretation of Kappa coefficient in Han (2020), the Kappa coefficient score ranging between 0.61 and 0.80 means two variables are "Substantial agreement". The higher the score is, the more perfect the agreement is. For example, the scores ranging between 0.81 and 0.99 means "Near-perfect agreement". The Cohen's kappa coefficient experiment indicates the good quality of the training set. Since the training set is constructed from a large-scale knowledge and medical literature pairs, it only keep the pairs with high confidence. As a result, the training set presents a higher Kappa coefficient to indicate the substantial agreement with the golden standard.

Ablation Study
We conducted experiments on removing one component from HiCapsRKL to validate how it performs on two test sets. The experimental results are listed in Table 4. The removed component each time is the capsule routing algorithm for RCor (CapsR RCor ), the capsule routing algorithm for KImp (CapsR KImp ), the capsule routing algorithm for both (CapsR), the RCor part (RCor), the KImp part (KImp), and the RCor&KImp parts (RCor&KImp). Especially, an additional experiment of applying capsule routing to R M edL is also included as w/ CapsR M edL . In table 4, from the last 3 lines, we can see that the RCor&KImp information plays an important role, and the RCor information shows greater influence than the KImp information. This is mainly because the relation information is hard to capture in the long medical literature. Moreover, the capsule routing algorithms further improve the performance when they are used for information extraction ("w/o Caps" 3 lines), which indicates that the powerful ability of the capsule routing algorithms. However, it is inappropriate to apply the capsule routing to R M edL ("w/ CapsR M edL " line). This is mainly because R M edL is learnt from the entire medical literature, and it is usually asked to learn comprehensive features to determine the hierarchical association, so there is no clear specific feature to extract from it for this task. Overall, all these components are still important in HiCapsRKL and contribute to associating knowledge with medical literature.

RCor&KImp Information Visualization
To clearly present how the learned capsrepresentations in HiCapsRKL are related to their input text fragment, we visualized the attentive weights between the caps-representation and its input text fragment. This analysis is performed on the relevance predication test test.
First, we output the caps-representation R c RCor and R c KImp in Section 3.3 using the trained HiCap-sRKL model. Second, we output each token representation in the RCor and KImp text fragments. Each token representation is from the language encoder. Finally, for RCor, we compute the cosine similarities as the attentive weights between R c RCor and each token representation in RCor text fragment, and visualize the attentive weights with heat map. For KImp, the computation is between R c KImp and each token representation in KImp text fragment. Figure 4 lists two typical examples for the most common relationship "indications" due to page limitations, which are randomly selected from all samples to better present the relation between the caps-representation and its input text fragments. In Figure 4, in each example, the heat map in the second block is used for the RCor text fragment (RCor block), and the third block is for the KImp text fragment (KImp block). Each block includes the English text translated from its Chinese version, heat map, corresponding Chinese characters and English words. For the relationship, we can see that the characters or words "in the treatment of" in the heat map usually have a larger attentive weight value than others. This indicates that the R c RCor indeed contains the relationship information. For the subject, the heat map in KImp block is absolutely different. In the left example, the subject in the medical literature is about "pituitrin", "pulmonary tuberculosis", "evaluation of the effectiveness". The knowledge is related to these words, and the words also present larger attentive weight values. The right example also proves the subject information. The heat map of attentive weights indicates that the caps-representations from HiCap-sRKL have learnt the RCor&KImp information in knowledge and medical literature.

Conclusion
In this paper, we proposed HiCapsRKL to leverage capsule routing to associate knowledge with medical literature hierarchically. This is a worthy research work for better integrating knowledge to learn rich text representation. On two manually labeled test sets, namely the relevance prediction test set and medical literature retrieval test set, the proposed HiCapsRKL model has shown SOTA performances than other comparison methods. Exhaustive experimental results and analyses have proven the excellent ability of the proposed model, and showed its potential on learning association features.
In the future, we will focus on applying this work to improve the text representation of the knowledge integration methods by the hierarchical knowledge. For example, the HiCapsRKL can be used as multitask, using the relevance of the knowledge, e.g. the softmax probability or relevance label, as a weight or filter to control the integrating process. HiCap-sRKL will help to reduce the effect of the noisy knowledge and may further improve the quality of text representation. Besides, this work can also contribute to other NLP researches (e.g., the medical information processing, question answering, information retrieval, reading comprehension, etc), which may benefit from integrating knowledge. Chunning Du, Haifeng Sun, Jingyu Wang, Qi Qi, Jianxin Liao, Tong Xu, and Ming Liu. 2019b

A.1 The definition of four-point scale graded relevance assessment
The four-point scale graded relevance assessment (Kekäläinen, 2005) from Text REtrieval Conference (TREC) is shown in Table 5. The assessment has been adaptive for this work properly. In this work, the TREC graded relevance assessment is used as the standard to guide annotators to manually annotate the test sets and the subset from training set for Cohen's kappa coefficient computing in Section 5.1.

A.2 Constructing Training and Test sets
A.2.1 The weakly supervised training set In this set, the relevance label between knowledge and medical literature is automatically mapped according to the RCor and KImp labels in Table 1. So each training pair is assigned a relevance label, a RCor label and a KImp label.
To calculate the RCor label of a pair, we first collected the RCor texts as described in Section 3.1, and grouped these texts based on the relationship of two entities. Since two entities will only have one relationship in CMeKG, the texts in one group are the possible candidates to describe it. Second, we replaced all the entities in the texts with a placeholder e (denoted as e-placed texts) according to CMeKG entities. Third, we used the KNN algorithm to cluster the e-placed texts, and ranked the clusters based on its text quantity. Next, we selected top 15 clusters for each group, and randomly sampled the e-placed texts in each cluster to manually check whether the texts describe the relationship. When over 70% of the sampled e-placed texts did describe, we regarded this cluster as the correct one to describe the relationship. Finally, when a new RCor text of one knowledge comes, the KNN cluster algorithm is used to calculate the distance with the known clusters in its group. Once the text is clustered into the correct ones, the text is regarded as describing the relationship, and the training data pair is assigned a positive RCor label, otherwise a negative RCor label.
To calculate the KImp label of a pair, we first used the optimized latent Dirichlet allocation (LDA) (Blei et al., 2003) model in Gensim toolkit to learn a topic model with all the medical literature. When training the LDA model, each literature is considered to discuss one independent subject. We passed the medical literature into the LDA model sequentially, and used it to construct the word frequency matrix for training. After the training, the LDA model could output the subject probability of each word in each medical literature. The word probability has been normalized, and all values add up to 1. Secondly, we used the trained LDA model to output the subject probabilities of the entities and relationship in knowledge related to one medical literature, respectively. Next, we added up the entity and relationship probabilities as the subject probability of the knowledge related to the medical literature. Finally, based on the scope of the knowledge subject probability, we set the KImp label at four levels, roughly corresponding to the definition in Table 1.
At last, for each knowledge and medical literature pair, we have its RCor label and KImp label, and based on the mapping definition in Table 1 we will also have an overall relevance label, which can be used to train the matching model.

A.2.2 The manually-labeled relevance prediction test set
In this set, the data pairs are randomly selected from the whole resource that excludes those in the training set. First, we recruited several professional annotators with the language skills, and they were trained in advance according to the TREC fourpoint scale graded relevance assessment to determine the label of a pair. These annotators knew nothing about the RCor and KImp or their definitions. Second, for each data pair, we assigned three annotators to annotate it simultaneously. Each annotator needs to assign one label from "H r ", "F r ", "M r " and "I r " to a data pair. Finally, we determined the label of a pair by the crowd-sourcing principle (Liu et al., 2018a). The crowd-sourcing principle is that two or more annotators give the same label, and if necessary, the third annotator will not give a label that conflicts with other annotators (Liu et al., 2018a). The label from one annotator conflicts with the other one if two labels follow one of the cases, namely ("H r ", "M r "), ("F r ", "M r "), and ("I r ", "H r " or "F r " or "M r " ). For these annotations, they will further discuss them until no conflict.

A.2.3 The manually-labeled medical literature retrieval test set
In this set, each knowledge corresponds to multiple medical literature. These medical literature is collected from the whole resource with the pooling In case of multi-faceted knowledge, all or most sub-themes or viewpoints are covered.

Fr
Fairly relevance The retrieved literature contains more information than the knowledge description but the presentation is not exhaustive. In case of multi-faceted knowledge, only some of the sub-themes or viewpoints are covered.

Marginally relevance
The retrieved literature only points to the knowledge. It does not contain more or other information than the description.

Ir
Irrelevance The retrieved literature does not contain any information about the knowledge.
method. The annotators need to give the relevance label between the knowledge and each medical literature. First, we randomly selected 100 knowledge from CMeKG. Second, we trained the comparison and proposed models with the training set, and then applied these trained models on knowledge to retrieval medical literature from the whole medical literature resource. Third, the popular pooling method (Spark-Jones, 1975) in IR was used, in which the top-5 literature from the retrieval results of each model was collected. All the literature (total 5,500) from these models was gathered and de-duplicated to obtain the candidate medical literature for each knowledge. Finally, the annotators manually annotated the relevance label between knowledge and its corresponding medical literature from "H r ", "F r ", "M r " and "I r ". Therefore, in the medical literature retrieval test set, each knowledge is assigned multiple literature. The annotations and label determination process for each pair follow the same crowd-sourcing process as that in the relevance prediction test set. The distributions of these datasets are shown in Table 6, including the numbers of each label, medical literature, knowledge, and relationship in three datasets.  (Dai et al., 2018), BERT (Devlin et al., 2019), and Siamese BERT (Reimers and Gurevych, 2019), which have shown promising performance on many relevance prediction or ranking benchmarks. The KNRM and Conv-KNRM models take the medical literature and knowledge as input and output the relevance label for prediction or the relevance probability for ranking. The BERT and Siamese BERT models are pre-trained matching methods. After fine-tuned training, they can also output a relevance label for each medical literature and knowledge pair. In BERT model, the medical literature and knowledge are converted into one sequence and modeled with multi-layer Transformer architecture. The Siamese BERT model is a modification of BERT, in which the knowledge and medical literature are fed into the shared BERT encoder to learn independent representations, respectively. Two representations are fused in the last layer for the final label prediction. The source codes for KNRM, Conv-KNRM, and BERT are from the official release in GitHub. For Siamese BERT, we follow the structure described in the paper (Reimers and Gurevych, 2019) for this experiment. Besides, we have tried to implement more recent matching models (Liu et al., 2018b;Hofstätter, 2020) and BERT-based variant models (Boualili et al., 2020;  Rudra and Anand, 2020), but we do not obtain the expected excellent results. For fair comparison, these methods are not included in this paper. Translation based KG embedding methods (KGemb) (Bordes et al., 2013;Wang et al., 2014;Ji et al., 2015;Sun et al., 2019): The translation based knowledge graph embedding methods learn the entity or relationship representation by entity prediction. They are widely used to model the knowledge in low dimensional vector space, and also maintain the attributes of entity and relationship. First, four well-known methods are applied in this experiment, namely transE (Bordes et al., 2013), transH (Wang et al., 2014), transD (Ji et al., 2015, and rotatE (Sun et al., 2019). They are pretrained on the knowledge in CMeKG respectively, and every method could output an embedding file containing the entity embeddings and relationship embeddings. These methods are implemented from the OpenKE toolkit (Han et al., 2018). Second, in each KGemb matching model, the knowledge representation is the concatenation of the KG embedding of entities and relationship, and the medical literature presentation is from the BERT encoder. Finally, Both representations are fused in the last layer for the relevance prediction or ranking literature. Since the graph-based embedding methods, e.g. node2vec (Grover and Leskovec, 2016) and graph2vec (Narayanan et al., 2017), only focus on the node embedding in the graph and ignore the relationship, they are not included for comparison in this work.

A.3 Comparison Methods
Our implementations: In this work, first we implemented the proposed HiCapsRKL model according to each part description in Section 3. Second, we implemented seven additional models for the ablation study experiment. These models are CapsR M edL , CapsR RCor , CapsR KImp , CapsR, RCor, KImp, and RCor&KImp. Each model is one component different from the proposed HiCap-sRKL model.

A.4 Experimental setup:
Some experimental settings or hyper parameters in this work are listed below: The Chinese text segmentation tool for sentence processing is Jieba. During the processing, the entities in CMeKG and keywords in the literature are also added into the Jieba dictionary. The language encoder in all experiments is the BERT-base, Chinese. The max length of input pairs for modeling RCor and KImp texts is 256, and that of knowledge and medical literature pair is 512. In multi-head splitting operation, the splitting head is 12. In capsule routing method, the number of capsules is 12, and the dimension D of the input and output capsules is 64. The layer iteration K is 3. The threshold scopes of distinguishing four relevance labels in TF · IDF method are In Section 3.1 the sentences in KImp text fragment are selected based on the distribution of the top-10 words in the literature. As described in Section A.2.1, we ranked and selected the top-10 words based on the subject probability of each word, and made statistics on which sentences cover the most of these top-10 words, and selected them into the KImp text fragment. Based on the coverage in the statistics, these words mostly locate in the title, keywords, first sentence, tail sentence, and other parts in descending order. The work integrates three representations in Section 3 to get the R M edL by using the "add" operation, and the cross-entropy function is the loss function in HiCapsRKL model training. Early stopping is used for parameter selection when training all models. All the NeuL2R, KGemb and our implemented methods are trained with the weakly supervised training data, and all comparison methods are evaluated on two test sets.