Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT

Infusing factual knowledge into pre-trained models is fundamental for many knowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets.


Introduction
Leveraging factual knowledge to augment pretrained language models is of paramount importance for knowledge-intensive tasks, such as question answering and fact checking (Petroni et al., 2021). Especially in the biomedical domain where public training corpora are limited and noisy, trusted biomedical KGs are crucial for deriving accurate inferences (Li et al., 2020;. However, the infusion of knowledge from realworld biomedical KGs, where entity sets are very large (e.g. UMLS, Bodenreider 2004, contains ∼4M entities) demands highly scalable solutions.
Although many general knowledge-enhanced language models have been proposed, most of them rely on a computationally expensive joint training of an underlying masked language model (MLM) along with a knowledge-infusion objective function to minimize the risk of catastrophic forgetting (Xiong et al., 2019;Wang et al., 2021Peters et al., 2019;Yuan et al., 2021). Alternatively, entity masking (or entity prediction) has emerged as one of the most popular self-supervised training objectives for infusing entity-level knowledge into pretrained models Yu et al., 2020;He et al., 2020). However, due to the large number of entities in biomedical KGs, computing an exact softmax over all entities is very expensive for training and predicting (De Cao et al., 2021). Although negative sampling techniques could alleviate the computational issue (Sun et al., 2020), tuning an appropriately hard set of negative instances can be challenging and predicting a very large number of labels may generalize poorly (Hinton et al., 2015).
To address the aforementioned challenges, we propose a novel knowledge infusion approach, named Mixture-of-Partitions (MoP), to infuse factual knowledge based on partitioned KGs into pretrained models (BioBERT, Lee et al. 2020;SciB-ERT, Beltagy et al. 2019;and PubMedBERT, Gu et al. 2020). More concretely, we first partition a KG into several sub-graphs each containing a disjoint subset of its entities by using the METIS algorithm (Karypis and Kumar, 1998), and then the Transformer ADAPTER module (Houlsby et al., 2019;Pfeiffer et al., 2020b) is applied to learn portable knowledge parameters from each subgraph. In particular, using ADAPTER module to infuse knowledge does not require fine-tuning the parameters of the underlying BERTs, which is more flexible and efficient while avoiding the catastrophic forgetting issue. To utilise the independently learned knowledge from sub-graph adapters, we introduce mixture layers to automatically route useful knowledge from these adapters to downstream tasks. Figure 1 illustrates our approach.
Our results and analyses indicate that our "divide and conquer" partitioning strategy effectively preserves the rich information presented in two biomedical KGs from UMLS while enabling us to ⇥ 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " m M y N q P N R l i j x / A H z u c P m 2 2 P r Q = = < / l a t e x i t > G < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 e T D x P S F n r l 0 G N L y q 4 0 W Z F H J s q k = " > A A A B 8 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I R R c u C i 5 0 W c E + Y D q U T J p p Q z P J k G S E M v Q z 3 L h Q x K 1 f 4 8 6 / M d P O Q l s P B A 7 n 3 E v O P W H C m T a u + + 2 U 1 t Y 3 N r f K 2 5 W d 3 b 3 9 g + r h U U f L V B H a J p J L 1 Q u x p p w J 2 j b M c N p L F M V x y G k 3 n N z m f v e J K s 2 k e D T T h A Y x H g k W M Y K N l f x + j M 2 Y Y J 7 d z Q b V m l t 3 5 0 C r x C t I D Q q 0 B t W v / l C S N K b C E I 6 1 9 j 0 3 M U G G l W G E 0 1 m l n 2 q a Y D L B I + p b K n B M d Z D N I 8 / Q m V W G K J L K P m H Q X P 2 9 k e F Y 6 2 k c 2 s k 8 o l 7 2 c v E / z 0 9 N d B 1 k T C S p o Y I s P o p S j o x E + f 1 o y B Q l h k 8 t w U Q x m x W R M V a Y G N t S x Z b g L Z + 8 S j o X d a 9 R v 3 x o 1 J o 3 R R 1 l O I F T O A c P r q A J 9 9 C C N h C Q 8 A y v 8 O Y Y 5 8 V 5 d z 4 W o y W n 2 D m G P 3 A + f w B 6 N p F f < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " W E p S v Z 3 x b z / K W 4 o T D D o c + S b x r b E = " > A A A B 9 H i c b V C 7 S g N B F L 3 r M 8 Z X 1 N J m M A h W Y V d E L S w C F l p G M A 9 I l j A 7 u U m G z M 6 u M 7 O B s O Q 7 b C w U s f V j 7 P w b Z 5 M t N P H A w O G c e 7 l n T h A L r o 3 r f j s r q 2 v r G 5 u F r e L 2 z u 7 e f u n g s K G j R D G s s 0 h E q h V Q j Y J L r B t u B L Z i h T Q M B D a D 0 W 3 m N 8 e o N I / k o 5 n E 6 I d 0 I H m f M 2 q s 5 H d C a o a M i v R u 2 h 1 1 S 2 W 3 4 s 5 A l o m X k z L k q H V L X 5 1 e x J I Q p W G C a t 3 2 3 N j 4 K V W G M 4 H T Y i f R G F M 2 o g N s W y p p i N p P Z 6 G n 5 N Q q P d K P l H 3 S k J n 6 e y O l o d a T M L C T W U i 9 6 G X i f 1 4 7 M f 1 r P + U y T g x K N j / U T w Q x E c k a I D 2 u k B k x s Y Q y x W 1 W w o Z U U W Z s T 0 V b g r f 4 5 W X S O K 9 4 l x X v 4 a J c v c n r K M A x n M A Z e H A F V b i H G t S B w R M 8 w y u 8 O W P n x X l 3 P u a j K 0 6 + c w R / 4 H z + A P 8 I k j s = < / l a t e x i t > G k < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 V O S V m b W C 0 n I t n d i l P D V p U Q s O D s = " > A A A B / X i c b V D L S s N A F L 2 p r 1 p f 8 b F z E y y C q 5 K I q A s X B R e 6 r G A f 0 I Q w m U 7 a o Z N J m J k I N Q R / x Y 0 L R d z 6 H + 7 8 G y d t F t p 6 Y O B w z r 3 c M y d I G J X K t r + N y t L y y u p a d b 2 2 s b m 1 v W P u 7 n V k n A p M 2 j h m s e g F S B J G O W k r q h j p J Y K g K G C k G 4 y v C 7 / 7 Q I S k M b 9 X k 4 R 4 E R p y G l K M l J Z 8 8 8 B t j a i f u R F S I 4 x Y d p P 7 4 9 w 3 6 3 b D n s J a J E 5 J 6 l C i 5 Z t f 7 i D G a U S 4 w g x J 2 X f s R H k Z E o p i R v K a m 0 q S I D x G Q 9 L X l K O I S C + b p s + t Y 6 0 M r D A W + n F l T d X f G x m K p J x E g Z 4 s U s p 5 r x D / 8 / q p C i + 9 j P I k V Y T j 2 a E w Z Z a K r a I K a 0 A F w Y p N N E F Y U J 3 V w i M k E F a 6 s J o u w Z n / 8 i L p n D a c 8 4 Z z d 1 Z v X p V 1 V O E Q j u A E H L i A J t x C C 9 q A 4 R G e 4 R X e j C f j x X g 3 P m a j F a P c 2 Y c / M D 5 / A O + G l Y Y = < / l a t e x i t > Gk < l a t e x i t s h a 1 _ b a s e 6 4 = " b T 0 7 z 5 V + A L q Y / Q M 7 N 3 a g + J k c z v Q = " > A A A B / X i c b V D L S s N A F L 3 x W e s r P n Z u B o v g q i R F 1 I W L g g t d V r A P a E K Y T C f t 0 M m D m Y l Q Q / B X 3 L h Q x K 3 / 4 c 6 / c d J m o a 0 H B g 7 n 3 M s 9 c / y E M 6 k s 6 9 t Y W l 5 Z X V u v b F Q 3 t 7 Z 3 d s 2 9 / Y 6 M U 0 F o m 8 Q 8 F j 0 f S 8 p Z R N u K K U 5 7 i a A 4 9 D n t + u P r w u 8 + U C F Z H N 2 r S U L d E A 8 j F j C C l Z Y 8 8 9 B p j Z i X O S F W I 4 J 5 d p N 7 j d w z a 1 b d m g I t E r s k N S j R 8 s w v Z x C T N K S R I h x L 2 b e t R L k Z F o o R T v O q k 0 q a Y D L G Q 9 r X N M I h l W 4 2 T Z + j E 6 0 M U B A L / S K F p u r v j Q y H U k 5 C X 0 8 W K e W 8 V 4 j / e f 1 U B Z d u x q I k V T Q i s 0 N B y p G K U V E F G j B B i e I T T T A R T G d F Z I Q F J k o X V t U l 2 P N f X i S d R t 0 + r 9 t 3 Z 7 X m V V l H B Y 7 g G E 7 B h g t o w i 2 0 o A 0 E H u E Z X u H N e D J e j H f j Y z a 6 Z J Q 7 B / A H x u c P m O m V T Q = = < / l a t e x i t >

G2
… < l a t e x i t s h a 1 _ b a s e 6 4 = " N X O y a F v B w k D t 7 o R D 3 f 9 t O R n v H b k = " > A A A B / X i c b V D L S s N A F L 2 p r 1 p f 8 b F z E y y C q 5 K I q A s X B R e 6 r G A f 0 I Q w m U 7 a o Z N J m J k I N Q R / x Y 0 L R d z 6 H + 7 8 G y d t F t p 6 Y O B w z r 3 c M y d I G J X K t r + N y t L y y u p a d b 2 2 s b m 1 v W P u 7 n V k n A p M 2 j h m s e g F S B J G O W k r q h j p J Y K g K G C k G 4 y v C 7 / 7 Q I S k M b 9 X k 4 R 4 E R p y G l K M l J Z 8 8 < l a t e x i t s h a 1 _ b a s e 6 4 = " L a u w r u 6 B F q z C y Q f H u g 7 S z f v z j l s = "

j L 3 E q t 4 p F e r F X l x h t 3 7 b J T d a Z A i 8 T N S R l y N L r 2 l 9 e L S R p R o Q n H S n V c J 9 F + h q V m h N N x y U s V T T A Z 4 j 7 t G C p w R J W f T e 8 f o 2 O j 9 F A Y S 1 N C o 6 n 6 e y L D k V K j K D C d E d Y D N e 9 N x P + 8 T q r D C z 9 j I k k 1 F W S 2 K E w 5 0 j G a h I F 6 T F K i + c g Q T C Q z t y I y w B I T b S I r m R D c + Z c X S a t W d c + q 7 u 1 p u X 6 Z x 1 G E Q z i C E 3 D h H O p w D Q 1 o A o F H e I Z X e L O e r B f r 3 f q Y t R a s f G Y f / s D 6 / A H u 8 Z T G < / l a t e x i t >
k 2 {1, 2, · · · , K} Repeat for all K sub-graphs scale up training on these very large graphs. Additionally, we observe that while individual adapters specialize towards sub-graph specific knowledge, MoP can effectively utilise their individual expertise to enhance the performance of our tested biomedical BERTs on six downstream tasks, where five of them achieve new SOTA performances.

Mixture-of-Partitions (MoP)
We denote a KG as a collection of ordered triples where E and R are the sets of entities and relations, respectively. All the entities and relations are associated with their textual surface forms, which can be a single word (e.g. fever), a compound (e.g. sars-cov-2), or a short phrase (e.g. has finding site). Given a pretrained model Θ 0 , our task is to learn Φ G based on an input knowledge graph G, such that it encapsulates the knowledge from G. The training objective, L G , can be implemented in many ways such as relation classification , entity linking (Peters et al., 2019), next sentence prediction (Goodwin and Demner-Fushman, 2020), or entity prediction . In this paper, we focus on entity prediction, one of the most widely used objectives, and leave exploration of other objectives for future work.

Knowledge Graph Partitioning
Graph partitioning (i.e., partitioning the node set into mutually exclusive groups) is a critical step to our approach, since we need to properly and automatically cluster knowledge triples for supporting data parallelism and controlling computation. In particular, it must satisfy the following goals: (1) maximize the number of resulting knowledge triples to retain as much factual knowledge as possible; (2) balance nodes over partitions to reduce the overall parameters across different entity prediction heads; (3) efficiency at scale for handling large KGs. In fact, an exact solution to (1) and (2) is referred to as the balanced graph partition problem, which is NP-complete. We use the METIS (Karypis and Kumar, 1998) algorithm as an approximation, simultaneously meeting all the above three requirements. METIS can handle billion-scale graphs by successively coarsening a large graph into smaller graphs, processing them quickly and then projecting the partitions back onto the larger graph, and has been used in many tasks (Chiang et al., 2019;Defferrard et al., 2016;.

Knowledge Infusion with Adapters
Once the large knowledge graph is partitioned, we use ADAPTER modules to infuse the factual knowledge into a pretrained Transformer model by training an entity prediction objective for each sub-graph. ADAPTERs (Houlsby et al., 2019;Pfeiffer et al., 2020b) are newly initialized modules inserted between the Transformer layers of a pretrained model. The training of ADAPTER does not require fine-tuning the existing parameters of the pretrained model. Instead, only the parameters within the ADAPTER modules are updated. In  (Gu et al., 2020) 60.24 (Gu et al., 2020) 87.56 (Gu et al., 2020) 90.32 (Nentidis et al., 2020) 36.70 (Jin et al., 2020) 83.80 (Peng et al., 2019)  this paper, we use the ADAPTER module configured by Pfeiffer et al. (2020a), which is shown in Figure 1 (b). In particular, given a sub-graph G k , we remove the tail entity name for each triple (h, r, t) ∈ G k , and transform the triple into a list of tokens: The sub-graph specific ADAPTER module is trained to predict the tail entity using the representation of the [CLS] token and the parameters Φ G k are optimized by minimizing the cross-entropy loss. During the finetuning of downstream tasks, both the parameters of ADAPTER and pre-trained LM will be updated.

Mixture Layers
Given a set of knowledge-encapsulated adapters, we use AdapterFusion mixture layers to combine knowledge from different adapters for downstream tasks. AdapterFusion is a recently proposed model (Pfeiffer et al., 2020a) that learns to combine the information from a set of task adapters by a softmax attention layer. It learns a contextual mixture weight over adapters at layer l using an attention with the softmax weights: where s l,k is used to mix the adapter outputs to be passed into the next layer, and the final layer L is used to predict a task label y: where f is the target task prediction head. Closely related to ours is the sparsely-gated Mixture-of-Experts layer (Shazeer et al., 2017). Alternatively, a more flexible mechanism such as Gumbel-Softmax (Jang et al., 2017) can be used for obtaining more discrete/continuous mixture weights. However, we found both alternatives underperform AdapterFusion (see Appendix for a comparison).  We evaluate our proposed MoP on two KGs, named SFull and S20Rel, which are extracted from the large biomedical knowledge graph UMLS (Bodenreider, 2004) under the SNOMED CT, US Edition vocabulary. The SFull KG contains the full relations and entities of SNOMED 2 , while the S20Rel KG is a sub-set of SFull that only contains the top 20 most frequent relations. Note that since some relations in SFull are the reversed mappings of the same entity pairs, e.g. "A has causative agent B" and "B causative agent of A", therefore for S20Rel we exclude those reversed relations in the top 20 relations. Table 2 shows the statistics of the two KGs and the used 20 relations of the S20Rel are listed in the appendix.

Evaluated Tasks and Datasets
We evaluate our MoP on six datasets over various downstream tasks, including four question answering (i. a binary/multiclass classification tasks. See Appendix for the detailed description of these tasks and their datasets.

Pretraining with Base Models
We experiment with three biomedical pretrained models, namely BioBERT (Lee et al., 2020), SciB-ERT (Beltagy et al., 2019) and PubMedBERT (Gu et al., 2020), as our base models, which have shown strong progress in biomedical text mining tasks. We first partition our KGs into different number of sub-graphs (i.e. {5, 10, 20, 40}), then for each sub-graph, we train the base models loaded with the newly initialized ADAPTER modules (with a compression rate CRate = 8) for 1-2 epochs by minimizing the cross-entropy loss. AdamW (Loshchilov and Hutter, 2018) is used as our training optimizer, and the learning rates for all the sub-graphs are fixed to 1e − 4, as suggested by (Pfeiffer et al., 2020b). Unless specified otherwise, all the reported performances are based on a partition of 20 sub-graphs, since this was optimal for task performance (see Section 3.7 for the performances over different number of partitions.).

Partition Evaluation on Tasks
In Figure 2 we report the average performance (10 runs) of the knowledge-infused PubMedBERT on two QA datasets over partitioned SFull. We can see that partitions contribute to various degrees while some (e.g. #5) have a negligible benefit. However,   Tasks   Table 1 shows the overall performance of our MoP deployed on SciBERT, BioBERT and Pub-MedBERT pretrained models. We see that MoP pretrained on the SFull KG improves both the BioBERT and PubMedBERT models for all the tasks, while the SciBERT model can be also improved on 4 out of 6 tasks. The result shows that MoP pretrained with the S20Rel KG achieves new SOTA performances on four tasks. This suggests further pruning of the knowledge triples helps task performance by reducing noise, and is a promising direction to explore in future.

METIS Partitioning Quality
We design a controlled random partitioning scheme to test whether METIS can produce high quality partitions for training. We fix the entity size for a 20-partitioned result produced by METIS, and randomly shuffle a percentage (ranging from 0%-100%) of entities across all the sub-graphs. Table  3 shows the number of training triples numbers over different shuffling ratios. In Figure 3 we report the results on BioASQ7b and PubMedQA under different shuffling rates. We can see that the performances of MoP on both datasets degrades significantly as the shuffling rate increases, which highlights the quality of the produced partitions.     Table 4 shows the performance of PubMed-BERT+MoP trained on the SFull knowledge graph over different number of partitions. We can clearly see that under 20 partitions, PubMedBERT+MoP performs the best in both of the BioASQ7b and PubMedQA datasets, and an average entity size of 15k-30k for the sub-graphs usually yields better performance than others.

Case Study
In Figure 4, we show six examples (contexts are omitted for brevity) from BioASQ8b and compare their mixture weights of the final layer [CLS] token inferred by the PubMedBERT+MoP (SFull) model. We see that each question elicits different mixture weights, indicating that MoP can leverage the expertise of different sub-graphs depending on the target example. We also plot the word cloud over six groups of sub-graphs that are clustered by k-means according to the entity name's TF-IDF  feature of these sub-graphs. We can observe that MoP identifies the most related sub-graphs for each example (e.g. Q2 has more weights on sub-graphs [1,13,14], which specialise in 'tumor' knowledge). This validates the effectiveness of our MoP in balancing useful knowledge across adapters.

Conclusion and Future Work
In this paper, we proposed MoP, a novel approach for infusing knowledge by partitioning knowledge graphs into smaller sub-graphs. We show that while the knowledge-encapsulated adapters perform very differently over different sub-graphs, our proposed MoP can automatically leverage and balance the useful knowledge across those adapters to enhance various downstream tasks. In the future, we will evaluate our approach using some general domain KGs based on some general domain tasks. interprets 11 has method 12 has direct device 13 has dose form 14 has subject relationship context 15 has pathological process 16 has interpretation 17 moved to 18 has intent 19 has temporal context

A.2 Evaluated Datasets and Experiment details
We evaluate our MoP on six datasets over various downstream tasks, including four question answering (i.e., PubMedQA, Jin et al. 2019;BioAsq7b, Nentidis et al. 2019;BioAsq8b, Nentidis et al. 2020;MedQA, Jin et al. 2020), one document classification (HoC, Baker and Korhonen 2017), and one natural language inference (MedNLI, Romanov and Shivade 2018) datasets. While HoC is a multi-label classification, and MedQA is a multichoice prediction, the rest can be formulated as a binary/multiclass classification tasks.
• HoC (Baker and Korhonen, 2017): The Hallmarks of Cancer corpus was extracted from 1852 PubMed publication abstracts by Baker and Korhonen (2017), and the class labels were manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy, but in this paper we only consider the ten top-level ones. We use the publicly available train/dev/test split created by (Gu et al., 2020) and report the average performance over five runs by the average micro F1 across the ten cancer hallmarks.
• PubMedQA (Jin et al., 2019): This is a question answering dataset that contains a set of research questions, each with a reference text from a PubMed abstract as well as an annotated label of whether the text contains the answer to the research question (yes/maybe/no ). We use the original train/dev/test split with 450/50/500 questions, respectively. The reported performance are the average of ten runs under the accuracy metric.
• BioASQ7b, BioASQ8b (Nentidis et al., 2019(Nentidis et al., , 2020: The both BioASQ datasets are yes/no question answering tasks annotated by biomedical experts. Each question is paired with a reference text containing multiple sentences from a PubMed abstract and a yes/no answer. We use the official train/dev/test splits, i.e. 670/75/140 and 729/152/152 for BioASQ7b and BioASQ8b respectively, and the reported performances are the average of ten runs under the accuracy metric. • MedNLI (Romanov and Shivade, 2018): MedNLI is a Natural Language Inference (NLI) collection of sentence pairs extracted from MIMIC-III, a large clinical database. The objective of the NLI task is to determine if a given hypothesis can be inferred from a given premise. This task is formulated as the document classification task over three labels: {entailment, contradiction,neutral}. We use the same train/dev/test split generated by Romanov and Shivade (2018), and report the average accuracy performance over three runs.
• MedQA (Jin et al., 2020): MedQA is a publicly available large-scale multiple-choice question answering dataset extracted from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, but in this paper we only adopt the English set, which is split by Jin et al. (2020). Following (Jin et al., 2020), we use the Elasticsearch system to retrieve the top 25 sentences to each question+choice pair as the context for each choice, and concatenate them to obtain the normalized log probability over the five choices. Since this dataset is very large, we only report the average accuracy performance under three runs for all the models.

A.3 Comparison of Different Mixture Approaches
Our MoP approach first infuses the factual knowledge for all the partitioned sub-graphs using their respective adapters with newly initialized parameters, then these knowledge-encapsulated adapters are further fine-tuned alone with the underlying BERT model through mixture layers. In this paper, we explored three approaches for implementing the mixture layers, which are described as follows: • Softmax. As the default mixture layers deployed in our MoP, AdapterFusion is a recent proposed model (Pfeiffer et al., 2020a) that learns to combine the information from a set of tasks adapters by a softmax attention layer. In particular, the outputs from different adapters at layer l are combined using a contextual mixture weight calculated by a softmax over these adapters: (3) For brevity, we denote our MoP with the original AdapterFusion mixture layers as Softmax.
• Gumbel. We also extend the AdapterFusion by replacing the softmax layer with the Gumbel-Softmax (Jang et al., 2017) layer for obtaining more discrete mixture weights: where g 1 , · · · , g K are i.i.d samples drawn from Gumbel(0, 1) distribution, and τ is a hyper-parameter controlling the discreteness. For brevity, we denote our MoP with the Gumbel-Softmax AdapterFusion mixture layers as Gumbel.
• MoE. Mix-of-Experts (MoE) is a type of general purpose neural network component for selecting a combination of the experts to process each input. In particular, we use the the sparsely-gated mixture-of-experts, introduced by Shazeer et al. (2017), for obtaining a top-K sparse mixture of these adapters. And the mixture weights are calculated by:  where H(Φ l,G k ) is a function for transferring hidden variables into scalars with tunable Gaussian noise, and TopK(·) is a function for keeping only the top K values. We denote our MoP with this mixture approach as MoE. Table 6 shows the performance comparison of the three mixture approaches on the BioASQ7b and PubMedQA tasks. Note that Gumbel and MoE have additional hyper-parameters, i.e. τ and K, for controlling the discreteness and topK respectively. From Table 6, we can see that the original Adapter-Fusion with the softmax layer outperforms other two mixture approaches on all the hyper-parameter choices. This result justifies the choice of the mixture layers in our MoP model.

A.4 Performance on the Split Sub-graphs
To further validate that the performance improvements of the evaluated BERTs using our MoP are gained due to the infused knowledge from the subgraph adapters, rather than the newly added more parameters of adapters, we split the partitioned sub-graphs into two groups according to their test performance ranking, and use our MoP to fine-tune the adapters of each group. Table 7 and 8 show the performances of all the adapters and our MoP combining the grouped adapters over train/dev/test sets of the BioASQ7b and PubMedQA datasets respectively. As we can see from the two tables, our MoP fine-tuned under the group of higher performance adapters can consistently obtain better performance than the group of the lower performance adapters. Note that we have shown that in Figure 2