Adapters for Enhanced Modeling of Multilingual Knowledge and Text

Large language models appear to learn facts from the large text corpora they are trained on. Such facts are encoded implicitly within their many parameters, making it difficult to verify or manipulate what knowledge has been learned. Language models have recently been extended to multilingual language models (MLLMs), enabling knowledge to be learned across hundreds of languages. Meanwhile, knowledge graphs contain facts in an explicit triple format, which require careful and costly curation and are only available in a few high-resource languages, restricting their research and application. To address these issues, we propose to enhance MLLMs with knowledge from multilingual knowledge graphs (MLKGs) so as to tackle language and knowledge graph tasks across many languages, including low-resource ones. Specifically, we introduce a lightweight adapter set to enhance MLLMs with cross-lingual entity alignment and facts from MLKGs for many languages. Experiments on common benchmarks show that such enhancement benefits both MLLMs and MLKGs, achieving: (1) comparable or improved performance for knowledge graph completion and entity alignment relative to baselines, especially for low-resource languages (for which knowledge graphs are unavailable); and (2) improved MLLM performance on language understanding tasks that require multilingual factual knowledge; all while maintaining performance on other general language tasks.


Introduction
Knowledge graphs serve as a source of explicit factual information for various NLP tasks.However, language models (Devlin et al., 2019;Brown et al., 2020), which capture implicit knowledge from vast text corpora, are already being used in knowledgeintensive tasks.Recently, language models have  been successfully extended to multilingual language models (MLLMs) that integrate information sourced across hundreds of languages (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020).However, as with most neural networks, the information is encoded in a diffused and opaque manner that is difficult to interpret, verify or utilize (AlKhamissi et al., 2022).
Meanwhile, multilingual knowledge graphs (MLKGs) require careful curation of explicit facts and annotation of entities that occur in languages (cross-lingual entity alignment), making knowledge graphs expensive and time-consuming to extend to new languages, restricting knowledge graph research to a few high-resource languages.Further, open-source MLKGs such as WordNet (Bond and Foster, 2013) and Wikidata (Vrandečić and Krötzsch, 2014) suffer from incompleteness as many true facts (or triples) and entity alignments are missing (Chen et al., 2017(Chen et al., , 2020)).
In this work, we propose to overcome the above limitations of each knowledge source by integrating MLKGs into MLLMs (as shown in Figure 1), to enable (i) the transfer of MLKG knowledge from high-resource languages to low-resource languages; and (ii) explicit knowledge of MLKGs to supplement MLLMs for knowledge-intensive language tasks, one of the key challenges in MLLMs (AlKhamissi et al., 2022).
While this idea seems intuitive, there is no easy way to incorporate the explicit knowledge of MLKGs into the parametrically stored information of MLLMs.Existing knowledge integration methods utilize language models and knowledge graphs in two ways: (1) training knowledge graph embeddings individually and combining the embeddings corresponding to linked entities in sentences with the language model representations (e.g., Know-BERT (Peters et al., 2019) and ERNIE (Zhang et al., 2019)); or (2) absorbing the knowledge in knowledge graphs into the language model's parameters via joint training (e.g., K-BERT (Liu et al., 2020) and K-Adapter (Wang et al., 2021)).
The first method requires embedding knowledge graph entities and accurately extracting entities in sentences across hundreds of languages, which is highly challenging.The second method typically suffers from the curse of multilinguality (Conneau et al., 2020;Doddapaneni et al., 2021;Jiao et al., 2022) and catastrophic forgetting (Kirkpatrick et al., 2016) due to limited model capacity.Most importantly, both methods integrate knowledge implicitly such that it is difficult to access and extend to low-resource languages (AlKhamissi et al., 2022).Furthermore, both methods require large sets of aligned sentences and knowledge triples, which is costly to gather and accurately annotate across hundreds of languages.
To address above issues, we first collect and clean multilingual data from Wikidata2 and Wikipedia3 for the enhancement, where rich factual knowledge and cross-lingual alignments are available.Then, we propose to enhance MLLMs with the MLKG information by using a set of adapters (Houlsby et al., 2019), which are lightweight, collectively having only around 0.5% extra parameters than the MLLM.Each adapter integrates information from either MLKG Triples (i.e.facts) or cross-lingual Entity alignments, and is trained on either Phrase or Sentence level data.Each of the resulting four adapters (EP/TP/ES/TS) is trained individually to learn information sup-plemental to that already learned by the MLLM.Adapter outputs are combined by a fusion mechanism (Pfeiffer et al., 2021).Training objectives are similar to those for MLKG embedding (Chen et al., 2017) instead of mask language modeling, which are more efficient with large corpus.
We conduct experiments on various downstream tasks to demonstrate the effectiveness of our approach.For MLKG tasks, following the data collection methods of two existing benchmarks (Chen et al., 2020(Chen et al., , 2017)), we extended them from 2-5 languages to 22 languages, including two rare languages. 4Results show that our method obtains comparable performance to existing state-of-theart baselines on the knowledge graph completion benchmark, and significantly better performance on the entity alignment benchmark.More importantly, we can perform these knowledge graph tasks in lowresource languages for which no knowledge graph exists, and achieve comparable results to the highresource languages.Improvements over baseline MLLMs are significant.The results demonstrate that our proposed method integrates the explicit knowledge from MLKGs into MLLMs that can be used across many languages.Our method also improves existing MLLMs noticeably on knowledgeintensive language tasks, such as cross-lingual relation classification, whilst maintaining performance on general language tasks such as named entity recognition (NER) and question answering (QA).

Multilingual Knowledge Integration
In this paper, we fuse knowledge from a MLKG into a MLLM.Following previous works (Wang et al., 2021;Liu et al., 2021), we make use of an entity tagged corpus of text (called a knowledge integration corpus) for knowledge integration.We formally introduce these concepts below.

MLLM.
A multilingual LM can be thought of as an encoder that can represent text in any language l in a set of languages L. Let V denote the shared vocabulary over all languages.Let t l ∈ V denote a token in language l.A sentence s l in a language l can be denoted as a sequence of tokens: s l = (t l 1 , t l 2 , ...).The output representations of the MLLM for s l can be denoted by a sequence of vectors: LM(s l ) = (h 1 , h 2 , ...).These vectors correspond to representations for each token in the sentence, one representation per input token.Various tokenization schemes such as wordpiece or BPE might be considered here.We use the average of the token representations as the representation of the sentence: LM(s l ) = mean(h 1 , h 2 , ...).Similarly, for a phrase s l ij (starting from the i-th token and ending in the j-th token in the sentence), we can obtain its contextualized representation as LM(s l ij ) = mean(h i , h i+1 , . . .h j ).MLKG.A multilingual knowledge graph is a graph with entities and knowledge triples in each language l ∈ L. Let E denote the set of entities and T denote the set of knowledge triples.In a MLKG, each entity indexed i might appear in several languages.Let e l i denote the entity label of the i-th entity in language l.Furthermore, we denote a knowledge triple in the MLKG as (e l i , r l ′′ k , e l ′ j ) ∈ T , where r l ′′ k is the k th relation.Note that since entities (as well as relations) may appear in various languages under different labels, knowledge triples can be defined across languages.
Knowledge Integration Corpus.For knowledge integration, besides the MLKG, we make use of a corpus of text C (as shown in the right part of Figure 2).The corpus C comprises of two kinds of texts.First, we have a set of texts C 1 for the cross-lingual entity alignment, which comprise of sentences with mentions of entities in the MLKG.For example in Figure 2, given the sentence De Botton spent his early years in Zurich, we have the aligned entity Zurich and its cross-lingual labels.The second set of texts C 2 is for the knowledge triple, which comprises of sentences aligned with knowledge triples in the MLKG.For example in Figure 2, given the sentence Zurich is the largest city in Switzerland, we have its aligned knowledge triple (Zurich, is located in, Switzerland).

Adapters and Adapter Fusion
In this section, we first describe how we incorporate adapters into language models and how they can be used to enhance them with different sources of knowledge from knowledge graphs.
Adapter.Adapters have become a popular choice for parameter-efficient finetuning of language models on downstream tasks (Houlsby et al., 2019) due to their flexibility, effectiveness, low cost and scalability (Pfeiffer et al., 2021).Adapters are new modules that are added between layers of language models5 , the parameters of which are updated only during finetuning while the language model parameters are frozen.An adapter is a bottleneck layer composed of two feed-forward layers with one non-linear activation function.For h m , the hidden representation of token t l i at layer m, the adapter acts as Here, W down and W up are weight matrices, which map the hidden representations to the lowdimensional space and then map them back.b down and b up are bias parameters, and σ is a nonlinear activation function.
Adapter Fusion.We follow the architecture of Pfeiffer et al. (2021), but instead of using adapters for finetuning, we use them to enhance MLLMs with knowledge.Our approach is similar to Wang et al. (2021), but our adapters supplement and augment the existing implicit knowledge of MLLMs (into the explicit geometric properties of hidden representations), And our approach is more lightweight, with only c.0.5% additional parameters (cf > 10% in Wang et al. (2021)).
As shown in Figure 2 (left), still considering the m-th layer, the output representations of the feedforward layer (denoted h m as in Eq. 1) are input to the adapters.A fusion layer aggregates all adapter outputs A n (h m ) (n ∈ {1...N } indexes each adapter) and the un-adapted representations with a multiplicative attention mechanism: Here, A 0 (•) is the identity function; Q m , K m , V m are parameters in the multiplicative attention mechanism; and ⊗ is the Hadamard product.
The additional knowledge to be learned by the adapters comes from knowledge Triples and Entity alignments, each provided in both Phrase and Sentence format (hence N = 2 × 2 = 4).As shown in Figure 2 (center), for a given entity in two languages l and l ′ , Adapter-EP.learns to align the two (multilingual) representations of e l i and e l ′ i , e.g., Zurich is aligned with Zurigo.Adapter-TP.learns knowledge triples, e.g., predicting Switzerland given entity and relation (Zurich, is located in,).Besides these non-contextualized settings, entities within context can be considered also (MLLM corpus).Thus, Adapter-ES.and Adapter-TS.have the similar objectives but use contextualized representations from input sentences.

Knowledgeable Adapters
Next, we design objectives with corresponding knowledge integration datasets to train a set of adapters.Similar to MLKG embedding (Chen et al., 2017), we aim to encode knowledge into the geometric properties of the adapted MLLM representations, i.e., the MLLM and adapters collectively act as an MLKG embedding model.Specifically, we use cosine distance within the contrastive learning loss of InfoNCE (van den Oord et al., 2018): where X is a batch that includes the positive sample x ′ and a number of negative samples.6 Adapter-EP.We use Wikidata (Vrandečić and Krötzsch, 2014) to enhance MLLMs with the knowledge of cross-lingual entity alignments.Inspired by the idea that languages are aligned implicitly in a universal space in MLLMs (Wu and Dredze, 2019;Wei et al., 2021), we train the aligned entities to have closer representations.Denoting the MLLM with this adapter as LM(•), the objective used to train EP is: where LM(•) means we take the mean of token representations as the entity representation vector.
Adapter-TP.We train this adapter using the knowledge triples in Wikidata.Inspired by previous knowledge graph embedding algorithms (e.g.Bordes et al., 2013), for a given fact triple, we train the (adapted) object entity embedding to be close to the (adapted) joint embedding of the subject entity and relation.The objective used to train TP is quite different from existing mask language modeling-based ones: where [; ] denotes text concatenation.Note that we apply code-switching (Liu et al., 2021), and thus entities and relations can be in different languages.This is helpful to capture knowledge triples for low-resource languages.
Adapter-ES.Entity alignment can also be applied to contextualized embeddings produced by the MLLM when entities are input within natural language sentences.For this purpose, we use summaries taken from multilingual Wikipedia.Specifically, we first align the entity in Wikidata with the Wikipedia title, and extract sentences that contain the entity label in its summary.As described earlier, we denoted this corpus as C 1 .Thus, similar to Adapter-EP, we train ES by aligning contextualized entity representations of cross-lingually aligned entities with the objective: where s l ij means that we input sentence s l into an MLLM but keep only the representation of entity label e l (indexed from i-th token to j-th token).As in Figure 2 (right), s l is: De Botton spent his early years in Zurich, and s l ij here is the entity label of e l as: Zurich.The difference between this adapter and Adapter-EP is that contextual information is included within the entity representation.
Adapter-TS.Knowledge triples can also be learned with contextualized embeddings.This requires paired data in which triples (entities and relations) are annotated in natural sentences.However, no such multilingual corpus exists.Thus, we use the T-REx-RC dataset (Elsahar et al., 2018) 7 , which provides aligned data in English and contains sentence and triple pairs.Thus, the objective used to train TS is: where s k \e j represents the sentence s k with entity label e j masked.As the example in Figure 2 (right), s k \e j is: [MASK] is the largest city in Switzerland, and the aligned triple is: (Zurich, is located in, Switzerland.In contrast to Adapter-TP, subject entities and relations occur in natural sentences.

Enhancement Workflow
We introduce our overall enhancement workflow, which contains four stages.In the first stage, an MLLM is pretrained on a large amount of data.In the second stage, the MLLM is frozen while each adapter is trained separately on its particular dataset (knowledge integration corpus) to extract additional information.Adapter outputs are aggregated in the fusion layer to enable their collective knowledge to be pooled (Pfeiffer et al., 2021).For example, we lack knowledge graph data for lowresource languages, however we have two adapters (TP, TS) that learn facts in a particular language (English) and two adapters (EP, ES) that learn crosslingual alignment.By aggregating them, we can effectively integrate factual knowledge into the representations of low-resource languages.In the third 7 We denoted this aligned corpus earlier by C2 and final stages, all parameters of the MLLM, the adapters, and the fusion module are finetuned on a training set for a specific downstream task resulting in a specialized model for the task (see Figure 3).

Experiments
This section first introduces the general experimental settings ( §5.1).We then show that our adapter set can enhance MLLMs with the knowledge of MLKGs and, in particular, that the enhanced MLLMs generalize well to perform MLKGrelated tasks in low-resource languages ( §5.2).We also show that enhancing MLLMs with MLKGs improves their performance on knowledge-intensive language tasks ( §5.3).We compare our approach with the only existing MLKG integration work ( §5.4).Finally, we present an ablation study of the adapter set to demonstrate the effectiveness of each adapter ( §5.5).

MLLMs and Integration Corpus
We select three representative MLLMs implemented by Huggingface8 and train a set of adapters for each: the base version of mBERT (Devlin et al., 2019), and both the base and large versions of XLMR (i.e., XLM-RoBERTa) (Conneau et al., 2020).Since mBERT and XLMR cover different sets of languages, we consider the intersecting 84 languages supported by both models.All adapters are trained with the same hyperparameters (see Appendix A for details).The statistics of the knowledge integration corpora are summarized in Table 1.Next, we introduce their preprocessing steps.The set of entity alignments used to train Adapter-EP is extracted from Wikidata by keeping only entities that have more than 10 multilingual entity labels among the 84 considered languages.Knowledge graph triples are used to train Adapter-TP if both entities are in that entity set (see Table 8 of Appendix B for further details).For the Wikipedia dataset, we use entities in the Wikidata subset and query their descriptions (the first sentence in the Wikipedia summary that contains the entity label).We remove entities that have less than 2 multilingual descriptions, which results in 1.93 million multilingual sentences to train Adapter-ES.For Adapter-TS, we use the monolingual dataset T-REx-RC (Elsahar et al., 2018), which has 0.97 million alignments between knowledge triples and sentences in English.

MLKG Benchmarks
We show that our knowledge adapter set can enhance MLLM performance at MLKG-related tasks.We select two popular MLKG benchmarks for evaluation: DBP5L (Chen et al., 2020) for the knowledge graph completion task, and WK3L (Chen et al., 2017) for the cross-lingual entity alignment task.These tasks require the MLLM to identify the correct entity, which is performed by maximizing the similarity of output representations.To evaluate MLLMs in a more comprehensive setting, we extend their test sets (from 2 − 5 languages) to 22 languages following their data construction settings 9 , where languages that contain the most entity labels are selected.Statistics are in Figure 4. We split these languages into three categories to show the generalizability of enhanced MLLMs: Sup.: supervised languages, which are used to train adapters and for finetuning; ZS-In: zero-shot languages, which are used for adapter training but not for finetuning; ZS-Un.: unseen languages, which are unseen in both adapter training and finetuning.

Knowledge Graph Completion
The knowledge graph completion task tests if the model can find the missing triples in different languages.Specifically, for each test triple of a given language, the model is asked to retrieve the correct object entity from the entity set of that language given the subject entity and relation.
Settings.We follow the settings of DBP5L.10Specifically, we use the training set of knowledge triples of the five languages (i.e. the Sup.set) to finetune the model, and then use the provided test sets, as well as our extended test sets to evaluate it.For comparison, we select two typical knowledge graph embedding methods, TransE (Bordes et al., 2013) and DistMult (Yang et al., 2015), as baselines and compare the performance of MLLMs and MLLMs-A Fusion , enhanced with the knowledge adapter and fusion mechanism (see Appendix A for further implementation details).
Table 2: Results on the knowledge graph completion task.We attach the number of languages to each type.We can see that for zero-shot languages and unseen languages, using our adapters can significantly improve the performance of LMs on knowledge graph completion.

Model
Sup.
(5) ZS-In (15) ZS-Un.(2') Hit@1 ↑ MRR ↑ Hit@1 ↑ MRR ↑ Hit@1 ↑ MRR ↑ Results.Results are summarized in Table 2 (with further detail in Table 9 of Appendix C).We report both Hit@1 score and Mean Reciprocal Rank (MRR) for evaluation.We find that enhancing MLLMs with adapters can improve performance for the supervised languages, which is comparable to existing knowledge graph embedding methin Chen et al. (2020Chen et al. ( , 2017) ) for the extension.
ods.For the zero-shot languages and unseen languages, existing (transductive) knowledge graph embedding methods cannot perform the task since entities must be in the training set.Here we find that MLLMs still perform comparably to the supervised languages11 , and the enhanced MLLMs-A Fusion models outperform MLLMs on zero-shot languages by significant margins.This indicates that the adapters allow factual knowledge to be transferred across languages.

Entity Alignment
The entity alignment task is to align entities in different languages.Specifically, given a target language and an entity in a source language (typically English), the model should retrieve that entity from the set of all entities in the target language.
Settings.We follow settings of WK3L.12Specifically, we train models using the entity alignments English to German, and English to French.We test models on those two supervised languages, as well as our extended 17 zero-shot languages and 2 unseen languages. 13We select one typical MLKG embedding method, MTransE (Chen et al., 2017), and a state-of-the-art method, JEANS (Chen et al., 2021), as baselines (see Appendix A for details).Results.The results are summarized in Table 3 (with further detail in Table 10 of Appendix C).Performance is evaluated again by Hits@1 and MRR.As previously, the (transductive) baselines cannot be extended to languages not in the training set.
For the supervised languages, we can find that existing MLLMs often outperform classic base-lines.However, performance of MLLMs on zeroshot languages is noticeably worse.This indicates that existing MLLMs do not transfer entity alignment knowledge well to other languages.However, MLLMs enhanced with the adapter set, MLLMs-A Fusion , generally achieve the best performance, often with significant improvement.The results indicate that our adapter set successfully enhances MLLMs with multilingual knowledge.

MLLM Benchmarks
Above results show that our adapter set can enhance MLLMs to perform well on MLKG-related tasks on both previously seen and unseen languages.Here, we show that our knowledge adapter set can allow MLKGs to enhance MLLM performance on language tasks.In particular, the enhanced MLLMs achieve improved performance on knowledge-intensive language task while maintaining performance on other general language tasks.

Cross-Lingual Relation Classification
We select a popular relation classification benchmark: RELX (Köksal and Özgür, 2020), for which MLLMs must extract relations from sentences in a cross-lingual setting.Models are finetuned on a high-resource corpus, and tested on low-resource languages in a zero-shot setting.For this task, MLLMs are required to transfer the knowledge across languages, as well as capture factual knowledge for the relation classification.
Settings.Our training data is only in English, and test data contains 4 more (zero-shot) languages.We follow the exact setting of Köksal and Özgür (2020) and use the same provided set of hyperparameters to evaluate all MLLMs.We also report the performance of the enhanced BERT model of Köksal and Özgür (2020) called Matching the Multilingual Blanks (MTMB) as a baseline.11 of Appendix D for further detail).We find that for supervised languages, mBERT-A Fusion outperforms both the base version of mBERT as well as the knowledge-enhanced version (MTMB), whereas XLMR with adapters obtains comparable performance.As for zero-shot languages, MLLMs-A Fusion achieve consistent and significant improvements over baselines.This demonstrates that our knowledge adapter set can enhance MLLMs for knowledge-intensive tasks.

General Language Tasks
Besides above knowledge-intensive tasks, we show that our knowledge adapter set can maintain the performance of MLLMs on general multilingual language tasks.We select the popular multilingual benchmark called XTREME (Hu et al., 2020) to evaluate the enhanced MLLMs, which are finetuned on English training data, and tested with many other languages.We select cross-lingual NER and QA as two general tasks.We follow the settings of the XTREME benchmark.NER.We select the WikiAnn dataset (Pan et al., 2017) (under the setting of XTREME) for the NER task, where 40 languages are included for evaluation.The results are summarized in Table 5, and detailed results can be found in Table 12 in Appendix D. We find that MLLMs with our adapter set perform as well as the baseline MLLMs with slight improvements on the zero-shot languages.Question Answering.Following the setting of XTREME, We finetune the models on the SQuAD (Rajpurkar et al., 2016) dataset (in English), and evaluate on the test sets of XQuAD (Artetxe et al., 2020) involving 11 languges.Detailed results are in Table 13 in Appendix D. We find that mBERT-A Fusion maintains the performance as its original version, while XLMR large -A Fusion can be boosted slightly.In general, MLLMs-A Fusion with our adapters can obtain comparable or slightly better performance across different language tasks.For those tasks requiring rich knowledge about triples and entity alignments, our adapter set can indeed enhance the MLLMs.

Comparison with Existing Methods
We compare our approach with the only existing related work (Liu et al., 2021) that attempts to integrate MLKGs into MLLMs.However, it only considers a relatively small set of 10 languages and finetunes the entire MLLM with a joint objective, which is computationally expensive.In contrast, as shown below, our knowledge adapter set can achieve better performance at a much lower cost.
Settings.We follow settings and metrics in Liu et al. (2021), which are slightly different from original settings of RELX and WikiAnn (XTREME) datasets.We only report the performance for MLLMs that are implemented in their study.Results.In Table 7, for the relation classification task, where Liu et al. (2021) outperforms the MLLM baseline, our method achieves significant further improvement.For NER, only 10 popular zero-shot languages (instead of 40 languages in XTREME) are selected for their knowledge integration and evaluation.Even if generally our method achieves better performance for XMLR large -A Fusion (40 languages) in Table 5, it performs slightly worse than the original version here (10 popular languages).However, the performance of Liu et al. ( 2021) is worse still.For QA, similar performance is achieved by all three MLLMs, although our enhanced MLLM slightly outperforms other methods.Entity Alignment Performance Figure 5: Ablation study results.We select two MLKG-related tasks and the relation classification task for evaluation.We can find that adapters that integrate factual knowledge into MLLMs achieve better performance than others on the MLKG completion task, while adapters that integrate cross-lingual alignments outperform others on the entity alignment task.For the relation classification task, sentence-level adapters achieve better performance.For our adapter set, it can achieve roughly the best performance under all conditions.

Ablation Study
We conduct ablation studies to understand our knowledge adapters and show that they work as expected. 14We also compare against a large adapter (A Large ) with a comparable total number of parameters (including fusion layers).The large adapter is trained with the same settings as our adapter set and has one set of parameters that integrate all knowledge types at once.As previously, we finetune the original mBERT, mBERT-A Large , and mBERT with our adapters on each downstream task.
In Figure 5, for the knowledge graph completion task (left), mBERT-A TP and mBERT-A TS perform better than their entity-based counterparts.While mBERT-A Large also performs well, mBERT-A Fusion outperforms it significantly.For the entity alignment task (center), the situation is reversed such that better performance is achieved by mBERT-A EP and, mBERT-A ES .Our mBERT-A Fusion also achieves comparable performance which is much better than mBERT-A Large with shared parameters.As for the relation classification task (right), sentence-level adapters outperform phrase-level adapters, which is intuitive since the task requires sentence-level context.Fusing all four adapters (i.e., mBERT-A Fusion ) gives the best performance while mBERT-A Large performs worse than single smaller adapters.In summary, with our method, we learn different types of knowledge in separate adapters, which can be fused in different proportions according to the downstream task at hand to typically perform better and more consistently than any single adapter-enhanced MLLMs.
6 Other Related Work MLLM for MLKG.Several works use the implicit knowledge in language models to improve knowledge graph-related tasks (Yao et al., 2019;Niu et al., 2022).However, these approaches are for monolingual knowledge triples and can not easily incorporate cross-lingual entity alignment.Huang et al. (2022) use MLLMs for knowledge graph completion, but language models only encode entities, and the task itself is achieved by graph neural networks.Previous MLKG embedding methods consider entity alignment (Chen et al., 2017(Chen et al., , 2020)), but are designed for existing MLKGs, and can not generalize to other, e.g.low-resource, languages without the multilingual knowledge in MLLMs (Pires et al., 2019;Wu and Dredze, 2019).MLKG for MLLM.Liu et al. (2021) propose to synthesize code-switched sentences to solve the problem but the resulting MLKG-enhanced MLLMs achieve minimal improvement on language understanding tasks as shown in our experiment, and it cannot benefit the MLKG field.In summary, our work first combine MLKG and MLLM, showing that combining them using our light knowledge adapter set can effectively improve the downstream task performance on both sides.

Conclusion
In this paper we propose an approach to enhance MLLMs with MLKGs using a set of knowledge adapters, where explicit knowledge from MLKGs is integrated into the implicit knowledge learned by MLLMs.In experiments, we show that enhanced MLLMs can conduct MLKG-related tasks and achieve better performance on knowledge-intensive tasks, especially on low-resource languages where knowledge graphs are not available.

Limitations
We point out that there are some limitations of our work.First, even if the adapter set can enhance MLLMs to perform well on various downstream tasks, it is not suitable for tasks with the fully zeroshot setting (without any training data), since the fusion module has to be tuned to suit the task.Second, as shown in our results, the fusion module cannot always outperform all single adapters.For some tasks, a better fusion mechanism could be proposed for the improvement.

Reproducibility Statement
We elaborate the experiment settings and hyperparameters in the paper and in Appendix A. We have published our prepossessed multilingual knowledge integration data, extended MLKG-related task datasets, as well as our code.

A Implementation Details
We implement the adapters using the AdapterHub library 15 , where all Transformer layers in MLLMs are inserted with adapters.
Adapters in Knowledge Enhancement.To train these knowledgeable adapters, we use 8 GPUs (Tesla V100) with batch size as 128.The learning rate is set as 1e − 4. We use the Adam optimizer with 1e4 warm-up steps.We train Adapter-EP by randomly sampling entity alignments in different languages.The number of sampled alignments is around 94.2 million.And the training epoch number for Adapter-TP, Adapter-ES, and Adapter-TS is all set as 10.As for the InfoNCE loss, we use the negative sampling within batch.Since we train adapters with sampling strategy and use the contrastive learning loss instead of mask language modeling, it only takes several hours to train one adapter (1-10 hours).The whole enhancement procedure would take around half a day.
Adapters in knowledge graph completion.For MLLM-based methods, we set all hyperparameters as the same to ensure the comparison is fair16 .We use the average value of word(-piece) representation as the entity embedding.Specifically, we train MLLMs as well as MLLMs-AF (including adapters and the fusion mechanism) to embed entities, where the output representations of the object entities should be close to the context (subject entities with relations) output representations.The similarity is measure by cosine17 .During the training, the learning rate is set as 1e − 8, and the epoch number is set as 10.The batch size is set as 8.We train MLLMs using the contrastive learning loss similar to the knowledge integration process.
Adapters in Entity Alignment.Similarly, we set all hyperparameters as the same for all MLLMbased methods.Specifically, we set the epoch number as 1 since the overfitting is easy with training data only on 2 languages.Other hyperparameters and settings are the same to that of the MLKG Completion task.
Adapters in Language Tasks.We evaluate our adapter set with MLLMs on the XTREME bench-mark.The evaluation settings are the same as theirs.

B Knowledge Integration Dataset Statistics
The detailed statistics can be found in Table 8 below.

C MLKG Dataset Statistics and Detailed Results
The detailed statistics and results can be found in Table 9 and Table 10.

D MLLM Dataset Statistics and Detailed Results
The detailed statistics and results can be found in  Table 9: The performance of various models for the MLKG completion task (Hit@1/MRR) across different languages.We also report the number of entities in the test set to show the general difficulty of the completion task in that language.

Figure 1 :
Figure 1: Combining MLLMs and MLKGs benefits both: MLKGs suffer from incompleteness and are limited to few languages, which MLLMs can supplement.MLLMs lack entity alignment and firm facts, which MLKGs can provide.

Figure 2 :
Figure2: The architecture of MLLMs with adapters and their roles.We enhance multilingual and factual knowledge in phrase and sentence levels using different knowledge integration corpus.

Figure 3 :
Figure 3: Four stages of using the knowledge adapter set in MLLMs.The dashed outlines mean the parameters are frozen.

Figure 4 :
Figure 4: Statistics of the size of test sets for MLKG completion and entity alignment tasks.We can see that extended test sets for zero-shot languages have comparable number of samples as original test sets.

Table 1 :
Statistics of knowledge integration corpora for training adapters.Align.: all aligned multilingual entities; Relat.: all relations in triples; Sent.: sentences.

Table 3 :
Results on multilingual entity alignment tasks.We can find that using our adapters can significantly enhance MLLMs' performance on entity alignment tasks, which also outperforms existing MLKG embedding baselines.

Table 4 :
Results on the multilingual relation classification task (F1 score).We can find that our adapters can effectively enhance MLLMs on the knowledge-intensive downstream tasks, especially for the performance on zero-shot languages.
Results.Results are summarized inTable 4 (see Table

Table 5 :
Results on the multilingual NER task (F1 score).We can find that our adapters can enhance MLLMs on the performance of NER task for zero-shot languages.

Table 6 :
Results on the multilingual QA tasks.Using our adapters would not reduce the performance on language modeling tasks, while marginal improvement can even achieved.

Table 8 :
Distribution of Wikidata for adapter training.We report the full name and ISO code for all languages.For the entity, relation, and triple, we report the ratio of the label in that specific language to the total number of it.