Enhancing Multilingual Language Model with Massive Multilingual Knowledge Triples

Knowledge-enhanced language representation learning has shown promising results across various knowledge-intensive NLP tasks. However, prior methods are limited in efficient utilization of multilingual knowledge graph (KG) data for language model (LM) pretraining. They often train LMs with KGs in indirect ways, relying on extra entity/relation embeddings to facilitate knowledge injection. In this work, we explore methods to make better use of the multilingual annotation and language agnostic property of KG triples, and present novel knowledge based multilingual language models (KMLMs) trained directly on the knowledge triples. We first generate a large amount of multilingual synthetic sentences using the Wikidata KG triples. Then based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to enable the LMs to not only memorize the factual knowledge but also learn useful logical patterns. Our pretrained KMLMs demonstrate significant performance improvements on a wide range of knowledge-intensive cross-lingual tasks, including named entity recognition (NER), factual knowledge retrieval, relation classification, and a newly designed logical reasoning task.


Introduction
Pretrained Language Models (PLMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have achieved superior performances on a wide range of NLP tasks.Existing PLMs usually learn universal language representations from general-purpose large-scale corpora but do not concentrate on capturing world's factual knowledge.It has been shown that knowledge graphs (KGs), such as Wikidata (Vrandečić and Krötzsch, 2014) and Freebase (Bollacker et al., 2008), can provide rich factual information for better language understanding.Many studies have demonstrated the effectiveness of incorporating such factual knowledge into monolingual PLMs (Peters et al., 2019;Zhang et al., 2019;Liu et al., 2020a;Poerner et al., 2020;Wang et al., 2021a).Following this, a few recent attempts have been made to enhance multilingual PLMs with Wikipedia or KG triples (Calixto et al., 2021;Ri et al., 2022;Jiang et al., 2022).However, due to the structural difference between KG and texts, existing KG based pretraining often relies on extra relation/entity embeddings or additional KG encoders for knowledge enhancement.These extra embeddings/components may add significantly more parameters which in turn increase inference complexity, or cause inconsistency between pretrain and downstream tasks.For example, mLUKE (Ri et al., 2022) has to enumerate all possible entity spans for NER to minimize the inconsistency caused by entity and entity position embeddings.Other methods (Liu et al., 2020a;Jiang et al., 2022) also require KG triples to be combined with relevant natural sentences as model input during training or inference.
In this work, we propose KMLM, a novel Knowledge-based Multilingual Language Model pretrained on massive multilingual KG triples.Unlike prior knowledge enhanced models (Zhang et al., 2019;Peters et al., 2019;Liu et al., 2020a;Wang et al., 2021a), our model requires neither a separate encoder to encode entities/relations, nor heterogeneous information fusion to fuse multiple types of embeddings (e.g., entities from KGs and words from sentences).The key idea of our method is to convert the structured knowledge from KGs to sequential data which can be directly fed as input to the LM during pretraining.Specifically, we generate three types of training data -the parallel knowledge data, the code-switched knowledge data and the reasoning-based data.The first two are obtained by generating parallel or code-switched sentences from triples of Wikidata (Vrandečić and Krötzsch, 2014), a collaboratively edited multilingual KG.The reasoning-based data, containing rich logical patterns, is constructed by converting cycles from Wikidata into word sequences in different languages.We then design pretraining tasks that are operated on the parallel/code-switched data to memorize the factual knowledge across languages, and on the reasoning-based data to learn the logical patterns.
Compared to existing knowledge-enhanced pretraining methods (Zhang et al., 2019;Liu et al., 2020a;Peters et al., 2019;Jiang et al., 2022), KMLM has the following key advantages.(1) KMLM is explicitly trained to derive new knowledge through logical reasoning.Therefore, in addition to memorizing knowledge facts, it also learns the logical patterns from the data.(2) KMLM does not require a separate encoder for KG encoding, and eliminates relation/entity embeddings, which enables KMLM to be trained on a larger set of entities and relations without adding extra parameters.The token embeddings are enhanced directly with knowledge related training data.(3) KMLM does not rely on any entity linker to link the text to the corresponding KG entities, as done in existing methods (Zhang et al., 2019;Peters et al., 2019;Poerner et al., 2020).This ensures KMLM to utilize more KG triples even if they are not linked to any text data, and avoids noise caused by incorrect links.(4) KMLM keeps the model structure of the multilingual PLM without introducing any additional component during both training and inference stages.This makes the training much easier, and the trained model is directly applicable to downstream NLP tasks.
We evaluate KMLM on a wide range of knowledge-intensive cross-lingual tasks, including NER, factual knowledge retrieval, relation classification, and logical reasoning which is a novel task designed by us to test the reasoning capability of the models.Our KMLM achieves consistent and significant improvements on all knowledgeintensive tasks, meanwhile it does not sacrifice the performance on general NLP tasks.

Related Work
Knowledge-enhanced language modeling aims to incorporate knowledge, concepts and relations into the PLMs (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020), which proved to be beneficial to language understanding (Talmor et al., 2020a).
The existing approaches mainly focus on monolingual PLMs, which can be roughly divided into two lines: implicit knowledge modeling and explicit knowledge injection.Previous attempts on implicit knowledge modeling usually consist of entity-level masked language modeling (Sun et al., 2019;Liu et al., 2020a), entity-based replacement prediction (Xiong et al., 2020), knowledge embedding loss as regularization (Wang et al., 2021b) and universal knowledge-text prediction (Sun et al., 2021).In contrast to implicit knowledge modeling, the methods of explicit knowledge injection separately maintain a group of parameters for representing structural knowledge.Such methods (Zhang et al., 2019) usually require a heterogeneous information fusion component to fuse multiple types of embeddings obtained from the text and KGs.Zhang et al. (2019) and Poerner et al. (2020) employ external entity linker to discover the entities in the text and perform feature interaction between the token embeddings and entity embeddings during the encoding phase of a transformer model.Peters et al. (2019) borrow the pre-computed knowledge embeddings as the supporting features of training an internal entity linker.Wang et al. (2021a) insert an adapter component (Houlsby et al., 2019;He et al., 2021) in each transformer layer to store the learned factual knowledge.
Extending knowledge based pretraining methods to the multilingual setting has received increasing interest recently.Zhou et al. (2022b) propose an auto-regressive model trained on knowledge triples for multilingual KG completion.Calixto et al. (2021); Ri et al. (2022) attempt improving multilingual entity representation via Wikipedia hyperlink prediction, however, their methods add a large amount of parameters due to the reliance on extra entity embeddings.For example, the mLUKE BASE (Ri et al., 2022)  is derivable from the existing facts.Such reasoning capability is regarded as a crucial part of building consistent and controllable knowledgebased models (Talmor et al., 2020b).In this paper, our explored methods for multilingual knowledgeenhanced pretraining boost the capability of implicit knowledge reasoning, together with the purpose of consolidating knowledge modeling and multilingual pretraining (Mulcaire et al., 2019;Conneau et al., 2020;Liu et al., 2022).

Framework
In this section, we describe the proposed framework for knowledge based multilingual language model (KMLM) pretraining.We first describe the process to generate knowledge-intensive multilingual training data, followed by the pretraining tasks to train the language models to memorize factual knowledge and learn logical patterns from the generated data.

Knowledge Intensive Training Data
In addition to the large-scale plain text corpus that is commonly used for language model pretraining, we also generate a large amount of knowledge intensive training data from Wikidata (Vrandečić and Krötzsch, 2014), a publicly accessible knowledge base edited collaboratively.Wikidata is composed of massive amounts of KG triples (h, r, t), where h and t are the head and tail entities respectively, r is the relation type.As shown in Table 1, most of the entities, as well as the relations in Wikidata, are annotated in multiple languages.In each language, many aliases are also given though some of them are used infrequently.

Code-Switched Synthetic Sentences
Training language models on high-quality code-switched sentences is one of the most intuitive ways to learn language agnostic representation (Winata et al., 2019), where the translations of words/phrases can be treated in a similar way as their aliases.The code  mixing techniques have also proved to be helpful for improving cross-lingual transfer performance in many NLP tasks (Qin et al., 2020;Santy et al., 2021).Therefore, we propose a novel method to generate code-switched synthetic sentences using the multilingual KG triples.See Fig. 1 for some generated examples.
For each triple (h, r, t) in Wikidata, we use h l,0 to denote the default label of h in language l.For the entity Q1420 in Table 1, h en,0 is "motor car" and h es,0 is "automóvi".h l,i denotes the aliases when the integer i > 0. We define r l,i and t l,i in the same way for the relation and the tail entity, respectively.Since English is resource-rich and often treated as the source language for crosslingual transfer, we only consider language pairs of {(en, l )} for code switching, where l is an arbitrary non-English language.With such a design, English can also work as a bridge for cross-lingual transfer between a pair of none English languages.
Specifically, the code-switched sentences for (h, r, t) can be generated in 4 steps: 1) select a language pair (en, l ); 2) find the English default labels (h en,0 , r en,0 , t en,0 ); 3) For each item in the triple, uniformly sample a value v ∈ {true, f alse}, if v is true and the item has a translation (i.e.default label) in l , then replace the item with the translation in l ; 4) generate the sequence of "h [mask] r [mask] t." by inserting two mask tokens.The alias-replaced sentences can be generated in a similar way, except that we randomly sample aliases in the desired language to replace the default label in steps 2 and 3.  transfer (Aharoni et al., 2019;Conneau and Lample, 2019;Chi et al., 2021).However, it is expensive to obtain a large amount of parallel data for LM pretraining.We propose a method to generate a large amount of knowledge intensive parallel synthetic sentences, with a minor modification of the method for generating code-switched sentences described above.For each triple (h, r, t) extracted from Wikidata, the corresponding synthetic sentences in different languages can be generated by first finding the default labels (h l,0 , r l,0 , t l,0 ) for each language l, and then inserting mask tokens to generate sequences in the form "h [mask] r [mask] t.".Fig. 1 shows an example.More sentences can be generated by replacing the default labels with their aliases.

Reasoning-Based Training Data
The capability of logical reasoning allows humans to solve complex problems with limited information.However, this ability did not receive much attention in the previous LM pretraining methods.In KGs, we can use nodes to represent entities, and edges between any two nodes to represent their relations.In order to train the model to learn logical patterns, we generate a large amount of reasoning-based training data by finding cycles from the Wikidata KG.As shown with an example in Fig. 2(a), the cycles of length 3 can be viewed as the basic component for more complex logical reasoning process.We train language models to learn the entity-relation co-occurrence patterns so as to infer the best candidate relations for incomplete cycles, i.e. deriving the implicit information from the given context.
Similar to the structure of the parallel/codeswitched synthetic sentences described above, the cycles in Fig. 2(a) is composed of 3 triples, and hence can be converted to 3 synthetic sentences (the first example in Fig. 4).To increase the difficulty, we also extract cycles of length 4 to generate the reasoning oriented training data.However, we We treat Wikidata as an undirected graph when extracting cycles.Given an entity, the length-3 cycles containing this entity can be easily extracted by first finding all the neighbouring entities, and then iterating through the pairs of neighbouring entities to check whether they are also connected.The length-4 cycles with an additional diagonal edge connecting any two neighbours can be extracted with a few extra steps.Assuming we have identified a length-3 cycle containing entity A and its two neighbouring entities B and C, we can iterate through the neighbours of B (excluding A and C) to check whether it is also connected to C. We remove the duplicate cycles in data generation.

Pretraining Tasks Multilingual Knowledge Oriented Pretraining
In the generated code-switched and parallel synthetic sentences, the "[mask]" tokens are added between entities and relations to denote the linking words.For example, the first mask token in "motor car [mask] designed to carry [mask] passager."may denote "is", while the second one may denote "certains" (French word "certains" means "some" or "certain").Since the ground truth of such masked linking words are not known, we do not compute the loss for those corresponding predictions.Instead, we randomly mask the remaining tokens in the parallel/code-switched synthetic sentence, and compute the cross entropy loss over these masked entity and relation tokens (Fig. 3).We use L K to denote this cross entropy loss for Knowledge Oriented Pretraining.Note that our models are not trained on the sentence pairs like the Translation LM loss or TLM (Conneau and Lample, 2019) when utilizing the parallel or code-switched pairs.Alternatively, we shuffle the data, and feed one sentence into the model each time (as shown in Fig. 3), which makes our model inputs more consistent with those of the downstream tasks.
Logical Reasoning Oriented Pretraining We design tasks to train the model to learn logical reasoning patterns from the synthetic sentences generated from the length-3 and length-4 cycles.As can be seen in Fig. 4, both of the relation prediction and entity prediction problems are cast as masked language modeling.For the length-3 cycles, each entity appears exactly twice in every training sample.Formulating the task as a masked entity prediction problem may lead to shortcut learning (Geirhos et al., 2020) by simply counting the appearance numbers of the entities.Therefore, we only mask one random relation in each sample for model training, and let the model learn to predict the masked relation tokens based on the context.
Two types of tasks are designed to train the model to learn reasoning with the length-4 cycles: 1) For 80% of the time, we train the model to predict randomly masked relation and entities.We first mask one random relation.To increase the difficulty, we also mask one or two randomly selected entities at equal chance.The lower half of Fig. 4 shows an example where one relation and one entity are masked.2) For the remaining 20% of the time, we randomly mask a whole sentence to let the model learn to derive new knowledge from the remaining context.To provide some hints on the expected new knowledge, we keep the relation of the selected sentence unmasked, i.e., only mask its two entities.The loss L L for Logical Reasoning Oriented Pretraining can also be computed with the cross entropy loss over the masked tokens.Note that masked entity prediction is not always nontrivial in this task.For example, when we mask exactly one entity and the entity E only appears once in the masked sample, then it is easy to guess E is the masked one.In Fig. 4, a concrete example is masking the first appearance of "Raj Kapoor" in the original sentence of the length-4 cycle.We do not deliberately avoid such cases, since they may help introduce more diversity to the training data.
Loss Function In addition to the pretraining tasks designed above, we also train the model on the plain text data with the original masked language modeling loss L M LM used in previous work (Devlin et al., 2019;Conneau et al., 2020).Therefore, the final loss can be computed as: where α is a hyper-parameter to adjust the weights of the original MLM and the losses for modeling the multilingual knowledge and logical reasoning.

Experiments
We first describe the pretraining details of our KMLMs.Then we verify its effectiveness on the knowledge-intensive tasks.Finally, we examine its performance on general cross-lingual tasks.In all of the tasks except X-FACTR (Jiang et al., 2020), the PLMs are fine-tuned on the English training set and then evaluated on the target language test sets.The evaluation results are averaged over 3 runs with different random seeds.X-FACTR does not require fine-tuning, so the PLMs are directly evaluated using the official code.The results of the baseline models are reproduced in the same environment.

Pretraining Details
Our proposed framework can be conveniently implemented on top of the existing transformer encoder based models like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) without any modification to the model structure.Therefore, instead of pretraining the model from scratch, it is more time-and cost-efficient to initialize the model with the checkpoints of existing pretrained models.We build our knowledge intensive training data in 10 languages: English, Vietnamese, Dutch, German, French, Italian, Spanish, Japanese, Korean and Chinese.We only use the 5M entities and 822 relations filtered by Wang et al. (2021b), and generate 250M code-switched synthetic sentences, 190M parallel synthetic sentences2 and 100M reasoning-based samples following the steps in §3.1.In addition, 260M sentences are sampled from the CC100 corpus 3 (Wenzek et al., 2020) for the 10 languages.Our models KMLM-XLM-R BASE and KMLM-XLM-R LARGE are initialized with XLM-R BASE and XLM-R LARGE , respectively.Then we continue to pretrain these models with the proposed tasks ( §3.2).KMLM CS , KMLM Parallel and KMLM Mix are used to differentiate the models trained on the code-switched data, parallel data and the concatenated data of these two, respectively.The reasoning-based data is used in all these three models, and ablation studies are presented in §4.5 to verify the effectiveness of logical reasoning task.Previous studies showed that the original mBERT model outperforms XLM-R on the X-FACTR (Jiang et al., 2020) and RELX (Köksal and Özgür, 2020) tasks, so we also initialize KMLM-mBERT BASE with mBERT BASE , and train it on Wikipedia corpus for a more faithful comparison 4 .We find the KMLM CS and KMLM Mix models initialized with the XLM-R BASE checkpoint outperform the corresponding KMLM Parallel model in most of the tasks, so we only train KMLM CS and KMLM Mix when comparing with XLM-R LARGE and mBERT BASE .See Appendix §A.1 for more pretraining details.
intensive synthetic sentences may also help improving entity representation more efficiently.We conduct experiments on the CoNLL02/03 (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) and WikiAnn (Pan et al., 2017) NER data to verify the effectiveness of our framework.The same transformer-based NER model and hyperparameters as Hu et al. (2020) are used in our experiments.
The results on CoNLL02/03 data are presented in Table 2. Compared with XLM-R BASE , all of our corresponding models improve the average F1 on target languages by more than 2.13 points.Especially on German, all of our models demonstrate at least 4.65 absolute gains.Moreover, all of our models also outperform XLM-K (Jiang et al., 2022), a knowledge-enhanced multilingual LM proposed in a recent work.Even when compared with XLM-R LARGE , our large model still improves the average performance by 1.56.The WikiAnn dataset allows us to evaluate our models on all of the 10 languages involved in pretraining.Jiang et al. (2022) did not report XLM-K results on WikiAnn, so we evaluate their pretrained model on WikiAnn and the following knowledge intensive tasks for better comparison.As the results shown in sponding XLM-R models by 2.64 and 1.60 respectively.From both datasets we observe KMLM CS -XLM-R BASE performs better than KMLM Parallel -XLM-R BASE , which shows the efficacy of the codeswitching technique for large-scale cross-lingual pretraining.Moreover, both KMLM Mix -XLM-R BASE and KMLM Mix -XLM-R LARGE (i.e. the models pretrained on the mixed code-switched and parallel data) surpass all of the compared models in terms of F1, suggesting that the mixed data can help further generalize the representations across languages.

Factual Knowledge Retrieval
X-FACTR (Jiang et al., 2020) is a benchmark for assessing the capability of multilingual pretrained language model on capturing factual knowledge.It provides multilingual cloze-style question templates and the underlying idea is to query knowledge from the models for filling in the blank of these question templates.From (Jiang et al., 2020), we notice the performance of XLM-R BASE is much worse than mBERT BASE (see Table 4).It is probably because mBERT BASE has a much smaller vocabulary than XLM-R (120k vs 250k) and employs Wikipedia corpus instead of the general data crawled from the Internet.So we also pretrain KMLM CS -mBERT BASE for more comprehensive comparison.As we can see from Table 4, all of the models trained with our framework demonstrate significant improvements on factual knowledge retrieval accuracy, which again indicates the benefits of our method on factual knowledge acquisition.Our model still demonstrates better performance than XLM-K, though it is also trained using Wikipedia.

Cross-lingual Relation Classification
RELX (Köksal and Özgür, 2020) is developed by selecting a subset of KBP-37 (Zhang and Wang, 2015), a commonly-used English relation classifi- Choices: part of, said to be the same as, located in time zone, instance of, has part, followed by Answer: said to be the same as Figure 5: An example (English) extracted from our cross-lingual logical reasoning (XLR) dataset.
cation dataset, and by generating human translations and annotations in French, German, Spanish, and Turkish.We evaluate the same set of models as §4.3, since mBERT BASE also outperforms XLM-R BASE on this task.The evaluation script provided by Köksal and Özgür (2020) is used to finetune the pretrained models on English training set and evaluate on the target language test sets.
As the results shown in Table 5, all of our models achieves consistently higher accuracy than XLM-K and XLM-R.

Cross-lingual Logical Reasoning
Dataset To verify the effectiveness of our logical reasoning oriented pretraining tasks ( §3.2) in an intrinsic way, we propose a cross-lingual logical reasoning (XLR) task in the form of multiple-choice questions.An example of such reasoning question is given in Fig. 5.The dataset is constructed using the cycles extracted from Wikidata.We manually annotate 1,050 samples in English and then translated them to the other 9 non-English languages (see Sec. 4.1) to build the multilingual test sets.The 3k train samples and 1k dev samples in English are also generated and cleaned automatically.
The cycles used to build the test set are removed from the pretraining data, so our PLMs have never seen them beforehand.The detailed dataset construction steps can be found in Appendix §A.2.

Results
We modify the multiple choice evaluation script implemented by Hugging Face 5 for this experiment.The models are finetuned on the English training set, and evaluated on the test sets in different target languages.Results are presented in Table 6.All of our models outperform the baselines significantly.Unlike on the previous tasks, where KMLM Mix often performs the best, KMLM CS shows slightly higher accuracy than KMLM Mix .We also conduct ablation study to verify the effectiveness of our proposed logical reasoning oriented pretraining task.We pretrain the None-Reasoning models, KMLM CS-NR -XLM-R BASE and KMLM Mix-NR -XLM-R BASE on the same data as KMLM CS -XLM-R BASE and KMLM Mix -XLM-R BASE , but without the logical reasoning tasks, i.e., with the MLM task only on the reasoning-based data.As presented in Fig. 6, the none-reasoning models also performs better than XLM-R BASE , which shows the usefulness of our reasoning-based data.We also observe KMLM CS -XLM-R BASE and KMLM Mix -XLM-R BASE , i.e., the models pretrained with logical reasoning tasks, consistently perform the best, which proves our proposed task can help models learn logical patterns more efficiently.

General Cross-lingual Tasks
Recall that our models are directly trained on the structured KG data.Though we attempt to minimize its difference from the natural sentences when designing the pretraining tasks, it is unknown how the difference affects cross-lingual transfer performance on the general NLP tasks.Therefore, we also evaluate our models on the part-of-speech (POS) tagging, question answering and classification tasks prepared by XTREME (Hu et al., 2020).Experimental results are shown in Table 7.Note that many of the languages covered by these tasks are not in our pretraining data, but we include all their results when computing the average performance.Overall, the performance of our models is comparable with the baselines on all of the tasks, except POS.Possibly because the POS task is more sensitive to the change of the training sentence structures.Though some of our models perform slightly better than the baselines on XQuAD (Artetxe et al., 2020) and MLQA (Lewis et al., 2020), we find the performance gain of our models on TyDiQA6 (Clark et al., 2020) is more obvious, which is a more challenging QA task that  (Hu et al., 2020).
has less lexical overlap between question-answer pairs.From these results we can see that, when our KMLMs achieve consistent improvements on the knowledge-intensive tasks, as shown by the experimental results in the previous subsections, it does not sacrifice the performance on the general NLP tasks.

Conclusions
In this paper, we have presented a novel framework for knowledge-based multilingual language pretraining.Our approach firstly creates a synthetic multilingual corpus from the existing KG and then tailor-makes two pretraining tasks, multilingual knowledge oriented pretraining and logical reasoning oriented pretraining.These multilingual pretraining tasks not only facilitate factual knowledge memorization but also boost the capability of implicit knowledge modeling.We evaluate the proposed framework on a series of knowledgeintensive cross-lingual tasks and the comparison results consistently demonstrate its effectiveness.

Limitations
The KMLM models proposed in this work are pretrained on 10 languages in our experiments, so it is unclear whether scaling up to more languages will help further improve its performance on the downstream tasks.Due to the high computation cost, we leave it for future work.Despite the promising per-formance improvement on the knowledge intensive tasks, we also observe that KMLM do not perform well on the part-of-speech tagging tasks ( §4.6).It is possibly caused by the large amount of synthetic sentences used in pretraining, where mask tokens are used to replace the linking words.In future, we will explore efficient ways to leverage pretrained denoising models (Liu et al., 2020b) or graph-tosequence models (Ammanabrolu and Riedl, 2021) to convert the synthetic sentences or knowledge triples to the form more close to natural sentences.

Ethical Impact
Neural models have achieved significant success in many NLP tasks, especially for the popular languages like English, Spanish, etc.However, neural models are data hungry, which poses challenges for applying them to the low-resource languages due to the limited NLP resources.In this work, we propose methods to inject knowledge into the multilingual pretrained language models, and enhance their logical reasoning ability.Through extensive experiments, our methods have been proven effective in a wide range of knowledge intensive multilingual NLP tasks.Therefore, our proposed method could help overcome the resource barrier, and enable the advances in NLP to benefit a wider range of population.

A Appendix
A  Hyper-Parameters The hyper-parameters used for language model pretraining are presented in Table 9.After pretraining, we finetune the models on the plain text data with max sequence length of 512 for another 600 steps.Due to the high computation cost of LM pretraining, we do not run many experiments for hyper-parameter searching.Instead, the learning rate, batch size, mlm probability are determined according to those used in the previous LM pretraining studies.To determine the knowledge task loss weight α for large scale pretraining, we compare α ∈ {0.5, 0.3, 0.1} using the base models pretrained on a smaller dataset.Each base model takes about 30 days to train with 8 V100 GPUs.

A.2 Cross-lingual Logical Reasoning Task
We propose a cross-lingual logical reasoning (XLR) task in the form of multiple-choice questions to verify the effectiveness of our logical reasoning oriented pretraining tasks in an intrinsic way.An example of such reasoning question is given in Fig. 5.The dataset is constructed using the length-3 and length-4 cycles extracted from Wikidata.For each cycle, we pick a triplet to create the question and answer.The question is created by asking the relation between a pair of entities in that triplet.

Figure 1 :
Figure 1: Examples of the en-fr code-switched and parallel synthetic sentences.The words replaced with translations or aliases are marked with bold font and underline, respectively.

Figure 2 :
Figure 2: Examples of extracted cycles of length 3 and 4.

Figure 4 :
Figure 4: Examples of the masked training samples for logical reasoning.The relations are highlighted in orange.The masked entity and relation tokens are highlighted in lime green.
located in time zone, UTC+01:00) (Poland, located in time zone, Central European Time) Question: What is the relation between UTC+01:00 and Central European Time?

5Figure 6 :
Figure 6: Comparison of the models trained with and without logical reasoning task.

Table 2 :
Zero-shot cross-lingual NER F1 on the CoNLL02/03 datasets.The average results of non-English languages are reported in column avg tgt .

Table 7 :
Zero-shot cross-lingual POS, QA and classification results.Note that the performance of the languages not appearing in our prepared pretraining data are also counted.
†The results are from .1 Language Model Pretraining Details Training Data The statistics of the data used for pretraining are shown in Table 8. samples from length-3 cycles 24,142,272 reasoning based training samples from length-4 cycles 73,881,422 sampled CC100 sentences (KMLM-XLM-R only) 260,000,000 sampled Wikipedia sentences (KMLM-mBERT only) 153,011,930

Table 8 :
Statistics of the data used for pretraining.