KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model

Most biomedical pretrained language models are monolingual and cannot handle the growing cross-lingual requirements. The scarcity of non-English domain corpora, not to mention parallel data, poses a significant hurdle in training multilingual biomedical models. Since knowledge forms the core of domain-specific corpora and can be translated into various languages accurately, we propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach. We achieve a biomedical multilingual corpus by incorporating three granularity knowledge alignments (entity, fact, and passage levels) into monolingual corpora. Then we design three corresponding training tasks (entity masking, relation masking, and passage relation prediction) and continue training on top of the XLM-R model to enhance its domain cross-lingual ability. To validate the effectiveness of our model, we translate the English benchmarks of multiple tasks into Chinese. Experimental results demonstrate that our model significantly outperforms monolingual and multilingual pretrained models in cross-lingual zero-shot and few-shot scenarios, achieving improvements of up to 10+ points. Our code is publicly available at https://github.com/ngwlh-gl/KBioXLM.


Introduction
In recent years, biomedical pretrained language models (PLMs) (Lee et al., 2020;Gu et al., 2020;Yasunaga et al., 2022) have made remarkable progress in various natural language processing (NLP) tasks.While the biomedical annotation data (Li et al., 2016;Dogan et al., 2014;Du et al., 2019;Collier and Kim, 2004;Gurulingappa et al., 2012) are predominantly in English.Therefore, non-English biomedical natural language process- * Corresponding author.We begin with an English passage identified as 42030 , which serves as the primary source.From this passage, we extract entities and relations using UMLS trans , and we search for the corresponding aligned Chinese passage 6640907 from Wikipedia.Combining these three granular knowledge sources, we construct aligned corpora.We will predict the relationship between the two passages using the "[CLS]" token.
ing tasks highlight the pressing need for crosslingual capability.However, most biomedical PLMs focus on monolingual and cannot address cross-lingual requirements, while the performance of existing multilingual models in general domain fall far behind expectations (Devlin et al., 2018;Conneau et al., 2019;Chi et al., 2021).Multilingual biomedical models can effectively tackle cross-lingual tasks and monolingual tasks.Therefore, the development of a multilingual biomedical pretrained model is urgently needed1 .
Unlike in general domains, there is a scarcity of non-English biomedical corpora and even fewer parallel corpora in the biomedical domain, which presents a significant challenge for training mul-tilingual biomedical models.In general domains, back translation (BT) (Sennrich et al., 2015) is commonly used for data augmentation.However, our experiments (refer to "XLM-R+BT" listed in Table 5) reveal that due to the quality issues of domain translation, back translation does not significantly improve multilingual biomedical models' cross-lingual understanding ability.Unlike the entire text, translating entities and relations constituting domain knowledge is unique.Since domain knowledge is considered the most crucial content (Michalopoulos et al., 2020;He et al., 2020), we propose a novel model called KBioXLM to bridge multilingual PLMs like XLM-R (Conneau et al., 2019) into the biomedical domain by leveraging a knowledge-anchored approach.Concretely, we incorporate three levels of granularity in knowledge alignments: entity, fact, and passage levels, to create a text-aligned biomedical multilingual corpus.We first translate the UMLS knowledge base2 into Chinese to obtain bilingual aligned entities and relations named UMLS trans .At the entity level, we employ code-switching (Yang et al., 2020) to replace entities in sentences with expressions in another language according to UMLS trans .At the fact level, we transform entity pairs with relationships into another language using UMLS trans and concatenate them after the original monolingual sentence.At the passage level, we collect paired biomedical articles in English and Chinese from Wikipedia3 to form a coarse-grained aligned corpus.An example of the construction process can be seen in Figure 1.Furthermore, we design three training tasks specifically tailored for the knowledgealigned data: entity masking, relation masking, and passage relation prediction.It is worth noting that in order to equip the model with preliminary biomedical comprehension ability, we initially pretrain XLM-R on monolingual medical corpora in both Chinese and English.Then, continuously training on top of the model using these three tasks, our approach has the ability to handle cross-lingual biomedical tasks effectively.
To compensate for the lack of cross-lingual evaluation datasets, we translate and proofread four English biomedical datasets into Chinese, involving three different tasks: named entity recognition (NER), relation extraction (RE), and document classification (DC).The experimental results demon-strate that our model consistently outperforms both monolingual and other multilingual pretrained models in cross-lingual zero-shot and few-shot scenarios.On average, our model achieves an impressive improvement of approximately 10 points than general multilingual models.Meanwhile, our method maintains a comparable monolingual ability by comparing common benchmarks in both Chinese and English biomedical domains.
To summarize, our contributions can be outlined as follows: • We innovatively propose a knowledgeanchored method for the multi-lingual biomedical scenario: achieving text alignment through knowledge alignment.• We design corresponding tasks for multigranularity knowledge alignment texts and develop the first multilingual biomedical pretrained language model to our knowledge.• We translate and proofread four biomedical datasets to fill the evaluation gap in crosslingual settings.
2 Related Work

Biomedical Pretrained Language Models
In recent years, the advent of pre-trained language models like BERT (Devlin et al., 2018)

Multi-lingual Pretrained Language Models
Multilingual pre-trained models represent multiple languages in a shared semantic vector space and enable effective cross-lingual processing.Notable examples include mBERT (Devlin et al., 2018), which utilizes Wikipedia data and employs Multilingual Masked Language Modeling (MMLM) during training.XLM (Lample and Conneau, 2019) focuses on learning cross-lingual understanding capability from parallel corpora.ALM (Yang et al., 2020) adopts a code-switching approach for sentences in different languages instead of simple concatenation.XLM-R (Conneau et al., 2019), based on RoBERTa (Liu et al., 2019), significantly expands the training data and covers one hundred languages.To further enhance the cross-lingual transferability of pre-trained models, InfoXLM (Chi et al., 2021) introduces a new pretraining task based on contrastive learning.However, as of now, there is no specialized multilingual model specifically tailored for the biomedical domain.

Method
This section presents the proposed knowledgeanchored multi-lingual biomedical PLM called KBioXLM.Considering the scarcity of biomedical parallel corpora, we utilize language-agnostic knowledge to facilitate text alignment at three levels of granularity: entity, fact, and passage.Subsequently, we train KBioXLM by incorporating these knowledge-anchored aligned data on the foundation of the multilingual XLM-R model.Figure 2 provides an overview of the training details.

Knowledge Base
We obtain entity and fact-level knowledge from UMLS and retrieve aligned biomedical articles from Wikipedia.sentences and their counterparts in the other language.To ensure balance, we randomly substitute 10 biomedical entities with their respective counterparts in each sample, keeping an equal number of replaced entities in both languages.

UMLS. UMLS (Unified
We design specific pretraining tasks to facilitate the exchange of entity-level information between Chinese and English.Given a sentence X = {x 1 , x 2 , • • • , x n } containing Chinese and English entities, we randomly mask 15% of the tokens representing entities in the sentence.The objective of KBioXLM's task is to reconstruct these masked entities.The loss function is defined as follows: Here, e i represents the masked Chinese or English entity.

Fact-level Knowledge
Fact refers to a relationship between a subject, predicate, and object in the knowledge base.Our assumption is that if both entities mentioned in the fact are present together in a text, then the text should contain this fact.We employ fact matching to create bilingual corpora and develop a pretraining task called relation masking.The process of constructing bilingual corpora at the fact level involves the following steps: • Retrieve potential relationships between paired entities from UMLS trans in monolingual corpus.• Organize these facts in another language and concatenate them with the original monolingual sentence.
An example is depicted in green color in Figure 1.
Given the input text "Laudanum contains approximately 10% opium poppy ...", we extract the fact "(opium poppy, associated with, morphine)" and its corresponding Chinese translation "(罂粟花, 有关联, 吗啡)".The final text would be "Laudanum contains approximately 10% opium poppy ... [SEP] 罂粟花有关联吗啡".We will mask the relationship "有关联".The fact-level task is to reconstruct the masked relationships.The loss function for relation masking is defined as follows: where f i represents the masked relationship, which can be either a Chinese or an English representation.

Passage-level Knowledge
Some biomedical NLP tasks are performed at the document level, so we broaden the scope of crosslingual knowledge to encompass the passage level.This expansion is illustrated in blue in Figure 1.Specifically, we employ paired biomedical English and Chinese Wikipedia articles to create an aligned corpus at the passage level.This corpus serves as the foundation for designing a pretraining task focused on predicting passage relationships.Inspired by Yasunaga et al. (2022), the strategies employed to construct the passage-level corpus are as follows: • Randomly selecting one Chinese segment and one English segment, we label them as "positive" if they belong to paired articles and as "random" otherwise.• We pair consecutive segments in the same language to create contextualized data pairs and label them as "context".
Ultimately, we gather a collection of 30k segment pairs, with approximately equal quantities for each of the three types of segment pairs.The pretraining task employed to incorporate bilingual passage knowledge into the model is passage relationship prediction.The loss function for this task is as follows: where c ∈ {positive, random, context}, X pair is the hidden state with global contextual information.The tokens present in the three-level biomedical multilingual corpus are documented in Table 1.KBioXLM is trained using an equal proportion of monolingual data and the previously constructed three-level bilingual corpora to ensure the model's proficiency in monolingual understanding.The overall pretraining loss function for KBioXLM is defined as follows: By integrating these three multi-task learning objectives, KBioXLM exhibits improved cross-lingual understanding capability.

Backbone Multilingual Model
Our flexible approach can be applied to any multilingual pre-trained language model.In this study, we adopt XLM-R as our foundational framework, leveraging its strong cross-lingual understanding capability across various downstream tasks.To tailor XLM-R to the biomedical domain, we conduct additional pretraining using a substantial amount of biomedical monolingual data from CNKI7 (2.15 billion tokens) and PubMed8 (2.92 billion tokens).The pre-training strategy includes whole word masking (Cui et al., 2021) and biomedical entity masking.We match the biomedical Chinese entities and English entities contained in UMLS trans with the monolingual corpora in both Chinese and English for the second pretraining task.Clearly, we have already incorporated entity-level knowledge at this pretraining stage to enhance performance.
For specific details regarding the pretraining process, please refer to Section A.1.For convenience, we refer to this model as XLM-R+Pretraining.

Prompt:
Let's think step by step.I will provide you with a marked English text and its translated Chinese text.You need to detect and correct missing marks <num> or </num> and their orders.Here is a positive example： Input: EN: Severe <1> bleomycin </1> <0> lung toxicity </0> : reversal with high dose corticosteroids .

Output：
According to your request, I have obtained the following correction results:

Human Evaluation & Revision
Figure 3: This picture illustrates the sentence, "After taking Metoclopramide, she developed dyskinesia and a period of unresponsiveness."which is initially marked for translation and subsequently revised through a collaborative effort involving ChatGPT and manual editing.
Due to the lack of biomedical cross-lingual understanding benchmarks, we translate several wellknown biomedical datasets from English into Chinese by combining translation tools, ChatGPT, and human intervention.As shown in Table 2, these datasets are BC5CDR (Li et al., 2016) and ADE (Gurulingappa et al., 2012) for NER, GAD for RE, and HoC (Hanahan and Weinberg, 2000) for DC.ADE and HoC belong to the BLURB benchmark9 .Please refer to Section A.2 for details about these four tasks.
The process, as depicted in Figure 3 can be summarized as follows: we conduct a simple translation using Google Translator for the document classification dataset.To preserve the alignment of English entities and their relationships in the NER and RE datasets after translation, we modify them by replacing them with markers "<num>Entity</num>" based on the NER golden labels in the original sentences.Here, "num" indicates the entity's order in the original sentence.During post-processing, we match the translated entities and their relationships back to their original counterparts using the numerical information in the markers.Despite the proficiency of Google Translator, it has some limitations, such as missing entity words, incomplete translation of English text, and semantic gaps.To address these concerns, we design a prompt to leverage ChatGPT in identifying inconsistencies in meaning or incomplete translation between the original sentences and their translations.Subsequently, professional annotators manually proofread the sentences with identified defects while a sample of error-free sentences is randomly checked.This rigorous process guarantees accurate and consistent translation results, ensuring proper alignment of entities and relationships.Ultimately, we obtain high-quality Chinese biomedical datasets with accurately aligned entities and relationships through meticulous data processing.

Pretraining
The pretraining process of KBioXLM employs a learning rate of 5e-5, a batch size of 128, and a total of 50,000 training steps.5,000 warm-up steps are applied at the beginning of the training.The model is pretrained on 4 NVIDIA RTX A5000 GPUs for 14 hours.
Table 3: Characteristics of our baselines." " indicates that the model has this feature while " " means the opposite."CN" means Chinese, "EN" represents English and "Multi" represents Multilingual.

Finetuning
For monolingual tasks and cross-lingual understanding downstream tasks listed in Table 2, the backbone of Named Entity Recognition is the encoder part of the language model plus conditional random fields (Lafferty et al., 2001).The simplest sequence classification is employed for relation extraction and document classification tasks.F1 is used as the evaluation indicator for these tasks.

Baselines
To compare the performance of our model in monolingual comprehension tasks, we select SOTA models in English and Chinese biomedical domains.
Similarly, to assess our model's cross-lingual comprehension ability, we conduct comparative experiments with models that possess strong cross-lingual understanding capability in general domains, as there is currently a lack of multilingual PLMs specifically tailored for the biomedical domain.

Monolingual Biomedical PLMs.
For English PLMs, we select BioBERT (Lee et al., 2020), PubMedBERT (Gu et al., 2020), and Bi-oLinkBERT (Yasunaga et al., 2022) for comparison, while for Chinese, we choose eHealth (Wang et al., 2021) and SMedBERT (Zhang et al., 2021b).Multilingual PLMs.XLM-R baseline model and other SOTA multilingual PLMs, including mBERT (Devlin et al., 2018), InfoXLM (Chi et al., 2021) are used as our baselines.We also compare the results of two large language models (LLMs) that currently perform well in generation tasks on these four tasks, namely ChatGPT and ChatGLM-6B10 .Multilingual Biomedical PLMs.To our knowledge, there is currently no multilingual pretraining model in biomedical.Therefore, we build two baseline models on our own.Considering the effectiveness of back-translation (BT) as a data augmentation strategy in general multilingual pretraining, we train baseline XLM-R+BT using back-translated biomedical data.Additionally, in order to assess the impact of additional pretraining in KBioXLM, we directly incorporate the three levels of knowledge mentioned earlier into the XLM-R architecture, forming XLM-R+three KL.Please refer to Table 3 for basic information about our baselines.

Main Results
This section explores our model's cross-lingual and monolingual understanding ability.Please refer to Appendix A.3 for the model's monolingual understanding performance on more tasks.

Cross-lingual Understanding Ability
We test our model's cross-lingual understanding ability on four tasks in two scenarios: zero-shot and few-shot.As shown in 4 and 5, KBioXLM achieves SOTA performance in both cases.Our model has a strong cross-lingual understanding ability in the zero-shot scenario under both "EN-to-CN" and "CN-to-EN" settings.It is worth noting that the performance of LLMs in language understanding tasks is not ideal.Moreover, compared to ChatGPT, the generation results of ChatGLM are unstable when the input sequence is long.Similarly, general-domain multilingual PLMs also exhibit a performance difference of over 10 points compared to our model.The poor performance of LLMs and multilingual PLMs underscores the importance of domain adaptation.XLM-R+three KL is pretrained with just 65M tokens, and it already outperforms XLM-R by 14 and 9 points under these two settings.And compared to XLM-R+BT, there is also an improvement of 5 and 2 points, highlighting the importance of knowledge alignment.Compared to KBioXLM, XLM-R+three KL performs 5 or more points lower.This indicates that excluding the pretraining step significantly affects the performance of biomedical cross-lingual understanding tasks,  highlighting the importance of initially pretraining XLM-R to enhance its biomedical understanding capability.
In the "EN-to-CN" few-shot scenario, we test models' cross-lingual understanding ability under two settings: 10 training samples and 100 training samples.It also can be observed that XLM-R+three KL and KBioXLM perform the best among these four types of PLMs.Multilingual PLMs and Chinese biomedical PLMs have similar performance.However, compared to our method, there is a difference of over 10 points in the 10-shot scenario and over 5 points in the 100-shot scenario.This indicates the importance of both domain-specific knowledge and multilingual capability, which our model satisfies.

Monolingual Understanding Ability
Although the focus of our model is on cross-lingual scenarios, we also test its monolingual comprehension ability on these four datasets.Table 6 shows the specific experimental results.It can be seen that KBioXLM can defeat most other PLMs in these tasks, especially in the "CN-to-CN" scenario.Compared to XLM-R, KBioXLM has an average improvement of up to 4 points.BioLinkBERT performs slightly better than ours on English comprehension tasks because it incorporates more knowledge from Wikipedia.KBioXLM's focus, however, lies in cross-lingual scenarios, and we only utilize a small amount of aligned Wikipedia articles.

Ablation Study
This section verifies the effectiveness of different parts of the used datasets.Table 7 presents the results of ablation experiments in zero-shot scenarios.Removing the bilingual aligned data at the passage level results in a 1-point decrease in model performance across all four tasks.Further removing the fact-level data leads to a continued decline.When all granularity bilingual knowledge data is removed, our model's performance drops by approximately 4 points.These experiments demonstrate the effectiveness of constructing aligned corpora with three different granularities of knowledge.Due to the utilization of entity knowledge in the underlying XLM-R+pretraining, it is difficult to accurately assess the performance when none of the three types of knowledge are used.

Conclusion
This paper proposes KBioXLM, a model that transforms the general multi-lingual PLM into the biomedical domain using a knowledge-anchored approach.We first obtain biomedical multilingual corpora by incorporating three levels of knowledge alignment (entity, fact, and passage) into the monolingual corpus.Then we design three train-ing tasks, namely entity masking, relation masking, and passage relation prediction, to enhance the model's cross-lingual ability.KBioXLM achieves SOTA performance in cross-lingual zero-shot and few-shot scenarios.In the future, we will explore biomedical PLMs in more languages and also venture into multilingual PLMs for other domains.

Limitations
Due to the lack of proficient personnel in other less widely spoken languages, our experiments were limited to Chinese and English only.However, Our method can be applied to various other languages, which is highly significant for addressing crosslingual understanding tasks in less-resourced languages.Due to device and time limitations, we did not explore our method on models with larger parameter sizes or investigate cross-lingual learning performance on generative models.These aspects are worth exploring in future research endeavors.
Figure1: The construction process of aligned biomedical data relies on three types of granular knowledge.We begin with an English passage identified as 42030 , which serves as the primary source.From this passage, we extract entities and relations using UMLS trans , and we search for the corresponding aligned Chinese passage 6640907 from Wikipedia.Combining these three granular knowledge sources, we construct aligned corpora.We will predict the relationship between the two passages using the "[CLS]" token.

aligned data: Fact knowledge aligned data: Passage knowledge aligned data: input
Medical Language Sys-tems) is widely acknowledged as the largest and most comprehensive knowledge base in the biomedical domain, encompassing a vast collection of biomedical entities and their intricate relationships.While UMLS offers a broader range of entity types in English and adheres to rigorous classification criteria, it lacks annotated data in Chinese.Recognizing that knowledge often possesses distinct descriptions across different languages, we undertake a meticulous manual translation process.A total of 982 relationship types are manually translated, and we leverage both Google Translator 4 and ChatGPT 5 to convert 880k English entities into their corresponding Chinese counterparts.Manual verification is conducted on divergent translation results to ensure accuracy and consistency.This Chinese-translated version of UMLS is called UMLS (Zan et al., 2021)seamless cross-lingual access to biomedical knowledge.Wikipedia.Wikipedia contains vast knowledge across various disciplines and is available in multiple languages.The website offers complete data downloads 6 , providing detailed information about each page, category membership links, and interlanguage links.Initially, we collect relevant items in medicine, pharmaceuticals, and biology by following the category membership links.We carefully filter out irrelevant Chinese articles to focus on the desired content by using a downstream biomedical NER model trained on CMeEE-V2(Zan et al., 2021)dataset.The dataset includes nine major categories of medical entities, including 504 common pediatric diseases, 7085 body parts, 12907 clinical manifestations, and 4354 medical procedures.illustrated in orange in Figure1.We extract entities from UMLS trans that appear in monolingual KBioXLM [CLS] Laudanum contains approximately 10% 罂粟花 , equivalent to 1% [MASK] [MASK] ... [CLS] Laudanum ... 10% opium poppy ... [SEP] 罂粟花 [MASK][MASK][MASK] 吗啡 ... [CLS] Laudanum contains approximately ...[SEP] 鸦片酊通过将罂粟的提取物溶解在 ... Entity knowledge output [CLS] Laudanum contains approximately 10% 罂粟花 , equivalent to 1% 吗啡 ... [CLS] Laudanum contains approximately 10% opium poppy ... [SEP] 罂粟花 有关联 吗啡 ... [CLS] Laudanum contains approximately ...[SEP] 鸦片酊通过将罂粟的提取物溶解在 ...Figure 2: Overview of KBioXLM's training details.Masked entity prediction, masked relation token prediction, and contextual relation prediction tasks are shown in this figure.The color orange represents entity-level pretraining task, green represents fact-level, and blue represents passage-level.Given that the two passages are aligned, our model predicts a "positive" relationship between them.

Table 4 :
Cross lingual zero shot results."EN-to-CN" and "CN-to-EN" indicate training in English and testing on Chinese datasets, and vice versa.AVG represents the average F1 score across four cross-lingual tasks.

Table 5 :
Cross lingual EN-to-CN few shot results."Bio" represents Biomedical and "Multi" represents Multilingual.Due to the limited number of input tokens, we only conduct 10-shot experiments for LLMs.

Table 6 :
The performance of KBioXLM and our baselines on English and Chinese monolingual comprehension tasks.

Table 7 :
KBioXLM's cross-lingual understanding ablation experiments in the zero-shot scenario.