Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking

Injecting external domain-specific knowledge (e.g., UMLS) into pretrained language models (LMs) advances their capability to handle specialised in-domain tasks such as biomedical entity linking (BEL). However, such abundant expert knowledge is available only for a handful of languages (e.g., English). In this work, by proposing a novel cross-lingual biomedical entity linking task (XL-BEL) and establishing a new XL-BEL benchmark spanning 10 typologically diverse languages, we first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. The scores indicate large gaps to English performance. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones. To this end, we propose and evaluate a series of cross-lingual transfer methods for the XL-BEL task, and demonstrate that general-domain bitext helps propagate the available English knowledge to languages with little to no in-domain data. Remarkably, we show that our proposed domain-specific transfer methods yield consistent gains across all target languages, sometimes up to 20 Precision@1 points, without any in-domain knowledge in the target language, and without any in-domain parallel data.


Introduction
Recent work has demonstrated that it is possible to combine the strength of 1) Transformer-based encoders such as BERT (Devlin et al., 2019;Liu et al., 2019), pretrained on large general-domain data with 2) external linguistic and world knowledge (Zhang et al., 2019;Levine et al., 2020;Lauscher et al., 2020). Such expert human-curated knowledge is crucial for NLP applications in specialised domains such as biomedicine. There, Liu et al. (2021) recently proposed self-alignment pretraining (SAP), a technique to fine-tune BERT on phraselevel synonyms extracted from the Unified Medical Language System (UMLS; Bodenreider 2004). 1 Their SAPBERT model currently holds state-of-theart (SotA) across all major English biomedical entity linking (BEL) datasets. However, this approach is not widely applicable to other languages: abundant external resources are available only for a few languages, hindering the development of domainspecific NLP models in all other languages.
Simultaneously, exciting breakthroughs in crosslingual transfer for language understanding tasks have been achieved (Artetxe and Schwenk, 2019;Hu et al., 2020). However, it remains unclear whether such transfer techniques can be used to improve domain-specific NLP applications and mitigate the gap between knowledge-enhanced models in resource-rich versus resource-poor languages. In this paper, we thus investigate the current performance gaps in the BEL task beyond English, and propose several cross-lingual transfer techniques to improve domain-specialised representations and BEL in resource-lean languages.
In particular, we first present a novel crosslingual BEL (XL-BEL) task and its corresponding evaluation benchmark in 10 typologically diverse languages, which aims to map biomedical names/mentions in any language to the controlled UMLS vocabulary. After empirically highlighting the deficiencies of multilingual encoders (e.g, MBERT and XLMR;Conneau et al. 2020) on XL-BEL, we propose and evaluate a multilingual extension of the SAP technique. Our main results suggest that expert knowledge can be transferred from English to resource-leaner languages, yielding huge gains over vanilla MBERT and XLMR, and English-only SAPBERT. We also show that leveraging general-domain word and phrase translations offers substantial gains in the XL-BEL task.
Contributions. 1) We highlight the challenge of learning (biomedical) domain-specialised crosslingual representations. 2) We propose a novel multilingual XL-BEL task with a comprehensive evaluation benchmark in 10 languages. 3) We offer systematic evaluations of existing knowledge-agnostic and knowledge-enhanced monolingual and multilingual LMs in the XL-BEL task. 4) We present a new SotA multilingual encoder in the biomedical domain, which yields large gains in XL-BEL especially on resource-poor languages, and provides strong benchmarking results to guide future work. The code, data, and pretrained models are available online at: github.com/cambridgeltl/sapbert.

Methodology
Background and Related Work.
Learning biomedical entity representations is at the core of BioNLP, benefiting, e.g., relational knowledge discovery (Wang et al., 2018) and literature search (Lee et al., 2016). In the current era of contextualised representations based on Transformer architectures (Vaswani et al., 2017), biomedical text encoders are pretrained via Masked Language Modelling (MLM) on diverse biomedical texts such as PubMed articles (Lee et al., 2020;Gu et al., 2020), clinical notes (Peng et al., 2019;Alsentzer et al., 2019), and even online health forum posts (Basaldella et al., 2020). However, it has been empirically verified that naively applying MLMpretrained models as entity encoders does not perform well in tasks such as biomedical entity linking (Basaldella et al., 2020;Sung et al., 2020). Recently, Liu et al. (2021) proposed SAP (Self-Alignment Pretraning), a fine-tuning method that leverages synonymy sets extracted from UMLS to improve BERT's ability to act as a biomedical entity encoder. Their SAPBERT model currently achieves SotA scores on all major English BEL benchmarks.
In what follows, we first outline the SAP procedure, and then discuss the extension of the method to include multilingual UMLS synonyms ( §2.1), and then introduce another SAP extension which combines domain-specific synonyms with generaldomain translation data ( §2.2).

Language-Agnostic SAP
Let (x, y) ∈ X × Y denote the tuple of a name and its categorical label. When learning from UMLS synonyms, X × Y is the set of all (name, CUI 2 ) pairs, e.g., (vaccination, C0042196). While Liu et al. (2021) use only English names, we here consider names in other UMLS languages. During training, the model is steered to create similar representations for synonyms regardless of their language. 3 The learning scheme includes 1) an online sampling procedure to select training examples and 2) a metric learning loss that encourages strings sharing the same CUI to obtain similar representations.

Training Examples. Given a mini-batch of
, we start from constructing all possible triplets for all names x i ∈ X B . Each triplet is in the form of (x a , x p , x n ) where x a is the anchor, an arbitrary name from X B ; x p is a positive match of x a (i.e., y a = y p ) and x n is a negative match of x a (i.e., y a = y n ). Let f (·) denote the encoder (i.e., MBERT or XLMR in this paper). Among the constructed triplets, we select all triplets that satisfy the following constraint: where λ is a predefined margin. In other words, we only consider triplets with the positive sample further to the negative sample by a margin of λ. These 'hard' triplets are more informative for representation learning (Liu et al., 2021). Every selected triplet then contributes one positive pair (x a , x p ) and one negative pair (x a , x n ). We collect all such positives and negatives, and denote them as P, N .
Multi-Similarity Loss. We compute the pairwise cosine similarity of all the name representations and obtain a similarity matrix S ∈ R |X B |×|X B | where each entry S ij is the cosine similarity between the i-th and j-th names in the mini-batch B. The Multi-Similarity loss (MS, Wang et al. 2019), is then used for learning from the triplets: (1) α, β are temperature scales; is an offset applied on the similarity matrix; P i , N i are indices of positive and negative samples of the i-th anchor.

SAP with General-Domain Bitext
We also convert word and phrase translations into the same format ( §2.1), where each 'class' now contains only two examples. For a translation pair (x p , x q ), we create a unique pseudo-label y xp,xq and produce two new name-label instances (x p , y xp,xq ) and (x q , y xp,xq ), 4 and proceed as in §2.1. This allows us to easily combine domainspecific knowledge with general translation knowledge within the same SAP framework.

The XL-BEL Task and Evaluation Data
A general cross-lingual entity linking (EL) task (McNamee et al., 2011;Tsai and Roth, 2016) aims to map a mention of an entity in free text of any language to a controlled English vocabulary, typically obtained from a knowledge graph (KG). In this work, we propose XL-BEL, a cross-lingual biomedical EL task. Instead of grounding entity mentions to English-specific ontologies, we use UMLS as a language-agnostic KG: the XL-BEL task requires a model to associate a mention in any language to a (language-agnostic) CUI in UMLS. XL-BEL thus serves as an ideal evaluation benchmark for biomedical entity representations: it challenges the capability of both 1) representing domain entities and also 2) associating entity names in different languages.
Evaluation Data Creation. For English, we take the available BEL dataset WikiMed (Vashishth et al., 2020), which links Wikipedia mentions to UMLS CUIs. We then follow similar procedures as WikiMed and create an XL-BEL benchmark covering 10 languages (see Table 2). For each language, we extract all sentences from its Wikipedia 4 These pseudo-labels are not related to UMLS, but are used to format our parallel translation data into the input convenient for the SAP procedure. In practice, for these data we generate pseudo-labels ourselves as 'LANGUAGE CODE+index'. For instance, ENDE2344 indicates that this word pair is our 2,344th English-German word translation. Note that the actual coding scheme does not matter as it is only used for our algorithm to determine what terms belong to the same (in this case -translation) category. dump, find all hyperlinked concepts (i.e., words and phrases), lookup their Wikipedia pages, and retain only concepts that are linked to UMLS. 5 For each UMLS-linked mention, we add a triplet (sentence, mention, CUI) to our dataset. 6 Only one example per surface form is retained to ensure diversity. We then filter out examples with mentions that have the same surface form as their Wikipedia article page. 7 Finally, 1k examples are randomly selected for each language: they serve as the final test sets in our XL-BEL benchmark. The statistics of the benchmark are available in Table 1.

Experiments and Results
UMLS Data. We rely on the UMLS (2020AA) as our SAP fine-tuning data, leveraging synonyms in all available languages. The full multilingual fine-tuning data comprises ≈15M biomedical entity names associated with ≈4.2M individual CUIs. As expected, English is dominant (69.6% of all 15M names), followed by Spanish (10.7%) and French (2.2%). The full stats are in App. §A.3.

Main Results and Discussion
Multilingual UMLS Knowledge Always Helps (Table 2). Table 2 summarises the results of applying multilingual SAP fine-tuning based on UMLS knowledge on a wide variety of monolingual, multilingual, and in-domain pretrained encoders. Injecting UMLS knowledge is consistently beneficial to the models' performance on XL-BEL across all languages and across all base encoders. Using multilingual UMLS syn-onyms to SAP-fine-tune the biomedical PUBMED-BERT (SAPBERT all syn ) instead of English-only synonyms (SAPBERT) improves its performance across the board. SAP-ing monolingual BERTs for each language also yields substantial gains across all languages; the only exception is Thai (TH), which is not represented in UMLS. Fine-tuning multilingual models MBERT and XLMR leads to even larger relative gains.
Performance across Languages (Table 2). UMLS data is heavily biased towards Romance and Germanic languages. As a result, for languages more similar to these families, monolingual LMs (upper half,  continue training on general translation data ( §2.2) after the previous UMLS-based SAP. With this variant, base multilingual LMs become powerful multilingual biomedical experts. We observe additional strong gains (cf., Table 2) with out-of-domain translation data: e.g., for MBERT the gains range from 2.4% to 12.7% on all languages except ES. For XLMR, we report Precision@1 boosts of >10% on RU, TR, KO, TH with XLMR+SAP en syn , and similar but smaller gains also with XLMR+SAP all syn . We stress the case of TH, not covered in UMLS. Precision@1 rises from 11.5% (XLMR+SAP en syn ) to 30.9% ↑19.4% (XLMR+SAP all syn (+en-th wt+ muse)), achieved through the synergistic effect of both knowledge types: 1) UMLS synonyms in other languages push the scores to 20.6% ↑9.1% ; 2) translation knowledge increases it further to 30.9% ↑10.3% . In general, these results suggest that both external in-domain knowledge and generaldomain translations boost the performance in resource-poor languages.
The More the Better (Table 4)? According to Table 4 (lower half), it holds almost universally that all syn > en+{$LANG} syn > en syn/{$LANG} syn on XLMR, that is, it seems that more in-domain knowledge (even in nonrelated languages) benefit cross-lingual transfer. However, for MBERT (Table 4, upper half), the trend is less clear, with en+{$LANG} syn sometimes outperforming the all syn variant. Despite modest performance differences, this suggests that the choice of source languages for knowledge transfer also plays a role; this warrants further investigations in future work.
Are Large Models (Cross-Lingual) Domain Experts (  (78.7%), without SAP-tuning (Table 5). The scores without SAP fine-tuning on XLMR LARGE , although much higher than of its BASE variant, decrease on other ('non-English') languages. At the same time, note that XLMR BASE achieves randomlevel performance without SAP-tuning. After SAP fine-tuning, on average, XLMR LARGE +SAP still outperforms BASE models, but the gap is much smaller: e.g., we note that the performance of the two SAP-ed models is on par in English. This suggests that with sufficient knowledge injection, the underlying base model is less important (English); however, when the external data are scarce (other languages beyond English), a heavily parameterised large pretrained encoder can boost knowledge transfer to resource-poor languages.

Conclusion
We have introduced a novel cross-lingual biomedical entity task (XL-BEL), establishing a widecoverage and reliable evaluation benchmark for cross-lingual entity representations in the biomedical domain in 10 languages, and have evaluated current SotA biomedical entity representations on XL-BEL. We have also presented an effective transfer learning scheme that leverages general-domain translations to improve the cross-lingual ability of domain-specialised representation models. We hope that our work will inspire more research on multilingual and domain-specialised representation learning in the future.

572
A Appendix A A.1 XL-BEL: Full Statistics Table 1 in the main paper summarises the key statistics of the XL-BEL benchmark. It was extracted from the 20200601 version of Wikipedia dump. "sentences" refers to the number of sentences that contain biomedical mentions in the Wiki dump. "unique titles (Wiki page)" denotes the number of unique Wikipedia articles the biomedical mentions link to. "mentions" denotes the number of all biomedical mentions in the Wikipedia dump. "unique mentions" refers to the number of mentions after filtering out examples containing duplicated mention surface forms. "unique mentions mention!=title " denotes the number of unique mentions that have surface forms different from the Wikipedia articles they link to. The 1k test sets for each language are then randomly selected from the examples in "unique mentions mention!=title ".

A.2 XL-BEL: Selection of Languages
Our goal is to select a diverse and representative sample of languages for the resource and evaluation from the full set of possibly supported languages. For this reason, we exclude some Romance and Germanic languages which were too similar to some languages already included in the resource (e.g., since we include Spanish as a representative of the Romance language, evaluating on related languages such as Portuguese or Italian would not yield additional and new insights, while it would just imply running additional experiments). The language list covers languages that are close to English (Spanish, German); languages that are very distant from English (Thai, Chinese, etc.); and also languages that are in the middle (e.g., Turkish, which is typologically different, but shares a similar writing script with English). The availability of biomedical texts in Wikipedia also slightly impacted our choice of languages. The overlapping entities of Wikipedia and UMLS are not evenly distributed in the biomedical domain. For example, since animal species are comprehensively encoded in UMLS, they become rather dominant for certain low-resource languages. We manually inspected the distribution of the covered entities in each language to ensure that they are indeed representative biomedical concepts. Languages with heavily skewed entity distributions are filtered out. E.g., biomedical concepts in Basque Wikipedia are heavily skewed towards plant and an-imal species (which are valid UMLS concepts but not representative enough). As a result, we dropped Basque as our evaluation language. The current 10 languages all have a reasonably fair distribution over biomedical concepts categories.

A.3 UMLS Data Preparation
All our UMLS fine-tuning data for SAP is extracted from the MRCONSO.RRF file downloaded at

A.4 Translation Data
The full statistics of the used word and phrase translation data are listed in Table 7. The "muse" word translations are downloaded from https:// github.com/facebookresearch/MUSE while the Wikititle pairs ("wt") are extracted by us, and are made publicly available.

A.5 Pretrained Encoders
A complete listing of URLs for all used pretrained encoders hosted on huggingface.co is provided in  we made the best effort to select the most popular one (based on download counts).
A.6 Full Table for Comparing with LARGE Models Table 9 list results across all languages for comparing BASE and LARGE models.

A.7 Future Work
Investigating Other Cross-Lingual Transfer Learning Schemes. We also explored adapting multilingual sentence representation transfer techniques like Reimers and Gurevych (2020) that leverage parallel data. However, we observed no improvement comparing to the main transfer scheme reported in the paper. We plan to investigate existing techniques more comprehensively, and benchmark more results on XL-BEL in the future.
Comparison with in-Domain Parallel Data.
While we used general-domain bitexts to cover more resource-poor languages, we are aware that in-domain bitexts exist among several "mainstream" languages (EN, ZH, ES, PT, FR, DE, Bawden et al. 2019). 8 In the future, we plan to also compare with biomedical term/sentence translations on these languages to gain more insights on the impact of domain-shift.