MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets-for biomedical named entity recognition and financial sentence classification-covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.


Introduction
The unsupervised pretraining of language models on unlabelled text has proven useful to many natural language processing tasks. The success of this approach is a combination of deep neural networks (Vaswani et al., 2017), the masked language modeling objective , and large-scale corpora (Zhu et al., 2015). In fact, unlabelled data is so important that better downstream task performance can be realized by pretraining models on more unique tokens, without repeating any examples, instead of iterating over smaller datasets (Raffel et al., 2020). When it is not possible to find vast amounts of unlabelled text, a better option is to continue pretraining a model on domainspecific unlabelled text (Han and Eisenstein, 2019;Dai et al., 2020), referred to as domain adaptive pretraining (Gururangan et al., 2020). This results in a better initialization for consequent fine-tuning for a downstream task in the specific domain, either on target domain data directly (Gururangan et al., 2020), or if unavailable on source domain data (Han and Eisenstein, 2019).
The majority of domain-adapted models are trained on English domain-specific text, given the availability of English language data. However, many real-world applications, such as working with financial documents (Araci, 2019), biomedical text , and legal opinions and rulings (Chalkidis et al., 2020), should be expected to work in multiple languages. For such applications, annotated target task datasets might be available, but we lack a good pretrained model that we can finetune on these datasets. In this paper, we propose a method for domain adaptive pretraining of a single domain-specific multilingual language model that can be fine-tuned for tasks within that domain in multiple languages. There are several reasons for wanting to train a single model: (i) Data availability: we cannot always find domain-specific text in multiple languages so we should exploit the available resources for effective transfer learning . (ii) Compute intensity: it is environmentally unfriendly to domain-adaptive pretrain one model per language (Strubell et al., 2019), and BioBERT was domain adaptive pretrained for 23 days on 8×Nvidia V100 GPUs. (iii) Ease of use: a single multilingual model eases deployment when an organization needs to work with multiple languages on a regular basis (Johnson et al., 2017).
Our method, multilingual domain adaptive pretraining (MDAPT), extends domain adaptive pretraining to a multilingual scenario, with the goal of training a single multilingual model that performs, as close as possible, to N language-specific models. MDAPT starts with a base model, i.e. a pretrained multilingual language model, such as mBERT  or XLM-R (Conneau et al., 2020). As monolingual models have the advantage of language-specificity over multilingual models (Rust et al., 2020;Rönnqvist et al., 2019), we consider monolingual models as upper baseline to our approach. We assume the availability of English-language domain-specific unlabelled text, and, where possible, multilingual domain-specific text. However, given that multilingual domainspecific text can be a limited resource, we look to Wikipedia for general-domain multilingual text (Conneau and Lample, 2019). The base model is domain adaptive pretrained on the combination of the domain-specific text, and general-domain multilingual text. Combining these data sources should prevent the base model from forgetting how to represent multiple languages while it adapts to the target domain.
Experiments in the domains of financial text and biomedical text, across seven languages: French, German, Spanish, Romanian, Portuguese, Danish, and English, and on two downstream tasks: named entity recognition, and sentence classification, show the effectiveness of multilingual domain adaptive pretraining. Further analysis in a cross-lingual biomedical sentence retrieval task indicates that MDAPT enables models to learn better domain-specific representations, and that these representations transfer across languages. Finally, we show that the difference in tokenizer quality between mono-and multilingual models is more pronounced in domain-specific text, indicating a direction for future improvement.
All models trained with MDAPT and the new datasets used in downstream tasks and pretraining data 1 and our code is made available 2 .

Problem Formulation
Pretrained language models are trained from random initialization on a large corpus C of unlabelled sentences. Each sentence is used to optimize the parameters of the model using a pretraining objective, for example, masked language modelling, where, for a given sentence, 15% of the tokens are masked in the input m, and the model is trained to predict those tokens J(θ) = −log p θ (x m | x \m ) . C is usually a corpus of no specific domain, 3 e.g. Wikipedia or crawled web text. Domain-adaptive pretraining is the process of continuing to pretrain a language model to suit a specific domain (Gururangan et al., 2020;Han and Eisenstein, 2019). This process also uses the masked language modelling pretraining objective, but the model is trained using a domain-specific corpus S, e.g. biomedical text if the model should be suited to the biomedical domain. Our goal is to pretrain a single model, which will be used for downstream tasks in multiple languages within a specific domain, as opposed to having a separate model for each language. This single multilingual domain-specific model should, ideally, perform as well as language-specific domain-specific models in a domain-specific downstream task.
In pursuit of this goal, we use different types of corpora for domain adaptive pretraining of a single multilingual model. Each considered corpus has two properties: (1) a domain propertyit is a general or specific corpus; and (2) a language property -it is either monolinugal or multilingual. These properties can be combined, for example the multilingual Wikipedia is a multi-general corpus, while the abstracts of English biomedical publications would be a monospecific corpus. Recall that specific cor-pora are not always available in languages other than English, but they are useful for adapting to the intended domain; while multi-general are more readily available, and should help maintain the multilingual abilities of the adapted language model. In the remainder of this paper, we will explore the benefits of domain adaptive pretraining with mono-specific, multi-specific, and multi-general corpora. Figure 1 shows how MDAPT extends domain adaptive pretraining to a multilingual scenario.

Multilingual Domain Adaptive Pretraining
Recall that we assume the availability of large scale English domain-specific and multilingual general unlabelled text. In addition to these monospecific and multi-general corpora, we collect multilingual domain-specific corpora, using two specific domains-financial and biomedicalas an example (Section 3.1). Note that although we aim to collect domain-specific data in as many languages as possible, the collected data are usually still relatively small. We thus explore different strategies to combine different data sources (Section 3.2), resulting in three different types of pretraining corpora of around 10 million sentences, that exhibit specific and multi properties to different extents: E D : English domain-specific data; M D +E D : Multilingual domain-specific data, augmented with English domain-specific data; and M D +M WIKI : Multilingual domain-specific data, augmented with multilingual general data. We use mBERT  as the multilingual base model, and employ two different continued pretraining methods (Section 3.3): adapterbased training and full model training, on these three pretraining corpora, respectively.

Domain-specific corpus
Financial domain As specific data for the financial domain, we use Reuters Corpora (RCV1, RCV2, TRC2), 4 SEC filings (Desola et al., 2019), 5 and FINMULTICORPUS, which is an in-house collected corpus. The FINMULTICORPUS consists of articles in multiple languages published on PwC website. The resulting corpus contains the following languages: zh, da, nl, fr, de, it, ja, no, pt, ru, es, sv, en, tr. Statistics on the presented languages can be found in Table 9 in the Appendix. Information about preprocessing are detailed in Appendix C.
Biomedical domain As specific data for the biomedical domain, we use biomedical publications from the PubMed database, in the following languages: fr, en, de, it, es, ro, ru, pt. For languages other than English, we use the language-specific PubMed abstracts published as training data by WMT, and additionally retrieve all language specific paper titles from the database. 6 For English, we only sample abstracts. We make sure that no translations of documents are included in the pretraining data. The final statistics on biomedical pretraining data can be found in Table 8 in the Appendix, as well as more details about preprocessing the documents. The descriptive statistics of these pretraining data can be found in Table 1.

Combination of data sources
Recall that multi-specific data is usually difficult to obtain, and we explore different strategies to account for this lack. The different compositions of pretraining data are illustrated in Figure 2. We control the size of the resulting corpora by setting a budget of 10 million sentences. This allows a fair comparison across data settings. With plenty of English specific text available, E D and M D +E D are composed by simply populating the corpus until reaching the allowance. As a resource for multi-general data, we use Wikipedia page content, where we ensure the same page is not sampled twice across languages. Up-sampling M D +M WIKI using general domain multilingual data requires a sampling strategy that accounts for individual sizes. Sampling low-resource languages too often may lead to overfitting the repeated contents, whereas sampling high-resource language too much can lead to a model underfit. We balance the language samples using exponentially smoothed weighting (Xue et al., 2020;Conneau and Lample, 2019;. Following Xue et al., we use a α of 0.3 to smooth the probability of sampling a language, P (L), by P (L) α . After exponentiating each probability by α, we normalize and populate the pretraining corpus with Wikipedia sentences according to smoothed values until reaching our budget. Except for English, we up-sample using Wikipedia data. The statistics of the extracted sentences is presented in tables 8 and 9 in the Appendix.

Pretraining methods
Continue pretraining the whole model We initialize our models with pretrained base model weights 7 and then continue pretraining the whole base model via the masked language modeling objective. We follow  in randomly masking out 80% of subtokens and randomly replacing 10% of subtokens. For all models, we use an effective batch size of 2048 via gradient accumulation, a sequence length of 128, and a learning rate of 5e-5. We train all models for 25,000 steps, which takes 10 GPU days.
Adapter-based training In contrast to finetuning all weights of the base model, adapter-based training introduces a small network between each layer in the base model, while keeping the base model fixed. The resulting adapter weights, which 7 MBERT: https://huggingface.co/bert-base-multilingualcased  can be optimized using self-supervised pretraining or later downstream supervised objectives, are usually much lighter than the base model, enabling parameter efficient transfer learning (Houlsby et al., 2019). We train each adapter for 1.5M steps, taking only 2 GPU days. We refer readers to Pfeiffer et al. (2020b) for more details of adapter-based training and also describe them in the Appendix D for self-containedness.

Domain-Specific Downstream Tasks
To demonstrate the effectiveness of our multilingual domain-specific models, we conduct experiments on two downstream tasks-Named Entity Recognition (NER) and sentence classificationusing datasets from biomedical and financial domains, respectively.

NER in the biomedical domain
Datasets We evaluate on 5 biomedical NER datasets in different languages. The French QUAERO (Névéol et al., 2014) dataset, the Romanian BIORO dataset (Mitrofan, 2017)

Sentence classification in the financial domain
Datasets We use three financial classification datasets, including the publicly available English FINANCIAL PHRASEBANK (Malo et al., 2014), German ONE MILLION POSTS (Schabus et al., 2017), and a new Danish FINNEWS. The FINAN-CIAL PHRASEBANK is an English sentiment analysis dataset where sentences extracted from financial news and company press releases are annotated with three labels (Positive, Negative, and Neutral). Following its annotation guideline, we create FINNEWS-a dataset of Danish financial news headlines annotated with a sentiment. 2 annotators were screened to ensure sufficient domain and language background. The resulting dataset has a high inter-rater reliability (a measure of 82.1% percent agreement for raters and a Krippendorff's alpha of .725, measured on 800 randomly sampled examples). ONE MILLION POSTS is sourced from an Austrian newspaper. We use TITLE and TOPIC for two classification settings on this dataset: a binary classification, determining whether a TITLE concerns a financial TOPIC or not; and a multi-class classification that classify a TITLE into one of 9 TOPICs. We list the descriptive statistics in Table 3, and further details can be found in Appendix C.

Results
To measure the effectiveness of multilingual domain adaptive pretraining, we compare the effectiveness of our models trained with MDAPT on downstream NER and classification, to the respective monolingual baselines (mono-general), and to the base multilingual model without MDAPT (Table 4). Where available, we also compare to the respective monolingual domain-specific models (mono-specific).
Baseline models As mono-general baselines, we use English BERT ( (Schneider et al., 2020) is the only biomedical model for non-English language, we use it as Portuguese biomedical baseline, see Appendix A for more details.

Main results
The main results for the biomedical NER and financial sentence classification tasks are presented in Table 4. We report the evaluation results for the mono-BERT baselines in the respective languages and the performance difference of the multilingual models compared to these monolingual baselines.
We also consider two domain adaptive pretraining approaches: full model training, reported in the upper half of the  Our work is motivated by the finding that domain adaptive pretraining enables models to better solve domain-specific tasks in monolingual scenarios. The first row in Table 4 shows our re-evaluation of the performance of the three available domain adaptive pretrained mono-specific-BERT models matching the domains investigated in our study. We confirm the findings of the original works, that the domain-specific models outperform their general domain mono-BERT counterparts. This underlines the importance of domain adaptation in order to best solve domain-specific task. The improvements of PT-BIO-BERT over PT-BERT are small, which coincides with the findings of Schneider et al. (2020), and might be due to the fact that the CLINPT dataset comprises clinical entities rather than more general biomedical entities.
Full model training Recall that the aim of MDAPT is to train a single multi-specific model that performs comparable to the respective mono-general model. Using full model pretraining, we observe that the domain adaptive pretrained multilingual models can even outperform the monolingual baselines for es and en biomedical NER, and de for financial sentence classification. On the other hand, we observe losses of the multilingual models over the monolingual baselines for fr and ro NER, and da and en sentence classification. In all cases, MDAPT Table 5: Cross-domain control experiments. We report two control results for OMP-9 since two MDAPTsetting achieved the same averaged accuracy.
i.e. multilingual domain adaptive pretraining helps to make the multilingual model better suited for the specific domain.
Adapter-based training Adapter-based training exhibits a similar pattern: MDAPT improves MBERT across the board, except for the da and en sentence classification tasks, where MDAPT is conducted using only en-specific data. For most tasks, except da and en sentence classification, the performance of adapter-based training is below the one of full model training. On pt NER dataset, the best score (66.2) achieved by adapterbased training is much lower than the one (72.7) by the full model training.
Comparison of combination strategies After we observe a single multi model can achieve competitive performance as several mono models, the next question is how do different combination strategies affect the effectiveness of MDAPT? As a general trend, the pretraining corpus composed of multilingual data-M D +E D and M D +M WIKIachieves better results than E D composed by only en data. This is evident across both full -and adapter-based training. M D +E D performs best in most cases, especially for the adapter-based training. This result indicates the importance of multilingual data in the pretraining corpus. It is worth noting that even pretraining only on E D data can improve the performance on non-English datasets, and for en tasks, we see an expected advantage of having more en-specific data in the corpus.

Cross-domain evaluations
To make sure that the improvements of MDAPT models over MBERT stem from observing multilingual domain-specific data, and not from exposure to more data in general, we run cross-domain experiments (Gururangan et al., 2020), where we evaluate the models adapted to the biomedical domain on the financial downstream tasks, and vice versa. The results are shown in Table 5, where we report results for the best MDAPT model and its counterpart in the other domain (¬ MDAPT). In almost all cases, MDAPT outperforms ¬ MDAPT, indicating that adaptation to the domain, and not the exposure to additional multilingual data is responsible for MDAPT's improvement over MBERT. For the OMP datasets, ¬ MDAPT performs surprisingly well, and we speculate this might be because it requires less domain-specific language understanding to classify the newspaper titles.

Analysis
Our experiments suggest that MDAPT results in a pretrained model which is better suited to solve domain-specific downstream tasks than MBERT, and that MDAPT narrows the gap to monolingual model performance. In this section, we present further analysis of these findings, in particular we investigate the quality of domain-specific representations learned by MDAPT models compared to MBERT, and the gap between mono-and multilingual model performance. sult in improved representations of domain-specific text in multiple languages. We evaluate the models' ability to learn better sentence representations via a cross-lingual sentence retrieval task, where, given a sentence in a source language, the model is tasked to retrieve the corresponding translation in the target language. To obtain a sentence representation, we average over the encoder outputs for all subtokens in the sentence, and retrieve the k nearest neighbors based on cosine similarity. As no fine-tuning is needed to perform this task, it allows to directly evaluate encoder quality. We perform sentence retrieval on the parallel test sets of the WMT Biomedical Translation Shared Task 2020 (Bawden et al., 2020). The results in Table 6 show that MDAPT improves retrieval quality, presumably because the models learned better domainspecific representations across languages. Interestingly, with English as target language (upper half), the model trained on English domain-specific data works best, whereas for English as source language, it is important that the model has seen multilingual domain-specific data during pretraining.

Domain-specific multilingual representations
Effect of tokenization Ideally, we want to have a MDAPT model that performs close to the corresponding monolingual model. However, for the full fine-tuning setup, the monolingual model outperforms the MDAPT models in most cases. Rust et al. (2020) find that the superiority of monolingual over multilingual models can partly be attributed to better tokenizers of the monolingual models, and we hypothesize that this difference in tokenization is even more pronounced in domain-specific text. Following Rust et al. (2020), we measure tokenizer quality via continued words, the frac-tion of words that the tokenizer splits into several subtokens, and compare the difference between monolingual and multilingual tokenizer quality on specific text (the train splits of the downstream tasks), with their difference on general text sampled from Wikipedia. Figure 3 shows that the gap between monolingual and multilingual tokenization quality is indeed larger in the specific texts (green bars) compared to the general texts (brown bars), indicating that in a specific domain, it is even harder for a multilingual model to outperform a monolingual model. This suggests that methods for explicitly adding representations of domain-specific words (Poerner et al., 2020;Schick and Schütze, 2020) could be a promising direction for improving our approach. Error analysis on financial sentence classification To provide a better insight into the difference between the mono and multi models, we compare the error predictions on the Danish FINNEWS dataset, since results in Table 4 show that the mono outperforms all multi models with a large margin on this dataset. We note that the FINNEWS dataset, which is sampled from tweets, contains a heavy use of idioms and jargon, on which the multi models usually fail. For example, • Markedet lukker: Medvind til bankaktier på en rød C25-dag [POSITIVE] English translation: Market closes: Tailwind for bank shares on a red C25-day • Nationalbanken tror ikke saerskat får den store betydning: Ekspert kaldet det "noget pladder" [NEGATIVE] English translation: The Nationalbank does not think special tax will have the great significance: Expert called it "some hogwash" Pretraining data for the mono DA-BERT includes Common Crawl texts and custom scraped data from two large debate forums. We believe this exposes the DA-BERT to the particular use of informal register. By contrast, the pretraining data we use are mainly sampled from publications. This could be an interesting direction of covering the variety of a language in sub-domains for a strong MDAPT model.

Related Work
Recent studies on domain-specific BERT Alsentzer et al., 2019;Nguyen et al., 2020), which mainly focus on English text, have demonstrated that in-domain pretraining data can improve the effectiveness of pretrained models on downstream tasks. These works continue pretraining the whole base model-BERT or ROBERTAon domain-specific corpora, and the resulting models are supposed to capture both generic and domain-specific knowledge. By contrast, Beltagy et al. (2019); Gu et al. (2020); Shin et al. (2020) train domain-specific models from scratch, tying an in-domain vocabulary. Despite its effectiveness, this approach requires much more compute than domain adaptive pretraining, which our work focuses on. Additionally, we explore an efficient variant of domain adaptive pretraining based on adapters (Houlsby et al., 2019;Pfeiffer et al., 2020b), and observe similar patterns regarding pretraining a multilingual domain-specific model.
Several efforts have trained large scale multilingual language representation models using parallel data (Aharoni et al., 2019;Conneau and Lample, 2019) or without any cross-lingual supervision Conneau et al., 2020;Xue et al., 2020). However, poor performance on lowresource languages is often observed, and efforts are made to mitigate this problem (Rahimi et al., 2019;Ponti et al., 2020;Pfeiffer et al., 2020b). In contrast, we focus on the scenario that the NLP model needs to process domain-specific text supporting a modest number of languages.
Alternative approaches aim at adapting a model to a specific target task within the domain directly, e.g. by an intermediate supervised fine-tuning step (Pruksachatkun et al., 2020;Phang et al., 2020), resulting in a model specialized for a single task.
Domain adaptive pretraining, on the other hand, aims at providing a good base model for different tasks within the specific domain.

Conclusion
We extend domain adaptive pretraining to a multilingual scenario that aims to train a single multilingual model better suited for the specific domain. Evaluation results on datasets from biomedical and financial domains show that although multilingual models usually underperform their monolingual counterparts, domain adaptive pretraining can effectively narrow this gap. On seven out of nine datasets for document classification and NER, the model resulting from multilingual domain adaptive pretraining outperforms the baseline multigeneral model, and on four it even outperforms the mono-general model. The encouraging results show the implication of deploying a single model which can process financial or biomedical documents in different languages, rather than building separate models for each individual language.   Lopes et al. (2019), and comprises texts about neurology from a clinical journal.
Prepocessing NER data We convert all annotations to BIO format. The gaps in discontinuous entities are labeled. We sentence tokenize at line breaks, and if unavailable at fullstops. We word tokenize all data at white spaces and split off numbers and special characters. If available, we use official train/dev/test splits. For BIORO, we produce a random 60/20/20 split. For CLINPT, we use the data from volume 2 for training and development data and test on volume 1.

C Financial data
Preprocessing pretraining data Sentences are tokenized using NLTK. For languages not cover by the sentence tokenizer, we split by full stops. Additionally, a split check of particular large sentences, filtering out sentences with no letters, and HTML and tags have been removed.
Downstream classification data FINMULTICORPUS The corpus consists of PwC publications in multiple languages made publicly available on PwC websites. The publications cover a diverse range of topics that relates to the financial domain. The corpus is created by extracting text passages from publications. Table 2 describes the number of sentences and the languages that the CPT corpus cover.
FINNEWS The financial sentiment dataset is curated from financial newspapers headline tweets. The motivation was to create a Danish equivalent to FINANCIAL PHRASEBANK. The news headlines are annotated with a sentiment by 2 annotators. The annotators were screened to ensure sufficient domain and educational background. A description of positive, neutral, and negative was formalized before the annotation process. The dataset has an 82.125% rater agreement and a Krippendorff's alpha of .725 measured on 800 randomly sampled instances. (Schabus et al., 2017) The annotated dataset includes user comments posted to an Austrian newspaper. We use the TITLE (newspaper headline) and TOPICS, i.e., 'KULTUR', 'SPORT', 'WIRTSCHAFT', 'INTERNATIONAL', 'INLAND', 'WISSENSCHAFT', 'PANORAMA', 'ETAT', 'WEB'. With the dataset, we derive two downstream tasks. The binary classification task OMP binary that deals with whether a TITLE concerns a financial TOPICS or not. Here we merge all non-financial TOPICS into one category. The multiclass classification OMP multi seeks to classify a TITLE into one of the 9 TOPICS.

D Adapter-based training
Recall that the main component of a transformer model is a stack of transformer layers, each of which consists of a multi-head self-attention network and a feed-forward network, followed by layer normalization. The idea of adapter-based  Table 7: A comparison between baseline mono models and the multi model: MBERT. We use total file size (Gigabyte) and the total number of tokens to represent the training data size.
training (Houlsby et al., 2019;Stickland and Murray, 2019;Pfeiffer et al., 2020a) is to add a small size network (called adapter) into each transformer layer. Then during the training stage, only the weights of new adapters are updated while keeping the base transformer model fixed. Different options regarding where adapters are placed, and its network architecture exist. In this work, we use the bottleneck architecture proposed by Houlsby et al. (2019) and put the adapters after the feed-forward network, following (Pfeiffer et al., 2020a): Adapter l (h l , r l ) = U l (ReLU (D l (h l ))) + r l where r l is the output of the transformer's feedforward layer and h l is the output of the subsequent layer normalisation.