Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA

Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO 2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We evaluate on eight English biomedical Named Entity Recognition (NER) tasks and compare against the recently proposed BioBERT model. We cover over 60% of the BioBERT - BERT F1 delta, at 5% of BioBERT’s CO 2 footprint and 2% of its cloud compute cost. We also show how to quickly adapt an existing general-domain Question Answering (QA) model to an emerging domain: the Covid-19 pandemic.


Introduction
Pretrained Language Models (PTLMs) such as BERT (Devlin et al., 2019) have spearheaded advances on many NLP tasks. Usually, PTLMs are pretrained on unlabeled general-domain and/or mixed-domain text, such as Wikipedia, digital books or the Common Crawl corpus.
When applying PTLMs to specific domains, it can be useful to domain-adapt them. Domain adaptation of PTLMs has typically been achieved by pretraining on target-domain text. One such model is BioBERT (Lee et al., 2020), which was initialized from general-domain BERT and then pretrained on biomedical scientific publications. The domain adaptation is shown to be helpful for target-domain tasks such as biomedical Named Entity Recognition (NER) or Question Answering (QA). On the downside, the computational cost of pretraining can be considerable: BioBERTv1.0 was adapted for ten 1 www.github.com/npoe/covid-qa days on eight large GPUs (see Table 1), which is expensive, environmentally unfriendly, prohibitive for small research labs and students, and may delay prototyping on emerging domains.
We therefore propose a fast, CPU-only domainadaptation method for PTLMs: We train Word2Vec (Mikolov et al., 2013a) on target-domain text and align the resulting word vectors with the wordpiece vectors of an existing general-domain PTLM. The PTLM thus gains domain-specific lexical knowledge in the form of additional word vectors, but its deeper layers remain unchanged. Since Word2Vec and the vector space alignment are efficient models, the process requires a fraction of the resources associated with pretraining the PTLM itself, and it can be done on CPU.
In Section 4, we use the proposed method to domain-adapt BERT on PubMed+PMC (the data used for BioBERTv1.0) and/or . We improve over general-domain BERT on eight out of eight biomedical NER tasks, using a fraction of the compute cost associated with BioBERT. In Section 5, we show how to quickly adapt an existing Question Answering model to text about the Covid-19 pandemic, without any target-domain Language Model pretraining or finetuning.
2 Related work 2.1 The BERT PTLM For our purpose, a PTLM consists of three parts: A tokenizer T LM : L + → L + LM , a wordpiece embedding lookup function E LM : L LM → R d LM and an encoder function F LM . L LM is a limited vocabulary of wordpieces. All words from the natural language L + that are not in L LM are tokenized into sequences of shorter wordpieces, e.g., dementia becomes dem ##ent ##ia. Given a sentence S = [w 1 , . . . , w T ], tokenized  as T LM (S) = [T LM (w 1 ); . . . ; T LM (w T )], E LM embeds every wordpiece in T LM (S) into a real-valued, trainable wordpiece vector. The wordpiece vectors of the entire sequence are stacked and fed into F LM . Note that we consider position and segment embeddings to be a part of F LM rather than E LM .
In the case of BERT, F LM is a Transformer (Vaswani et al., 2017), followed by a final Feed-Forward Net. During pretraining, the Feed-Forward Net predicts the identity of masked wordpieces. When finetuning on a supervised task, it is usually replaced with a randomly initialized layer.

Domain-adapted PTLMs
Domain adaptation of PTLMs is typically achieved by pretraining on unlabeled target-domain text. Some examples of such models are BioBERT (Lee et al., 2020), which was pretrained on the PubMed and/or PubMed Central (PMC) corpora, SciBERT (Beltagy et al., 2019), which was pretrained on papers from SemanticScholar, Clinical-BERT (Alsentzer et al., 2019;Huang et al., 2019a) and ClinicalXLNet (Huang et al., 2019b), which were pretrained on clinical patient notes, and Adapt-aBERT (Han and Eisenstein, 2019), which was pretrained on Early Modern English text. In most cases, a domain-adapted PTLM is initialized from a general-domain PTLM (e.g., standard BERT), though Beltagy et al. (2019) report better results with a model that was pretrained from scratch with a custom wordpiece vocabulary. In this paper, we focus on BioBERT, as its domain adaptation corpora are publicly available.

Word vectors
Word vectors are distributed representations of words that are trained on unlabeled text. Contrary to PTLMs, word vectors are non-contextual, i.e., a word type is always assigned the same vector, regardless of context. In this paper, we use Word2Vec (Mikolov et al., 2013a) to train word vectors. We will denote the Word2Vec lookup function as E W2V : L W2V → R d W2V . In this paper, we apply the method to the problem of domain adaptation within the same language.

Method
In the following, we assume access to a generaldomain PTLM, as described in Section 2.1, and a corpus of unlabeled target-domain text.

Creating new input vectors
In a first step, we train Word2Vec on the targetdomain corpus. In a second step, we take the intersection of L LM and L W2V . In practice, the intersection mostly contains wordpieces from L LM that correspond to standalone words. It also contains single characters and other noise, however, we found that filtering them does not improve alignment quality. In a third step, we use the intersection to fit an unconstrained linear transformation W ∈ R d LM ×d W2V via least squares: Intuitively, W makes Word2Vec vectors "look like" the PTLM's native wordpiece vectors, just  like cross-lingual alignment makes word vectors from one language "look like" word vectors from another language. In Table 2, we report word alignment accuracy when we split L LM ∩ L W2V into a training and development set. 2 In Table 3, we show examples of within-space and cross-space nearest neighbors after alignment.

Updating the wordpiece embedding layer
Next, we redefine the wordpiece embedding layer of the PTLM. The most radical strategy would be to replace the entire layer with the aligned Word2Vec vectors: In initial experiments, this strategy led to a drop in performance, presumably because function words are not well represented by Word2Vec, and replacing them disrupts BERT's syntactic abilities. To prevent this problem, we leave existing wordpiece vectors intact and only add new ones:

Updating the tokenizer
In a final step, we update the tokenizer to account for the added words. Let T LM be the standard BERT tokenizer, and letT LM be the tokenizer that treats all words in L LM ∪ L W2V as one-wordpiece tokens, while tokenizing any other words as usual.
In practice, a given word may or may not benefit from being tokenized byT LM instead of T LM . To give a concrete example, 82% of the words in the BC5CDR NER dataset that end in the suffix -ia are part of a disease entity (e.g., dementia). T LM tokenizes this word as dem ##ent ##ia, thereby exposing this strong orthographic cue to the model. As a result, T LM improves recall on -ia diseases. But there are many cases where wordpiece tokenization is meaningless or misleading. For instance euthymia (not a disease) is tokenized by T LM as e ##uth ##ym ##ia, making it likely to be classified as a disease. By contrast,T LM gives euthymia a one-wordpiece representation that depends only on distributional semantics. We find that usingT LM improves precision on -ia diseases.
To combine these complementary strengths, we use a 50/50 mixture of T LM -tokenization andT LMtokenization when finetuning the PTLM on a task. At test time, we use both tokenizers and mean-pool the outputs. Let o(S; T ) be some output of interest (e.g., a logit), given sentence S tokenized by T .

Experiment 1: Biomedical NER
In this section, we use the proposed method to create GreenBioBERT, an inexpensive and environmentally friendly alternative to BioBERT. Recall that BioBERTv1.0 (biobert v1.0 pubmed pmc) was initialized from general-domain BERT (bertbase-cased) and then pretrained on PubMed+PMC.

Domain adaptation
We train Word2Vec with vector size d W2V = d LM = 768 on PubMed+PMC (see Appendix for details). Then, we update the wordpiece embedding layer and tokenizer of general-domain BERT (bert-base-cased) as described in Section 3.

Finetuning
We finetune GreenBioBERT on the eight publicly available NER tasks used in Lee et al. (2020). We also do reproduction experiments with generaldomain BERT and BioBERTv1.0, using the same setup as our model. See Appendix for details on preprocessing and hyperparameters. Since some of the datasets are sensitive to the random seed, we report mean and standard error over eight runs. Table 4 shows entity-level precision, recall and F1, as measured by the CoNLL NER scorer. For ease of visualization, Figure 1 shows test set F1 shifted and scaled as

Results and discussion
where BERT (ref) and BioBERTv1.0 (ref) are reported scores from Lee et al. (2020). In other words, the figure shows what portion of the reported BioBERT -BERT F1 delta is covered by our less expensive GreenBioBERT model. On average, we cover between 61% and 70% of the delta (61% for BioBERTv1.0, 70% for BioBERTv1.1, and 61% if we take our reproduction experiments as reference points).

Ablation study
To test whether the improvements over generaldomain BERT are due to the aligned Word2Vec vectors, or just to the availability of additional word vectors in general, we perform an ablation study where we replace the aligned vectors with their non-aligned counterparts (by setting W = 1 in Eq. 1) or with randomly initialized vectors. Table 5 shows that dev set F1 drops on all datasets under these circumstances, i.e., vector space alignment seems to be important.

Experiment 2: Covid-19 QA
In this section, we use the proposed method to quickly adapt an existing general-domain QA model to an emerging target domain: the Covid-19 pandemic. Our baseline model is SQuADBERT, 3 an existing BERT model that was finetuned on the general-domain SQuAD dataset (Rajpurkar et al., 2016). We evaluate on Deepset-AI Covid-QA (Möller et al., 2020), a SQuAD-style dataset with 2019 annotated span-selection questions about 147 papers from . 4 We assume that there is no labeled targetdomain data for finetuning on the task, and instead use the entire Covid-QA dataset as a test set. This is a realistic setup for an emerging domain without annotated training data.  Table 6: Results (%) on Deepset-AI Covid-QA. EM (exact answer match) and F1 (token-level F1 score) are evaluated with the SQuAD scorer. "substr": Predictions that are a substring of the gold answer. Much higher than EM, because many gold answers are not minimal answer spans (see Appendix, "Notes on Covid-QA", for an example).

Domain adaptation
We train Word2Vec with vector size d W2V = d LM = 1024 on CORD-19 and/or PubMed+PMC. The process takes less than an hour on CORD-19 and about one day on the combined corpus, again without the need for a GPU. Then, we update SQuADBERT's wordpiece embedding layer and tokenizer, as described in Section 3. We refer to the resulting model as GreenCovidSQuADBERT. Table 6 shows that GreenCovidSQuADBERT outperforms general-domain SQuADBERT on all measures. Interestingly, the small CORD-19 corpus is enough to achieve this result (compare "CORD-19 only" and "CORD-19+PubMed+PMC"), presumably because it is specific to the target domain and contains the Covid-QA context papers.

Conclusion
As a reaction to the trend towards high-resource models, we have proposed an inexpensive, CPUonly method for domain-adapting Pretrained Language Models: We train Word2Vec vectors on target-domain data and align them with the wordpiece vector space of a general-domain PTLM.
On eight biomedical NER tasks, we cover over 60% of the BioBERT -BERT F1 delta, at 5% of BioBERT's domain adaptation CO 2 footprint and 2% of its cloud compute cost. We have also shown how to rapidly adapt an existing BERT QA model to an emerging domain -the Covid-19 pandemic -without the need for target-domain Language Model pretraining or finetuning.
We hope that our approach will benefit practitioners with limited time or resources, and that it will encourage environmentally friendlier NLP. We extract all abstracts and text bodies and apply the BERT basic tokenizer (a rule-based word tokenizer that standard BERT uses before wordpiece tokenization). Then, we train CBOW Word2Vec 5 with negative sampling. We use default parameters except for the vector size (which we set to d W2V = d LM ). We cut all sentences into chunks of 30 or fewer whitespace-tokenized words (without splitting inside labeled spans). Then, we tokenize every chunk S with T = T LM or T =T LM and add special tokens: Word-initial wordpieces in T (S) are labeled as B(egin), I(nside) or O(utside), while non-wordinitial wordpieces are labeled as X(ignore). 5 www.github.com/tmikolov/word2vec

Modeling, training and inference
We follow Lee et al. (2020)'s implementation (www.github.com/dmis-lab/biobert): We add a randomly initialized softmax classifier on top of the last BERT layer to predict the labels. We finetune the entire model to minimize negative log likelihood, with the AdamW optimizer (Loshchilov and Hutter, 2018) and a linear learning rate scheduler (10% warmup). All finetuning runs were done on a GeForce Titan X GPU (12GB). At inference time, we gather the output logits of word-initial wordpieces only. Since the number of word-initial wordpieces is the same for T LM (S) andT LM (S), this makes mean-pooling the logits straightforward.

Hyperparameters
We tune the batch size and peak learning rate on the development set (metric: F1), using the same hyperparameter space as Lee et al. (2020): Batch size: [10, 16, 32, 64] 6 Learning rate: [1 · 10 −5 , 3 · 10 −5 , 5 · 10 −5 ] We train for 100 epochs, which is the upper end of the 50-100 range recommended by the original authors. After selecting the best configuration for every task and model (see Table 7), we train the final model on the concatenation of training and development set, as was done by Lee et al. (2020). See Figure 2 for expected maximum development set F1 as a function of the number of evaluated hyperparameter configurations (Dodge et al., 2019).