Task-adaptive Pre-training of Language Models with Word Embedding Regularization

Pre-trained language models (PTLMs) acquire domain-independent linguistic knowledge through pre-training with massive textual resources. Additional pre-training is effective in adapting PTLMs to domains that are not well covered by the pre-training corpora. Here, we focus on the static word embeddings of PTLMs for domain adaptation to teach PTLMs domain-specific meanings of words. We propose a novel fine-tuning process: task-adaptive pre-training with word embedding regularization (TAPTER). TAPTER runs additional pre-training by making the static word embeddings of a PTLM close to the word embeddings obtained in the target domain with fastText. TAPTER requires no additional corpus except for the training data of the downstream task. We confirmed that TAPTER improves the performance of the standard fine-tuning and the task-adaptive pre-training on BioASQ (question answering in the biomedical domain) and on SQuAD (the Wikipedia domain) when their pre-training corpora were not dominated by in-domain data.


Introduction
Pre-trained language models (PTLMs) trained with massive textual and computational resources have achieved high performance in natural language processing tasks (Devlin et al., 2019). Additional pretraining often is used to tackle domain discrepancies between the downstream task and the pretraining corpora. Additional pre-training with a large corpus in the domain of the downstream task, such as BioBERT , improves the performance on the task (Alsentzer et al., 2019;Beltagy et al., 2019;Chalkidis et al., 2020). However, this approach requires large corpora in the target domain and entails a high computational cost. Gururangan et al. (2020) proposed task-adaptive pre-training (TAPT), which is additional pre-training using only the training data of the downstream task. TAPT can be regarded as a new finetuning process in which the standard fine-tuning is preceded by low-cost additional pre-training.
In this study, we focus on the static word embeddings of PTLMs (i.e., non-contextualized 0-th layer representations) for domain adaptation. Our method is designed to teach PTLMs the domainspecific meanings of the words as static word embeddings. We are motivated by the observation that the middle BERT layers capture the syntactic information (Hewitt and Manning, 2019;Jawahar et al., 2019;Liu et al., 2019a). We consider that we can adapt the models without harming the domain-independent linguistic knowledge contained in higher layers by learning the static word embeddings directly.
We propose a novel fine-tuning process called task-adaptive pre-training with word embedding regularization (TAPTER). First, TAPTER obtains word embeddings in the target domain by adapting a pre-trained fastText model (Bojanowski et al., 2017) to the target domain using the training data of the downstream task. Next, TAPTER runs the taskadaptive pre-training by making the static word embeddings of the PTLM close to the word embeddings obtained with the fastText model. Finally, TAPTER runs the standard fine-tuning process.
We found that TAPTER achieves higher scores than the standard fine-tuning and TAPT on question answering tasks in the biomedical domain, BioASQ (Tsatsaronis et al., 2015), and Wikipedia domain, SQuAD1.1 (Rajpurkar et al., 2016). Our key findings are: (i) Word embedding regularization in task-adaptive pre-training enhances domain adaptation when the initial pre-training corpora do not contain a high proportion of in-domain data. (ii) The word embeddings of fastText, which uses a shallow neural network, can be adapted to the target domains more easily than the static word arXiv:2109.08354v1 [cs.CL] 17 Sep 2021 embeddings of PTLMs.

Pre-trained Language Models
We focus on the static word embeddings of PTLMs. Let V LM be the vocabulary. We input a token sequence X ∈ V l LM to the model, where l is the length of the sequence. The embedding layer of the model has a word embedding matrix E ∈ R V LM ×d LM as trainable parameters, where d LM is the embedding dimension. The word embedding of the i-th token is E x i .

fastText
fastText is a word embedding method using subword information (Bojanowski et al., 2017). The skipgram model (Mikolov et al., 2013) of fast-Text learns word embeddings by predicting the surrounding words x j (j ∈ C i ) from a word x i , where C i is the set of the indices within a given window size. Specifically at position i, we use the surrounding words as positive examples and randomly sample negative words N i from the vocabulary V FT . The loss function is That is, the model learns to score higher for positive examples and lower for negative examples.
fastText uses subword information to model the score function s. Let S v be the set of substrings of the word v ∈ V FT . The score of the input word x i and the output word x j is Here, W in ∈ R N ×d FT consists of the word embeddings of the input layer and W out ∈ R V FT ×d FT consists of the word embeddings of the output layer. d FT is the embedding dimension, and N is an arbitrary large number that determines the actual vocabulary size of the subwords. In the implementation of fastText, the model does not restrict the vocabulary size by hashing a subword w into an index less than N . The model has limits on the minimum and maximum lengths of subwords.
At inference time, the embedding of a word w is w∈Sv W in,w . Bojanowski et al. (2017) reported that fastText learns word similarity with less training data than other methods do by utilizing the subword information.

Related Work
Static word embeddings in PTLMs have attracted attention in the areas of domain adaptation and cross-lingual transfer learning. Artetxe et al. (2020) proposed to replace word embeddings in the PTLMs trained in the source or target languages. Poerner et al. (2020) proposed a vocabulary expansion using Word2Vec (Mikolov et al., 2013) trained in the target domain for domain adaptation on a CPU. However, our preliminary experiments showed that simple replacement or vocabulary expansion harms performance in our setting because of the limited amount of data. Unlike the previous studies, the proposed method requires no additional corpus by incorporating regularization of the word embeddings in the additional pre-training framework with the training data of the downstream task.

Proposed Method
The proposed method consists of three stages.
Additional Training of fastText First, we train a fastText model using the training data of the downstream task, where the model is initialized with publicly available fastText embeddings 1 .
Our method introduces the embeddings of the PTLM vocabulary inferred by using the fastText model F ∈ R V LM ×d FT as the word embeddings in the target domain. Unlike other word embedding methods such as GloVe (Pennington et al., 2014), fastText retains subword information. Therefore, we can obtain the embeddings of the PTLM vocabulary containing subword units 2 . The additional training of fastText runs much faster than the additional training of the PTLMs. Note that TAPTER does not make any changes to the original vocabulary of the PTLMs.
Additional Pre-training of PTLMs Second, we train the entire PTLM using the training data of the  downstream task. We input a token sequence X ∈ V l LM to the model. We train the model with the loss function of language modeling L LM (X) with l 2 -norm regularization on the difference between the word embeddings. That is, the loss function is where R(X) is the set of the target tokens of the regularization. The target tokens R(X) exclude stop words and subwords shorter than the minimum length configured in fastText.
The function f maps a d LM -dimensional embedding to a d FT -dimensional embedding: LN denotes layer normalization (Ba et al., 2016). Fine-Tuning Finally, we run the standard finetuning process (Devlin et al., 2019) without any regularization.

Dataset
We evaluated the proposed method on two question answering datasets. Table 1 shows the statistics. The experimental setup is shown in Appendix A.
SQuAD SQuAD1.1 is a task to answer a question with information from a textual source (Rajpurkar et al., 2016). The dataset provides pairs of a question and a related passage from Wikipedia as the textual source. The input of the model is a token sequence that is a concatenation of the question and the passage. The ground-truth answer is a span in the passage. The output of the model consists of the indices of the answer start and end positions. The indices are calculated from the twodimensional linear layer on top of the PTLM. The official evaluation metrics are exact matching (EM) and partial matching (F1).
BioASQ BioASQ5 is a question answering dataset in the biomedical domain (Tsatsaronis et al., 2015). Following , we used the factoid questions pre-processed into SQuAD format. We used three official evaluation metrics. Strict accuracy (SACC) is the rate at which the top-1 prediction is correct. Lenient accuracy (LACC) is the rate at which the top-5 predictions include the ground-truth answer. Mean reciprocal rank (MRR) is the average of the reciprocal of the rank of the ground-truth answer. We trained the models with ten random seeds and report the average performance. In the fine-tuning stage, as in Wiese et al. (2017), we first trained the model with SQuAD and then trained it with BioASQ.

Compared Models
We used three PTLMs, BERT-base-cased (Devlin et al., 2019), BioBERT , and RoBERTa-base (Liu et al., 2019b). BERTbase-cased was pre-trained with English Wikipedia (2.5B words) and BookCorpus (800M words) (Zhu et al., 2015). BioBERT was initialized with BERTbase-cased and pre-trained with PubMed abstracts and PMC articles (18B words). RoBERTa-base was pre-trained with 160GB corpora including news and Web articles as well as Wikipedia and Book-Corpus (used to train BERT). We compared three fine-tuning methods: standard fine-tuning, TAPT, and TAPTER.

Results and Discussion
Is TAPTER effective at adaptation to the biomedical domain? Table 2 shows the results in BioASQ. We evaluated the performance of the domain adaptation with BERT-base-cased because the model does not use a biomedical corpus in the original pre-training.
TAPTER improved BERT's performance by 3.05/0.27/2.01 points (SACC/LACC/MRR) over the simple fine-tuning. As well, TAPTER statistically significantly outperformed TAPT in the SACC (top-1 accuracy) and MRR metrics. We   consider that the regularization of the word embeddings improves the adaptation of the PTLM. Appendix B shows the word embeddings of the models with principal component analysis. The scatter plots show that the word embeddings of BERT-base-cased and TAPT resemble each other, though TAPTER and BioBERT have dissimilar word embeddings distributions to that of BERTbase-cased. This indicates that the additional pretraining of language modeling alone does not adapt the static word embeddings to the biomedical domain unlike TAPTER.
Is additional pre-training effective with the model pre-trained in the target domain? The additional pre-training from BioBERT did not improve the overall performance, although some of the scores slightly increased. There was no significant difference between TAPTER and TAPT in each metric (p < .05). We consider that BioBERT has already learned the knowledge in the biomedical domain because it was pre-trained with a massive biomedical text. Therefore, the additional pretraining had little effect on performance.
Is TAPTER effective in the general domain? Table 3 shows the results for SQuAD. In the experiments with BERT neither of the additional pre- training methods improved performance. On the other hand, in the experiments with RoBERTa, TAPTER improved performance by 0.79/0.46 points (EM/F1). This was the best performance among the compared models on SQuAD.
We consider that TAPTER and TAPT improve performance when the corpora of the original pretraining were not dominated by in-domain data. A large part of the pre-training corpora of BERT is Wikipedia. Therefore, the additional pre-training was not effective. However, the pre-training corpora of RoBERTa cover broader topics. Although the corpora include Wikipedia, the additional pretraining can adapt the model to the Wikipedia domain.
It is known that the performance of PTLMs tends to improve as the amount of pre-training corpora increases (Baevski et al., 2019;Lan et al., 2019). Our results show that TAPTER can improve the performance of PTLMs that were pre-trained with very large corpora even if the domain of the downstream task is included in the pre-training corpora.
How well does TAPTER learn the language modeling and the word embeddings? Figure 1 shows the learning curve of the additional pretraining in BioASQ from BERT. We can see that the second term in Eq. (1) representing the word embeddings decreased more sharply than the first term in Eq. (1) representing the language modeling. Since the BERT model is huge and complicated, we must learn the model slowly with a small learning rate over a large number of steps. However, the regularization term decreases quickly without corrupting the model. This is one of the advantages of TAPTER in low-resource settings.
In addition, the decrease in the first term on the development data stopped on the way. However, the word embeddings were trained with less discrepancy between the training and development data. We consider that the training data of BioASQ is too small to represent the distribution of the text in the biomedical domain. Since MLM takes a document-level sequence X as input, the search space of the true distribution Pr(X) is huge, and MLM is a very difficult task to train with limited training data. On the other hand, the regularization term depends on the word-level distribution Pr(x i ). Therefore, the model can decrease the regularization term on the evaluation data even in low-resource settings without overfitting.

Conclusion
We proposed a new fine-tuning process including additional pre-training with word embedding regularization. TAPTER learns the meanings of words in the target domain by making the static word embeddings of the PTLM close to the word embeddings obtained in the target domain with fastText. TAPTER improves the performance of BERT in the biomedical domain. Moreover, it improves the performance of RoBERTa even in the Wikipedia domain although the original pre-training corpora of RoBERTa contain Wikipedia.
Many PTLMs with more parameters and trained with more data have been published (Raffel et al., 2020;Shoeybi et al., 2019). We believe that TAPTER is an important method to teach such largely pre-trained language models knowledge in the target domain.

A Experimental Setup
We trained the models on one NVIDIA GeForce GTX 1080Ti (11GB). The hyperparameter settings are in Table 4 and Table 5. The optimization algorithm was Adam (Kingma and Ba, 2014). We used PyTorch (Paszke et al., 2019) and Transformers (Wolf et al., 2020). Stop words were implemented by NLTK (Bird et al., 2009). The wordlevel tokenizer was spaCy (Honnibal et al., 2020). For the target tokens of the regularization R(X), we randomly selected 50 % tokens in the input excluding stop words and subwords shorter than the minimum length configured in fastText. Following the default setting, the minimum length of the subwords in fastText was set to three. The maximum length was six. In BioASQ, we lowercased the corpora in the additional training of fastText and R(X) in the additional pre-training because only a limited number of words containing capital characters appear.
Note that the computational time of our additional pre-training was about seven hours on one NVIDIA GeForce GTX 1080Ti (11GB) GPU, while that of BioBERT was more than ten days on eight NVIDIA V100 (32GB) GPUs.

B Visualization of Word Embeddings
Here, we show the word embeddings of the models with principal component analysis. Figures 2,  3, 4, and 5 are scatter plots of the word embeddings of BERT-base-cased, the model additionally pre-trained with TAPT, the model additionally pretrained with TAPER, and BioBERT. The figures show that the word embeddings of BERT-base-cased and TAPT resemble each other. The average distance between the embeddings of BERT-base-cased and TAPT among all words is 0.0576, although the distance between the embeddings of BERT-base-cased and TAPTER is 0.172. Therefore, the additional pre-training of language modeling alone does not adapt the static word embeddings to the biomedical domain unlike TAPTER. TAPTER and BioBERT have dissimilar word embedding distributions to that of BERT-basecased.