AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain

During the fine-tuning phase of transfer learning, the pretrained vocabulary remains unchanged, while model parameters are updated. The vocabulary generated based on the pretrained data is suboptimal for downstream data when domain discrepancy exists. We propose to consider the vocabulary as an optimizable parameter, allowing us to update the vocabulary by expanding it with domain specific vocabulary based on a tokenization statistic. Furthermore, we preserve the embeddings of the added words from overfitting to downstream data by utilizing knowledge learned from a pretrained language model with a regularization term. Our method achieved consistent performance improvements on diverse domains (i.e., biomedical, computer science, news, and reviews).


Introduction
A language model (LM) is pretrained with a large corpus in a general domain and then is fine-tuned to perform various downstream tasks, such as text classification, named entity recognition, and question answering. However, fine-tuning the LM is challenging when the downstream domain is significantly different from the pretrained domain, requiring domain adaptation to improve the downstream performance [Gururangan et al., 2020, Lee et al., 2020, Beltagy et al., 2019.
Prior approaches conducted additional training with a large domain-specific corpus in between pretraining and fine-tuning. In these approaches, the pretrained vocabulary remains unchanged, although the model is being adapted to a downstream domain, such as biomedicine or politics.
We argue that the vocabulary should also be adapted during the fine-tuning process towards downstream data. Recent studies (e.g., SciB-ERT [Beltagy et al., 2019]) showed that using an * Equal Contribution optimized vocabulary for a particular downstream domain is more effective than using the vocabulary generated in pretraining stage. However, these approaches required a large domain-specific corpus additional to the downstream data in order to construct optimized vocabulary for the downstream domain. We propose to Adapt the Vocabulary to downstream Domain (AVocaDo), which updates the pretrained vocabulary by expanding it with words from the downstream data without requiring additional domain-specific corpus. The relative importance of words is considered in determining the size of the added vocabulary. As shown in Figure 1-(c), domain-specific words are tokenized in unwilling manner in the corresponding domain. For example, in reviews domain, the "bluetooth" represents a short-range wireless technology standard, but when the word is tokenized into "blue" and "tooth", the combined meaning of each subword is totally different from the intended meaning of "bluetooth". Furthermore, we propose a regularization term that prevents the embeddings of added words from overfitting to downstream data, since downstream data is relatively small compared to the pretraining data.
The experimental results show that our proposed method improves the overall performance in a wide variety of domains, including biomedicine, computer science, news, and reviews. Moreover, the advantage of the domain adapted vocabulary over the original pretrained vocabulary is shown in qualitative results.

Related Work
As transfer learning has shown promising results in natural language processing (NLP), recent work leveraged the knowledge learned from the pretrained model, such as BERT [Devlin et al., 2018]  and Sato et al. [2020] proposed to expand vocabulary and leverage external domain-specific corpus to train new embedding layers.
On the contrary, AVocaDo requires only downstream dataset in domain adaptation. Furthermore, our method selects a subset of domain-specific vocabulary considering the relative importance of words.

Methods
In AVocaDo, we generate domain-specific vocabulary based on the downstream corpus. The subset of the generated vocabulary is merged with the original pretrained vocabulary. The size of subset is controlled by the fragment score. Afterwards, we apply a regularization term during fine-tuning to prevent the embeddings of added words from overfitting to the downstream data.

Adapting Vocabulary
In this section, we describe the procedure of adapting the vocabulary to the downstream domain through Algorithm 1. First, the domain-specific vocabulary set V D is constructed from the downstream corpus C given a vocabulary size N D and a tokenizing algorithm. The adapted vocabulary set V A is constructed by merging the subset of V D , size of n D , with the original pretrained vocabulary set V P , size of N P . In other words, N A , the size of V A , is equal to the sum of the merged vocabulary sets, i.e, N A = n D + N P . Note that n D < N D , because when too many words are added, the added infrequent subwords might cause the rare word problem [Luong et al., 2015, Schick andSchütze, 2020].
The subset of V D , that is added to V P , is determined by the fragment score f C (V ) ∈ R, which we introduce as a new metric that measures the relative number of subwords tokenized by a vocabulary V from a single word in corpus C, i.e., , we keep f C (V A ) from exceeding a certain threshold γ. γ is a hyperparameter determining the lower bound of the f C (V A ). Decreasing the lower bound leads V A to less finely tokenize C. In contrast, increasing the lower bound leads V A to finely tokenize C.
We sought to consider the importance of subwords when adding V D . We simply selected a subset of V D following the order of merging subwords used in byte pair encoding algorithm [Sennrich et al., 2015]. The number of added vocabulary in each iteration is indicated by the hyperparameters α and β.
In summary, as frequent subword pairs are added from V D to V A as subwords, the f C (V A ) decreases.  Figure 2: Fine-tuning with regularization. Identical sentence "... the bluetooth function in my car ...", sampled from AMAZON, is tokenized with pretrained vocabulary (left) and with adapted vocabulary (right). The domainspecific word "bluetooth" is tokenized in two ways, which are highlighted as green and yellow respectively. The model is fine-tuned with regularization on l-th layer, highlighted as brown box, to preserve the embeddings of added words (e.g., bluetooth) from overfitting to downstream dataset.

Fig2 Current Version
The objective of adding V D to V A is to decrease the f C (V A ), but we make sure that f C (V A ) does not become too small, i.e., lower than the threshold γ. Therefore, we continue to add is higher than γ, and terminate the merging step otherwise.

Fine-tuning with Regularization
The embeddings of words in the subset of V D which is merged with V P to construct the adapted vocabulary V A are trained only with downstream data during fine-tuning. Since the size of downstream data is much smaller than that of the pretraining corpus, the embeddings trained only with the downstream data possibly suffer from overfitting. To prevent the potential overfitting, we leverage the pretrained contextual representation learned from a large corpus.
In contrastive learning [Chen et al., 2020], a pair of instances is encouraged to learn representations in relation to the similarity of the instances. We apply this contrastive learning framework as a regularization in fine-tuning. As described in Figure 2, an identical sentence is tokenized in two ways: one with the pretrained vocabulary V P and the other with the adapted vocabulary V A . A minibatch consists of B input sentences x = {x 1 , . . . x B }. Each input x i is tokenized with two types of vocabularies, and their l-th layer encoder outputs are denoted as h A,i are trained to maximize the agreement by the regularization term L reg i.e., where τ is a softmax temperature, B is a batch size, h A is prevented from overfitting by making it closer to its positive sample.
The model is trained to perform the target task with the regularization term L reg . The output of the encoder with V A is supervised by the label of downstream data with cross entropy loss L CE . The total loss L for domain adaptive fine-tuning is formalized as where f is a softmax function, C is the total number of classes, s i is the logit for i-th class, B is the batch size, and t i is the target label. In our implementation, we set λ as 1.0 for all experiments.  Table 1: Comparisons with baselines in four different domains. Pretrained LMs (i.e., BERT base , SciBERT, and BioBERT) are fine-tuned in two ways: one with pretrained vocabulary (represented without a subscription) and the other with adapted vocabulary (represented with subscription AVocaDo). The performance improvement is represented inside the parentheses with +. The reported value is averaged F 1 score (micro-F 1 for CHEMPROT and macro-F 1 for the others) over five random seeds. Invalid comparisons are represented as -.
Evaluation Protocol We report the macro-F 1 score for ACL-ARC, HYPERPARTISAN, and AMA-ZON and micro-F 1 score for CHEMPROT as done by previous work [Lee et al., 2020, Beltagy et al., 2019. The score is averaged over five random seeds. SciBERT is pretrained with scientific corpus while BERT is pretrained with general domain corpus (e.g., Wikipedia), and thus SciBERT can be fine-tuned only with BIOMED and CS. BioBERT conducted additional training with biomedical corpus, so that BioBERT can be fine-tuned only with BIOMED.

Quantitative Results
As described in Table 1, fine-tuning with AVo-caDo significantly improved the performance of the downstream task in all domains. Note that the performance is improved despite the low-resource environment, where the size of dataset is smaller than 5,000 as described in Appendix C (CHEMPROT, ACL-ARC, and HYPERPARTISAN). In BIOMED domain, applying AvocaDo improved the overall performance in various pretrained language models. This improvement shows that utilizing the domainspecific vocabulary has additional benefits on the downstream domain. In CS, AVocaDo outperforms BERT base and SciBERT, showing the performance improvements of 10.46 in BERT base and 8.13 in SciBERT. In NEWS and REVIEWS, our strategy significantly improved the performance; 4.80 in NEWS and 13.01 in REVIEWS.

Qualitative Results
To analyze the effectiveness of the adapted vocabulary V A , we show the sampled words from each domain that are tokenized with two types of vocabulary in Table 2. The adapted vocabulary V A tokenizes the domain-specific word into subwords that are informative in the target domain. For example, in the case of "sulfhydration", the word is tokenized as "sul, f, hy, dra, tion" with V P and "sulf, hydr, ation" with V A . "sulf" and "hydr" imply "sulfur" and "water" respectively, which are frequently used in BIOMED domain.
Furthermore, V A preserves the semantic of a domain-specific word by keeping it as a whole word, where the subwords tokenized with V P have completely different semantics from its original meaning. For instance, "otterbox" is an electronics accessory company in the REVIEWS domain. However, with V P , it is split into "otter" and "box", where the "otter" is a carnivorous mammal and "box" is a type of container. Randomly sampled tokenization examples from V P and V A are presented in Appendix Table 8.

Ablation Studies
The effectiveness of each component in AVocaDo, i.e., vocabulary adaptation and contrastive regularization, is shown in this section. As described in   Table 3, vocabulary adaptation improves the performance in three domains (i.e., ACL-ARC, HYPER-PARTISAN, and AMAZON) even in the absence of the regularization term.

Size of Added Vocabulary
The size of the added vocabulary n D is automatically determined by the fragment score of the adapted vocabulary V A , as described in Algorithm 1. In order to analyze how n D affects the performance, we compare the performance of downstream tasks by manually setting the n D as 500, 1000, 2000, and 3000 without using the fragment score, as shown in Table 4. Automatically determined n D is 1600, 700, 2850 and 1300 for each dataset. Except for AMAZON dataset, we demonstrate that determining n D by the fragment score shows the optimal performance.

Conclusion
In this paper, we demonstrate that a pretrained vocabulary should be updated towards a downstream domain when fine-tuning. We propose a fine-tuning strategy called AVocaDo that adapts the vocabulary to the downstream domain by expanding the vocabulary based on a tokenization statistic, and by regularizing the newly added words. Our approach shows consistent performance improvements in diverse domains on various pretrained language models. AVocaDo is applicable to a wide range of NLP tasks in diverse domains without any restrictions, such as massive computing resources or a large domain-specific corpus.

Appendix A Details on Fragment Score
Fragment score is a measure of the fineness of tokenization. We observed that the pretrained vocabulary set V P tokenizes domain-specific words (i.e., words that are frequently appeared in a downstream corpus but not in a pretrained corpus) into larger number of subwords than the number of subwords that non-domain-specific words are tokenized into (Figure 3). These finely tokenized subwords are not semantically informative enough. Inspired by the observations, we construct a new vocabulary V A that less finely tokenizes the domain-specific words than V P , i.e., V A such that . This is why we chose the fragment score of the newly constructed vocabulary set V A as a metric for selecting a subset of domainspecific vocabulary V D . Figure 3 shows the relative number of tokenized subwords from a single word in four domains where the publicly available vocabulary in BERT [Devlin et al., 2018] is denoted as V P and domain adapted vocabularies are denoted as V A . WikiText [Stephen et al., 2016] represents the general domain that is similar to the corpus that is used for pretraining BERT, while others are chosen as the downtream domain. The red and orange bar indicate the average number of subwords tokenized with pretrain vocabulary and adapted vocabulary. We observe that AVocaDo mitigates the domain gap.

B Different Aspects of the Vocabularies
C Implementation Details  We used four datasets in various domains for classification. As shown in Table 5, the size of the training data varies from 500 to about 110,000. The number of classes for each dataset varies from 2 to 13.

C.2 Experimental Settings
In all experiments, we trained the networks on a single 3090 RTX GPU with 24GB of memory. We implemented all models with PyTorch using Transformers library from Huggingface. All baselines are reproduced as described in previous works [Gururangan et al., 2020, Tai et al., 2020, Lee et al., 2020, Beltagy et al., 2019. In our experiment, the performance in HYPERPARTISAN dataset tends to have high variance depending on random seeds since the size of the dataset is extremely small. To produce reliable results on this dataset, we discard and resample seeds.
The embeddings of newly added words in AV-ocaDo are initialized as a mean value of BERT embeddings of subword components. For instance, if the word "bluetooth" is tokenized into ["blue","##tooth"] with V P and "bluetooth" with V A , we initialize the embedding of "bluetooth" with the average value of the two subword embeddings.

C.3 Hyperparameters
Hyperparameter Value lower bound of fragment score γ 3 number of added vocabulary (initial) α 500 number of added vocabulary β 50 batch size B 16 learning rate 1e-5, 2e-5, 5e-5 number of epochs 10 temperature τ from 1.5 to 3.5 domain vocabulary size N D 10,000 Table 6: Hyperparameters used in experiments. We conduct grid search for finding the best hyperparameter settings.
As shown in Table 6, we followed the hyperparameter setting in the previous work [Lee et al., 2020, Beltagy et al., 2019, Gururangan et al., 2020. To search the value for learning and temperature τ , we use grid search.

D Qualitative Results
For each downstream dataset, we randomly sampled ten words that are differently tokenized by pretrained vocabulary V P and by adapted vocabulary V A . As shown in Table 8, subwords tokenized by V A are more informative in the target domain because they preserve the semantic of a domainspecific word.  AVocaDo does not require additional domainspecific corpus. As shown in Table 7, all other baseline models require an adaptive pretraining stage before fine-tuning using domain-specific corpus. In general, the corpus used for adaptive pretraining is relatively large compared to the size of downstream dataset. Therefore, most methodologies that require adaptive pretraining require large training resources.

F Other Baselines
We perform additional experiments with other baseline models. In this experiment, we set exBERT [Tai et al., 2020], which expands the pretrained vocabulary from original BERT base vocabulary, and SciBERT SCIVOCAB [Beltagy et al., 2019], which constructs the customized vocabulary based on science and biomedical large corpora as baselines. Table 9 shows the overall performance on BIOMED and CS domains. We outperform exBERT in BIOMED domain. In comparison with SciBERT SCIVOCAB , AVocaDo shows the competitive performance.   Table 10: Experiments on other pretrained language models. The pretrained language models (i.e., RoBERTa base and ELECTRA base ) are fine-tuned with or without AVocaDo. The performance improvement is represented inside the parentheses with +. The symbol † indicates the performance reported by Gururangan et al.