NB-MLM: Efficient Domain Adaptation of Masked Language Models for Sentiment Analysis

While Masked Language Models (MLM) are pre-trained on massive datasets, the additional training with the MLM objective on domain or task-specific data before fine-tuning for the final task is known to improve the final performance. This is usually referred to as the domain or task adaptation step. However, unlike the initial pre-training, this step is performed for each domain or task individually and is still rather slow, requiring several GPU days compared to several GPU hours required for the final task fine-tuning. We argue that the standard MLM objective leads to inefficiency when it is used for the adaptation step because it mostly learns to predict the most frequent words, which are not necessarily related to a final task. We propose a technique for more efficient adaptation that focuses on predicting words with large weights of the Naive Bayes classifier trained for the task at hand, which are likely more relevant than the most frequent words. The proposed method provides faster adaptation and better final performance for sentiment analysis compared to the standard approach.


Introduction
Pre-training of neural networks with a language model (LM) or masked language model (MLM) objective on large amounts of non-domain-specific texts has given a significant boost of performance in almost all natural language processing tasks. While 16GB of texts were shown to BERT (Devlin et al., 2019) and ten times more to RoBERTa (Liu et al., 2019) during pre-training, the further training of these models with the MLM objective on domainspecific texts before fine-tuning to the target task was shown to further improve the final results (Sun et al., 2019;Gururangan et al., 2020). This technique is called the domain or task adaptation, depending on the degree of similarity of the data for adaptation to the target dataset. While initial pretraining is extremely expensive, it does not depend on the final task and can be performed only once. However, domain or task adaptation is done for each domain or task individually and is still quite resource-demanding, requiring hundreds of thousands of training steps or several GPU days, unlike final fine-tuning, which can often be done in a few GPU hours (Sun et al., 2019).
In this work, we propose a method for more efficient MLM adaptation. We have noticed that the standard MLM spends most of the training time on learning to restore the most frequent words like determiners or auxiliary verbs hidden (masked) from its input. While such training examples may be useful for learning English grammar, their domination during the adaptation phase seems to be wasteful for many final tasks. Since the final task and the dataset are already known in this phase, we propose to undersample such examples in favor of examples with targets related to the final task. This relatedness is estimated using a Naive Bayes classifier. Hence, we call our modified objective Naive Bayes Masked Language Model (NB-MLM). We hypothesize that hiding from the model and asking it to restore mostly features that are important for the final task will likely result in faster adaptation. Additionally, the absence of simple features and the requirement to restore them may teach the model to exploit more sophisticated and implicit features relevant to the final task.
We evaluate the proposed method on two datasets for sentiment analysis. It is one of the most popular tasks in natural language processing (Feldman, 2013) and an excellent playground for the comparison of adaptation methods due to the large amount of labeled and unlabeled user reviews of different products available. In particular, we consider the task of classifying the binary sentiment polarity of a given review. Our experiments  ?  great  love  best  acting  plot  well  nothing  no  thing  script  minutes  worst  always  performance  2  horror 't money NB-MLM show that the NB-MLM objective can significantly reduce adaptation time while achieving the same final performance or help to improve performance given the same amount of time for adaptation. 1

Related Work
Pre-training Transformer networks with the MLM objective is proposed in (Devlin et al., 2019) for the BERT model and is shown to outperform the more traditional LM objective, though the similar task of predicting a word from its left and right context was used with different architectures earlier (Mikolov et al., 2013;Melamud et al., 2016). RoBERTa enhances BERT by pre-training longer on ten times larger corpora, getting rid of the next sentence prediction (NSP) task during pre-training, and selecting different target words to be masked and predicted in each epoch (dynamic masking).
Various approaches to further pre-training of BERT on domain or task-specific data are compared in (Sun et al., 2019), while Gururangan et al. (2020) carry out a similar investigation with RoBERTa. They try various options of data sources for adaptation: texts only from the target dataset (called task adaptation or within-task pre-training), larger datasets from the same domain (called domain adaptation or in-domain pretraining), and datasets from different domains (called cross-domain pre-training). They find the task adaptation, which is a computationally cheapest option, to be a surprisingly good solution. In their experiments, it often outperforms the domain adaptation and is only marginally worse than com-bining both methods. However, due to the large amount of data used in domain adaptation, Gururangan et al. (2020) train the MLM only for one or very few epochs. We find that our method leveraging large data more efficiently makes the domain adaptation comparable to the task adaptation, and their combination is significantly better than each of them.
Our idea of employing Naive Bayes weights is inspired by the NB-SVM model (Wang and Manning, 2012;Mesnil et al., 2014), which scales bag-ofngrams vectors with Naive Bayes classifier weights and then trains linear SVM or logistic regression classifiers on them. It proved to be a very strong baseline, often outperforming both linear and more sophisticated models from that time.

MLM Objectives for Adaptation
Uniform MLM. For each input example, the standard MLM objective, as proposed by Devlin et al. (2019), samples 15% of the input positions (subwords) for calculating the loss. The positions are sampled from the uniform distribution without replacement: P (pos) ∝ 1. Then 80% of the tokens on sampled positions are masked (replaced with a [MASK] token), 10% are replaced with some random tokens from the uniform distribution over the vocabulary, and 10% are left intact.

NB-MLM.
As an alternative, we propose sampling 15% of positions from a non-uniform distribution that gives higher probabilities to positions that contain subwords with high feature importance f i(w): P (pos) ∝ exp(f i(w pos )/T ), where the temperature T is the hyperparameter allowing to balance between uniform sampling and determin-  istic selection of positions that contain the most important features. For binary classification, the feature importance is estimated using the Naive Bayes classifier weights as follows: Thus, those features that are much more probable in one class than in another receive the highest scores. Similar to the method proposed by Wang and Manning (2012), the probabilities are estimated by the multinomial Naive Bayes model with additive smoothing (alpha = 0.1). Additionally, the scores are set to zero for those features that occurred in less than m examples to avoid the overrepresentation of unreliable features. As an example, Figure 1 shows the words that the model is most frequently asked to predict during the task adaptation on the IMDB movie reviews dataset (T = 0.1, m = 5 for NB-MLM). Evidently, NB-MLM learns to predict words relevant to sentiment analysis more often than the standard MLM.
Along with the uniform and NB-based distributions, during the preliminary experiments, we tried other options, which are described and compared in Appendix D. However, only NB-MLM outperformed the uniform baseline.

Experiments and Results
During the preliminary experiments described in Appendix A, we found that our method helps for both BERT and RoBERTa models. However, the latter model achieved significantly better performance. Therefore, we describe the results for RoBERTa in the rest of the paper.
For domain adaptation (denoted as DAPT), we employed the Amazon Reviews dataset (McAuley et al., 2015) with duplicates removed. We removed reviews shorter than 500 characters and split the rest into the training and validation sets of 21M and 10K reviews correspondingly. The valida-   For domain and task adaptation, we used the batch size of 1024, while classifiers were fine-tuned with the batch size of 32. Based on our preliminary experiments, we set the learning rate of 2e-4 for the domain adaptation, 1e-4 for the task adaptation, and 1e-5 for final fine-tuning. Following Gururangan et al. (2020), we performed domain adaptation for one epoch on the Amazon dataset (20K steps, 38h on one V100 GPU) and task adapta-2 Using the whole target dataset for task adaptation has shown the best results for both Uniform MLM and NB-MLM, see Appendix C. This setup, when test examples (without labels) are exploited during training, is known as transductive learning. tion for 100 epochs on IMDB (18h) and 24 epochs on Yelp (14h). To show that NB-MLM can obtain results similar to Uniform MLM in a much shorter time, we also report the results of short adaptation with the duration reduced to 4K steps on Amazon, 20 epochs on IMDB, and 6 epochs on Yelp. To estimate the variance of the results due to the randomness in the order of training examples and positions selected for masking and prediction, we have trained each model with different random seeds. For both Uniform MLM and NB-MLM, we aggregated metrics from 15 runs for DAPT on IMDB, 3 runs for DAPT+all-TAPT on both IMDB and Yelp, and 6 runs for all other scenarios. The classifiers were fine-tuned for 4 epochs on IMDB and 2 epochs on Yelp 3 . For task adaptation with NB-MLM, we set T = 0.4, m = 50 based on preliminary experiments (see Appendix A). For domain adaptation with NB-MLM, we set T = 0.1, m = 10 on IMDB and T = 0.1, m = 50 after grid search from T = [0.05, 0.1, 0.2, 0.4, 0.8], m = [10, 50]. Generally, for task adaptation with many epochs of training on smaller datasets, larger temperatures are required to avoid over-fitting due to the same words masked in each example at each epoch. For domain adaptation, only one epoch of training is done on a large dataset, hence, smaller temperatures perform better. Figure 2 shows how the final classification accuracy improves during the task and domain adaptation. Our NB-MLM model significantly helps for domain adaptation on IMDB. For task adaptation, the difference is much smaller and fits into two standard deviations. Still, on average, the NB-MLM seems to provide a consistent improvement throughout the adaptation. For Yelp, the improvements from NB-MLM are also small but consistent. Table 1 compares our models and the previously published results on the test sets. In order to apply McNemar's test for statistical significance, instead of averaging across all runs of each model with different random seeds, we have to take predictions of a particular run. Thus, for each of our models, we selected the run with the median performance (for the even number of runs, the one just above the median) and reported its performance in the table.
For IMDB, the domain adaptation with NB-MLM obtains results similar to the Uniform MLM in 5x fewer training steps and data (only 20% of the data is seen during the first 4K steps). When trained for one epoch, it improves the results by more than 0.3%, which is also statistically significant. For task adaptation, the NB-MLM gives a much smaller improvement. Similarly to the results of Gururangan et al. (2020), in our experiments, the task adaptation with the Uniform MLM outperforms the domain adaptation that employs much more data by almost 0.5%. We suppose that this is due to the small proportion of relevant examples sampled by the Uniform MLM, which require many repetitions to learn from. Probably, training domain adaptation for hundreds of epochs, similarly to task adaptation, can fix this problem, but this is not feasible for large datasets and moderate computation resources. More efficient domain adaptation with NB-MLM, which focuses on targets that are likely relevant for the final task, reduces this difference to 0.2%. Finally, using the domain adaptation followed by the task adaptation results in the best final performance. In this scenario, NB-MLM gives 0.2% improvement for short adaptation and 0.1% for long adaptation. For Yelp, the metrics are higher, and the differences are smaller but still consistent.

Conclusion
We proposed a technique for the more efficient domain and task adaptation of MLMs. It is especially helpful for leveraging large data efficiently during the domain adaptation, resulting in significantly shorter adaptation time or better performance.  To verify our hypothesis, in the preliminary experiments, we tried improving the results of the ITPT (withIn-Task Pre-Training) method (Sun et al., 2019). Since no code for this paper was available at that time, we implemented this method using the Transformers library (Wolf et al., 2020) 4 , which closely followed the details and hyperparameters specified in the paper but adopted recommendations from more recent models by not using NSP prediction and exploiting dynamic masking. Since no official development set is available for the IMDB dataset (Maas et al., 2011) and the split is not specified in the paper, for early stopping during classifier fine-tuning and NB-MLM hyperparameters selection, we employed our own split 5 . Note that this split was used only for preliminary experiments; later, we switched to the split of Gururangan et al. (2020). For adaptation, we used the whole dataset, excluding half of the development set to measure the validation perplexity. Figure 3 (left) shows the final classification error rate depending on the number of adaptation steps. The best error rate on the development set across 10 epochs of the classifier fine-tuning is shown.
Evidently, NB-MLM outperforms MLM on average. Despite the variance of their difference being rather high, we can see that after 60K adaptation steps, NB-MLM with the best temperature robustly shows equal or better results than the best result of MLM across 150K adaptation steps, which is almost 2.5x speedup. For comparison, Figure 3 (right) shows the results for RoBERTa using the same split. Evidently, RoBERTa with NB-MLM adaptation robustly outperforms MLM. For small temperature T = 0.2 after 20K steps of adaptation, we receive better results than MLM trained more than 3 times longer. However, later the performance drops significantly for the smallest temperature. Inspecting perplexity during adaptation, we found that the model begins to strongly overfit after 20K steps, which is likely related to the same positions for masking and prediction sampled at each epoch. Larger temperature T = 0.4 provides smaller benefits in the short run but gives more robust improvements and better final results. Overall, after 20K steps, it gives the same performance as the MLM trained for 75K steps, which is almost 4x speedup.   In this section, we show the results on the development sets corresponding to the results on the test sets provided in the main text. Since these results were used to select hyperparameters and also for early stopping during fine-tuning of the classifiers, they are less reliable to draw conclusions about final classification performance and shall be considered only together with the results on the test sets. While general trends are the same, we notice that for domain adaptation, the gap between NB-MLM and Uniform MLM on the IMDB dev set (figure 4, top right) is twice as large as on the test set. This may be due to the large variance of classification accuracy during fine-tuning and using early-stopping on the development set.     The uniform distribution over positions is traditionally used to sample target subwords that are masked and predicted during MLM pre-training and adaptation. However, as Figure 1 shows, it makes the model learning to predict mostly frequent functional words such as articles, prepositions, pronouns, etc. While it may teach the model to extract some grammar-related features perfectly, it may also prevent the model from learning more specific features required for the final task due to rare necessity in such features during MLM training and limited model capacity. To solve this problem, we may simply lower the probability of sampling positions containing frequent words. Figure 6 compares the standard Uniform MLM and the proposed NB-MLM to a frequency-based baseline. In this baseline, we perform domain or task adaptation similarly to NB-MLM, but sample positions from P (pos) ∝ ( 1 f req(wpos) ) 1 n , where n plays the same role as the temperature in NB-MLM, allowing to balance between sampling positions from the uniform distribution and selecting positions containing the most infrequent words. Word frequencies f req(w) are estimated from the training subset of the IMDB dataset. We selected optimal values of n on the IMDB development set for all-TAPT and DAPT separately, resulting in n = 3.5 and n = 2.5 correspondingly. Evidently, the frequency-based baseline is on par with the uniform baseline. There are occasional improvements in the best validation accuracy, but they do not convert into improvements on the test set.

B Results on the Development Sets
Next, we introduce another alternative, which is based on the conditional pointwise mutual informa-tion between tokens and classes given context: P M I(w, c|ctx) = log P (w|c, ctx) P (w|ctx) .
Conceptually, it prefers to select tokens that are easier to predict based on the nearby context and class of the example than from the context alone. We supposed that learning to predict such tokens will make the model extract class-related features from the whole example rather than use only nearby context. We define the nearby context as one preceding token and one succeeding token and minimize PMI over them while maximizing it over classes. This means that we prefer selecting tokens, which are not easily predicted from either preceding or succeeding tokens but are much better predicted, at least for examples of one of the classes if that class is known.
f i(w i ) = max c min ctx∈{w i−1 ,w i+1 } P M I(w i , c|ctx) Similarly to NB-MLM, we estimated these weights from the IMDB training set and set them to zero for those tokens that appear in less than m examples. Then we apply temperature softmax to convert these weights into a probability distribution over positions. We selected the hyperparameters on the development set, resulting in T = 0.1, m = 10. Figure 7 shows that for all-TAPT on IMDB, the weights based on conditional MI do not help to improve the results of the Uniform MLM, unlike NB weights. Based on these results, we did not experiment with them for DAPT.