TADPOLE: Task ADapted Pre-Training via AnOmaLy DEtection

The paradigm of pre-training followed by finetuning has become a standard procedure for NLP tasks, with a known problem of domain shift between the pre-training and downstream corpus. Previous works have tried to mitigate this problem with additional pre-training, either on the downstream corpus itself when it is large enough, or on a manually curated unlabeled corpus of a similar domain. In this paper, we address the problem for the case when the downstream corpus is too small for additional pre-training. We propose TADPOLE, a task adapted pre-training framework based on data selection techniques adapted from Domain Adaptation. We formulate the data selection as an anomaly detection problem that unlike existing methods works well when the downstream corpus is limited in size. It results in a scalable and efficient unsupervised technique that eliminates the need for any manual data curation. We evaluate our framework on eight tasks across four different domains: Biomedical, Computer Science, News, and Movie reviews, and compare its performance against competitive baseline techniques from the area of Domain Adaptation. Our framework outperforms all the baseline methods. On small datasets with less than 5K training examples, we get a gain of 1.82% in performance with additional pre-training for only 5% steps compared to the originally pre-trained models. It also compliments some of the other techniques such as data augmentation known for boosting performance when downstream corpus is small; highest performance is achieved when data augmentation is combined with task adapted pre-training.


Introduction
Pre-trained language models such as ELMo (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2018), Transformer-xl  and XLNet  have become a key component in solving virtually all natural language tasks. These models are pre-trained on large amount of cross-domain data ranging from Wikipedia to Book corpus to news articles to learn powerful representations. A generic approach for using these models consists of two steps: (a) Pretraining: train the model on an extremely large general domain corpus, e.g. with masked language model loss; (b) Finetuning: finetune the model on labeled task dataset for the downstream task.
Even though the approach of pre-training followed by fine-tuning has been very successful, it suffers from domain shift when applied to tasks containing text from a domain that is not sufficiently represented in the pre-training corpus. An immediate way of solving the problem is to pre-train the model on task domain data instead of the general domain data. For a handful of very popular task domains, the research community invested time and resources to collect a large domain-specific data corpus and pre-train a language model on it. The models include BioBERT pre-trained on biomedical text (Lee et al., 2020), ClinicalBERT pretrained on clinical notes (Huang et al., 2019), SciB-ERT pre-trained on semantic scholar corpus (Beltagy et al., 2019), and FinBERT pre-trained on financial documents (Araci, 2019). These models achieve significant gain in performance over a model trained on general domain data, when the downstream task belongs to the respective domains.
These papers demonstrate how useful it can be to shift the domain of the pre-trained model. However, the approach is expensive and time consuming as it requires collecting gigabytes of domain data for each new task. The long-tail of domains remains left behind without a realistic solution. To mitigate this, in absence of the huge task domain data, a different known approach is to collect a medium (MBs, not GBs) amount of unlabeled task data, and adapt the pre-trained (on general data) model by e.g. extending the pre-training procedure on the unlabeled data (Howard and Ruder, 2018;Gururangan et al., 2020). Such task adapted pre-training approach achieves relatively smaller gain but is less expensive. Although this approach is cheaper in terms of manual labor when compared to domain adapted BERT, it still requires an effort to collect unlabeled data. It requires much more data than what is needed for only fine-tuning. This is often impossible to achieve, for example when data is highly sensitive. In this paper we propose a solution to this challenging problem, providing domain adaptation to the pre-trained model without the need for any manual effort of data collection.
The high level idea is quite intuitive. Given a generic pre-training data containing text from multiple domains, we filter the available general domain to contain only pieces that are similar to the downstream task corpus. By continuing the pretraining process on this adapted corpus we achieve a better tuned pre-trained model. Figure 1 illustrate the feasibility of this approach with an example downstream task from a medical domain and highlighted text from a news article available in a general domain corpus. The key for a successful implementation is finding the best way of evaluating the similarity of a given snippet to the downstream task.

RCT20K TASK DATA
To investigate the efficacy of 6 weeks of daily low-dose oral prednisolone in improving pain, mobility, and systemic los-grade . . . [OBJECTIVE] A total of 125 patients with primary knee OA were randomize . . . [METHODS] Outcome measures included pain reduction and systemic inflammation markers . . . [METHODS] News Article For as much as we workout warriors recite that whole "no pain, no gain" mantra, we sure do pop a lot of painkillers. A recent article published in. . . These popular medicines, known as nonsteroidal anti-inflammatory drugs, or NSAIDs, work by suppressing inflammation. . . the article kind of blows past is the fact plenty of racers . . . Although not many methods exist to solve the problem of domain shift in the context of pretraining, literature on Domain Adaptation provides several methods for the core task of evaluating the above mentioned similarity. These previous approaches use either a simple language model (LM) (Moore and Lewis, 2010; Axelrod et al., 2011;Duh et al., 2013;Wang et al., 2017b;van der Wees et al., 2017), or a hand crafted similarity score (Wang et al., 2017a;Plank and Van Noord, 2011;Remus, 2012;Van Asch and Daelemans, 2010). The LM based technique are often both over simplistic, and require a fairly large corpus of task data to create a reasonable LM. The hand crafted similarity scores can be seen as ad-hoc methods for distinguishing inliers from outliers (i.e., anomaly detection); they tend to be focused on individual tasks and do not generalize well.
We formulate the similarity evaluation task as that of anomaly detection and propose a Task ADapted Pre-training via anOmaLy dEtection (TADPOLE) framework. Indeed, anomaly detection methods given a domain of instance are able to provide a score for new instances assessing how likely they are to belong to the input domain. We exploit pre-trained models to get sentence representations that are in turn used to train an anomaly detection model. By using pre-trained models, our method is effective even for small text corpora. By taking advantage of existing anomaly detection methods, we replace hand-crafted rules with techniques proven to generalize well. Our approach does not require any manual data curation. To train the anomaly detection method, we only use the task data which is already available as it is necessary for fine-tuning.
In what follows we discuss how we implement our technique and compare it with other data selection methods based on extensive experimental results. We start by filtering out the subset of general domain corpus most relevant to the task. To do this, we explore several anomaly detection methods and give a quantitative criterion to identify the best method for a given task data. Then, we start with a pre-trained model on the general domain corpus and run additional pre-training for only 5% more steps on the filtered corpus from different methods. This is followed by the regular finetuning on the labeled task data. We measure the performance gain as an improvement in accuracy of finetuned model with additional pre-training vs the accuracy of finetuned model without additional pre-training. To establish the performance gain of TADPOLE, we evaluate it on eight tasks across four domains: Biomedical, Computer Science, News, and Movie reviews. We investigate all aspects of TADPOLE by comparing its performance with various baselines based on its variants and the competitive methods available in literature. The main highlights of our work are as follows: • We provide TADPOLE, a novel anomaly de- tection based framework for adapting pretraining for the downstream task. The framework is explained in detail and all its steps are justified via extensive ablation studies.
• TADPOLE is superior to all the baseline methods including (i) LM based relevance score (iii) Distance based relevance score (iii) Continued pre-training on the task data (iv) Data Augmentation while finetuning. • On tasks with small labeled dataset (less than 5K examples), our method achieves an average 1.82% lift in performance whereas the baselines achieve no more than 0.48%. • For tasks with large labeled dataset, our method does not depreciate in performance and achieves an average gain of 0.32%. In addition, if only a subset of training samples are available, we observe significantly higher gain.
In particular, if only 1000 training samples are available, we observe an average 2.01% gain in performance whereas the baselines achieve no more than 0.24% gain. In all individual tasks, our method is either on par or (statistical) significantly better than all alternatives. • TADPOLE complements some of the other techniques known for improving performance for small datasets. For instance, TADPOLE performs better than data augmentation and performs even better when combined with data augmentation; Data-Aug ≤ TADPOLE ≤ TADPOLE + Data-Aug. • For a task requiring little domain adaptation, GLUE sentiment analysis, our method achieves an improvement of 0.4% in accuracy.

Related Work
Since our focus is on Data Selection Methods, we only discuss the related work on Data Selection in Domain Adaptation here. We discuss the other Domain Adaptation techniques in Appendix A.
Data Selection: As discussed above, core of data selection is to determine the relevance weights that in turn modify the source domain to become more similar to the target domain. There has been a sequence of works in trying to find the relevance weights via language models (Moore and Lewis, 2010; Wang et al., 2017b;van der Wees et al., 2017). For instance, Moore and Lewis (2010), Axelrod et al. (2011) and Duh et al. (2013) train two language models, an in-domain language model on the target domain dataset (same as task domain in our case) and an out-of-domain language model on (a subset of) general domain corpus. Then, relevance score is defined as the difference in the cross-entropy w.r.t. two language models. These methods achieve some gain but have a major drawback. A crucial assumption they rely on: there is enough in-domain data to train a reasonable in-domain language model. This assumption is not true in most cases. For most tasks, we only have access to a few thousands or in some cases a few hundreds of examples which is not enough to train a reasonably accurate language model. Our techniques rely on text representations based on the available pre-trained model. As such, our similarity score does not rely on models that can be trained with a small amount of data.
Another line of work defines hand crafted domain similarity measures to assign relevance score and filter out text from a general domain corpus (Wang et al., 2017a;Plank and Van Noord, 2011;Remus, 2012;Van Asch and Daelemans, 2010;Gururangan et al., 2020). For instance, Wang et al. (2017a) define the domain similarity of a sentence as the difference between Euclidean distance of the sentence embedding from the mean of in-domain sentence embeddings and the mean of out-of-domain sentence embeddings. Plank and Van Noord (2011) and Remus (2012) define the similarity measure as the Kullback-Leibler (KL) divergence between the relative frequencies of words, character tetra-grams, and topic models. Van Asch and Daelemans (2010) define domain similarity as Rényi divergence between the relevant token frequencies. These are adhoc measures suitable only for the respective tasks, and can be seen as a manual task-optimized anomaly detection. They fail to generalize well for new tasks and domains. Ruder and Plank (2017) attempts to remedy this issue and tries to learn the correct combination of these metrics for each task. They learn the combination weight vector via Bayesian optimization. However, Bayesian optimization is infeasible for deep networks like BERT. Each optimization step of this process amounts to pre-training the model and finetuning it for the task. For Bayesian optimization to work well it requires repeating this process multiple times, which is prohibitively computationally expensive. Thus, they use models such as linear SVM classifier and LDA which do not yield stateof-the-art performance. In contrast, we propose a lightweight method -based on anomaly detection -that can be applied to state-of-the-art deep language models like BERT.

TADPOLE: Task ADapted
Pre-training via anOmaLy dEtection Language model and Downstream Tasks. A generic approach for using state-of-the-art language models such as ELMo, GPT, BERT, and XLNet is to pre-train them on an extremely large general domain corpus and then finetune the pretrained model on the downstream labeled task data. There is evident correlation between model's pretraining loss and its performance on the downstream task after finetuning (Devlin et al., 2018). Our design is motivated by an observation, backed by empirical evidence, that the correlation is even stronger if we consider the pre-training loss not on the pre-training data but the downstream task data.
To make this distinction formal, let D, D in be the pre-training and task data. Let Θ denote the parameters of the language model and LM denote the language model loss function. The pre-training loss on pre-training data (L LM (Θ)) and target data (L in LM (Θ)) are defined as follows: is better correlated with the performance of the downstream task we consider several BERT language models pre-trained on random combinations of datasets from different domains mentioned in Section 4. Among these we select the BERT models M 1 , . . . , M k such that selected models have similar Masked Language Model (MLM) loss on the general domain corpus; | LM (Θ i ) − LM (Θ j )| <= 0.02. MLM loss of these models on the text from task domain ( in LM (Θ i )) is different. For each M i , we contrast it with the accuracy/f1 score of the finetuned model on the task data.
We observe in Figure 3 that if we have two equally good language models (similar language model loss on general domain), the language model which is better (tailored) for the task domain has significantly better downstream performance. We conclude that in order to ensure success in the downstream task, we should aim to minimize L in LM (Θ). A first attempt would be to pre-train or finetune the language model on D in . However, training a language model such as ELMo, GPT, BERT or XLNet requires a large corpus with several GBs of text and the available domain specific corpus D in is often just the task data which has few MBs of text. Training on such a small dataset would introduce high variance. We reduce this variance by taking training examples from the general domain corpus D, but control the bias this incurs by considering only elements having high relevance to the domain D in . Formally, we optimize a weighted pre-training loss function x is relevant to D in and (close to) 0 otherwise. We compute these weights using an anomaly detection model fitted on D in . Note that concept of weighted loss to handle noisy or irrelevant data is well known (Moore and Lewis, 2010; Wang et al., 2017a). Major contribution of our paper is in proposing an anomaly detection based robust approach to find the relevance weights.

Anomaly Detection to solve the Domain Membership Problem
Detecting whether an instance x is an in-domain instance is equivalent to solving the following problem: Given task data T and a sentence s, determine if s is likely to come from the distribution generating T or if s is an anomaly.
This view helps us make use of a wide variety of anomaly detection techniques developed in literature (Noble and Cook, 2003;Chandola et al., 2009;Chalapathy and Chawla, 2019). To make use of these techniques, we first need a good numeric representation (embedding) with domain discrimination property. We use pre-trained BERT to embed each sentence into a 768 dimensional vector. Once the data is embedded, we need to decide which among the many anomaly detection algorithms proposed in literature should be applied on the embeddings. To decide the anomaly detection method, we propose an evaluation method ranking the techniques based on their discriminative properties.
Ranking anomaly detection algorithms: The idea is to treat the anomaly score as the prediction of a classifier distinguishing between in-domain and out-of-domain data. By doing so, we can consider classification metrics such as the f1_score as the score used to rank the anomaly detection algorithm. To do this, we split the in-domain data (the task data) into D train in , D test in using a 90/10 split. We also create out-of-domain data D out as a random subset of D of the same size as D test in . We train an anomaly detection algorithm A with D train in , and evaluate it's f1_score on the labeled test set composed of the union D test in ∪ D out , where the labels indicate which set the instance originated from. Note that anomaly detection algorithms considered do not require labeled samples for training. Thus, mixing data from D out does not add much value. Table 1 provides the results of this evaluation on six anomaly detection algorithms. Details of the tasks can be found in Section 4. We can see that Isolation Forest consistently performs well for most of the tasks. Local Outlier Factor performs almost equally well but is slower in prediction. Although it is possible to adaptively choose for every task the anomaly detection algorithm maximizing the f1_score, we chose to use a single algorithm, Isolation Forests, for the sake of having a simpler technique and generalizable results. Due to space constraints, we push the discussion on Isolation Forest to Appendix C. Now that we chose the anomaly detection technique, we move to discuss the effectiveness of the algorithm in (i) identifying the domain from the task data (ii) identifying the domain related data from the general domain corpus.  algorithm is able to distinguish between the in-taskdomain data and out-of-task domain data. These experiments are done for the Sentiment Analysis task (SST) discussed in Section 4. Interestingly, we noticed in our experiments that a language model pre-trained on a diverse corpus is a better choice when compared to a model finetuned on the target domain. We conjecture that the reason is that a finetuned BERT is overly focused on the variations in the task data which are useful for task prediction and forgets information pertaining to different domains which is useful for domain discrimination. We exhibit this phenomenon more clearly in Figure 4 (right) where it is evident that the discriminating ability of the finetuned model is worse.
In order to assess the ability of our model to identify related text we perform the following experiment. First, we create a diverse corpus by taking the union of 4 datasets: News, Finance, CS abstracts and Biology abstracts. Figure 5, column 'Input data' contains their respective sizes. We then train two anomaly score based discriminators, one on CS task data and the other on Bio abstracts. For each model we choose a threshold that would filter out 80% of the data, and observe the data eventually retained by it. The fraction of data retained from each corpus for each model is given in Figure 5, columns 'Filtered (Bio)' and 'Filtered (CS)'. We see that data from the News and Finance corpus is almost completely filtered as it is quite different than the text in abstracts of academic papers. We also see that a non-negligible percent of the filtered data for the Bio model comes from CS and vice versa. Since both corpora are abstracts of academic papers it makes sense that each corpus contains relevant data for the other. The details related to these corpora are given in Appendix B.

From Anomaly Detection Scores to Domain Adapted Pre-training
Once the anomaly detection object is trained, we use it to compute the relevance weights i.e. compute λ values defined in equation 1 . Let the sentences in the pre-training corpus be s 1 , . . . , s N with anomaly scores {A(s 1 ), . . . , A(s N )}. We explore two different strategies of λ value computation. First is when we normalize and transform the scores to compute continuous values and second when we use threshold and compute 0/1 values.
Continuous λ values: We start by normalizing the anomaly scores to be mean zero and variance 1.
Then, for every i ∈ {1, . . . , N }, normalized score isĀ(s i ) = (A(s i )− µ)/σ. Using these normalized sentence anomaly scores, we compute the relevance weights as follows: λ(s i ) = 1 1+e −C(α−Ā(s i )) where C and α are hyper-parameters. C controls the sensitivity of the weight in terms of anomaly score and α controls the fraction of target domain data present in the general domain corpus. C → ∞ results in 0/1 weights corresponding to discrete λ setting whereas C = 0 results in no task adaptation setting. Discrete λ values: We sort the sentences as per anomaly scores, A(s σ(1) ) ≤ A(s σ(2) ) ≤ · · · ≤ A(s σ(N ) ) and pick β fraction of the sentences with lowest anomaly scores, λ(s σ(i) ) = 1 for i ∈ {1, . . . , βN } and 0 otherwise. Even though this approach is less general than the continuous λ values case, it has an advantage of being model independent. We can filter out text, save it and use it to train any language model in a black box fashion. It does not require any change in pre-training or finetuning procedure. However, to utilize this option we need to make a change. Instead of filtering out sentences we need to filter out segments containing several consecutive sentences.
To understand why, suppose we filter out sentence 1 and sentence 10 and none of the sentences in between. When we save the text and construct input instances from it for a language model, then an input instance may contain the end of sentence 1 and the start of sentence 10. This is problematic as sentence 1 and sentence 10 were not adjacent to each other in the original corpus and hence, language model does not apply to them. It distorts the training procedure resulting in worse language models. To resolve this issue, we group sentences into segments and classify the relevance of each segment. Formally, let γ be a hyperparameter and for all j ∈ 1, . . . , N/γ let the segment score be y j = j * γ i=(j−1) * γ+1 A(s i ) γ . We sort the segments according to their anomaly scores, y σ (1) ≤ · · · ≤ y σ (N/γ) and select the β fraction with lowest anomaly scores; save the sentences corresponding to these segments. To completely avoid the issue, we may set segment length very large. However, this is not feasible as the diverse nature of pre-training corpus makes sure that large enough segments rarely belong to a specific domain, meaning that the extracted data will no longer represent our target domain. We experimented with a handful of options for the segment length, and found the results to be stable with segments of 15 sentences.
Continued pre-training instead of pre-training from scratch: Once we have computed the relevance weights λ(s i ), we do not start pre-training the language model from scratch as this is not feasible for each new task/domain. Instead, we start with a language model pre-trained on the general domain corpus and perform additional pre-training for relatively fewer steps with the weighted loss function. In our case, we start with a BERT language model pre-trained for one million steps and continued pre-training with updated loss function for either 50, 000 or 100, 000 steps.

Experiments
We use datasets listed in Table 2 along with a general domain corpus consisting of 8GB of text from Wikipedia articles. We use BERT BASE model provided in the GluonNLP library for all our exper- Task  Train  Dev  Test  C   HYPERPARTISAN 516  64  65  2  ACL-ARC  1688  114  139  6  SCIERC  3219  455  974  7  CHEMPROT  4169  2427  3469  13  IMDB  20000  5000  25000 2  SST  67349  872  1821  2  AGNEWS  115000 5000  7600  4  HELPFULNESS  115251 5000  25000 2  RCT20K 180040 30212 30135 5  iments. It has 12 layers, 768 hidden dimensions per token, 12 attention heads and a total of 110 million parameters. It is pre-trained with a sum of two objectives. First is the masked language model objective where model learns to predict masked tokens. Second is the next sentence prediction objective where sentence learns to predict if sentence B follows sentence A or not. We use learning rate of 0.0001, batch size 256 and warm-up ratio 0.01. For finetuning, we pass the final layer [CLS] token embedding through a task specific feed-forward layer for prediction. We use learning rate 3e-5, batch size 8, warm-up ratio 0.1 and finetune the network for five epochs. In all the experiments, we start with a BERT pre-trained for one million steps and continue pre-training for additional 50, 000 steps in case of discrete λ, and 100, 000 steps in case of continuous λ. Also, as mentioned in Section 3.2, we filter out segments instead of sentences and save them. We set the segment length to be 15 sentences and filter out 20% of the data. Pseudo-code of the end-to-end algorithm can be found in Appendix B.

Baseline Methods
For each baseline data selection method, we start with a BERT pre-trained on general domain corpus for one million steps as in case of TADPOLE. Then, we continue pre-training the baseline method for the same number of steps as in case of our method. In case of baseline methods which filter general domain corpus, we filter the same fraction of text as in case of our method. Due to space constraints, we discuss some of the technical details of Baseline methods in Appendix ??. General: Continued pre-training on general do-main corpus. Random: Continued pre-training on random subset of general domain corpus.
Task (Gururangan et al., 2020): Continued pretraining on task data. Since task data is small, we can not pre-train on the task data for as many steps as in other cases. Instead we do 100 epochs, save the model after every epoch and pick the best one. LM (Moore and Lewis, 2010): Continued pretraining on text filtered via language models trained on task data. We train two language models, one on the task data and another on a subset of general domain corpus (same size as the task data). We select sentences with lowest scores given by the function f (s) = H I (s)−H O (s), where H I (s) and H O (s) are the cross-entropy between the n-gram distribution and the language model distribution.
Distance (Wang et al., 2017a): Continued pretraining on data filtered via Euclidean distance scoring function. For each sentence f , we consider BERT embedding v f and compute vector centers C F in and C Fout of the task data F in and a random subset of general domain corpus F out ; We score a sentence f as per the scoring function: Data-Aug (Xie et al., 2019): Data augmentation via back translation and tfidf based word replacement. In back translation, for each sentnence f , we tranlate it to french and then back into english. In tfidf based word replacement, we replace uniformative words with other uninformative words. Label of the new training example is same as the label of original example. Due to space constraints, we report the average scores of the two data augmentation strategies. Note that data augmentation only applies to task data used while finetuning and does not involve additional pre-training. Table 3 shows the effectiveness of TADPOLE, automatically adapting pre-training to the task domain. Top half contains the result for four small datasets and bottom half contains the results for five large datasets. Since focus of the paper is on small datasets, we take subsamples of large datasets of size 500, 1000,2000,5000,20000. For tasks with less than 5k samples, continuing pre-training on the unfiltered corpus (General or Random subset) yields an average gain less than 0.11%. Adapting pre-training by training on the task data only yields  Table 3: Performance of TADPOLE, six baseline methods and TADPOLE combined with Data Augmentation. At top, we list the performance of each task whereas at bottom we list the average performance gain over five tasks such that for each task, we subsample a fixed number of training samples. Base corresponds to the pre-trained model on general domain corpus with no further pre-training. Baseline methods are mentioned in previous subsection. TADPOLE corresponds to our method with discrete relevance weights. T+Data-Aug corresponds to our method combined with data augmentation during finetuning. Keeping in line with the previous works, we use the following metrics: accuracy for SST, micro f1 score for CHEMPROT and RCT20K, macro f1 score for ACL-ARC, SCIERC, HELPFULNESS, HPRPARTISAN, IMDB, and AGNEWS. Each model is finetuned eight times with different seeds and the mean value is reported. Subscript correspond to the standard deviation in the finetuned model performance. Average gain corresponds to the average improvement over Base for each of the baseline methods and TADPOLE. Subscript in Average Gain corresponds to the standard deviation in the estimate of the average gain.

Results
an average gain of 0.20%. Applying popular data selection methods known for domain adaptation including Language Model based relevance score or Distance based relevance score yields a maximum gain of 0.34%. TADPOLE beats all these methods and achieve an average gain of 1.82%. Data augmentation on task data while finetuning (no additional pre-training) achieves an average gain 0.48%. Combining Data-Augmentation with TAD-POLE yields the maximum average gain showing that TADPOLE and Data Augmentation are complimentary methods and TADPOLE is superior to Data Augmentation on it's own. For tasks with more than 5k samples, we achieve an average gain of 0.36%. Detailed results can be found in Appendix G. To further test the efficacy of TADPOLE with small number of training samples, we randomly select a subset of training samples from the large task datasets and treat it as a new task. We observe that the gap between performance gain from TADPOLE and baseline methods increase when the number of training samples are low.
Models pre-trained on each of the four domain specific corpus can achieve a higher gain (3.37%) over the base model. However, unlike these models, our method has the advantage that it does not require access to any large domain specific corpus. Instead we only need a small task datasetalready available for finetuning. So, it is applicable to any new task from any new domain. We observe that Performance boost is higher if the corresponding boost via additional pre-training on large domain specific corpus is higher. Results for this comparison can be found in Appendix F. In Table 3, results are presented for the discrete relevant weight case as they are better when the number of steps available to continue pre-training are small. Results for continuous weights case can be found in Appendix E. Results are not very sensitive to the fraction of data filtered as can be seen in Figure 6 in Appendix H.

Conclusion
Domain shift in finetuning from Pre-training can significantly impact the performance of deep learning models. We address this issue in the most reasonable setting when we only have access to the labeled task data for finetuning. We adapt data selection methods from Domain Adaptation to adapt pre-training for the downstream task. The existing methods either require sufficiently large task data, or are based on adhoc techniques that do not generalize well across tasks. Our major contribution is providing a new data selection technique that performs well even with very little task data, and generalizes well across tasks.

A Related Work
Domain Adaptation: A typical set up for Domain Adaptation involves access to labeled data in source domain, very limited or no labeled data in the target domain and unlabeled data in both source and target domains. This is somewhat different than the setup for our paper where we have access to labeled data with no additional unlabeled data in the task domain and our objective is optimize performance for the same domain. Nevertheless, several techniques of Domain Adaptation have similarities or core components useful for our setup. There are two sets of approaches addressing Domain Adaptation problem: model-centric and data-centric. Model-centric approaches redesign parts of the model: the feature space, the loss function or regularization and the structure of the model (Blitzer et al., 2006;Pan et al., 2010;Ganin et al., 2016). A recent such approach, appropriate for our setting is called Pivot-based Domain Adaptation; it has recently been applied to Task Adaptive Pre-training when there is additional unlabeled task data available (Ben-David et al., 2020). In a nutshell, the idea is to distinguish between pivot and non-pivot features, where pivot features behave similarly in both domains. Then, by converting the non-pivot to pivot features, one can make use of a model trained on the source data. This approach does not work well when the target data is small since the mapping of non-pivot to pivot features cannot be trained with a limited size dataset. Since our technique is data-centric and applies to the regime of a small target corpus, we do not further analyze this or any other model-centric approach.
Data-centric approaches for domain adaptation include pseudo-labeling, using auxiliary tasks and data selection. Pseudo-labeling apply a trained classifier to predict labels on unlabeled instances which are then treated as 'pseudo' gold labels for further training (Abney, 2007;Cui and Bollegala, 2019). Auxiliary-task domain adaptation use labeled data from auxiliary tasks via multi-task learning (Peng and Dredze, 2016) or intermediate-task transfer (Phang et al., 2018(Phang et al., , 2020. The methods most relevant to us are those of data selection and are discussed above in detail.

B Datasets in accuracy estimation of
Anomaly score based data filtration CS task data: To train anomaly score discriminator for CS data, we use the tasks data from ACL-ARC and SCIERC. Details of these datasets are mentioned in Section 4. CS and Bio Abstracts: Semantic Scholar corpus (Ammar et al., 2018) Forest (Liu et al., 2008) . Isolation Forest is an unsupervised decision tree ensemble method that identifies anomalies by isolating outliers of the data. It isolates anomalies in data points instead of profiling the normal points. Algorithm works by recursively partitioning the data using a random split between the minimum and maximum value of a random feature. It works due to the observation that outliers are less frequent than the normal points and lie further away from normal points in the feature space. Thus, in a random partitioning, anomalous points would require fewer splits on features resulting in shorter paths and distinguishing from the rest of the points. Anomaly score of a point x is defined is the expected path length of x in various decision trees, c(n) = 2H(n − 1) − 2(n − 1)/n is the average path length of unsuccessful search in a Binary Tree and H(n − 1) is the n − 1-th harmonic number and n is the number of external nodes.

D Pseudo Code
Algorithm 1 shows the pseudo code for the case of continuous relevance weights. Discrete relevance weight setting is same as C → ∞. As discussed in 3.2, in case of discrete relevance weights, we filter out segments containing several consec-Algorithm 1 Task Adaptive Pre-training Input: Pre-trained model B, Pre-training instances x 1 , . . . , x N , task data T , (C, α), #steps Stage 1: Instance weight computation Let the sentences of the task be s 1 , . . . , s t with sentence embeddings P = {Embed(s 1 ), . . . , Embed(s t )}. Let a random subset of pre-training instances (sentences of these instances) be s 1 , . . . , s t/10 with BERT based sentence embeddings N = {Embed(s 1 ), . . . , Embed(s t/10 )} Train an anomaly detection object, IF = IsolationForest(P ∪ N ) Stage 2: Adaptation of pre-training to target domain Continue training language model B for #steps on instances x 1 , . . . , x N with instance weights λ(x 1 ), . . . , λ(x N ). Finetune resulting model on the labeled task data T utive sentences. We experimented with several options for the segment length and found the stable segment length to be 15 sentences. Here, a sentence is a consecutive piece of text such that when applied through the BERT tokenizer, it results in 256 sentences.

E Continuous relevance weights
We see in Table 4 that a model additionally pretrained for 50,000 with discrete λ values consistently over performs the continuous case even when we train with continuous relevance weights for far higher number of steps. This is because of the fact that many of those steps yield virtually no training at all. For instance, suppose the relevance weights are uniformly distributed between 0 and 1; [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]. Then, in discrete case we pick the top two sentences and thus two steps are sufficient to train on these most relevant sentences (assume batch size is 1). However, in continuous case, we need to train the model for ten steps to train on these top two relevant sentences. Thus, we need many more steps to achieve and beat the performance achieved in the Discrete case. An open question is to combine the two settings so as to benefit from the generality of Continuous case and efficiency of the discrete case.

F Performance boost with Domain-specific Corpus vs TADPOLE
We compare the performance boost we achieved due to TADPOLE with the performance boost we achieve if we have access to large pre-training corpus. In Table 5, we list the gain in performance in both cases over eight tasks from four domains. We see that the performance boost is higher with TAD-POLE if the corresponding boost is higher with domain specific corpus. Thus if there is a large domain shift between the general domain corpus and the task data, as can be measured by the performance boost via large pre-training corpus, then TADPOLE is able to achieve large performance boost via Task Adaptation. Scale of numbers in the two columns are not directly comparable due to the following two reasons. First is that additional pre-training done is Gururangan et al. (2020) is for almost as many steps as the number of steps required to pre-train a network from scratch. However, in our case additional pre-training is done for only 5% of the number of steps required to pretrain a network from scratch. Second reason is that the model used in (Gururangan et al., 2020) is different, ROBERTA. Also, the general domain corpus is different and thus the domain shift is not exactly the same as in our case. The point however remains the same, which is that as the target domain is further away from the pre-training corpus, the benefits of TADPOLE increase. Table 6 shows the results for datasets with more than 5k training samples.

G Performance gain for large datasets
H Different data fraction Figure 6 shows that results shown in Table 3 are not very sensitive to the fraction of pre-training data filtered. We can chose anywhere between 2-20% of the data.    Table 6: Performance of TADPOLE and five Baseline methods. Base corresponds to the pre-trained model on general domain corpus with no further pre-training. Baseline methods are mentioned in previous subsection. TADPOLE corresponds to our method with discrete relevance weights. T+Data-Aug corresponds to our method combined with data augmentation during finetuning. Keeping in line with the previous works, we use the following metrics: accuracy for SST, micro f1 score for CHEMPROT and RCT20K, macro f1 score for ACL-ARC, SCIERC, HELPFULNESS, HPRPARTISAN, IMDB, and AGNEWS. Each model is finetuned eight times with different seeds and the mean value is reported. Subscript correspond to the standard deviation in the finetuned model performance. Average gain corresponds to the average improvement over Base for each of the baseline methods and TADPOLE. Subscript in Average Gain corresponds to the standard deviation in the estimate of the average gain.