Automatic Document Selection for Efficient Encoder Pretraining

Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.


Introduction
Large pretrained language models have achieved state-of-the-art performance in NLP tasks (Devlin et al., 2019;Liu et al., 2019, i.a.).These studies find that increasing pretraining data size usually leads to better task performance.For many tasks, additional task (in-domain) data helps improve the performance further (Gururangan et al., 2020;Dery et al., 2021;Li et al., 2022).Several studies have found that directly pretraining on task data is more effective : science texts (Beltagy et al., 2019), tweets (Nguyen et al., 2020), legal texts (Chalkidis et al., 2020) or code (Tabassum et al., 2020;Chen et al., 2021).Notably, these domains are known a priori, and identifying data sources for curation is straightforward.In other instances where the domain is less clear, like "offensive online content" (Bai et al., 2021), more complicated data sampling is employed to guess at the desired data distribution suitable for training a downstream classifier.
To address such scenarios, we propose automatically identifying relevant domain-specific train- ing data for a large corpus and subsequently pretraining a model on the selected data.Specifically, we use Cynical Data Selection (Axelrod, 2017), an approach that advanced Moore-Lewis sampling (Moore and Lewis, 2010), to select data from the Pile dataset (Gao et al., 2021).This automatic selection method can include possibly overlooked yet relevant documents from domains that may not be too close to the target domain.Figure 1 illustrates this method which achieves higher performance on tasks in the target domain by using only 2.5GB (0.5%) of cynically selected data.
Specifically, we experiment with pretraining encoders with varying amounts of data sampled from the Pile.1 With our "target corpus" of OntoNotes (Weischedel et al., 2013), we compare language models trained with cynical and random selection at various data levels.We find that the cynically selected encoder achieves consistently lower target corpus perplexity than one trained with random selection.We further finetune the encoders on a suite of tasks, some of which are derived from OntoNotes.Again, we find that models pretrained with cynical selection perform best.We suggest this as a viable method for inexpensively pretraining effective domain-specific encoders.

Cynical Data Selection
Methods for data selection for language-related tasks have been widely studied, usually to select in-domain data (Axelrod et al., 2011;van der Wees et al., 2017;Dai et al., 2020;Killamsetty et al., 2020).One such method is Cynical Data Selection (Axelrod, 2017).The intuition behind cynical selection is greedily ranking sentences from the text corpus, based on its score computed against text representative of the target domain, which is based on how much information gained by selecting it.
Concretely, given representative text from the target domain, cynical selection uses the crossentropy of the selected text against the representative text and calculates the information gain of each sentence in the general corpus.It then picks the most useful sentence relative to what has already been selected and its similarity to the representative text.This also leads to a bias towards shorter sentences and preferring sentences that contain words with high probability in the representative text.
Our work extends the cynical selection to the document level selection.Sentences are still scored at the sentence level, but the average sentence-level gain determines the information gain of a document. 2We demonstrate its advantages in efficiently selecting related documents to the target domain.

Experiments and Results
In this work, we set OntoNotes 5.0 (Weischedel et al., 2013) as our target corpus, and we use a smaller sample from the training corpus of the CoNLL 2012 Shared Task (Pradhan et al., 2012) as the representative corpus for data selection.We first train an encoder based on the selected data and use the Edge Probing suite (Tenney et al., 2019b) for the downstream task evaluation, which has previously been used to probe and evaluate language models (Clark et al., 2019;Tenney et al., 2019a;Jiang et al., 2020;Zhang et al., 2021).

Data Selection
Dataset We adopt the Pile (Gao et al., 2021) for data selection, which consists of 1250GB text from 22 domains.Cynical selection naturally prefers text data based on the target corpus.To make a more fair comparison, we exclude 100GB data from "DM Mathematics" and "Github" to eliminate the noise of non-text data in random selection.Selection Strategy Encoder pretraining is naturally a document-level task, as context contributes critically to improved representations.Thus, we need to extend the sentence selection into the document selection to achieve a better-contextualized representation at the pretraining stage. 3We apply our extended document-level cynical selection to the Pile and extract the top {0.5%, 1%, 2%, 5%} scored documents. 4We also randomly sample the same percentage of documents from Pile to use as a corresponding baseline.As a baseline for manual selection, we use 30GB text from "Wikipedia" and "BookCorpus" subsets, following Liu et al. (2019).

Encoder Pretraining
We set up a BERT-base model and follow the pretraining objective and settings described in RoBERTa (Liu et al., 2019). 5In Figure 2, we plot the validation perplexity on both the representative corpus (CoNLL 2012 Shared Task) and a held-out set of the Pile.The perplexity on the held-out set decreases when there is more training data for both the cynical and random selection.Cynical selection attains a higher perplexity, which shows that while the selected documents are more adapted to the target domain, it is not better adapted to the general corpus.As each encoder needs different training steps for different corpus sizes, we try to make a fair comparison by assuming a fixed training budget of 100k update steps.In Figure 2, we find that at 100k steps, 2% of the cynically selected data achieves the lowest perplexity, and more training data does not help the adaptation to the target corpus.Also, cynical selected documents consistently outperforms the random selection, demonstrating the effectiveness of adapting to the target domain.(Tenney et al., 2019b).The cynical selection consistently outperforms both the random and manual selection in most cases, even with only 0.5% selected documents.

Edge Probing Evaluation
We evaluate the effectiveness of the pretrained encoders on 8 Edge Probing tasks (Tenney et al., 2019b), 6 for which the metric and architecture are uniformed to evaluate the span-level contextual representation of the language model, and it has been widely studied in the past few years.Results are plotted in Figure 3.We find: Observation 1: Models trained on cynically selected documents show consistent performance gain on all tasks compared to the random selection.
Observation 3: Compared to random sampling, the performance gain of the cynical selected documents is larger with only 0.5% to 1% of training data, and decreases for larger training sets as random selection catches up.
Observation 4: For some tasks, especially "const" and "pos," which are two tasks exactly based on the OntoNotes dataset, cynical selected documents yield good task performance with only 0.5% data, and the scores decrease when increasing the selection size to 2%, but increase again with 5%.This could suggest that in cynical selection, the top-scored documents are strongly related and helpful to the target task domain, while the others may not contribute as much or even hurt.However, more data ultimately does improve performance.
Overall, we could achieve promising results with only 0.5% documents of the entire corpus, demonstrating the effectiveness and efficiency of cynical 6 We adopt the jiant for edge probing data processing and finetuning, https://github.com/nyu-mll/jiant.selection in the adaptation to downstream tasks in the target domain.We also notice the standard deviation of the runs for random selection is much larger than cynical selection, indicating more stable encoder results from cynically selected documents.

Discussion
Data Distribution We plot the domain distribution of the selected documents in Figure 4.While random selection follows the distribution of the original Pile dataset, cynical selection prefers newslike articles such as the "Pile CC" and "OpenWeb-Text2," rather than technical ones, like StackExchange.Also, since we consider the same number of selected documents for each split, the actual selected data size is not the same (Figure 5).We notice that cynical selection prefers shorter documents, especially in the top-ranked samples.This should be related to our scoring strategy since we average the sentence scores as the final document score.In the case for long documents, even though there are sentences with higher scores, it is not very likely to be selected since the final scores are averaged by the total number of sentences.This explains why the cynical selection prefers shorter documents in the 0.5% and 1% selection but not in the 5% selection.Therefore, when we bring the actual selected data sizes into the comparison, the cynical selection is much more efficient than the random sampling.Future work can investigate other methods of aggregating sentence-level scores.
Computational Trade-off Cynical selection enables the language models to use less training data and GPU time while achieving competitive results.However, the data selection needs to be done before the training and pre-processing could be costly.Cynical selection on the Pile can be parallelized via sharding, because the specific order/ranking of a document in the final selected subset is not important.The intuition is that any good document will be chosen early, regardless of which shard it is in.So, we split the automatic document selection of the Pile into 10,000 smaller jobs, each requiring a single core CPU 7 and 10GB of RAM and taking 2 hours to finish.In general, the cost of the selection depends on the size of the general corpus that is being selected from.In our training environment with 8 RTX6000 GPUs, it takes 800+ GPU hours in total to train an encoder with 60GB randomly selected documents.To achieve comparable or even better performance with cynical selected documents, we only need 200 GPU hours for the 2.5GB of cynically selected data to converge.The market price for a single RTX6000 is $1.50/hour, so we need $1200+ to train with random selection but less than $300 for cynical selection.On the Google Cloud Platform, 20,000 hours on comparable or faster CPUs can be obtained with $200.Overall, cynical selected documents saves more than 50% of the computational cost and achieves better task scores.Overfitting Large language models have the ability to overfit or memorize small datasets (Kaplan et al., 2020;Carlini et al., 2022).We inspect the loss curves for two of the cynical selections (1% and 2%) in Figure 6.While the 1% encoder achieves a lower loss for most parts of the training, it is eventually surpassed by the 2% model.This highlights a tradeoff between computing cost and performance; given a limited compute budget (in this example, under 50K steps), it is better to use a smaller selection.While prior work suggests scaling up models to fit dataset size (Kaplan et al., 2020), we are successful in scaling down dataset sizes so that they can be efficiently fit (and outperform larger datasets) in fewer steps.

Related Work
Due to the huge computational cost of training large models, both researchers and engineers have sought alternatives to using data more efficiently.Some prior works use statistical methods to select relevant data from a large corpus (Rousseau, 2013;Kirchhoff and Bilmes, 2014;Eetemadi et al., 2015;Xu and Koehn, 2017).Some other studies introduce additional classifiers or language models to help the data selection (Ruder and Plank, 2017;Qu et al., 2019;Sun et al., 2021).Also, data selection could be more efficiently involved in the active learning approaches (Shen et al., 2004;Lowell et al., 2018;Erdmann et al., 2019;Shelmanov et al., 2019;Margatina et al., 2022;Tsvigun et al., 2022).This work applies a simple statistical method to find the most related text to a target domain.It incrementally constructs a dataset out of a large corpus for the goal of training language models.

Conclusion
This work builds the connection from corpus subselection in statistical LM construction to neural LMs.We extend cynical data selection to efficiently select task-related documents for encoder pretraining and achieve lower perplexity in the target domain.We also demonstrate its effectiveness on downstream tasks by achieving comparable or even better results with 20x less data, 3x fewer training iterations, and 2x less computational cost on 8 Edge Probing tasks.We believe this fills the gap in the literature on an important topic in training powerful LMs.We purposefully keep this work in the space of methods used in the days of Stat NLP to highlight their out-of-the-box applicability, for which that line of research is still salient.Based on our findings, this line is resurrected, suggesting new novel approaches should be studied.We anticipate that with this connection, researchers could explore this topic, investigate various subselection methods, and extend it to other domains.

Limitations
Since pretraining encoders is expensive, our study only experiments on one source corpus (Pile) and one target task domain (OntoNotes).However, this method could be demonstrated more effectively on other datasets that are more domain-specific.We do not run multiple random selections with different seeds due to the time and cost of training large models.We think the standard error for the randomly selected data would be significant, especially for the subset of only 0.5% or 1% documents.Also, we recognize that training our models longer or scaling up the model size is an "easy" method of improving performance (Liu et al., 2019;Kaplan et al., 2020).Our results assume a fixed training budget (max 100k steps).Thus with a larger budget, the trade-offs will vary.Another concern is that we do not experiment with other subselection meth-ods (Gururangan et al., 2019) or other languages, but we believe they should have similar trends.

A.1 Detailed Distribution
A detailed data distribution is shown in Table 2.

B Formalization of Cynical Data Selection
The aim of CynDS is to incrementally construct W through scoring each sentence by information gained relative to the already selected data (Equation 1).Given a REPresentative corpus from the target domain, CynDS is an effective and efficient method to identify the most relevant subset of sentences from a large corpus.Formally, we can define a cross-entropy between REP and some set of tokens as, where W is the set of tokens, V is the vocabulary, and C indicates the count of word type, v. C REP (v) is the count within REP and C(v) is the count within W .Let W 1 , . . ., W n be the incrementally selected corpus.We can define the cross-entropy after selecting n sentences as and minimize H n .This can be rewritten recursively as where ∆H n→n+1 (s) is the delta (effect) of a given sentence s.To find the n + 1 th sentence that minimizes ∆H n→n+1 , we can rewrite it as Here, penalty refers to how similar the sentence is to already selected texts and gain refers to how similar the sentence is to the representative corpus.Axelrod (2017) derives the P enalty and Gain as A proof of this derivation is given in Axelrod (2017).
In our work, we still let W 1 , . . ., W n represent the first n sentences, and H(REP ) is unchanged.However, we use the scores, ∆H n→n+1 (s), of each sentence and compute document-level scores for each document, These document-level scores can then be ranked, and we select the top k% of the documents.Note that while there are many alternatives to selecting documents, our goal is to select a method and evaluate whether automatic data selection is effective for LM pretraining rather than comparing different methods, which can be future work.

B.1 Sentence vs Document Selection
Results are shown below in

Figure 1 :
Figure 1: This figure highlights the efficiency of the automatic cynical selection of documents in the target domain.Scores are averaged from 8 Edge Probing tasks.Cynically selected 2.5GB data achieves the best score.

Figure 4 :
Figure 4: Data distribution over the Pile domains

Figure 5 :
Figure 5: For each percentage of cynically and randomly selected documents, we show the actual data size (GB) and corresponding document length.

7Figure 6 :
Figure 6: This figure shows the training loss for the runs of 1% and 2% cynically selected subsets.

Table 1 :
Each subset consists of 15GB text.