Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.


Introduction
Pretrained language models (PTMs) (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019), trained on massive and heterogeneous corpora, have significantly improved the state-of-the-art across a variety of natural language processing tasks (Wang et al., 2022(Wang et al., , 2023)).Kaplan et al. (2020) found power laws relating cross entropy loss to the sizes of language models and their training datasets.As a result, the field has recently shifted toward larger models and large data (Brown et al., 2020;Rae et al., 2021;Smith et al., 2022;Chowdhery et al., 2022) in hopes of improving performance.
However, training a state-of-the-art language model requires substantial computational resources which demand considerable energy, along with the associated financial and environmental costs (Strubell et al., 2019).For example, RoBERTa-Large (Liu et al., 2019), which was trained on  1000 V100 GPUs for approximately one day, has a computational cost of 4.36×10 21 FLOPs.Recently, Chowdhery et al. (2022) proposes PaLM, which consumes 580 times more FLOPs than RoBERTa-Large.PaLM was trained on 6144 TPU v4 chips for more than 1200 hours, which is unaffordable for most researchers.Therefore, finding ways to speed up pretraining is crucial for the development of pretrained model research.
In general, there are three main strategies used to speed up pretraining in NLP: parallel architectures, efficient model architectures, and novel pretraining tasks.The first one is to train a single model utilizing multiple GPUs distributed in many computational nodes (Wang et al., 2020b;Shazeer et al., 2018;Huang et al., 2019).Unfortunately, the gains in efficiency of this strategy depend entirely on the amount of computing hardware used.The second strategy is to improve model structures to reduce the computational complexity and therefore improve efficiency (Wang et al., 2020a;Katharopoulos et al., 2020;Roy et al., 2021).The last one explores more challenging pretraining tasks to accelerate a model's convergence (Clark et al., 2019;Joshi et al., 2020;Levine et al., 2020).However, their improvements are limited, with a reduction of less than an order of magnitude in computational expenses (measured in FLOPs).
In this paper, we aim to reduce the computational costs from data level (See Table 1).The PLMs are trained on the entire pretraining corpus D, which is task-agnostic.To take the downstream task into account, we hope to select the most relevant samples from the pretraining corpus based on the downstream data.Recently, Yao et al. (2022) proposes TLM, which retrieves data from a pretraining corpus using task data as queries.However, TLM remains task-agnostic, because it only considers text (i.e., X) similarities and ignores the label (i.e., Y) information.
Motivated by influence function (Cook and Weisberg, 1982;Koh and Liang, 2017), we propose Influential Subset Selection (ISS) for language model, i.e. selecting the samples with the most positive influence on the downstream task.To calculate the label-aware influence value, ISS utilizes the derivation chain rule from a test objective to training samples.Nevertheless, directly applying the chain rule leads to computing the inverse of Hessian with the complexity of O(nq 2 + q 3 )(n is the number of examples and q is parameter size), which is computationally expensive and may run out-of-memory in neural networks.To address this problem, we propose a gradient matching based influence approximation method for selecting pretraining data, which estimates the influence score by matching the gradient values of pretraining samples and end-task samples.Our method avoids the computation of the inverse of Hessian and significantly speeds up the estimation time of influence.
Our main contributions are summarized as follows: • We propose Influential Subset Selection for language model, which explicitly utilizes knowledge of the end-task to select the pretraining corpus.
• We design a simple, efficient, gradient matching based method for influence estimation, which avoids the calculation of the inverse of Hessian and significantly speeds up the estimation time.
• We evaluate the effectiveness of our method on eight tasks covering four domains.Notably, ISS outperforms PTMs (e.g.RoBERTa) with only 0.45% of the data and three orders of magnitude reduced FLOPS.Our code can be found at https://github.com/nitwtog/ISS.

Definition
We assume an end-task dataset represented as T = (Z t ) where Z t = (x 1 t , y 1 t ), (x 2 t , y 2 t ), . . ., (x m t , y m t ) represents a set of texts with their ground truth labels.And we assume a large-scale pretraining corpus D = (Z p ), where Z p = x 1 p , x 2 p , . . ., x M p represents unlabeled data.We define f = f (head) • f (feat) , such that f (feat) (•; θ ∈ Θ) is a feature extractor that is transferable across learning stages (e.g.pretraining to finetuning) and f (head) (•; ϕ ∈ Φ) is a task-specific head that is not transferable.And we assume l p (z p , θ, ϕ p ) and l t (z t , θ, ϕ t ) are the loss functions of pretraining and end-task.

Influence Function
Influence function (Cook and Weisberg, 1982;Koh and Liang, 2017) provides an efficient way to estimate the importance of a training sample.Considering a training sample z was weighted by a small ϵ during training, the empirical risk minimizer can be written as Assigning − 1 n to ϵ is equivalent to removing the training example z p .Then, the influence of weighting z p on the parameters is given by where H θ = 1 n z i ∈D ∇ 2 θ l z i , θ is the Hessian and positive definite by assumption, I param (z) ∈ R N , N is the number of network parameters.Then, we can linearly approximate the parameter change due to removing z without retraining the model by computing θ−z − θ ≈ − 1 n I param (z).

Methodology
We investigate an influence-based subset selection method to perform efficient pretraining while attempting to minimize accuracy loss on the end-task dataset (Section 3.1).Due to the high computational costs of influence function (Koh and Liang, 2017), we design an influence approximation strategy to speed up the calculation (Section 3.2).

Influence of Pretraining Corpus
PTMs used in previous works usually adopt language modeling as pretraining tasks, lacking taskspecific prior knowledge.However, we often know the end-task beforehand, so we can make specific choices about our pretraining regimen to improve end-task performance.Under this setting, we introduce Influential Subset Selection for language model, which measures the importance of pretraining samples by considering the X and Y information of the end-task simultaneously.Specifically, pretraining sample z p affects the prediction of end-task sample z t by influencing the parameters of the feature encoder θ.We can apply the chain rule to measure the influence of upweighting pretraining sample z p on the loss at end-task sample z t .
The more negative I (z p , z t ) is, the more positive influence z p can provide.However, computing the Hessian for the full training dataset is expensive, and inverting it is similarly prohibitive: with n training data points and p parameters, this computation requires O(n * p 2 + p 3 ) operations.It means that evaluating the influence of large-scale pretrained corpus is not achievable.Thus, we propose an influence approximation algorithm to speed up the estimation time.

Influence Approximation
Motivated by calculus, the update of the model parameters is the result of cumulative updates over several training iterations.Similarly, the difference between the loss of test point z t at the end of training versus at the beginning of training can be decomposed along the path taken by the training process.Thus, we hypothesize that the influences of all training examples on a fixed test point z t is exactly the total reduction in loss on z t .
Assume that we train the feature encoder by minimizing the pertaining loss l p (z p ; θ, ϕ), via an iterative optimization procedure (such as SGD) which utilizes one training example z p in iteration t.The parameters of the feature encoder before and after iteration t are θ t and θ t+1 respectively.The influence of z t on z p can be approximated in the following way.Figure 1: Illustration of gradient matching based influence approximation.g 1 and g 2 are the loss gradients of two different pretrained samples respectively, while g ′ is the loss gradient of the end-task sample.The influence of a pretrained sample is measured by how a small step based on its gradient affects the loss on the end-task sample.Compared to g 1 , the update step of g 2 is more generalized.
Suppose we are at point θ t , and we make a firstorder Taylor expansion of function l p (z p , θ t+1 ).
(5) Assuming the model employs SGD as the optimizer, then the update in parameters is θ t+1 − θ t = −η t ∇ θ l p (z t , θ t ), where η t is the learning rate at iteration t.Eq. ( 5) guarantees approximation precision as long as the update magnitude of θ is sufficiently small.By substituting the parameter update formula and disregarding the higher-order term, we arrive at the following first-order approximation.
We refer to this first-order approximation as gradient matching-based influence estimation.The full algorithm is provided in Algorithm 1.
Visualisation We visualize our influence estimation method in Fig 1 .g 1 and g 2 are the loss gradients of two different pretrained samples respectively, while g ′ is the loss gradient of the end-task sample.The influence of a pretrained sample can be viewed as the dot product of its gradient and the gradient of the end-task sample.Higher influence suggests that a network is learning parameters that generalize.

Implementation Details
Based on the influence score, we select the most relevant samples from the pretraining corpus.Following TLM, we first select a subset via a BM25 retrieval method.Then, we compute the influence score based on this subset to make ISS scalable and efficient.
Moreover, the number of parameters in largescale language models is very large, leading to very high dimensional gradients.To tackle this problem, we adopt a last-layer gradient approximation by only considering the last layer gradients of pretrained encoder.We select a subset of mini-batches by matching the weighted sum of mini-batch pretraining gradients to the mini-batch task gradients.Let B p and B t be the batch size of pretraining and end-task.The use of mini-batches considerably reduces the number of selection rounds during the ISS algorithm by a factor of B, resulting in B p * B t speed up.

Experimental Setup
To evaluate the efficiency and generality of our approach, we conduct experiments in two settings: pretraining from scratch, and further pretraining.
Baselines.We focus on comparison with general PLMs and TLM.Following Yao et al. (2022), we finetuned both BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) of base and large scales as our baselines.And we finetuned the released TLM models as baselines.
Evaluation Strategy.The results of the experiment are the average performance of three random seeds with the standard deviations.Following Gururangan et al. (2020), we report the test micro-F1 for ChemProt and RCT, and macro-F1 for the rest of datasets.Following TLM (Yao et al., 2022), we set three pretraining scales, namely small, medium, and large scales.Differently, at the same scale, our method only utilizes 20% size of the TLM data.More detailed settings are shown in Table A.1 in Appendix.
Training Details.We utilize the randomly initialized BERT of base scale as our starter models.We mostly follow optimization, and hyperparameters choices used in Yao et al. (2022)  For each task, we report the average F1 score across three random seeds with standard deviations as subscripts.We also show the number of parameters, the total training compute (FLOPs), and the size of training corpus for comparison.
Evaluation Strategy.Similar to pretraining from scratch, we report the average performance across three random seeds.And we report the micro-F1 for ChemProt and RCT, and macro-F1 for ACL-ARC and SCIERC.
Training Details.In this setting, we perform further pretraining on off-the-shelf pretrained models, such as BERT and RoBERTa.All experiments were conducted on 4 NVIDIA GeForce RTX 3090 GPUs.Detailed hyper-parameters are provided in Table A.2 in Appendix.

Experimental Results
In this section, we will discuss the results of comparing our methods against other baselines.

82.85
Table 4: Evaluation results for ISS in further pretraining.We report the average F1 score across three random seeds with standard deviations as subscripts.
sults that are better than or comparable to the PLM baselines with significant reductions in FLOPs and the size of training data.At the large scale, ISS achieves comparable results to RoBERTa-large, with an average of 0.19% of FLOPs and 0.45% of the training corpus.At the small and medium scales, ISS improves the performance by 0.29 and 0.74 points on average respectively; 2) At the same data scale, ISS significantly outperforms TLM, which indicates that task label information is crucial.And the influence-based subset selection can select more influential pertaining samples; 3) ISS could offer limited performance gains on highresource datasets.It demonstrates that the influence of the pretraining samples would be decreased as the task data grows sufficiently.

Further Pretraining
We compared ISS with other domain-specific further pretraining methods.Differently, we initialize the network with off-the-shelf pretrained models to provide initialization and select influential subsets from the domain corpus.results.In conclusion, our method outperforms all the baselines, with significant reductions in FLOPs and the size of training data by one order of magnitude or more.It proves our approach is feasible.

Comparison of Pretraining Steps
To validate the effect of pretraining steps, we compare the performance of ISS with TLM at different pretraining steps.The test results on the four tasks with different pretraining steps are shown in Figure 3.We observe that ISS could achieve the best performance with fewer steps on most of the datasets.

Subset Size for Pretraining
To compare the performance at different data scales, we extracted subsets from the TLM small-scale corpus at different scales via ISS and TLM, respectively.The results are shown in Table 5.We can observe that the performance of TLM becomes better as the dataset grows, but the best results are still lower than those of our method.In ISS, the F1-score would reach the top at the 20%-40% scale and gradually decrease as the data size grows.We believe that as the dataset expands, task-irrelevant or noisy data is added.

Last Better than First
As explained in Section 3.3, the last layer of gradients of the model encoder is only considered to speed up the computation.We have studied the relationship between the gradients at the different layers used in ISS and the corresponding performances.Table 3 shows the results on Chemprot and SciERC.We can observe that the closer the layer, to the task head, the better the selected subset works.The phenomena suggest that different layers in the language model can capture different information, with layers closer to the task head learning more information about the task.
Table 7 shows the times required by ISS calculating influences at the different layers.Overall, the time cost of selecting a subset is negligible compared to pretraining.In addition, the computational speed based on the last layer would be nearly  double, compared to that at the embedding layer.

Visualization of Pretrained Model
We visualize the task data on ISS-small, BERT, and RoBERTa, using the t-SNE algorithm (Van der Maaten and Hinton, 2008).The results are shown in Figure 4. We can observe that the different classes of deep features in ISS-small formed tighter clusters, suggesting that ISS provides better initialization for downstream tasks.In contrast, the features learned by BERT and Roberta are distributed respectively in separate clusters with overlapping parts that could not be distinguished.

Analyzing of Task-influential Words
We compute the point-wise mutual information (PMI) (Levy and Goldberg, 2014) between words and their corresponding labels in the task dataset.Briefly, PMI is used to measure the likelihood of two events occurring together, so the higher the PMI a word has, the more likely it is to be taskinfluential.We select words with high PMI as taskinfluential words, and compare their frequency in ISS-small and TLM-small datasets, respectively.As shown in Table 6, the word frequency in the ISS-small dataset is higher than that in the TLMsmall dataset.Thus, ISS may focus more on taskinfluential words.
7 Related Work

Efficient Pretraining for PLMs
Many attempts have been made to improve the efficiency of pretraining.Parallel architectures (Shazeer et al., 2018;Wang et al., 2020b) are commonly used in pretraining.However, parallelism  (Clark et al., 2019) applies the replaced token detection which is more challenging.PMI-Masking (Levine et al., 2020) selectively masks tokens based on their importance.However, their improvements are limited, with less than an order of magnitude reduction in computational expenses (measured in FLOPs).Orthogonal to these works, ISS investigates reducing training data redundancy by the influence of pretraining data points.

Further Pretraning in NLP
Continually pretraining can effectively improve PTMs' performance on new domains or downstream tasks (Gururangan et al., 2020).To achieve it, most previous works continually optimize the pretrained model parameters on a large number of corpora collected from the target domain (e.g., scientific (Beltagy et al., 2019), finance (Araci, 2019) and bio-media (Lee et al., 2020)).However, it is computationally expensive to further pretrain the model on a large amount of unlabeled data and it may not be feasible to collect such a large scale of unlabeled data on certain domains.In contrast, ISS does not need any additional domain data and only utilizes the general corpus.In addition, our approach can also be employed for further pretraining, as we demonstrate in our experiments.

Dataset Pruning
Dataset pruning is closely related to the coreset selection methods (Mirzasoleiman et al., 2020;Agarwal et al., 2004), which try to identify the most representative training samples.Several works (Killamsetty et al., 2021;Rebuffi et al., 2017;Toneva et al., 2018) et al., 2017), diversity (Sener and Savarese, 2017), and forgetfulness (Toneva et al., 2018), and then rank and select the training data according to the computed score.Recently, Yao et al. (2022) proposed TLM for transfer learning, which retrieves a subset from the pretraining corpus that is more similar to the task corpus.However, these methods are heuristic and lack of generalization guarantee, they also discard the influence interaction between the collected samples.Our proposed method overcomes these shortcomings.

Conclusion
In this paper, we propose Influential Subset Selection for language model, which aims to reduce the computational costs of pretraining from data level.Specifically, we introduce influence function to measure the importance of each pretraining sample.Moreover, we design a simple, efficient, gradient matching-based method for influence estimation, which significantly speeds up the estimation time.
Experiments on various datasets demonstrate that our method achieves comparable performance with PTMs, with a reduction of training FLOPs by three orders of magnitude.

Limitations
There are two potential risks with our method.First, ISS trades generality for efficiency by learning only task-specific representations.Consequently, it may not be suitable for other tasks.Secondly, our method is hardly practical for few-shot or zeroshot learning, as few or no task data are available as anchor points.These potential risks are left to future work.

Ethics Statement
Pretraining from scratch and further pretraining such as DAPT need large-scale unlabeled corpus to learn general knowledge, which results in corresponding greenhouse emissions due to energy consumption (Strubell et al., 2019).However, as shown in Section 5, our new efficient algorithms greatly increase the data efficiency of PTMs, reducing these harms as well as the various harms associated with labor for data collection.Our work introduces a new subset selection algorithm but leverages pre-existing datasets and models.
Overall, this work inherits some of the risks of the original work upon which it is implemented, (such as bias (Bender et al., 2021) or privacy leakage (Carlini et al., 2021).and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Section 4 D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Section 4 D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

:
loss landscape of pre-training : loss landscape of end-task : gradient of pre-training sample : gradient of end-task sample

Figure 4 :
Figure 4: Visualization of sentence representation on Chemprot using the t-SNE algorithm (Van der Maaten and Hinton, 2008).Each color denotes a class.would not actually reduce computational costs in terms of FLOPs.For most Transformer-based PTMs, as their input sequence goes longer, their efficiency is limited by the computation of attention weights.Choromanski et al. (2020) and Wang et al. (2020a) design low-rank kernels to theoretically approximate the original attention weights.Child et al. (2019) and Roy et al. (2021) introduce sparsity into attention mechanisms by limiting the view of each token to a fixed size and separating tokens into several chunks.ELECTRA(Clark et al., 2019) applies the replaced token detection which is more challenging.PMI-Masking(Levine et al., 2020) selectively masks tokens based on their importance.However, their improvements are limited, with less than an order of magnitude reduction in computational expenses (measured in FLOPs).Orthogonal to these works, ISS investigates reducing training data redundancy by the influence of pretraining data points.
Algorithm 1: Influential Subset Selection for Language Model Require: Pretraining corpus D; task training set T t and validation set T v ; learning rate α; initial subset S; candidates size k.Random initialize network θ, ϕ p , ϕ t θ, φp , φt = arg min 1 φt , end Sort pretraining samples based on influence Add top k influential samples to S end Return influential subset S

Table 2 :
. All experiments were conducted on 4 NVIDIA GeForce RTX 3090 GPUs.Detailed hyper-parameters are provided in Table A.1 in Appendix.Statistics of various target datasets.FLOPs 2 AGNews Hyp.Help.IMDB ACL.SciERC Chem.RCT Avg.
† indicates high-resource settings.1 For ISS, data size is reported by averaging over eight tasks.2 The training compute (FLOPs) is calculated by (6 × Training Tokens × Parameter Size) as in Kaplan et al. (2020).3 ISS utilizes 20% of the TLM size data, so we implemented the TLM model with the same size version.4 For a fair comparison, we implement TLM(Large) with BERT base and TLM large scale dataset.

Table 3 :
Evaluation results for ISS at three different training scales.
Table3shows the main results of ISS with the according TLM and PLMs baselines at three different scales.The followings are the related comparison and analysis we conducted: 1) ISS could achieve re-Comparison of ISS and TLM at different pretraining steps.The experiments were conducted on the small-scale dataset, and notably, the data scale of TLM was five times larger than ours.
Table 4 shows the main

Table 5 :
Results on the development set with different data scales.

Table 6 :
Comparison of the frequency of task influential words in different subsets.

Table 7 :
Comparison of the speed of computing influences using different layers.The experiments were conducted on Chemport dataset.
C T LM −small C T LM −small C T LM −small C T LM −small C T LM −small C T LM −small C T LM −small C T LM −small (Yao et al., 2022)LM −large C T LM −large C T LM −large C T LM −large C T LM −large C T LM −large C T LM −large T LM −smalland C T LM −large are provided by TLM(Yao et al., 2022). 2 ISS only uses a tiny subset of the source general corpus for training.We list the data size that are actually used for ISS training.

Table A .
(Lo et al., 2020)-parameters for ISS of different scales for each task CS2ORC is provided by S2ORC(Lo et al., 2020). 2 ISS only uses a tiny subset of the source general corpus for training.We list the data size that are actually used for ISS training.

Table A .
2: Detailed hyper-parameters for ISS in further pretraining C2.Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Section 4 D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students)