INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models

A salient characteristic of pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora and demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data. Further, we perform a rigorous empirical evaluation to show that the resulting models achieve up to $\sim99\%$ of the performance of the fully-trained models. We made our framework publicly available at https://github.com/Efficient-AI/ingenious.


Introduction
Pre-trained language models (PTLMs) (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2020;Brown et al., 2020;Raffel et al., 2020) have revolutionized the field of natural language processing (NLP), becoming the default choice for a wide array of NLP tasks.The versatility of PTLMs, however, is accompanied by significant costs.For instance, it costs an estimated $12 million to train GPT-3 (Brown et al., 2020) with roughly 1.2 million pounds of CO 2 emissions (Kahn, 2021).Megatron-Turing NLG (Smith et al., 2022) is a 530 billion parameter PTLM, which is thrice the size of GPT-3 and is trained on 4480 A100 GPUs and yields close to 1% performance improvements over GPT-3.By continually increasing the size of PTLMs and pre-training corpora to improve generalization ability, significant additional resources and energy are consumed, resulting in dire environmental consequences (Sharir et al., 2020).Further, such large-scale resource utilization and the costs associated with PTLMs create an uneven playing field for small organizations and universities, which operate with significant resource constraints.Hence, a crucial step towards developing responsible, fair, and GreenAI (Schwartz et al., 2020) involves minimizing inefficiencies and costs of training these models.
Significant efforts toward improving the efficiency of PTLMs have ventured in directions such as optimizing the model architecture (Chen et al., 2020;Gordon et al., 2020;Zafrir et al., 2021), modifications to the training pipeline (Izsak et al., 2021;Shen et al., 2022) and task (Schick and Schütze, 2021), sample efficient masking techniques for improved convergence (Bitton et al., 2021) and leveraging contextual knowledge to reduce model size (Kaur et al., 2022).In this work, driven by the observation that the scale of the pre-training corpus contributes significantly to the training costs of PTLMs, we explore the feasibility of training PTLMs using highly informative subsets of the corpus.Recent studies have demonstrated the feasibility of informative data subset selection for efficient deep model training for images (Mirzasoleiman et al., 2020;Killamsetty et al., 2021a,b,c;Pooladzandi et al., 2022) in both supervised and semi-supervised settings.In light of this, the key question we attempt to answer is: Can we efficiently pre-train language models using highly informative subsets of the training corpus without compromising performance?
The first step in answering the above question is identifying informative (or representative) subsets of the underlying training corpus such that they maximize the representation of the remaining samples in the corpus.Intuitively, given a set of sentences, the subsequent addition of sentences similar to existing sentences in the set yields diminishing returns.More information gains can be achieved by adding diverse, dissimilar sentences.While the classical subset selection problem is NP-hard, we can leverage the diminishing gains property of submodular functions (Fujishige, 2005) and frame subset selection as a submodular maximization problem.Several recent works (Wei et al., 2015;Mirzasoleiman et al., 2020;Kothawade et al., 2021;Karanam et al., 2022;Maheshwari et al., 2020) have formulated the subset selection problem as that of maximizing a submodular objective.However, applying existing subset selection frameworks to PTLMs is nontrivial given the scale of corpora typically used for pre-training (e.g., Wikipedia and Common Crawl consisting of hundreds of millions of sequences and billions of tokens).Most of the existing methods rely on per-sample gradients, which are expensive to compute, and to the best of our knowledge, none of the previous works have considered subset selection for such large datasets.

Our contributions:
We propose the informative data subset selection task for efficient pre-training of PTLMs and present INGENIOUS, a framework for subset selection using submodular optimization (Section 3).We show how to overcome the scalability challenge for typical large-scale pretraining corpora and employ scalable sentence feature encoders to obtain individual data sample features relevant for subset selection.We also employ various engineering techniques to scalably select subsets from large-scale datasets (Section 3).We use INGENIOUS to pre-train BERT and GPT-2 and evaluate the performance of the resulting models on downstream tasks (Section 4).A rigorous empirical evaluation reveals that the models pre-trained with INGENIOUS retain upto ≈ 99% performance of the models pre-trained using the full dataset.Figure 1 summarizes the cost-savings vs performance trade-off achieved by INGENIOUS for BERT pre-training.We also present thorough ablation studies revealing the impact of various design choices and parameters involved.We also evaluate the models trained by INGENIOUS in terms of their knowledge retention capabilities and show how INGENIOUS can be used to accelerate pre-training of domain-specific language models such as BioBERT (Section 4.4).Finally, we discuss the inferences that could be drawn from our work, limitations of our proposed framework and lay out directions for further improvement (Section 5).

Related Work
Knowledge distillation and pruning based methods (Sanh et al., 2019;Jiao et al., 2020;Muhamed et al., 2021) pre-train a smaller variant of PTLMs (such as BERT) with lesser capacity using the full model as teacher network.Even though lighter versions such as DistilBERT (Sanh et al., 2019) retain ≈ 97% of the performance with up to 60% faster inference, the PTLM still needs to be completely pre-trained initially to be able to distill the lighter version.Thus, the efficiency gains are restricted only to the fine-tuning and inference.Other methods prune the architecture through forcing the weights with lesser magnitude to zero value during pre-training (Chen et al., 2020;Gordon et al., 2020) as well as during finetuning (Zafrir et al., 2021).
Model architecture and training task optimizations: Schick and Schütze (2021) have shown that smaller PTLMs can achieve better performance by formulating the task input in cloze style.Izsak et al. (2021) proposed to optimize BERT pretraining through multiple optimizations related to data, model size, and optimizer choice.Shen et al. (2022) proposed a staged training mechanism where they start with training a relatively smaller model, which is then used for initializing the full capacity model at a later stage.Yao et al. (2022) identify relevant samples from the pretraining corpus based on their similarity with the task-specific dataset to train task-specific PTLM followed by fine-tuning, thus inherently suffering from the limitation of pre-training separate models for every downstream task.
Curriculum learning based methods employ the sequence length of training samples as a proxy for hardness.Typically, shorter (easier) sequences are presented in the initial stages of pre-training followed by longer (harder) sequences at later stages (Nagatsuka et al., 2021;Li et al., 2022).However, such methods have been shown to perform well only in limited configurations with respect to the choice of language models, stage of pre-training, etc..

Hardware optimizations for PTLM Training:
The suite of Open Pre-Trained Transformers (OPT) (Zhang et al., 2022) require 1/7th of the carbon footprint for pre-training when compared to popular PTLMs such as GPT-3 (Brown et al., 2020) while achieving comparable few-shot generalization.OPTs leverage extensive data and tensor parallelism with high-memory GPUs (supporting large batch sizes), which are usually not easily accessible and can lead to exorbitant costs.
Noticeably different from the aforementioned works, we explore making PTLM training more efficient by utilizing highly informative subsets of the training data.Consequently, our proposal effectively complements other optimization methods that target aspects such as model architecture and hardware enhancements.

The INGENIOUS Framework
We now present INGENIOUS -an informative data subset selection framework for pre-training language models.We summarize the training pipeline in Figure 2. We first describe the nota-tion to formulate the problem, followed by details of different steps involved in the framework.

Notation
We denote the unlabeled dataset for pre-training by U = {x j } n j=1 , consisting of n data points each corresponding to a varying length of sequence of symbols {s i } m i=1 (these symbols could be words or character sequences such as sub-words).Let S ⊆ U be the subset of the unlabeled dataset on which the language model is trained.Let the language model be parameterized by θ.We subscript the changing variables such as model parameters θ, subset S with the timestep t to denote their specific values at that timestep.

Problem Formulation
In its most general form, subset selection is defined as where the subset S t ⊆ U at step t is selected such that it maximizes the function f .
While the above general subset selection problem is NP-Hard, the problem becomes approximable in case the function f is submodular in nature (Fujishige, 2005).A set function We pose the data subset selection problem as a submodular maximization problem since it allows for easier optimization by employing different approximations (Nemhauser et al., 1978;Iyer and Bilmes, 2019).In order to choose a suitable submodular function, one must understand the characteristics of the subsets that are crucial for the end-goal -efficient learning in our case.Previous works in computer vision have demonstrated that commonly used vision datasets contain many redundancies, and eliminating such redundant data samples does not affect the model's performance (Birodkar et al., 2019;Toneva et al., 2019;Paul et al., 2021;Sorscher et al., 2022).Further, one can achieve faster model training by using highly informative and representative data subsets (Kaushal et al., 2019;Mirzasoleiman et al., 2020;Sorscher et al., 2022).Please refer to Appendix B for more related work on submodularity based subset selection.Building upon the learnings from computer vision research, our primary requirement for the selected subset is that it should faithfully represent the training data and have min-Figure 2: INGENIOUS framework for informative data subset selection to pre-train language models.We warmstart pre-training for W steps to enable it to learn useful representations (step A).Owing to the size of pre-training data, we divide the total number of samples (n) into P partitions (step B 1 ) followed by selecting instances according to submodular gains (step B 2 ) through probabilistic sampling (step C).We obtain a subset (of total size k) of representative samples from each partition such that the subset is updated periodically (step D) after R steps of training on selected subset.
imal redundancy within itself.

Overview of Approach
In order to select a representative subset as discussed above, we use Facility Location (Salhi, 1991;Krause and Golovin, 2014), a commonlyused submodular function closely related to kmedoid clustering which is defined as where A is the subset being evaluated, K is a pairwise similarity matrix and K ij is the similarity between the i th and j th samples.Thus, our subset selection problem can be represented as: Here, k represents the size of the subset S.
We would like to clarify that Equation (3) enables us to choose diverse samples such that each represents other samples in the corpus, instead of selection of similar samples.The optimization problem in Equation ( 3) is an instance of cardinality-constrained monotone submodular maximization for which an approximate solution can be obtained by incrementally building the subset from scratch using algorithms such as Naive Greedy (Nemhauser et al., 1978), Lazy Greedy (Minoux, 1978), Stochastic Greedy (Mirzasoleiman et al., 2015), Lazier-than Lazy-Greedy (Mirzasoleiman et al., 2015).We use the Lazier-than-Lazy Greedy optimizer as it is the most computationally efficient, along with memoization (Iyer and Bilmes, 2019).
The facility location function utilizes a pairwise similarity kernel K (of size |U| × |U|) between the data samples in U to select representative subsets.
To estimate the kernel values, we compute the cosine similarity between the feature representations of data samples obtained using the LM itself.To ensure that the extracted representations are meaningful during the initial phase, we warm start the model for W training steps as suggested by Killamsetty et al. (2021a,c) (step A in Figure 2).Further, to ensure that LM sees diverse data samples, we probabilistically sample data points based on submodular ordering obtained from running the greedy algorithm(steps B and C in Figure 2) and update the subset after every R th iteration (step D in Figure 2) .This re-sampling procedure is repeated till the predetermined number of steps.Algorithm 1 summarises the steps involved and in the following section, we describe the details of each step.

Methodology Details
Feature Encoders for Similarity Computation: The selection of optimal representative subsets requires a similarity kernel that captures the intrinsic relationships between data samples.We ex-   et al., 2020).Another possibility is to use sparse representations such as TF-IDF (Aizawa, 2003) owing to its success at capturing statistically important lexical features (Robertson et al., 2009).
We study the effect of using sparse feature representations (i.e., TF-IDF) and dense feature representations obtained from different layers of LM in Section 4.3.Our experiments revealed that dense feature encoders yield the best results.
Submodular Greedy Ordering based Data Selection: After deciding on the choice of similarity kernel, we now describe how to select the subsets (steps B and C in Figure 2) as defined by Equation (3).Approximate submodular maximization algorithms such as LazierthanLazy Greedy start with an empty subset and incrementally add data points one by one till the size of the subset equals the budget k set by us.If S represents subset selected so far, and e represents the next locally optimal data sample to be added, the submodular gain value of e is defined as f (S ∪ e) − f (S).While running the algorithm, we initially set the budget as the size of the entire data(say M ) in order to obtain and store the submodular gain (step B 2 in Figure 2) of each data sample at the time of their addition.
The key idea here is to use the submodular gains associated with each data sample as an importance score; convert them to a probability distribution by using the second order Taylor-softmax operation (de Brébisson and Vincent, 2016) (step C in Figure 2) and then sample a subset of desired size(say k) from the above distribution.Given gains vector {g 1 , g 2 , • • • , g M }, Taylor-softmax operation over the vector for converting it to probability distribution P can be specified as Using the probability distribution P for sampling ensures that samples which have high importance score associated with them are selected with greater probability.However, it also allows the LM to explore the samples with low importance score during training to prevent overfitting.We reuse this probability distribution to sample new subsets of size k every R steps by sampling k points without replacement (step D in Figure 2).
Recall that we require a similarity kernel of size |U| × |U |, hence the memory required for storing the similarity kernels is practically infeasible.We now describe how we scale INGENIOUS to handle size of the pre-training datasets used for LMs.

Partitioning based Efficient Subset Selection:
To minimize the memory consumption, instead of constructing a probability distribution over the entire unlabeled set directly, we first partition (step B 1 in Figure 2) the unlabeled set into N P random blocks of equal sizes (i.e., partition size is |U | N P ) and construct a probability distribution P i over each data block U p i : We then use the con-structed probability distributions P i over each data block U p i to sample a subset of size k/N P from the data block without replacement.We compute the final subset using subsets from each partition as follows: The partitioning of the unlabeled set allows us to get away with constructing similarity kernels of size |U | N P × |U | N P , thereby reducing the similarity kernel memory usage by around N P 2 times.We discuss the effect of the partition size in Appendix K.In order to maximize the utilization of available resources, we can construct probability distributions over each block in the data partition in parallel.As in recent work (Mittal et al., 2022), partitioned facility location can be shown as a lower bound of the original objective function,i.e., facility location that is being maximized.It should be noted that memory utilization also increases with the number of parallel processes.For example, when N P P subsets are selected from partitions in parallel, the memory usage due to similarity kernel is of the order O(N P P In our experiments, we set N P P = 100 processes.

Experiments and Results
We use BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019) and a domain-specific version of BERT -BioBERT (Lee et al., 2020) as the underlying LMs.Specifically, we use BERT-Base(110M) and GPT2-Small(124M).For BERT, we use English Wikipedia in conjunction with BooksCorpus as the pre-training corpora and employ MLM and NSP tasks for pre-training following details in the work of Devlin et al. (2019).We perform pre-training using a batch size of 1024 for 1,000,000 steps in the case of vanilla-BERT.We perform ablations over data subset sizes and number of pre-training steps for INGENIOUS enabled pre-training and find a subset size of 25% (Appendix J) with 250,000 pre-training steps (25%) as an optimal choice.We set the value of R to 25000 steps.We refer the reader to Appendix G for further implementation details.For INGENIOUS enabled pre-training of BioBERT and GPT-2, we discuss the implementation details and experimental results in Sections 4.4 and 4.5, respectively.the CoLA task (in Table 1) which is deemed to be most difficult (Geiping and Goldstein, 2022) in the GLUE benchmark.This implies that the subsets selected by INGENIOUS are able to capture the important and highly informative signals from the underlying data resulting in robust performance on challenging tasks as well.
Further, to compare different methods at different stages of pre-training, we obtain corresponding checkpoints and fine-tune on GLUE tasks.For this particular setting, we present a comparison of vanilla BERT pre-training against INGENIOUS in Figure 3.We plot the downstream performance for all the methods and it can be seen that INGENIOUS shows better performance than all the baselines at 250K steps of pre-training and thereafter, beyond 250K steps, the trend continues consistently (Figure 3   Table 2: Knowledge retention of different models as measured by LAMA probe.We report P@1 scores for all the four different subtasks in LAMA.
of importance sampling.

Knowledge Retention with INGENIOUS
Large PTLMs, when trained on a sufficiently large corpus, stores various types of knowledge implicitly in their parameters (AlKhamissi et al., 2022).Since INGENIOUS uses only a subset of the whole data for pre-training, it is natural for it to contain lesser knowledge in its parameters but how does it compare with vanilla BERT pretraining and other baseline when it comes to knowledge retention?To answer this question, we use LAMA benchmark (Petroni et al., 2019), a probe designed to analyze factual knowledge present in PTLMs.LAMA is derived from four distinct types of knowledge sources -Google-RE, T-REx, Con-ceptNet, and SQuAD -from which cloze sentences are created using facts contained in the respective knowledge sources.The PTLM has to predict the fact tokens in place of the mask tokens in cloze sentences.In Table 2, we summarize the results.We note that INGENIOUS suffers minimal loss in knowledge retention with respect to fully pre-trained vanilla BERT on all tasks.Further, the decrease in performance is less as compared to the baselines (for most tasks) which suffer a more severe decrease in performance.Intuitively, we attribute this to the ability of INGENIOUS to select highly informative subsets from the corpus while excluding the redundant information.

Effect of Embedding Representations
Different BERT layers have been shown to capture different information -lower layers capture word order (Rogers et al., 2020), middle capture syntactic information (Hewitt and Manning, 2019;Jawahar et al., 2019) and the later layers capture task-specific information (Kovaleva et al., 2019;Hao et al., 2019).We vary the layers    and report the performance on GLUE in Table 3.We observe that layer 9 features yield the best results.Further, in Table 3, we compare the effect of using TF-IDF as sample representations and contrast them against dense features (BERT Layer-9).We observe that dense embeddings perform better than shallow TF-IDF features.We also report effect of subset size and number of partitions in Appendices J and K.

INGENIOUS for Domain-Specific PTLM -BioBERT
We evaluate the performance of Bio-BERT (Lee et al., 2020) pre-trained on subsets selected through INGENIOUS and compare it with vanilla Bio-BERT by fine-tuning it on biomedical datasets for the Named Entity Recognition (NER) and Relation Extraction (RE) tasks.For vanilla Bio-BERT, we start with a pre-trained BERT model and further pre-train it on the PubMed abstracts dataset for 200,000 steps(as recommended by the original study).Please refer to Appendix I for further implementation details.We present the performance convergence plots of vanilla Bio-BERT vs. training time using INGENIOUS with a subset size of 25% in Figure 5.It shows that during initial

INGENIOUS for GPT-2 Pre-training
We also pre-train GPT-2 (Radford et al., 2019) using INGENIOUS.We estimate the mean accuracy for GLUE fine-tuning (averaged over 20 runs) and zero-shot accuracy on BBQ Lite generative task.Please refer to Appendix H for implementation details.We plot the performance (see Figure 6) obtained for the above benchmarks against checkpoints at different pre-training stages (steps).

Conclusions
We presented INGENIOUS, a framework for efficient pre-training of language models using highly informative data subsets, and presented a submodular optimization based algorithm.We described how it can be scaled for language models and showed its effectiveness using rigorous empirical evaluation.Our future work will explore exploiting external knowledge bases to identify and reduce redundancies in the corpus and to study multi-modal training where redundant information can be spread across different modalities.

Limitations
In terms of limitations, the submodular maximization based on estimation of pairwise sample sim-ilarity can be potentially constrained by memory limitations and might require high CPU RAM capacity.Further, we do acknowledge that our experiments are performed on relatively smaller PTLMs compared to GPT-3, OPT or PaLM owing to resource limitations.We have tried our best to perform extensive experiments and perform ablation studies to inform our design choices within our resource constraints.

Ethical Considerations
We believe that INGENIOUS has a significant positive impact on society since Submodular Data Subset Selection: Submodular optimization has been successfully employed for data subset selection in various applications such as speech recognition (Wei et al., 2014b,a;Mittal et al., 2022), machine translation (Kirchhoff and Bilmes, 2014), active-learning (Wei et al., 2015;Kothawade et al., 2021), efficient deep learning (Kaushal et al., 2019;Killamsetty et al., 2022;Pooladzandi et al., 2022).Another active area of research is selecting representative subsets of data, also known as coresets (Feldman, 2020).
A coreset is a weighted subset of data closely approximating certain desirable properties of the entire dataset (e.g., the loss function) (Feldman, 2020).Coreset selection has been shown to benefit a host of geometric problems such as k-means and k-median clustering (Har-Peled and Mazumdar, 2004) and, in recent times, has been used successfully for efficient bayesian inference (Campbell and Broderick, 2018) and improving training efficiency (Mirzasoleiman et al., 2020;Killamsetty et al., 2021a).Such informative data subset selection has shown remarkable promise for efficient and robust training of deep models (Killamsetty et al., 2021b,c).We direct the reader to a survey by Bilmes (2022) for a detailed review of submodularity and subset selection for ML.

C Datasets
For

J Subset size for efficiency gains
We study the effect of the size of the subset selected through INGENIOUS that is used for pretraining BERT.In Table 5, we analyse using the following values of subset sizes, viz., 10%, 15%, 20%, 25% and 30% and evaluate the fine-tuning performance on GLUE.While lower subset sizes (10-20%) result in inferior performance owing to the fact that the LM is shown less information, optimal performance is observed when 25% of the pre-training corpus is used, hence, we report corresponding results in Table 1.

K Partitions for efficient subset selection
As discussed in approach, we divide the pretraining dataset into partitions.In Table 6, we analyse the impact of performance on GLUE as the number of partitions is varied.Using fewest partitions (1500) is found to yield optimal performance.This aligns with the intuition that fewest partitions enable better subset selection since more samples are present in a single partition, allowing to select more representative samples overall.

L Few Examples of informative texts sampled by INGENIOUS
We summarize the three types of redundancies that we found in our analysis of selected subsets.More examples can be found at https://github.com/Efficient-AI/ingenious.
• Type 1: Same information conveyed by multiple sentences in different documents.
-Sentence 1: "separate sovereign countries but acted as a single bloc in foreign policy and security issues.the proposed union was being discussed by a joint scandinavian committee during the winter of 1948 -1949, but the cold war tension between the united states and the soviet union, and preparations for a western alliance that would result in the north atlantic treaty overshadowed the effort.when it became" -Sentence 2: "they would remain separate sovereign countries but act as a single block in foreign policy and security issues.the proposed union was discussed by a joint scandinavian committee during the winter of 1948 -1949, but in the end the cold war tension between the united states and the soviet union and preparations for a western alliance that would result in" • Type 2: Duplicates in the corpus.
-Sentence 1: "after we'd been handed our menus.i always get the frozen hot chocolate."frozen hot chocolate?it's really a thing?i thought they just made that up." no," i said, pointing to the spot on her menu.see? it's right there."so, do you order anything else?" cake."she looked at my deadpanned face and laughed.so, we're" -Sentence 2: "what's good here?"mia asked me after we'd been handed our menus.i always get the frozen hot chocolate."frozen hot chocolate?it's really a thing?i thought they just made that up." no," i said, pointing to the spot on her menu.see? it's right there."• Type 3: Recurring patterns of text.
-Sentence 1: "according to the united states census bureau, the village has a total area of, all land.demographics 2010 census as of the census of 2010, there were 377 people, 159 households, and 101 families residing in the village.the population density was.there were 176 housing units at" -Sentence 2: "according to the united states census bureau, the village has a total area of, all land.demographics 2010 census as of the census of 2010, there were 801 people, 323 households, and 225 families living in the village.the population density was.there were 358 housing units at"

Figure 1 :
Figure 1: Cost-savings vs Performance tradeoff achieved by INGENIOUS for BERT pre-training: We contrast the accuracy degradation with cost savings compared to the vanilla BERT pre-training on entire dataset.We observe 4.35× cost-savings with 2.1% accuracy drop and 2.33× cost-savings with 0.67% accuracy drop.

Figure 3 :
Figure 3: Comparison of INGENIOUS with vanilla BERT on GLUE performance vs. pre-training steps (top) and cost (bottom) using checkpoints obtained at intermediate pre-training stages.

Figure 4 :
Figure 4: INGENIOUS is found to outperform the baselines even on extended training till 1M steps -top).Also, pre-training through informative subsets enables BERT to achieve a performance level at 250K steps which the vanilla pretraining achieves only after over 350K iterations.Similarly, for any given pre-training cost, INGE-NIOUS yields a better GLUE score than the baselines (Figure 3 -bottom).Further we observe that INGENIOUS consistenly outperforms the baselines even when extended to 1 million steps(maximum number of training steps prescribed by Devlin et al. (2019) for vanilla BERT pre-training) as shown in Figure 4.

Figure 5 :
Figure 5: Plots (a) and (b) are the convergence results comparing Avg.F1 score (over three runs) with the Wall clock time for Vanilla BioBERT and BioBERT using INGENIOUS with 25% subset.We observe that INGENIOUS achieves much faster convergence than vanilla BioBERT(i.e., Full Training).

Figure 6 :
Figure 6: Comparison of INGENIOUS with vanilla GPT-2 pre-training at different pre-training stages.Pre-training on INGENIOUS subsets enables GPT-2 to achieve better GLUE score consistently.
Figure 6 -left and right shows that INGENIOUS performs consistently better than vanilla GPT-2 pretraining on GLUE and BBQ Lite respectively at different stages of pre-training indicating better convergence.

Table 3 :
Ablation study by varying embedding representation for selecting subsets.We report mean GLUE score to compare INGENIOUS variants.
it makes pre-training of LMs compute efficient, thereby reducing CO2 emissions and energy costs.Nonetheless, the IN-GENIOUS framework is susceptible to biases and toxic words within the pre-training corpora as it relies on standard pre-training datasets.An exciting future direction of this research is to investigate whether we could use targeted subset selection to filter out toxic words, as well as phrases that promote cultural stereotypes and biases from the pre-training corpora before LM pre-training.

Table 4 :
Comparison of pre-training cost and fine-tuning performance on GLUE tasks (averaged over 20 runs) for BERT.We report difference relative to full pre-training of vanilla BERT in brackets for cost and avg.GLUE score.INGENIOUS achieves 98.6% of fully pre-trained BERT performance, reducing pre-training cost to ∼ 28%.

Table 5 :
Ablation study by varying subset size of selected subsets.We report mean GLUE score to compare INGENIOUS variants.

Table 7 :
Comparison of validation set losses during pre-training.INGENIOUS achieves almost similar validation set loss as compared to vanilla BERT