TokenDrop + BucketSampler: Towards Efficient Padding-free Fine-tuning of Language Models

,


Introduction
Language Models (LMs) derived from attentionbased Transformer networks have significantly advanced the field of Natural Language Processing (NLP), with recent models showing remarkable performance on a wide range of tasks (Bubeck et al., 2023).LMs are typically characterized by large model sizes, and are pre-trained on very large text corpora.Pre-trained LMs are subsequently finetuned to solve a range of downstream tasks.While pre-training is the most expensive step by far in LM creation, and hence has attracted the most attention, we observe that the relatively high frequency of fine-tuning makes it an important challenge in its own right.Data available from public-domain pretrained LMs such as BERT (Devlin et al., 2019) and OPT (Zhang et al., 2022) suggests that these LMs have been fine-tuned millions of times by different users on a diverse range of downstream tasks.Further, LMs are notoriously sensitive to the initialization of the task-specific final layer and the order in which training data is presented during fine-tuning (Dodge et al., 2020).As a result, multiple runs of fine-tuning with different hyperparameters are often required to achieve high accuracy on a given downstream task.Finally, due to large model sizes of LMs (that continue to grow), even a single fine-tuning run is compute-intensive and can take several GPU-days.In summary, while the enormous computational costs of pre-training have attracted a lot of attention, it is likely that comparable if not more compute time and energy have been spent on fine-tuning compared to pre-training since the inception of LMs.
While several prior efforts have focused on reducing the costs of pre-training, relatively little attention has been paid to the computational challenges of fine-tuning.We observe that LM finetuning presents a unique set of challenges.First, fine-tuning is performed on variable-length text sequences, with significant spread in lengths.When batches of variable-length sequences are generated for fine-tuning, shorter sequences in a batch are "padded" to the length of the longest sequence in the batch by adding padding tokens.However, computations on padding tokens are useless, and adversely affect throughput during fine-tuning.Second, fine-tuning is performed in a supervised manner and requires expensive human annotations.As a result, fine-tuning datasets are several orders of magnitude smaller than pre-training datasets.In addition, when overparameterized LMs are trained on small task-specific datasets, overfitting leads to sub-optimal generalization performance (Fig. 1), even with the use of popular regularizers such as Dropout (Srivastava et al., 2014).Consequently, fine-tuning is performed over a very small number of epochs (typically ≤5).Finally, we observe that fine-tuned LMs are adversely impacted by even minor grammatical errors in inputs during inference when only grammatically perfect sequences are used for fine-tuning (Fig. 2).This sensitivity is problematic since the assumption of seeing only grammatically correct sequences during inference does not always hold in real-world scenarios, especially when LM inputs are provided by users with different levels of language proficiency.Figure 1: Training curve obtained from fine-tuning Roberta-base on RTE, a language understanding task with 2.5K training samples, with dropout rate = 0.1.We report loss averaged across 10 random seeds.
To address the aforementioned challenges, we present TokenDrop + BucketSampler, the synergistic combination of a new regularizer and batching method for accurate and efficient LM fine-tuning.
Our first contribution, BucketSampler, generates batches of samples with reduced spread in sequence lengths, thereby reducing computational overheads due to padding.In addition, the batch size is varied based on the lengths of the samples in the batch to maximize hardware utilization.Improved batching strategies have been incorporated into popular NLP libraries such as HuggingFace Transformers (Wolf et al., 2019) and Lingvo (Shen et al., 2019).However, these methods are optimized for training LMs from scratch.When applied to fine-tuning on small datasets for very few epochs, we find that these prior batching methods lead to significant drop in accuracy.BucketSampler includes key algorithmic enhancements that tune the optimal batch size and learning rate schedules, thereby enabling fine-tuning to achieve high accuracy while also maintaining high hardware utilization.
Our second contribution is TokenDrop, a novel regularizer that identifies and drops a random subset of insignificant tokens from each sequence in every epoch.More tokens are dropped from the longer sequences in each batch, further reducing the need for padding (Fig. 3).Further, TokenDrop reduces overfitting by ensuring that the model does not see the same sequence repeatedly over the course of fine-tuning.As a side effect, it also improves the resilience of LMs to grammatical errors during inference by exposing the model to grammatically incorrect inputs during fine-tuning.
Since TokenDrop + BucketSampler improves fine-tuning efficiency by eliminating ineffectual computations, it can be combined with previously proposed approaches for parameter-efficient finetuning, such as freezing layers (Lee et al., 2019;Zaken et al., 2022) and the use of adapters (Houlsby et al., 2019), to achieve further efficiency gains.We summarize our main contributions as follows: • We propose TokenDrop + BucketSampler, a framework for accurate and efficient LM finetuning.
• BucketSampler is a length-aware batching method that generates batches of sequences with similar lengths to reduce padding.Bucket-Sampler incorporates optimizations that enable fine-tuning convergence while maintaining high throughput.
• TokenDrop is a regularizer that drops a random subset of insignificant tokens from each sequence in every epoch to prevent LMs from memorizing fine-tuning data, while also further minimizing the need for padding.
• We demonstrate that TokenDrop can be synergistically combined with BucketSampler to simultaneously improve both accuracy (up to 1.2%) and efficiency (up to 10.61×) of LM fine-tuning.

Method
TokenDrop + BucketSampler is a framework for accurate and efficient fine-tuning of pre-trained LMs.
The first component, TokenDrop, is a regularizer that randomly drops a subset of insignificant tokens from each sequence in every epoch.The second component, BucketSampler, is a length-aware batching method that generates batches of samples with lower spread in sequence lengths.We demonstrate how TokenDrop and BucketSampler can be synergistically combined (hence the name TokenDrop + BucketSampler) to simultaneously reduce overfitting and eliminate the need for padding tokens, resulting in both accuracy and speed improvements (Fig. 3).

TokenDrop
Fine-tuning datasets for LMs are often small due to the need for expensive human annotations.As a result, we find that pre-trained LMs quickly memorize the training samples during fine-tuning (Fig. 1).Dropout (Srivastava et al., 2014), which drops a random subset of neurons in each training epoch, is currently the most widely used regularizer.However, we find that dropout is ineffective at preventing overfitting during fine-tuning of LMs (Fig. 1).
In addition, most fine-tuning datasets are only composed of sentences and phrases that are grammatically correct.As a result, minor errors in userprovided inputs during inference (such as missing punctuation marks) can degrade the quality of outputs produced by the LM (Fig. 2).
To address these challenges, we propose Tok-enDrop, a regularizer that drops a random subset of words from each sequence in every training epoch.Unlike Dropout, TokenDrop introduces data diversity between the different training epochs, thereby making the models unlikely to see the same sequence twice over the course of fine-tuning, and hence, less likely to overfit.We find that the choice of words selected for dropping has a significant impact on training convergence, since the semantics of the input text sequence can change if important tokens are dropped (see Appendix C).For instance, if the word "not" in the sequence "the movie was not good" is dropped, the sentiment of the sentence changes from negative to positive.In order to overcome this challenge, TokenDrop only drops stopwords from sequences.Stopwords are words in any language that do not contribute to the meaning of a sentence, but are added to make sentences grammatically correct.For instance, the words "the" and "was" in the aforementioned sequence are stopwords.In effect, TokenDrop provides a stronger regularization effect compared to Dropout (see Appendix D), and also improves the efficiency of fine-tuning by reducing the sequence length without affecting the meaning of the sequence.To-kenDrop also improves the resilience of LMs to grammatical errors during inference by exposing the model to grammatically incorrect sequences during fine-tuning.In particular, TokenDrop generates grammatically incorrect sequences by pruning stopwords from grammatically correct sequences.
The procedure for applying TokenDrop to a given dataset is described in Algorithm 1.Given a dataset and a list of stopwords, we identify and prune a random subset of stopwords in each sequence.The number of stopwords pruned in each sequence is determined by the T okens_to_drop(sequence) parameter.When a global T okenDrop_Rate (analogous to Dropout_Rate in Dropout) is used, i.e., the same fraction of stopwords are dropped in each sequence, then T okens_to_drop(sequence) = number_of _stopwords(sequence) * T okenDrop_Rate.
We note that it is not 11684 necessary to use the same T okenDrop_Rate for all sequences, and a different fraction of stopwords can be dropped in each sequence, as described in Section 2.3.The complete list of stopwords used in our experiments is provided in Appendix C.

BucketSampler
BucketSampler combines a sequence-length-aware, variable-batch-size batching strategy with algorithmic optimizations to enable faster fine-tuning.The method for generating batches with BucketSampler is illustrated in Algorithm 2. All samples in the dataset are divided into buckets such that sequences that fall into the same bucket have similar sequence lengths.Each bucket is defined by a triplet (min_seq_len, max_seq_len, HW Cap).Here, min_seq_len and max_seq_len denote the minimum and maximum sequence lengths of sequences in the bucket, respectively, i.e., all sequences whose lengths lie between min_seq_len (inclusive) and max_seq_len (exclusive) of a bucket fall into that bucket (lines 20-23 in Alg. 2).Then, batches are generated by only combining sequences from the same bucket (lines 24-32 in Alg. 2).In effect, BucketSampler reduces the spread of sequence lengths in a batch, thereby reducing the need for padding tokens.BucketSampler also provides support for variable batch sizes.In particular, since the peak memory requirements for processing a batch scale quadratically with sequence length due to the quadratic complexity of self-attention, we propose using large batch sizes when generating batches from buckets with small max_seq_len and viceversa.The HW Cap parameter associated with each bucket encodes the maximum batch size that can be used for generating batches from that bucket on a given hardware platform.HW Cap is experi-mentally determined by profiling on-chip memory usage for different sequence lengths.
While the BucketSampler algorithm described above provides some improvements in the efficiency of fine-tuning, it still leaves room for further improvement along two directions.(1) The constraint that batches can only be formed using samples from the same bucket leads to "residual" batches with small numbers of samples, adversely affecting hardware utilization.Therefore, we propose Residual Batch Merging (RBM) to merge residual batches in different buckets into larger batches.( 2) From an accuracy standpoint, we find that the convergence of fine-tuning is adversely impacted when using (a) very large batch sizes, and (b) a single learning rate schedule with variable batch sizes, since fine-tuning is often performed on small datasets for a very small number of epochs.To overcome these challenges, we propose Batch-Cap to progressively increase the maximum batch size over the course of fine-tuning.We also propose Learning Rate Modulation (LRM) to dynamically scale the learning rate based on the batch size.We explain these optimizations in greater detail in the following subsections.
Residual Batch Merging (RBM): Peak hardware utilization is achieved when the number of samples in each bucket is an exact multiple of the bucket's HW Cap.However, in reality, it is highly likely that the aforementioned condition is not satisfied, resulting in one batch in most buckets with batch_size < HW Cap.We term these batches as "residual" batches, and the hardware is underutilized when processing residual batches.In order to reduce the impact of residual batches, we propose merging residual batches from different buckets into larger batches.We present the Residual Batch Merging (RBM) algorithm to maximize hardware utilization, while also minimizing the number of additional padding tokens introduced as a result of merging sequences from different buckets (lines 5-18 in Algorithm 2).New batches are created by appending samples one-by-one from residual batches in each bucket, with buckets processed in increasing order of max_seq_len (lines 7, 11-12).When the number of samples in the new batch becomes an exact multiple of the bucket_batch_size of the bucket corresponding to the longest sequence in this new batch, the new batch is added to the list of batches used for fine-tuning (lines 13-15).Lines 8-10 account for corner cases where the num- Here, base_batch_size and scaling_f actor are hyperparameters that control the batch size used in the first epoch of fine-tuning, and the growth rate of the maximum batch size across epochs, respectively.We note that BatchCap leads to hardware under-utilization in the early epochs of fine-tuning.However, BatchCap is necessary to achieve convergence when fine-tuning with BucketSampler, and our exponential scaling rule ensures high utilization in the majority of epochs.
Learning Rate Modulation (LRM): The use of BucketSampler leads to large variance in batch sizes during fine-tuning.For instance, when finetuning Roberta-Base on a NVIDIA RTX 2080Ti GPU, HW Cap = 2525 for max_seq_len = 5 and HW Cap = 64 for max_seq_len = 128.Since the choice of learning rate is highly sensitive to the batch size used during training (Krizhevsky, 2014;Smith et al., 2018), we find that fine-tuning using BucketSampler fails to converge with a single learning rate schedule, even when a grid search is performed to find the best learning rate.We propose Learning Rate Modulation (LRM) to overcome the limitations of using a single learning rate schedule when training with variable batch sizes.LRM dynamically modulates the base learning rate based on the batch size of each batch.LRM scales the base_learning_rate for each batch as learning_rate(batch) = base_learning_rate * sqrt(batch_size(batch)/base_batch_size).
Here, base_learning_rate is the optimal learning rate schedule when training with fixed batch size, where all batches (except the last batch) have batch_size = base_batch_size.The formula for computing learning_rate(batch) is derived from the square-root scaling law relating learning rate and batch size (Krizhevsky, 2014), which is a popular trick for parallelizing training of deep neural networks using large batch sizes on GPU clusters.In particular, when batch size is changed to k * batch_size, the learning rate is changed to learning_rate * sqrt(k) to find the optimal learning rate for a given batch size when the optimal learning rate is known for a different batch size.In this work, we use this formulation for modulating learning rate on a batch-by-batch basis, i.e., changing learning rate for each batch based on the size of the batch.

TokenDrop + BucketSampler
When batches are generated using BucketSampler, the sequence lengths of all samples in a batch lie between the min_seq_len and max_seq_len of the bucket the batch was drawn from (except in merged residual batches).TokenDrop can be synergistically combined with BucketSampler to further equalize the lengths of all sequences in a batch by pruning stopwords from longer sequences in the batch, thereby eliminating the need for padding.To achieve this, we propose defining T okenDrop_Rate on a sequenceby-sequence basis, rather than having a global T okenDrop_Rate rate for all sequences.In particular, the number of stopwords to drop in a sequence is computed as T okens_to_drop(sample) = cardinality(sample) -min_seq_len(batch).Consequently, all sequences in a batch are pruned to min_seq_len(batch) by dropping a random subset of stopwords (Algorithm 3), thereby eliminating padding tokens (except in merged residual batches, where tokens_to_drop(sample) may be larger than num_stopwords(sample) due to larger variance in sequence lengths across merged buckets).

Experiments and Results
We implement TokenDrop + BucketSampler in Py-Torch using Huggingface Transformers (Wolf et al., 2019).We perform experiments on a NVIDIA RTX 2080 Ti GPU with 11 GB memory, and report results averaged across 10 runs with different random seeds.We perform 3 epochs of fine-tuning for our method and all baselines.The details of all hyperparameters used in our experiments are described in Appendix A. We note that TokenDrop is not used when fine-tuning on CoLA, since the task involves identifying if a given sentence is linguistically acceptable or not, and TokenDrop makes linguistically acceptable sequences unacceptable.

TokenDrop + BucketSampler improves the accuracy and efficiency of fine-tuning
Classification tasks: We present results of finetuning the popular Roberta (Liu et al., 2019) and Electra (Clark et al., 2020) 1.We find that fine-tuning with TokenDrop + BucketSampler consistently produces more accurate models compared to conventional fine-tuning with random batches (RandomSampler).In addition, TokenDrop + BucketSampler also reduces the wall-clock fine-tuning time by up to 10.61× compared to conventional fine-tuning, with an average speedup of 5.9× across the 10 GLUE and SQUAD tasks.With RandomSampler, 38.9% of all tokens used for training are padding tokens, which reduces to just 0.2% with TokenDrop + BucketSampler (not exactly 0%, since padding is needed in merged residual batches).We find that the speedup from using TokenDrop + BucketSampler on a given task is dependent on two factors: (1) the sequence length histogram, and (2) the size of the dataset.We provide a detailed analysis of the relationship between the statistics of the finetuning dataset and the speedup achieved from using TokenDrop + BucketSampler in Appendix E. We also provide supplementary results on fine-tuning Roberta-Large in Appendix B to demonstrate the benefits of using TokenDrop + BucketSampler for fine-tuning larger models.Generation tasks: We present results of finetuning the T5-small seq2seq model (Raffel et al., 2020) on text summarization using the XSum (Narayan et al., 2018) and CNN/DailyMail (Nallapati et al., 2016) datasets in Table 2.We find that fine-tuning with TokenDrop + BucketSampler improves the ROUGE-1 score by up to 0.3 points, while also reducing the wall-clock fine-tuning time by up to 8.62× compared to RandomSampler.We note that TokenDrop is only applied to input sequences, and bucketing is performed based on input sequence lengths for generation tasks.As a result, padding is still necessary for the target sequences.
Resilience to minor grammatical errors in inputs: We find that training with TokenDrop significantly enhances the resilience of fine-tuned models to minor grammatical errors in inputs.For instance, when articles ('a', 'an', 'the') and punctuation marks are removed from the test sequences, the average accuracy on GLUE (except CoLA) and SQUAD drops by 5.2% (Roberta-Base), and the average ROUGE-1 score drops by 3.1 points (T-5 small) for the baseline models.On the other hand, models fine-tuned with TokenDrop incur only 0.3% and 0.06 points drop in average accuracy and ROUGE-1 scores, respectively, thereby demonstrating significantly higher resilience to minor grammatical errors.The enhanced resilience to grammatical errors in models fine-tuned with TokenDrop can also be observed in Fig. 4, where there is negligible loss in accuracy even when 40% of all stopwords in each sequence are randomly chosen and deleted during inference.

TokenDrop + BucketSampler enables accurate and efficient inference
While the primary objective of TokenDrop + Buck-etSampler is to improve fine-tuning efficiency, we describe how they can also be used to improve the efficiency of both real-time and server-mode inference in the following subsections.
Real-time inference (batch size = 1): Real-time inference workloads have strict latency requirements and bursty input rates, and hence, inputs are typically processed as soon as they arrive with a batch size of 1.To reduce the latency of realtime inference, we propose filtering out stopwords in the input text sequence by applying TokenDrop during inference also.Inference-time TokenDrop offers a promising approach for accelerating realtime inference, enabling speedups of 2.2X when all stopwords are pruned (T okenDrop_Rate = 1 in Fig. 4).However, we find that models finetuned without TokenDrop suffer from large accuracy drop when TokenDrop is applied at inference time.On the other hand, models trained with To-kenDrop exhibit significantly higher resilience to inference-time TokenDrop, enabling 1.5× reduction in inference latency with no loss in accuracy, and 2.2× speedup with < 1% loss in accuracy (Fig.  4).In addition, TokenDrop can be combined with progressive token pruning methods that prune the least-important tokens in each layer based on attention scores (Wang et al., 2021;Goyal et al., 2020), to achieve further gains in efficiency.
Server-mode inference (batch size > 1): In the server-mode inference setting, inputs arrive simultaneously from several sources.Consequently, inputs are processed in batches, with the goal of maximizing throughput.In the server-mode inference setting, we propose utilizing TokenDrop + Bucket-Sampler to batch the inference queries.Here, we set BatchCap(bucket) = HW Cap(bucket) for all buckets to maximize hardware utilization, and hence, throughput.We find that using BucketSampler leads to a 4.5× speedup over random batching.
In addition, TokenDrop can also be synergistically combined with BucketSampler at inference time in models fine-tuned with TokenDrop to achieve 4.9× speedup with no loss in accuracy (Table 3).

Ablation: Breakdown of benefits from the different BucketSampler optimizations
We study the impact of the different BucketSampler optimizations on accuracy and efficiency of fine-tuning in Table 4 and Figure 5.When no optimizations are used, BucketSampler achieves 5.3× speedup, but incurs substantial accuracy drop due to insufficient training convergence.When Residual Batch Merging (RBM) is used, "stray" batches from different buckets are combined to form larger batches.In addition to improving efficiency by reducing hardware under-utilization, RBM also improves accuracy by reducing variability in batch sizes, thereby enabling better convergence with a single learning rate schedule.BatchCap further improves accuracy by using small batch sizes in early epochs, thereby ensuring sufficient numbers of weight updates to achieve training convergence.While BatchCap is necessary for achieving convergence, it leads to small drop in training efficiency due to hardware under-utilization in early epochs.Finally, the use of LRM to dynamically adjust the learning rate for each batch ensures that fine-tuning with BucketSampler results in near-identical training curves (Fig. 5) and hence, accuracy (Table 4) compared to fine-tuning with RandomSampler.

Related Work
Prior works have developed batching strategies for variable-length inputs to improve the efficiency of LM training.RandomSampler is the most commonly used batching technique, and is the default method in most NLP libraries.RandomSampler randomly selects samples from the training dataset to generate batches, and pads all sequences in a batch to the maximum length of sequences in the batch.Consequently, training with Random-Sampler requires substantial padding, and hence, substantial wasted computations.LengthGrouped-Sampler (introduced in Huggingface Transformers   (Wolf et al., 2019)) sorts sequences in order of increasing sequence length, and generates batches of adjacent sequences in the sorted list.While LengthGroupedSampler reduces padding, it uses fixed batch sizes, resulting in hardware underutilization.Packing (Krell et al., 2022) generates batches by concatenating different inputs along the sequence length dimension.However, we find that unlike when training from scratch, the overheads of packing and the additional computations in attention layers (computing irrelevant scores, followed by masking to prevent cross-contamination between different sequences) are not amortized over the small number of fine-tuning epochs.Finally, Tensorflow's tf.data.bucket_by_sequence_lengthdivides the training dataset into buckets based on sequence length, and generates batches only from sequences in the same bucket, similar to BucketSampler.While packing and tf.data.bucket_by_sequence_lengthsupport variable batch sizes to maximize hardware utilization, they incur accuracy drop during fine-tuning.In summary, while prior works have demonstrated efficiency gains when training LMs from scratch, the unique characteristics of fine-tuning (small datasets, very few epochs) make these methods ineffective.As a result, TokenDrop + BucketSampler significantly outperforms prior methods in terms of both accuracy and fine-tuning efficiency (Table 5).

Conclusion
In this work, we presented TokenDrop + Bucket-Sampler for accurate and efficient fine-tuning of LMs.TokenDrop prunes a random subset of stopwords in each sequence in every epoch to reduce

Limitations
(1) TokenDrop is not universally applicable to all fine-tuning tasks.In particular, TokenDrop cannot be used when dropping stopwords can potentially change the labels associated with sequences.For instance, when fine-tuning a LM to identify if a given sequence is grammatically correct or not (as in CoLA), dropping stopwords from sequences will make all sequences grammatically incorrect.( 2  Figure 6: Training curves obtained from fine-tuning Roberta-base.We report loss averaged across 10 random seeds.Dropout rate = 0.1 is used for all datasets when fine-tuning without TokenDrop.When fine-tuning with TokenDrop, Dropout Rate = {0 for RTE, 0.025 for SST-2 and MNLI}, as listed in Table 6.E Factors impacting speedup from using TokenDrop + BucketSampler on a given task We find that the size of the fine-tuning dataset and the sequence length spread play key roles in determining the speedup achieved when using To-kenDrop + BucketSampler on a given task.We analyze these relationships in the following subsections.

E.1 Dataset size
Both TokenDrop and BucketSampler add small overheads at fine-tuning time.When TokenDrop is used, stopwords are first identified in each sequence.Then, a random subset of stopwords are pruned from each sequence in every epoch.When BucketSampler is used, the training dataset is first divided into buckets.Then, batches are randomly generated from each bucket in every epoch, and residual batches from different buckets are merged.Finally, the generated batches are shuffled to randomly order batches from different buckets.As a result, once the buckets are generated at the start of fine-tuning, the other steps of BucketSampler are very similar to those in RandomSampler (with the exception of RBM, which accounts for a very small fraction of the runtime), and hence, the overheads are negligible.We analyze the overheads of TokenDrop + BucketSampler on three fine-tuning datasets of different sizes -small (RTE, with 2.5K training samples), medium (SST-2, with 67K training samples), and large (MNLI, with 393K training samples) -in Figure 7.We find that the most timeconsuming parts of TokenDrop (identifying all stopwords in each sequence) and BucketSampler (dividing the dataset into buckets) are performed only once at the start of fine-tuning.On the other hand, the operations performed in each epoch account for only a small fraction of the total runtime.Consequently, the overheads of TokenDrop and Buck-etSampler are better amortized over the course of fine-tuning when training on larger datasets, leading to higher speedups.
Further accelerating hyperparameter search during fine-tuning.When hyperparameter tuning is necessary, it is sufficient to split the dataset into buckets and perform stopword identification only once, during the first epoch of the first finetuning run.They can then be re-used for subsequent runs with different hyperparameters.As a result, if we compare the wall-clock time taken to perform fine-tuning with 10 different random seeds, TokenDrop + BucketSampler achieves an average speedup of 6.8× over RandomSampler across the 9 GLUE tasks and SQUAD.We also note that all speedups reported in Section 3 are not computed this way.Instead, it is assumed that both steps (stopword identification and bucket generation) are performed in every fine-tuning run, even when results are averaged across multiple random seeds.

E.2 Sequence length spread of a dataset
We quantify the sequence length spread of a dataset using two parameters: L avg and L K .L avg is the average sequence length of all sequences in the dataset, while L K is the maximum possible sequence length such that at least K% of all sequences in the dataset are longer than L K .When K = 87.5% and assuming a batch size of 64 (HW Cap = 64 when max_seq_len = 128 on a NVIDIA RTX 2080 Ti GPU), the probability of each batch having at least one sequence with length > L K = (1 − (87.5/100) 64 ) = 0.9999 when batching with RandomSampler.As a result, it is expected that each batch will be padded to at least L 87.5 .Consequently, padding_f raction_est = ((L 87.5 − L avg )/L 87.5 ) is a conservative estimate of the fraction of all tokens that are expected to be padding tokens when batching with RandomSampler.We observe that padding_f raction_est has a direct correlation with speedup achieved when using TokenDrop + BucketSampler (Fig. 8).We observe some outliers when the datasets are very small, as in the case of RTE (2.5k training samples) and WNLI (634 training samples), since the overheads of TokenDrop + BucketSampler account for a larger fraction of the wall-clock fine-tuning time (see Fig. 7).We achieve maximum speedup on SST-2, a relatively large dataset (67K training samples) with high padding_f raction_est (nearly 50% of all tokens are expected to be padding tokens with RandomSampler).In addition, SST-2 has L avg = 14, and hence, the majority of batches can be processed with batch sizes >800 (HW Cap = 800 for the bucket with max_seq_len = 15), leading to large speedups over RandomSampler (where the batch size of all batches is determined by the longest sequence in the dataset, leading to hardware under-utilization when processing batches with short sequences only).

Figure 2 :
Figure 2: Impact of minor grammatical errors on a Roberta-Base model fine-tuned using only grammatically correct examples.We generate inputs with minor grammatical errors by pruning articles ('a', 'an', 'the') and punctuation marks (comma, fullstop, apostrophe, etc.) from samples in the development set.

Figure 5 :
Figure 5: Average training loss across GLUE and SQUAD when fine-tuning Roberta-base.

Figure 7 :
Figure 7: Overheads of TokenDrop + BucketSampler.Times are measured on a NVIDIA RTX 2080 Ti GPU with 11 GB memory.

Figure 8 :
Figure 8: Impact of sequence length spread of a dataset on fine-tuning speedup achieved using To-kenDrop + BucketSampler.

Table 2 :
Results of fine-tuning the T5-small seq2seq model on text summarization.We report the ROUGE-1 score.Subscripts indicate standard deviation.

Table 3 :
Results of using TokenDrop + BucketSampler during batched inference with Roberta-base.We report the average score across the 9 GLUE tasks and SQUADv1.1.We assume that all samples in the test dataset arrive simultaneously, and speedup is computed by comparing the wall-clock time taken to infer on all test samples.

Table 4 :
Impact of the different BucketSampler optimizations when fine-tuning Roberta-base.We report the average score across GLUE and SQUAD.

Table 5 :
Accuracy and efficiency of fine-tuning Roberta-base with different batching strategies.Results are averaged across the GLUE tasks and SQUAD.

Table 8 :
Evaluation of regularization strategies.We report the average score across GLUE and SQUAD.