Characterizing the Efficiency vs. Accuracy Trade-off for Long-Context NLP Models

With many real-world applications of Natural Language Processing (NLP) comprising of long texts, there has been a rise in NLP benchmarks that measure the accuracy of models that can handle longer input sequences. However, these benchmarks do not consider the trade-offs between accuracy, speed, and power consumption as input sizes or model sizes are varied. In this work, we perform a systematic study of this accuracy vs. efficiency trade-off on two widely used long-sequence models - Longformer-Encoder-Decoder (LED) and Big Bird - during fine-tuning and inference on four datasets from the SCROLLS benchmark. To study how this trade-off differs across hyperparameter settings, we compare the models across four sequence lengths (1024, 2048, 3072, 4096) and two model sizes (base and large) under a fixed resource budget. We find that LED consistently achieves better accuracy at lower energy costs than Big Bird. For summarization, we find that increasing model size is more energy efficient than increasing sequence length for higher accuracy. However, this comes at the cost of a large drop in inference speed. For question answering, we find that smaller models are both more efficient and more accurate due to the larger training batch sizes possible under a fixed resource budget.


Introduction
Over the past few years, advances in sequence modeling have led to impressive results on several NLP benchmarks (Wang et al., 2019(Wang et al., , 2020)).A closer look at these results reveals that higher accuracies are typically achieved by increasingly larger and computationally intensive models, which have large carbon footprints that can have an adverse effect on the environment (Strubell et al., 2019).
This has led to the Green AI initiative, which urges researchers to consider energy and computational efficiency when evaluating models in order to promote those which achieve high accuracies with smaller carbon footprints (Schwartz et al., 2020).However, although it has been a few years since Green AI was introduced, efficiency metrics have still not been integrated into many recently proposed benchmarks such as the Long Range Arena (LRA) (Tay et al., 2020a) and SCROLLS (Shaham et al., 2022).These benchmarks serve as a strong basis for comparison between Transformer models in terms of accuracy.However, improved accuracy is often obtained by either increasing the input sequence length or the model size, and the energy cost of these improvements is not clear.Moreover, previous characterizations of model efficiency in terms of speed (e.g., in LRA) only focus on intermodel comparisons, keeping model sizes and input sequence lengths fixed.Here, we argue that the accuracy-vs-efficiency trade-off also has implications for intra-model comparisons when selecting hyperparameters -e.g., increasing the sequence length might positively impact accuracy but may also negatively impact efficiency metrics.As a result, when faced with a fixed resource budget, it is not clear whether practitioners should opt for increasing the model size or increasing the input length for the most efficient use of resources.
In this work, we perform a systematic study of the trade-off between efficiency and accuracy for two widely used long-context NLP models -Big Bird (Zaheer et al., 2020) and Longformer-Encoder-Decoder (LED) (Beltagy et al., 2020) -on four datasets from the SCROLLS benchmark. 1We characterize efficiency using several metrics, including the total energy consumption during training, training speed, inference speed, and power efficiency.We compare the models across several different input lengths and two different model sizes (base and large).Overall, for summarization, we find that, perhaps surprisingly, increasing model size is a more energy efficient way of increasing accu-racy as compared to increasing sequence length.However, if inference speed is the main efficiency metric of interest, then smaller models should be preferred.For question answering, on the other hand, we find that using smaller models is more efficient in terms of all metrics and more accurate due to the larger training batch sizes allowed under a fixed resource budget.

NLP Benchmarks
Benchmarks such as SuperGLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2018) have served as the gold standard in the development of NLP models.However, these benchmarks only capture model performance on short text sequences while many NLP tasks of interest, such as question answering and summarization, involve long contexts.Recently, several efficient Transformer models have been introduced which require subquadratic memory and time complexity with respect to the input length (Tay et al., 2020b).Consequently, new standardized benchmarks have been introduced specifically focusing on the long sequence modeling capabilities of these models, including the Long Range Arena (LRA) (Tay et al., 2020a) and SCROLLS (Shaham et al., 2022).
Although LRA evaluates long-sequence models, it only contains two language datasets which artificially elongate the input sequences through byte tokenization.The SCROLLS benchmark, on the other hand, focuses on language tasks which naturally require synthesizing information from long sequences, including summarization, question answering, and classification.SCROLLS does not compare models in terms of efficiency at all, and while LRA compares model speeds, it only does so across different model architectures, ignoring the impact of hyperparameter choices.For our analysis, we utilize three summarization tasks and one question answering task from SCROLLS.

Energy Considerations
As deep learning models grow more complex to meet increasing demands, the computation required to run these models generates an increasingly larger energy cost (Strubell et al., 2019).This has led to the Green AI initiative (Schwartz et al., 2020) which demands higher energy efficiency while maintaining state-of-the-art accuracies.A benchmark of the performance and energy efficiency of AI accelerators has been performed during training, but it only examined 2-layer LSTMs and vanilla Transformers (Wang et al., 2020).HULK (Zhou et al., 2021) is an NLP benchmark that evaluates the energy efficiency of several Transformer models (e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019)) during pre-training, fine-tuning, and inference, but it does not consider long-range models.Additionally, neither of the benchmarks consider the effects of different sequence lengths on both energy efficiency and accuracy.However, we confirm the observation from HULK that larger model sizes do not always imply lower efficiency.

Methodology
Our main contribution is an analysis of how different sequence lengths affect the trade-off between accuracy, power, and speed in long-context Transformer models during fine-tuning and inference.Since our focus is on long-context NLP tasks, we investigated the following four input sequence lengths: 1024, 2048, 3072, and 4096.

Datasets
We conduct our analyses on four datasets from the SCROLLS benchmark: GovReport (Huang et al., 2021), SummScreenFD (Chen et al., 2021), QM-Sum (Zhong et al., 2021), and Qasper (Dasigi et al., 2021).These datasets span two different taskssummarization and question answering -which frequently involve long inputs.We provide a summary of these datasets in Table 1 with more details provided in Appendix A. We cast these datasets in a unified sequence-to-sequence format using the same procedure as done in SCROLLS.

Models
Following standard practice, we start with pretrained models and restrict our analysis to the fine-tuning and inference stages.Since our tasks are cast in a sequence-to-sequence format, we pick two widely used encoder-decoder models for longcontext NLP -the Longformer-Encoder-Decoder (LED) and Big Bird.To mimic a typical use-case, we obtained these two pre-trained models from the HuggingFace library 2 -hence our analysis can be easily extended to any HuggingFace model.
Longformer-Encoder-Decoder (LED).We analyzed both the base and large version of the LED model released with the original paper (Beltagy et al., 2020).This version of the LED model utilized the Longformer-chunks implementation that achieves high compute efficiency at the cost of higher memory by chunking the key and query matrices such that only a single matrix multiplication operation from PyTorch is needed.
The two versions of the model are stored on HuggingFace as allenai/led-base-16384 and allenai/led-large-16384.
Big Bird.Following the encoder-decoder setup in the original Big Bird paper (Zaheer et al., 2020), we utilized the version of Big Bird-large that has been pretrained on the PubMed dataset starting from Pegasus-large.This model is stored on HuggingFace as google/bigbird-pegasus-large-pubmed.We only performed experiments on the large version of this model as the base version is not released on HuggingFace.

Hardware Resources Provisioned
Our initial experiments with the LED-base model suggest that large batch sizes are imperative for obtaining high accuracies on the question answering task but less so for the summarization tasks (see Table 2).Quadrupling the batch sizes on the Qasper question answering dataset -through the use of gradient accumulation step size of fourresulted in a two to four point increase in the F1 scores across the input sequence lengths.Take the input sequence length of 1024 as an example (i.e., first row of Table 2), we were able to fit a batch size of 24 on one GPU (labeled 1 GPU) without suffering an out-of-memory error when performing fine-tuning, obtaining a modest F1 score of 17.68.
When we quadrupled the batch size to 96 by using gradient accumulation with step size of four (labeled 1 GPU -Accum), the model accuracy went up 2 https://huggingface.co/ to an F1 score of 21.39.When the batch sizes were further increased through the use of more GPUs (labeled 8 GPUs -Accum), the increase in F1 scores becomes more prominent at four to seven points.The same trends hold for all sequence lengths on the Qasper dataset.On the other hand, quadrupling the batch sizes for the GovReport summarization dataset resulted in negligible increases in Rouge scores while the further increase via multiple GPUs actually resulted in (slightly) lower Rouge scores.
These initial experiments informed our decision to use a fixed resource budget of 1 Nvidia RTX A6000 GPU for both fine-tuning and inference of all models on the summarization tasks, since increasing the number of GPUs does not have a positive effect on the model accuracy.On the other hand, for the question answering task, we used a much larger fixed resource budget of 8 Nvidia RTX A6000 GPUs (on the same server) for both finetuning and inference to allow for larger batch sizes that can obtain much better model accuracy.

Fine-tuning
All pre-trained models mentioned in Section 3.2 are fined-tuned without mixed precision or gradient checkpointing on all datasets until convergence.A model has converged when the accuracy metric of interest for that specific task stays the same or has worsened for 3 validation calls.In our case, since we perform validation every 500 steps for summarization tasks and every 10 steps for the question answering task, a model has converged when the metric has stayed the same or worsened for 1500 steps for summarization tasks and 30 steps for the question answering task.
In terms of hyperparameters, we used the same hyperparameters that the SCROLLS benchmark utilized for the LED-base model except for the batch sizes.To control for the effects of memory on our metrics, for each sequence length and model, we selected the largest batch size that can fit on the 48GB A6000 GPU.For the question answering task, the batch sizes were selected so that the minibatches on each of the 8 GPUs were maximized.To further increase the effective size of each of minibatches in the question answering task, we set gradient accumulation steps to four.More information about the hyperparameters is outlined in Appendix B.

Inference
Since we do not have access to the labels in the test sets of SCROLLS, inference is run on the vali- Accum indicates that a gradient accumulation step size of four was used to obtain the larger batch sizes.On the Qasper question answering task, where Acc represents the F1 score of the predicted answers, increasing the batch sizes significantly improves the accuracy for all sequence lengths.On the GovReport summarization task, where Acc represents the Rouge score, increasing the batch sizes has a negligible effect.
dation set using the fine-tuned models.All of our inferences were performed with a batch size of 16.

Evaluation Criteria
Accuracy.Our evaluation metrics for accuracy of the models on each dataset follow those mentioned in the SCROLLS paper.GovReport, Summ-ScreenFD, and QMSum are evaluated using Rouge, as is standard for summarization; Qasper is evaluated using a token-level F1 score after normalizing both the predicted and ground-truth answer strings. 3For Rouge, following SCROLLS, we calculated the geometric mean of three different types of rouge to provide a single value: Rouge-1 (unigram overlap), Rouge-2 (bigram overlap), and Rouge-L (longest sequence overlap).
Efficiency.For efficiency metrics, we explored the training power efficiency (number of samples trained per second per Watt), total training energy required (average power × training time), training speed (number of samples trained per second), and inference speed (number of samples inferenced per second).The training and inference speeds are provided by the HuggingFace library while the total energy consumed and the power efficiency of the GPU(s) were collected with the help of the Weights and Biases (wandb) tool. 4 We chose power efficiency as one of our metrics because it is one of the most important industry standard metrics used for machine learning platforms (TPU uses performance per Watt, 3 Normalization is done in the same manner as Squad (Rajpurkar et al., 2018)).
4 https://wandb.ai/siteMLPerf (Reddi et al., 2020;Mattson et al., 2020) measures the number of samples inferenced per second per Watt) as it is a key component of TCO (Total Cost of Ownership).Cloud providers routinely spend 40-50% of the cost towards electricity as well as powering and cooling the servers, and this cost is increasing.Hence, maximizing the utility of this spent power by increasing the number of samples processed per watt is crucial for reducing the carbon footprint of NLP research.

Summarization Datasets
Figure 1 depicts the power efficiency of each summarization dataset vs. its corresponding training accuracy for input lengths ranging from 1024 to 4096 tokens.We make the following observations: First, power efficiency has a strong inverse correlation with the size of the input sequence lengths, with small variations across datasets.Second, the Big Bird-large model has similar power efficiency to LED-large model across the input sequence lengths, but Big Bird's Rouge scores are much lower, making one of the LED models a better choice to select when training summarization tasks.
Figure 2 shows the total energy consumed during training on each of the three summarization datasets.Interestingly, we observe that on GovReport and QMSum, LED-large with sequence length 1024 is more efficient and has higher accuracy than each of the LED-base models with larger sequence lengths.Increasing the sequence length for LEDlarge further increases this accuracy while still often being more efficient than LED-base models  with greater sequence lengths.This suggests that, for summarization, using larger models with short sequence lengths is a more energy friendly way to get higher accuracies (as compared to small models with larger sequence lengths).We find Big Bird to both consume more energy and achieve lower Rouge scores.
The training speed (Figure 3) and the inference speed (Figure 4) of the summarization datasets show similar trends.As the input sequence lengths increase, the training and inference speeds decrease due to the sub-quadratic runtime complexity (with respect to the input sequence lengths) exhibited in the attention mechanisms employed in these efficient Transformer models.Unlike training energy, inference speed increases when the model size is smaller at the cost of lower accuracy.However, sometimes (such as the datapoints exhibited in the GovReport dataset) a similar accuracy can be obtained by LED-base model with a larger input length (2048) as opposed to LED-large with a smaller input length (1024).

Qasper Dataset and Scaling Up Resources
Figure 5 shows all four efficiency metrics for the Qasper question answering task.Once again, the LED models outperform Big Bird in the overall F1 score.Interestingly, we observe that under fixed resources, LED-base also outperforms LED-large on this dataset. 5We suspect this is due to the larger batch sizes we can fit for LED-base as compared to LED-large, which we found to be particularly important for this dataset.Hence, we found it to be more efficient and more accurate to use the smaller model on this task.Increasing sequence length brings large gains in accuracy with a small increased cost in training energy but a large slowdown in terms of speed.

Energy Consumption Deep Dive
To understand the energy consumption of the hardware platform, we present a deeper analysis on the GovReport dataset.We plot the GPU utilization (as an average over the entire training run), the GPU memory usage (as an average over the entire training run), and the training time (in seconds) in Figure 6.From the GPU utilization plot, we observe that the single GPU is pretty well utilized for the LED models while Big Bird seems to not saturate the GPU especially when the input sequence length is 4096.This would suggest that Big Bird would incur a smaller energy cost because not all GPU resources are online.However, Big Bird took about 48 hours to train for a sequence length of 4096 while LED-large took 14 hours to train at the same sequence length.

Figure 1 :
Figure 1: Power efficiency measured in number of samples per second per watt vs. model accuracy in Rouge score for the three summarization datasets -GovReport (Left), SummScreenFD (Middle), QMSum (Right) -while varying input sequence lengths.

Figure 2 :
Figure 2: Total training energy consumption measured in kiloWatt-hour vs. model accuracy in Rouge score for the three summarization datasets -GovReport (Left), SummScreenFD (Middle), QMSum (Right) -while varying input sequence lengths.

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Model training speed measured in number of samples per second vs. model accuracy in Rouge score for the three summarization datasets -GovReport (Left), SummScreenFD (Middle), QMSum (Right) -while varying input sequence lengths.

Table 2 :
Accuracy of the LED-base model with varying batch sizes across different hardware configurations.

Table 5 :
The almost four times in training time contributed to Big Bird's high en-Batch sizes used for fine-tuning the different models for each of the tasks at each input sequence length.Summ indicates summarization, and QA means question answering.The batch sizes listed for the QA task is the total batch size across the 8 GPUs with gradient accumulation step set to four.