Dynamic Stashing Quantization for Efficient Transformer Training

Large Language Models (LLMs) have demonstrated impressive performance on a range of Natural Language Processing (NLP) tasks. Unfortunately, the immense amount of computations and memory accesses required for LLM training makes them prohibitively expensive in terms of hardware cost, and thus challenging to deploy in use cases such as on-device learning. In this paper, motivated by the observation that LLM training is memory-bound, we propose a novel dynamic quantization strategy, termed Dynamic Stashing Quantization (DSQ), that puts a special focus on reducing the memory operations, but also enjoys the other benefits of low precision training, such as the reduced arithmetic cost. We conduct a thorough study on two translation tasks (trained-from-scratch) and three classification tasks (fine-tuning). DSQ reduces the amount of arithmetic operations by $20.95\times$ and the number of DRAM operations by $2.55\times$ on IWSLT17 compared to the standard 16-bit fixed-point, which is widely used in on-device learning.


Introduction
Large Language Models (LLMs) based on the Transformer architectures (Vaswani et al., 2017) are currently seen as the foundation models (Bommasani et al., 2021).The pre-train and then finetune paradigm has shown promising results for a variety of Natural Language Processing (NLP) tasks (Liu et al., 2019;Raffel et al., 2020;Brown et al., 2020).However, the training of LLM is both computationally and memory intensive, posing a significant challenge for their deployment.
In the hardware world, the Roofline model demonstrates that there is an optimal balance of processor and memory performance.The metric used to assess performance is referred to as the operational intensity, which is calculated as the ratio of arithmetic intensity to memory bandwidth: The Roofline model has enabled us to identify the sweet spot (I opt ) for a processor to reach its peak arithmetic performance (Williams et al., 2009;Ding et al., 2022).As illustrated in Figure 1, as operational intensity (I) increases, the maximum attainable performance rises at a linear rate initially before reaching a constant value.The region to the left of the turning point is limited by the available memory bandwidth; the region to the right is constrained by the processor's arithmetic computing capability.Training Transformer models, as shown by Ivanov et al. (2021), is memory-bound, which means it sits at the left quadrant in the Roofline model (I < I opt ).Consequently, the performance of LLM training on modern hardware is significantly hindered by the inadequate bandwidth, as the amount of data movements to and from DRAM is the major performance bottleneck.
For this reason, researchers have sought to accelerate the training process of Transformers through quantization.This approach aims to reduce memory consumption by decreasing the precision of parameters.Prior work has looked into the effect of quantization on Transformer models, a majority of which focus on the forward pass of model inference with fixed weights (Zhang et al., 2020;Bai et al., 2020;Tao et al., 2022).A number of studies have also investigated low-precision training for Transformers (Sun et al., 2019(Sun et al., , 2020)).Although works have demonstrated the effectiveness of quantization, they typically assume a single precision level, either per neural network layer or per network, which over-simplifies the hardware target.When viewed from a Roofline model perspective, existing quantization methods attempt to optimize both compute complexity and memory bandwidth requirement, and then fail to recognize that the workload is heavily memory-bound.
Motivated by this observation, we propose a novel quantization strategy for LLM training named Dynamic Stashing Quantization (DSQ).We identify the most memory-intensive part of LLM training -the communication between the forward and backward passes, and define stashing as the process of storing intermediate results in a memory buffer (in a normal case, DRAM) for later use.The proposed quantization places an emphasis on this communication, and dynamically quantize the intermediate results between forward and backward passes for a significant reduction of the DRAM traffic.As illustrated in Figure 1, this reduction of DRAM bandwidth helps DSQ to move closer to the optimal operational intensity.We have the following contributions:

Related Work
Quantization has been studied in detail for inference.These include using uniform (Zafrir et al., 2019;Bhandare et al., 2019) and non-uniform (Sun et al., 2019;Darvish Rouhani et al., 2020) quantization methods.Specifically, uniform quantization methods such as fixed-point (Zafrir et al., 2019;Bhandare et al., 2019;Lin et al., 2020), ternary (Zhang et al., 2020), or even binary (Bai et al., 2020) number formats have been applied to inference of Transformer models.In this work, we focus on quantization for LLM training which introduces new challenges such as the large dynamic range needed during the backward pass for lossless training (Sun et al., 2019) where non-uniform quantization methods have seen more success.
Training LLM models is approximately 3× more expensive than running inference for the same model.Thus, quantizing all operations during training has been an area of active research (Sun et al., 2019(Sun et al., , 2020;;Yang et al., 2019;Fu et al., 2020;Fox et al., 2020;Kalamkar et al., 2019).Most of these methods use non-uniform quantization to handle larger dynamic range needed for gradient updates (Kalamkar et al., 2019).Floating-point arithmetic and its variants have become a popular method for low-precision training (e.g.fewer than 8 bits).Mini-floats with extremely small exponents (e.g. 1 bit or 2 bits) have been demonstrated to be effective in small language models, such as LSTMs (Sun et al., 2019(Sun et al., , 2020)).Block floating-point or block mini-floats, where an exponent is shared between a set of values, has become popular in quantized training (Yang et al., 2019;Drumond et al., 2018;Fox et al., 2020) as it allows for a large dynamic range while approximating the cost of integer formats for multiplication.Specifically, Draumond et al. utilized block floating-point with roughly 24 bits to perform lossless training on vision tasks (Drumond et al., 2018).Fox et al. demonstrated that 8-bit training is possible with an around 0.5 BLEU score degradation on machine translation (Fox et al., 2020).Our work extends these formats to Large Language Models, includes quantization of stashed weights, and introduces a dynamic aspect to further reduce the required bit widths.The idea of stashing has also been explored before by Jain et al. (2018), although they only focused on applying lossless encoding methods on single precision numbers (Float16).However, in this paper we show a more aggressive stashing techniques (e.g. on average less than 4 bits per number) that is time-adaptive for LLM training.Fractrain (Fu et al., 2020), to our knowledge, is the only work that applied the idea of dynamic quantization on standard training, but was primarily focusing on vision tasks.Our work extends dynamic quantization to encompass stashed values and evaluates these effects on LLMs.Prior research on distributed training has looked at reducing the communication cost (Alistarh et al., 2017;Hönig et al., 2022), where Honig et al. also investigate how a time-adaptive quantization would help federated systems to learn.These works focused on device-to-device traffic while our work focuses on reducing DRAM traffic.

Method
Figure 2 provides a high-level illustration of the DSQ flow.We consider the inputs x l of a neural network layer with parameters w l , and the output of the layer is x l+1 .In the backward pass, we consider the partial derivatives dx l of the input and also the gradient of the weights dw l .Naturally, a single training step requires three GEMMs as illustrated in Figure 2. We illustrate four quantization opportunities in this training step and their effects: • q 0 : mainly affects the arithmetic density of forward pass, notice it is possible for x l and w l to use different precisions, but this optimization is not the focus of our work.• q 1 : affects the DRAM memory bandwidth, one key point in our work is that we show q 1 can be different from q 0 and in fact can be a very aggressive, dynamic quantization.• q 2 : affects mainly the computation complexity of the first GEMM in the backward pass.• q 3 : affects the DRAM bandwidth and also the computation complexity of the second GEMM in the backward pass.
In our knowledge, we are the first to systematically illustrate the potential effects, both on compute and off-chip memory bandwidth, of various quantization opportunities within a standard training pass.The two GEMMs in the backward pass can be potentially fused (e.g.pipelined), and in that case dx l does not have to be written to and then read from the DRAM.In our cost model estimation, we use a conservative strategy and assume this tensor is always flushed to DRAM.In DSQ, we use Block Floating Point as the quantizer for q 0 , q 1 , q 2 and q 3 , since this quantizer is shown superior to fixed-point quantization (Darvish Rouhani et al., 2020).We also use a time-adaptive quantization strategy, this means the quantization uses a different quantization level q t i for each round t of the training.We design DSQ to monotonically increase q t i as a function of t and use the validation loss to inform this increase.This monotonic increase strategy has been proven more effective than other complex scheduling methods in Hönig et al. (2022).Through extensive tuning and experimentation, we also notice that it is important to keep q 3 ≥ 16 through the entire training process, and Appendix C studies the effect of different quantization levels for q 3 .

Evaluation
We evaluate the effectiveness of DSQ on two different translation tasks, WMT14 EN-DE (Bojar et al., 2014) (in Appendix D) and IWSLT17 EN-DE (Cettolo et al., 2017), and two tasks from the GLUE benchmark (Wang et al., 2018), the details of these datasets are in Appendix A. We used the Adam optimizer and the details for all the learning rate and batch size selections are in Appendix B. For the translation tasks, we use a classic 6-layer transformer model (Vaswani et al., 2017) and the RoBERTa-base model (Liu et al., 2019) for the GLUE tasks.All tasks are executed on systems that have 2 AMD EPYC 7763 64-Core Processors 1.8GHz (128 cores in total), and 4 NVIDIA A100-SXM-80GB GPUs, with 1000 GiB RAM.We are interested in understanding the costs of arithmetic operations, as well as the number of memory reads and writes.To this end, we have built a hardware performance modeling framework to estimate the training cost.Our cost model is similar to Sun et al. (2020) and Samajdar et al. (2018), but our numbers are derived from a production hardware system, taking the numbers reported in Darvish Rouhani et al. (2020), to provide a higher-fidelity estimation.
Table 1 presents the results of our study comparing different quantization strategies.We compare popular low-latency training baselines and Block floating-point (BFP) (Darvish Rouhani et al., 2020;Fox et al., 2020) with different precisions.For all BFP implementations considered in this paper, we keep the exponent bitwidth to be 8 and the bounding-box size to be 16 following The training is viewed as a combination of a forward pass and a backward pass.q 0 , q 1 , q 2 and q 3 define where the tensors are quantized, we use [q 0 , q 1 , q 2 , q 3 ] to describe the DSQ configuration.DSQ ensures all GEMM inputs are quantized.Notice for the second and third GEMMs, dx l+1 , x l and dx l are the quantized version fetched from the DRAM, the fact that these values are heavily quantized helps us to save DRAM bandwidth.

Conclusion
In this paper, we propose Dynamic Stashing Quantization (DSQ) for LLM training.This new quantization strategy applies a more aggressive quantization for intermediate results between the forward and backward passes generated during training, thereby reducing DRAM traffic.Specifically, our approach uses a low precision at the beginning of training, and then gradually increases the precision level, to reduce the effect of round-off errors introduced by quantization.We demonstrate the effectiveness of DSQ by showing how it can reduce both the computation cost and DRAM bandwidth requirement on machine translation and LLM fine-tuning tasks.

Limitation
• DSQ precision configurations are decided through experimentation on the IWSLT dataset.The precision for different stages is scheduled based on the validation loss value, the precision would increase if the validation loss becomes 'flat' (non-decreasing).In our particular method, if our validation loss has been non-decreasing for a fixed number of N epochs, we then move to the next quantization level, following a setup similar to that proposed by Hönig et al. (2022).We find that setting N = 5 is sufficient for all test scenarios.The choice of this free parameter is a limitation in our paper and will be investigated in further research.The same precision configuration setup is used for all other datasets.
• Language models that are larger than classic Transformer and RoBERTa have been developed in recent works.Due to limited resouces we have, we choose to work on smaller models as an exemplar.While we have tested on larger LMs like OPT-1.3B (in Appendix E) to show large LLMs today are also memorybound, additional experimentation is expected to be conducted to enhance the robustness and precision of our findings.

A Datasets
Four datasets are used: translation WMT14 EN-DE and IWSLT2017 EN-DE for machine translation tasks, QNLI and MNLI for textual entailment tasks.Table 2 presents details for the datasets.

B Hyperparameters
The training hyperparamters, such as learning rates, are picked following standard benchmarks and open implementaitons (Liu et al., 2019;Vaswani et al., 2017).We summarize them in Table 3 for repeatability.We use the Adam optimizer with β 1 = 0.9, β 2 = 0.98 for both training and finetuning models.The learning rate schedule is Inverse Square Root for training models, and Polynomial Decay for finetuning models.Dropout with rates of P IW SLT = 0.3 and P W M T = 0.2, label smoothing with value ϵ = 0.1 are applied to train models.DSQ precision configurations are decided through experimentation on the IWSLT dataset as discussed in section 6 Limitations.Table 4 shows a collection of tuning runs we had, we found that heavily quantized models still work at the start of training stage, and [16,4,4,16] quantized BFP model works as well as less aggressive ones.This indicates that DSQ should start with heavily aggressive precision setup (we pick [2, 2, 2, 16] for IWSLT14 DE-EN), and jump to [16,4,4,16] when needed during training process.C The effect of q 3 The gradient output (dx l ) plays an important role in the performance of fixed-point quantization.Notice in table Table 5, gradient output quantized to 8 bits leads to training failure for fixed-point quantization.
In order to focus on the idea of stashing, we apply 16 bits quantization of gradient output for all our stashing precision setups.

D Additional results on WMT14
We also train the model on WMT14 EN-DE dataset, the BLEU scores we gain are relatively low compared to the 27.3 BLEU score achieved by Vaswani et al. (2017) because we only trained the models for 15 epochs.Table 6 presents the results.
E Memory-bound verification on OPT-1.3B we have also tested on larger LMs like OPT-1.3B to show large LLMs today are also memory-bound so that dynamic stashing will serve as a useful technique for reducing the DRAM traffic and compute.
We run a similar analysis to Ivanov et al. (2021).Table 7 presents the results.The analysis agrees with the conclusion of Ivanov et al. (2021).This may be of interest to the community in understanding the memory-bound nature of current LLMs.

Figure 1 :
Figure 1: The Roofline model with operational intensity (I) and attainable performance (P ). 1 is non-quantized, 2 is a standard quantization and 3 is DSQ.Operational Intensity = Number of Operations DRAM traffic

Figure 2 :
Figure2: An illustration of the DSQ flow for a single linear layer.The training is viewed as a combination of a forward pass and a backward pass.q 0 , q 1 , q 2 and q 3 define where the tensors are quantized, we use [q 0 , q 1 , q 2 , q 3 ] to describe the DSQ configuration.DSQ ensures all GEMM inputs are quantized.Notice for the second and third GEMMs, dx l+1 , x l and dx l are the quantized version fetched from the DRAM, the fact that these values are heavily quantized helps us to save DRAM bandwidth.

Table 1 :
The performance of Machine Translation trained with a 6-layer Transformer architecture, the model is assessed using numbers reported as percentages.∆ shows the performance difference compared to the floating-point 32-bit baseline.

Table 2 :
Details for each dataset, including the number of classes, a description and the source.

Table 3 :
Details of the optimal hyper-parameters including batch size, learning rate and weight decay values for each set of experiments with the same dataset and prompting model.

Table 4 :
Tests on stashing precision setup.The models are trained on IWSLT14 DE-EN.∆ shows the performance difference compared to the floating-point 32-bit baseline.