Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around $5\times$ speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only $1.4\times$$\sim$$2\times$ speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.


Introduction
As the de facto method for text generation, Au-toRegressive (AR) decoding is widely blamed for its poor inference efficiency due to its low level of parallelism, which fails to utilize the full potential of modern parallel computing devices like GPUs.This inefficiency not only leads to high deployment Autoregressive Decoding Figure 1: Compared with autoregressive decoding (left) that generates token by token, the draft-then-verify paradigm (right) first drafts multiple tokens efficiently and then verifies these tokens in parallel.Drafted tokens after the bifurcation position (e.g., y 5 ) will be discarded to guarantee the generation quality.
costs but also limits the application of advanced AR models in real-time scenarios.
In this work, we study the draft-then-verify paradigm for accelerating seq2seq generation of an existing AR model2 .As shown in Figure 1, the draft-then-verify paradigm first generates a number of drafted tokens efficiently and then verifies these tokens using the existing AR model in parallel to ensure the decoding result matches AR decoding.However, previous attempts in the "draftthen-verify" paradigm such as Blockwise Decoding (Stern et al., 2018) and Aggressive Decoding (Sun et al., 2021) tend to lack in-depth investigation of this paradigm.Their modest speedup (i.e., 1.4×∼2.0×)or limitation to certain seq2seq tasks like Grammatical Error Correction (GEC) has caused this paradigm to be underestimated, resulting in it not receiving much attention and remaining dormant for years.
To fully exploit the draft-then-verify paradigm, we propose Speculative Decoding (SpecDec), drawing inspiration from speculative execution 3 in com-puter architecture, with two key innovations that improve drafting and verification processes respectively.For drafting, we derive two principles for designing the drafting model 4 : the Capability Principle and the Latency Principle.Following these two principles, we propose Spec-Drafter -a specialized independent model optimized in the draftthen-verify paradigm, which can accurately and efficiently fulfill the drafting task.
For verification, we propose an advanced method -Spec-Verification that relaxes the vanilla verification strategy.Spec-Verification allows the decoding results of SpecDec to be slightly different from AR greedy decoding, offering an opportunity to accept more drafted tokens without sacrificing generation quality and leading to higher decoding efficiency.
We conduct extensive experiments on various seq2seq generation tasks like machine translation and abstractive summarization.Results show our approach can achieve around 5× speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, largely outperforming previous draft-thenverify work (1.4×∼2.0×speedup).Moreover, we demonstrate that SpecDec has several additional advantages that enhance its practicality for accelerating generative models in real-world applications.
Our contributions can be summarized as follows: • We are the first work that explicitly exploits the idea of speculative execution to accelerate Transformer inference.Our proposed two key innovations -the independent Spec-Drafter and Spec-Verification strategy allow SpecDec to achieve over 5× lossless speedup over autoregressive decoding in seq2seq tasks, refreshing the impression that the "draft-thenverify" paradigm only has a limited 1.5× ∼ 2× acceleration potential.• We demonstrate 3 advantages of SpecDec with extensive empirical results in addition to its remarkable acceleration performance: better latency-throughput trade-off, easy adaptability for existing models and retaining the behavior of the original model, revealing its huge practical value and bringing the longdormant draft-then-verify paradigm back into the spotlight. 4The drafting model is also called the drafter in this paper.
2 Background: draft-then-verify decoding The "draft-then-verify" paradigm first drafts multiple tokens efficiently, as a speculation of AR decoding results; then, it verifies these tokens in parallel to ensure they match the AR decoding result, as illustrated in Figure 1.It is an implicit implementation of speculative execution in Transformer inference.
Draft There are different approaches to drafting tokens, including model-based (Stern et al., 2018) and input-(context-) based5 methods (Sun et al., 2021;Yang et al., 2023).Take Blockwise Decoding -the most representative work attempting the draft-then-verify paradigm -as an example (illustrated in Figure 2(a)): it introduces additional k − 1 feedforward network (FFN) heads on top of an existing AR model, enabling the model to predict the next k drafted tokens in parallel during inference.
Verify The generated drafted tokens are fed into the original AR model and verified in parallel.Specifically, it finds the bifurcation position c, the largest index that ensures all previous c − 1 drafted tokens and the corresponding AR decoded tokens are identical: where I(•) is the indicator function, x is the source sentence, ŷ≤j is the previously generated tokens6 and y j+i is the i-th drafted token.Drafted tokens after the position c are all discarded.The final decoded tokens in the current iteration are: The above draft and verification steps are iterated until the termination condition is met.

Speculative Decoding
To fully exploit speculative execution for Transformer inference, we propose Speculative Decoding (SpecDec) with two innovations -Spec-Drafter and Spec-Verification that substantially improve drafting (Section 3.1) and verification (Section 3.2) respectively.
Transformer Block Transformer Block

Design Principles
As a crucial ingredient in the draft-then-verify paradigm, the drafting process has a drastic impact on end-to-end acceleration performance.However, there are very limited explorations of the designing principles for the drafter by previous studiesmost of them arbitrarily implement a drafter, which accounts for their undesirable acceleration results.
To understand the effect of drafting, we look into the overall latency in the draft-then-verify paradigm for one sample of the length L as follows: × tv total verification latency (4) where Tok.denotes the average number of drafted tokens accepted per iteration, t d and t v are the time costs of drafting and verification7 each iteration respectively.
According to Eq (4), Tok. is inversely proportional to the number of iterations, which is primarily influenced by drafting accuracy: A drafter that is more capable of drafting can attain greater Tok.values, consequently completing the decoding process in fewer iterations.This observation leads us to derive the first principle for designing the drafter: Principle I (Capability Principle): The drafter model should be seriously in- vested to guarantee its capability of accurate drafting.
Principle I is the most crucial principle in determining the end-to-end speedup, as it directly influences the value of Tok. which affects both total drafting and verification latency.Surprisingly, little previous work adheres to this seemingly simple and straightforward principle maybe due to the concern of increasing the drafting latency.For instance, the drafter in Blockwise Decoding is not properly invested: Its drafter not only has limited parameters (FFN prediction heads), making it difficult to fit the challenging drafting task, but more importantly, it employs a shared attention mechanism that forces all drafted tokens to share a single set of attentions (only differentiating at the final prediction head), as shown in Figure 2(a).However, different target positions should attend different context tokens, as illustrated in Figure 3.Despite its computational efficiency, the shared attention mechanism in Blockwise decoding severely constrains the drafter's capability, resulting in low drafting accuracy and consequently leading to most drafted tokens being discarded.
In addition to the drafter's accuracy, its latency also impacts the end-to-end speedup result but from another perspective -by affecting the latency of each iteration (i.e., t d in Eq (4)) -from which we derive Principle II for designing the drafter: Millions of drivers will be concerned about
Millions of drivers will worry concerned about their Millions of drivers will be concerned about their

Output Next Input
Millions of drivers will be concerned about their  9)).As a result, Spec-Verification allows more drafted tokens to be accepted even if they are slightly different from the AR top-1 result, leading to a higher inference speedup.
Principle II (Latency Principle): The drafter should be fast at generating drafted tokens to minimize the latency overhead of each iteration.
Designing a fast drafter solely based on Principle II is not difficult, as done in most previous work.The real challenge lies in designing a lowlatency drafter without compromising its capability (Principle I), since it is difficult to achieve both low latency and high capability simultaneously.

Model Architecture
We propose Spec-Drafter, which adheres to both principles for accurate and fast drafting.To ensure the drafter is sufficiently capable of accurate drafting (Principle I), Spec-Drafter employs an independent encoder-decoder model architecture, which generates drafted tokens conditioned on the leftward context and source tokens in a mask-predict manner (Ghazvininejad et al., 2019), as illustrated in Figure 2(b).This independent model design facilitates Spec-Drafter to predict each drafted token using distinct attention queries, in contrast to Blockwise Decoding employing a shared attention query for predicting all drafted tokens (as illustrated in Figure 2).In this way, Spec-Drafter could better align with the AR model's behavior, thereby increasing the chances of its drafted tokens being accepted during verification, as shown in Figure 3.
To make Spec-Drafter fast (Principle II) without compromising its capability, we design its decoder to be lightweight by reducing the number of decoder layers and reallocating the freed-up budget to its encoder (by increasing its depth), which is motivated by the fact that the encoder is forwarded only once, while the decoder is frequently forwarded for iterative decoding.This encoder-favored modeling has been demonstrated by previous work to improve latency with little generation quality degradation (Kasai et al., 2021;Sun et al., 2021;Ge et al., 2022a).We find it also highly effective for the drafter in the draft-then-verify decoding paradigm.

Training and Inference
Formally, given the source sentence x and the randomly sampled prefix y ≤p (0 ≤ p < m) of the target sentence, Spec-Drafter appends k special "[MASK]" tokens to y ≤p , and is trained to predict these masked tokens in parallel: In addition, we leverage the glancing strategy following Qian et al. (2021), which exploits curriculum learning during training to get better generation performance.
During inference, Spec-Drafter appends k "[MASK]" tokens to the previously decoded tokens ŷ≤j and simultaneously predict these masked tokens as a drafted block: where i = 1, . . ., k.

Spec-Verification
As introduced in Section 2, the vanilla verification strategy of preliminary studies only accepts the drafted tokens that match the top-1 result of the AR model, which guarantees that the decoding results are identical to AR greedy decoding.However, the top-1 results are not necessarily better than the drafted tokens, especially when the paradigm is equipped with a high-quality drafter.Therefore, the strict verification criterion (i.e., top-1 matching) will result in many good drafted tokens being discarded just because they are different from the top-1 result of the AR model, which limits the speedup of the paradigm.
To make better use of the drafting results, we propose an advanced verification strategy named Spec-Verification, which is illustrated in Figure 4. Instead of the rigid matching requirement shown in Eq (2), Spec-Verification relaxes the criterion to trust the drafting results more, by only requiring the drafted tokens to fall in top-β candidates with a tolerable (log-likelihood) score gap τ (away from the top-1 result).Formally, it will accept the i-th drafted token y j+i if all previous i − 1 tokens are accepted, and Eq (8) and ( 9) are both true: log P (ŷ j+i |△; θAR) − log P ( yj+i|△; θAR) ≤ τ, (9) where log P (ŷ j+i |△; θ AR ) is the top-β ranked result's log-likelihood score by the AR model.

Experimental Settings
Datasets and Evaluation We mainly evaluate our approach on two standard machine translation benchmarks: WMT14 EN↔DE (4.5M pairs) and WMT16 EN↔RO (610K pairs).Following prior work (Ott et al., 2018), for WMT14 EN↔DE translation, we adopt newstest-13 as our validation set for finding the best hyperparameters, and test on newstest-14.For WMT16 EN↔RO translation, we use the dataset released by Lee et al. (2018), where newsdev2016 and newstest2016 are taken as validation and test sets.We use 32K Byte Pair Encoding (BPE) (Sennrich et al., 2016) subwords8 as the joint source-target dictionary.We evaluate performance with BLEU (Papineni et al., 2002) for both language pairs9 .
For inference efficiency, we report decoding speedup over beam search.Specifically, we test the inference speed by running the model with one sentence at a time (batch=1).We perform model inference with fairseq implementation10 using Pytorch 1.10.1 with 1 Nvidia Tesla P100-PCIe of 16GB GPU memory under CUDA 11.1.

Model Configuration
The primary target model we accelerate in our experiments is the Transformer-base model with a 6-layer encoder and a 6-layer decoder of 512/2048 embedding/FFN dimension, which can achieve state-of-the-art results on the benchmarks under comparable model size conditions.For the Spec-Drafter, we adopt a similar architecture to the AR model except with 12 encoder layers and 2 decoder layers to make sure it adheres to both the Capability and Latency principles.We apply sequence-level knowledge distillation (Kim and Rush, 2016) by the AR teacher to the Spec-Drafter to align its behavior with the AR model as much as possible.We include model training details in Appendix A. For the Spec-Verification, we find the hyperparameters β and τ leading to the best generation quality on the validation set.Besides, we re-implement Blockwise Decoding11 using the same device and environment as ours to facilitate fair comparison.

Results
We present the performance and the acceleration effect of SpecDec to Transformer in Table 1.As reported in the previous work (Stern et al., 2018), Blockwise Decoding (k = 10) can only achieve 1.4×∼2× speedup without affecting the generation results over the Transformer-base model.Further increasing the parallel capability of Blockwise Decoding (e.g., k = 25) will not introduce more speedup as its limited drafting accuracy prevents more drafted tokens from being accepted.In contrast, our SpecDec shows consistent performance improvement with increased parallel capabilities (k = 10 → k = 25), resulting in around  4.6×∼5.5×speedup across the translation benchmarks; moreover, it even achieves an improvement in generation quality (by the BLEU metric) compared with AR greedy decoding.Similar results are also observed when accelerating the Transformer with a 12-layer encoder and 2-layer decoder -SpecDec can still achieve around 2.5×∼3.3×speedup while Blockwise Decoding's acceleration effect becomes nearly negligible over the fast AR baseline.

Analysis
In this section, we conduct a comprehensive and thorough analysis, to demonstrate that the significant improvement of the SpecDec arises from both the Spec-Drafter (Section 4.3.1)and Spec-Verification (Section 4.3.2).   the drafter, it also experiences a substantial decline in end-to-end acceleration performance due to increased latency in each iteration (reflected by a higher t d ).

According to
Moreover, we analyze the block size k's effect on the end-to-end acceleration performance of the paradigm in Table 3.In contrast to the blockwise decoding achieving its best acceleration performance at k = 10 as shown in Table 1, SpecDec achieves its best performance at k = 25 with 7.89 mean accepted tokens each iteration.Further increasing k has an adverse effect, because it will become very hard for the model to learn to draft too many tokens simultaneously given the model capacity, resulting in a drop of Tok..

Verification
We study the effect of Spec-Verification on the development set (i.e., newstest-2013) of WMT14 EN→DE in Table 4. Moderately increasing τ and β in the Spec-Verification not only leads to an increase of mean accepted tokens (Tok.) and speed since AR verification becomes less strict but also improves the generation quality over greedy decoding.However, the generation quality may decrease if over-relaxed: the BLEU score will degrade from the peak of 26.97 to 26.58 when decoding with top-5 selection (i.e., β = 5) and τ = 5.0.Based on the results in the development set, we select β = 3, τ = 1.0 as our Spec-Verification hyperparameters.

Practical Value
In addition to the remarkable speedup results, we demonstrate SpecDec's additional advantages that enhance its practical value in the following three aspects: Better latency-throughput trade-off SpecDec achieves inference acceleration by increasing the GPU computing parallelism.Although increasing the batch size can also increase the computing parallelism to improve throughput, it results in increased latency, which is not desirable in real-world application scenarios.Therefore, a smaller batch size is often employed during inference, but this in turn results in the underutilization of GPU computing resources, leading to the dilemma of low throughput for small batches and high latency for large batches, as illustrated by Figure 5. SpecDec effectively addresses this dilemma.Even by maintaining a small batch size, SpecDec can fully utilize the computing performance of the GPU, significantly improving both efficiency and throughput.
Easily adaptable for existing models In many practical applications, generative models are often pretrained with massive data, which exhibits very high performance.Developing a faster model from scratch to replace the pretrained model is highly challenging and typically requires substantial computational costs to reiterate the pretraining process.
Otherwise, the quality of the new models is very likely to be compromised despite the increased speed, as Table 5 shows.However, SpecDec can be easily adapted to accelerate existing pretrained models.Taking the BART-base (Lewis et al., 2020) model for the abstractive summarization task as an example, we can easily achieve 5.1× speedup without compromising the generation quality only by initializing the Spec-Drafter with the BARTbase encoder and training it with the BART-base distilled summarization training set.
Compared to prior NAR methods, SpecDec can be easily adapted to accelerate the BART model only by downstream task fine-tuning.

Models BLEU
Transformer-base (greedy) 100.00 GLAT+CTC (Qian et al., 2021) 59.10 DAT (Huang et al., 2022) 63.79 CMLM (Ghazvininejad et al., 2019) 60.15 RewriteNAT (Geng et al., 2021) 65.42 Deep-Shallow (Kasai et al., 2020) 64.66 SpecDec (k = 25) 86.52 faster model to replace the existing model.Instead, it accelerates the existing model with minimal changes to its behavior.As shown in Table 6, the consistency (BLEU) of SpecDec's generated results with the original model exceeds 85%, while that of a newly built fast NAR model is only around 55%.The characteristic of maintaining the behavior of the original model makes SpecDec even more valuable in practical applications because transitioning from a well-tested and mature model to one with substantially different behavior is risky, requiring extensive recalibration and various offline evaluations and online feedback in practice.

Related Work
Speculative Decoding We have demonstrated since early 2022 (see our arXiv preprints in 2022) that our proposed methodology, which formally introduces an independent model as a drafter combined with an advanced verification strategy to fully exploit speculative execution, is promising and has potential to evolve into a de facto standard in the future for efficient and lossless de-coding.Since this work was proposed, we are pleased to see an increasing number of following studies (Leviathan et al., 2023;Chen et al., 2023;Kim et al., 2023;Spector and Re, 2023;Zhang et al., 2023) acknowledge, explore and adopt this methodology to accelerate Transformer inference.Among them, Leviathan et al. (2023) use the same name as ours (i.e., Speculative Decoding), employing a small AR model as a drafter12 as well as advanced sampling algorithm.Chen et al. ( 2023) is similar to Leviathan et al. (2023) but it was the first to validate this methodology to accelerate a large language model (i.e., 70B Chinchilla) with a 4B drafter model, thus receiving the most attention.SpecInfer (Miao et al., 2023) proposed to utilize various boost-tuned small language models for joint drafting, to improve the speculation accuracy of the LLM's outputs.Besides, it introduces an advanced token tree verification strategy to verify all candidate token sequences in parallel.Distill-Spec (Zhou et al., 2023) further investigated the efficacy of knowledge distillation in enhancing the alignment between the target model and the drafter in speculative decoding.In addition to employing additional models as drafters, there has also been some research that proposes various strategies to efficiently generate drafts from the LLM itself (Santilli et al., 2023;Zhang et al., 2023).All the following research strongly backs up the value of this original work.
Early Draft-then-verify attempts This work is a generalized version of our previously proposed (Input-guided) Aggressive Decoding13 (Sun et al., 2021) in Grammatical Error Correction (GEC), which assumes that the input is exactly the sentence to be generated in the future and then verifies the whole sentence in parallel.Blockwise Decoding (Stern et al., 2018) inserted k − 1 feedforward heads on top of the Transformer decoder to generate k positions in parallel and used the original head to verify these outputs.However, both the above studies did not fully investigate the potential of this paradigm and thus failed to uncover its great value for efficient seq2seq generation: Sun et al. ( 2021) only works for tasks whose inputs and outputs are highly similar (e.g., GEC).Stern et al. (2018) overlooked the importance of drafting accuracy; as a result, their underinvested prediction heads severely limit the acceleration results.In contrast, we conduct thorough investigations and fully exploit speculative execution, refreshing the impression of its limited acceleration potential and revealing its real value in practice.
Non-autoregressive Decoding There is also another line of work named Non-Autoregressive Decoding (NAR) (Gu et al., 2018), which decodes multiple tokens in parallel compared with conventional AR, thus showing remarkable superiority in inference efficiency.Recently, various attempts have been made to improve the performance of NAR models, including training with alignment-based objectives (Libovický and Helcl, 2018;Ghazvininejad et al., 2020;Saharia et al., 2020;Gu and Kong, 2021;Shao and Feng, 2022), modeling dependencies between target tokens (Ghazvininejad et al., 2019;Shu et al., 2020;Qian et al., 2021;Bao et al., 2021) and designing various model architectures (Zheng et al., 2021;Huang et al., 2022).As discussed in Section 4.4, replacing a powerful pretrained model with NAR models in practice is challenging due to the substantial computational costs required to reiterate the pretraining process.Additionally, transitioning from a well-tested and mature model to a new NAR model with significantly different behavior poses risks in practical applications.In contrast, our proposed SpecDec can be conveniently adapted to speed up existing AR models including highperformance pretrained models like BART with little effort.Moreover, SpecDec minimally alters the behavior of existing models, showcasing its ability to preserve reliable generation performance in real-world practical applications.

Conclusion
We present Speculative Decoding (SpecDec), the first work to explicitly embrace the idea of speculative execution for seq2seq generation acceleration with a formal study and extensive discussion of both drafting and verification phases.Contrary to the common belief that an increase in model complexity tends to hamper inference speed, SpecDec's introduction of an appropriately invested auxiliary drafter model substantially speeds up Transformer inference, owing to higher computational parallelism introduced by speculative execution to better utilize computing resources.
The remarkable acceleration performance, combined with the advantages demonstrated in our experiments, clearly illustrates that SpecDec is a practical acceleration method for model deployment in real-world applications.We hope that our preliminary study could draw more attention to this promising decoding paradigm that may potentially evolve into a de facto standard for efficient Transformer decoding in the near future.

Limitations
Compared with conventional autoregressive decoding, SpecDec introduces an extra Spec-Drafter module for ensuring its drafting accuracy, which brings additional memory cost at test time.Therefore, SpecDec is particularly suitable for inference scenarios where GPU memory is abundant but there is an urgent need to improve latencyit provides a solution to trading the surplus GPU memory for speed improvements.As thoroughly discussed in Appendix B, such scenarios are very common in practice.Most importantly, memory is no longer the bottleneck for practical model deployment.With the emergence and maturity of various data/tensor/pipeline parallelism techniques, the addition of more GPUs can easily address memory issues, which is also why models continue to grow larger.In contrast, latency remains an inescapable bottleneck in model deployment that cannot be resolved merely by increasing the number of machines.Therefore, we believe the increased memory consumption may not severely affect its practical value.
• The Spec-Drafter's last encoder layer's representation that will not be freed until decoding finishes, which is equal to Table 8 and Table 9 show the comparisons of peak GPU memory footprint14 (MB) between SpecDec and AR (during inference) on the above two scenarios (i.e., MT and summarization).The results are consistent with our analysis above: The majority of the additional memory cost (i.e., ∆Memory) is for storing the Spec-Drafter's weights and the additional memory cost is not very likely to significantly increase as the batch size or sequence length increases.
Our experiments above pre-loaded both the Spec-Drafter and AR model.In fact, it is also possible to load the static weights of the AR model and Spec-Drafter in a lazy loading manner in the meantime of GPU computation to save memory as they run alternatively.However, it is usually unnecessary in practice, because for a seq2seq model deployed on modern GPUs for online service, it is latency rather than memory that is the performance bottleneck.See the next section for more discussion.

B.2 Memory Is Rarely the Bottleneck
To understand the performance bottleneck of online deployed seq2seq models, we test the latency and memory cost of T5-large15 (around 770M parameters) with fp16 on 1 Nvidia A40 GPU running greedy decoding in the machine translation and abstractive summarization task, and show results in For MT, T5-large's latency is over 1 second which is actually too long to be accepted because most MT engines in practice require the latency to be less than 100ms.However, its memory cost is only less than 2GB -far below A40 GPU's memory capacity (i.e., 48GB 16 ).
For abstractive summarization, even if the batch size increases to 32, its memory cost is still less than 50% utilization of 1 A40 GPU but its latency is already close up to 5 seconds that is too long for an online service in practice.
To sum up, we now understand latency is the bottleneck of seq2seq models for online deployment in most cases.Therefore, we do not think additional memory cost by SpecDec will undermine its practical value; instead, we think a significant lossless acceleration even at the cost of memory (i.e., time-memory trade-off) is much more meaningful than the acceleration at the cost of quality, which should be the right path that we need to pay more attention to given much memory headroom on modern GPUs.

C SacreBLEU and COMET Scores
Despite tokenized BLEU scores, we also report SacreBLEU17 (Post, 2018) and COMET18 (Rei et al., 2020) scores in Table 12 and 13 to provide a reference for future research.SpecDec can also achieve performances on par with the AR model with the evaluation in sacreBLEU and COMET.Schmidt et al. (2022)    recommend that future research use sacreBLEU when comparing with our work.

D Speedup Distribution
Figure 6 presents SpecDec's speedup distribution of a single sentence on the WMT14 EN→DE test set (which has 3,003 sentences in total), showing that most sentences are translated with a 3×∼7× speedup compared to AR beam search, while some rare cases can even achieve over 10×∼11× speedup.

E Discussions of Beam Search
For possible concerns that SpecDec may not apply beam search, we make three points here: 1.As Kim and Rush (2016) mentioned, knowledge distillation largely decreases the performance gap of beam search and greedy decoding.In practice, greedy decoding can actually be comparable to beam search results after KD. 2. In practical online deployment, KD is almost used by default for enhancing the results for student models and greedy decoding is much more common than beam search because it is more cost-effective -it not only runs faster than beam search but also achieves decent performance with a student model trained through KD (as Point 1 addressed) 3. Beam search is also an approximate and heuristic solution, which is not a golden rule.
In fact, Spec-Verification works in a similar way as beam search -it is also an approximate and heuristic solution by considering n-best and scores, which can be considered as an approximation of beam search.As shown in Table 1, it achieves comparable performance to beam search but much faster (4× ∼ 6×).

F Carbon Emission
SpecDec introduces extra computational overhead, which leads to an increase in GPU power consumption.However, it can substantially reduce GPU hours owing to its high efficiency.We compare GPU power consumption and GPU hours of autoregressive decoding (AR) and SpecDec for translating 3000 sentences in Table 15.We follow a formula for Wu et al. (2022) to calculate the total energy consumption and carbon emitted: E = P × t × PUE, where we set PUE (Power Usage Effectiveness) at 1.1.CO 2 eq = 0.385g × CO 2 eq/Wh × E, where 0.385g*CO 2 eq/Wh is the carbon intensity factor that is set based on the US national average.
According to Table 15, while SpecDec's GPU power consumption is 28% higher, its GPU hours are 540% shorter than autoregressive decoding.As a result, SpecDec's total energy consumption

Figure 2 :
Figure 2: (a) Blockwise Decoding that introduces k − 1 FFN heads on top of the target AR model for drafting the next k tokens with shared attention; (b) Spec-Drafter is an independent model for drafted token prediction.It employs distinct attention queries for predicting each drafted token.Modules colored in yellow belong to the original AR model while those colored in red denote newly introduced modules.

Figure 3 :
Figure 3: Upper: An AR model's attention heatmap showing that different target positions should attend to different source tokens; Lower: The Spec-Drafter's attention heatmap showing its capability of modeling drafted tokens in different positions, which highly aligns with the AR counterpart.

Figure 4 :
Figure 4: Illustration of Spec-Verification.Compared to the vanilla verification strategy strictly requiring the drafted tokens to match the AR top-1 result, Spec-Verification slightly relaxes the criterion to trust the drafts more, by only requiring the drafted tokens to fall in the top-β AR candidates with a tolerable log-likelihood gap (not shown in this Figure; see Eq (9)).As a result, Spec-Verification allows more drafted tokens to be accepted even if they are slightly different from the AR top-1 result, leading to a higher inference speedup.

Figure 5 :
Figure 5: The latency-throughput curve with various batch sizes on WMT14 EN→DE.

Table 1 :
The performance of Speculative Decoding (SpecDec) to speed up the Transformer-base on the WMT benchmarks.We re-implement and evaluate Blockwise Decoding using the same device and environment as ours.

Table 2 :
Ablation studies of the drafter in the SpecDec on WMT14 EN→DE.Tok.denotes the average number of drafted tokens accepted in each iteration.t (Stern et al., 2018)e time cost of drafting per iteration.The head-based drafter is the one used by Blockwise Decoding(Stern et al., 2018).The Spec-Drafter (w/o Principle I) reduces the model/FFN dimension to 256/1024, while the Spec-Drafter (w/o Principle II) does not use a deep encoder and shallow decoder but instead utilizes a more balanced architecture with equal depth of encoder and decoder layers.

Table 2 ,
the Spec-Drafter significantly outperforms the head-based drafter (as used in Blockwise Decoding) in terms of both end-to-

Table 3 :
The mean accepted tokens (Tok.), the generation quality (BLEU), and the efficiency (Speed) when decoding with a various number of block size k on the development set of WMT14 EN→DE.
in a drastic drop in end-to-end acceleration performance, as more iterations are needed to complete the decoding process, indicated by a lower Tok..When we ablate the Latency Principle by using a balanced (6+6) encoder-decoder architecture for

Table 4 :
Results of SpecDec (k = 25) on the development set of WMT14 EN→DE with different hyperparameters.Each cell lists the mean accepted tokens and BLEU score.Among the runs in the table, the highest BLEU score of 26.97 is achieved when β = 3 and τ = 1.0, with a 5× speedup.On the other hand, when β = 5 and τ = 5, the highest Tok.(i.e., 11.01) is reached, resulting in almost 7× speedup, though the BLEU score slightly decreases to 26.58.

Table 5 :
Results of different methods for Abstractive Summarization on CNN-DM (

Table 6 :
Relative BLEU score computed between the generation of the existing Transformer (i.e., the target model) and other models/approaches.SpecDec shows much better alignment with the target model's behavior than others.
B • S • d where B is the batch size, S is the sequence length and d is the dimension of the model.This part is actually negligible: for example, when B = 32,

Table 8 :
Peak GPU memory utilization on WMT14 EN-DE translation dataset.The results are obtained with fp32 on a single Nvidia P100 GPU.

Table 10 :
Table 10 and 11.Latency and peak GPU memory utilization of T5-Large on WMT14 EN-DE.

Table 12 :
SacreBLEU and COMET scores on WMT14 EN-DE.