Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention

Nikhil Bhendawade; Irina Belousova; Qichen Fu; Henry Mason; Antonie Lin; Mohammad Rastegari; Mahyar Najibi

doi:10.18653/v1/2025.emnlp-main.986

Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention

Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Antonie Lin, Mohammad Rastegari, Mahyar Najibi

Abstract

Speculative decoding is a prominent technique for accelerating LLM inference by leveraging an auxiliary draft model, but its effectiveness is limited by the autoregressive nature of draft generation, where acceptance rates depend on the draft model’s size. Scaling the draft model improves acceptance but also increases speculation latency, limiting overall speedup. Furthermore, fine-tuning both the draft and target models is often necessary to achieve high acceptance rates, adding complexity to inference systems as the number of downstream tasks grows. Single-model approaches like Medusa generate speculative tokens non-autoregressively but lack token dependencies, limiting effectiveness. Alternatives like Hydra and Eagle incorporate token dependencies but rely on dedicated heads, making speculation independent of the base model and limiting the extent to which stronger base models can improve speculation.We introduce a novel speculative decoding method that integrates speculative draft generation directly within the target model using multi-stream attention. This improves acceptance rates by introducing interdependencies between speculative tokens while ensuring non-autoregressive draft generation with minimal overhead. As target models scale in size and quality, speculative generation improves naturally with our method, unlike prior approaches. Furthermore, our approach is both parameter- and FLOP-efficient, requiring over 1000X fewer additional parameters than Medusa, making it highly suitable for resource-constrained devices. We design our method to operate in two modes: (1) Lossless mode, a plug-and-play method that preserves the output of any pre-trained model; and (2) Shared mode, optimizing both speedup and downstream performance. We demonstrate a 2–3.5X speedup across diverse tasks, including summarization, translation, question answering, mathematical reasoning, SQL generation, and retrieval-augmented generation (RAG).

Anthology ID:: 2025.emnlp-main.986
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19536–19559
Language:
URL:: https://aclanthology.org/2025.emnlp-main.986/
DOI:: 10.18653/v1/2025.emnlp-main.986
Bibkey:
Cite (ACL):: Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Antonie Lin, Mohammad Rastegari, and Mahyar Najibi. 2025. Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19536–19559, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention (Bhendawade et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.986.pdf
Checklist:: 2025.emnlp-main.986.checklist.pdf

PDF Cite Search Checklist Fix data