SLiM: Speculative Decoding with Hypothesis Reduction

Chi-Heng Lin, Shikhar Tuli, James Smith, Yen-Chang Hsu, Yilin Shen, Hongxia Jin


Abstract
Speculative decoding has emerged as a prominent alternative to autoregressive decoding for expediting inference in large language models (LLMs). However, prevailing assumptions often focus solely on latency reduction, neglecting the computational expenses. In this paper, we present Speculate Less, validate More (SLiM), a speculative decoding enhancement to reduce the speculation set while validating more effective tokens. SLiM is designed to mitigate LLMs’ computation costs associated with the token verification by introducing hypothesis reduction based on a fast posterior estimation. It consistently surpasses counterparts lacking cost reduction across a spectrum from CPU to GPU. Our evaluation with diverse conversational datasets shows that SLiM can achieve a substantial 70% reduction in FLOPs while generating more effective predictions on top of prior arts.
Anthology ID:
2024.findings-naacl.63
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1005–1017
Language:
URL:
https://aclanthology.org/2024.findings-naacl.63
DOI:
10.18653/v1/2024.findings-naacl.63
Bibkey:
Cite (ACL):
Chi-Heng Lin, Shikhar Tuli, James Smith, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. 2024. SLiM: Speculative Decoding with Hypothesis Reduction. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1005–1017, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
SLiM: Speculative Decoding with Hypothesis Reduction (Lin et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.63.pdf