James Smith
2024
SLiM: Speculative Decoding with Hypothesis Reduction
Chi-Heng Lin
|
Shikhar Tuli
|
James Smith
|
Yen-Chang Hsu
|
Yilin Shen
|
Hongxia Jin
Findings of the Association for Computational Linguistics: NAACL 2024
Speculative decoding has emerged as a prominent alternative to autoregressive decoding for expediting inference in large language models (LLMs). However, prevailing assumptions often focus solely on latency reduction, neglecting the computational expenses. In this paper, we present Speculate Less, validate More (SLiM), a speculative decoding enhancement to reduce the speculation set while validating more effective tokens. SLiM is designed to mitigate LLMs’ computation costs associated with the token verification by introducing hypothesis reduction based on a fast posterior estimation. It consistently surpasses counterparts lacking cost reduction across a spectrum from CPU to GPU. Our evaluation with diverse conversational datasets shows that SLiM can achieve a substantial 70% reduction in FLOPs while generating more effective predictions on top of prior arts.