CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Longze Chen; Renke Shan; Huiming Wang; Lu Wang; Ziqiang Liu; Run Luo; Jiawei Wang; Hamid Alinejad-Rokny; Min Yang

doi:10.18653/v1/2025.acl-long.1525

CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, Min Yang

Abstract

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3× ∼ 1.7× on LLaMA3 series models without altering the original distribution of the generated text.

Anthology ID:: 2025.acl-long.1525
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 31608–31618
Language:
URL:: https://aclanthology.org/2025.acl-long.1525/
DOI:: 10.18653/v1/2025.acl-long.1525
Bibkey:
Cite (ACL):: Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, and Min Yang. 2025. CLaSp: In-Context Layer Skip for Self-Speculative Decoding. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31608–31618, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: CLaSp: In-Context Layer Skip for Self-Speculative Decoding (Chen et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1525.pdf

PDF Cite Search Fix data