Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Xianzhen Luo; Yixuan Wang; Qingfu Zhu; Zhiming Zhang; Xuanyu Zhang; Qing Yang; Dongliang Xu

doi:10.18653/v1/2025.acl-long.338

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu

Abstract

The rapid growth in the parameters of LLMs has made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based train-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. This approach stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires <2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30% and even a training method by 25%.

Anthology ID:: 2025.acl-long.338
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6816–6831
Language:
URL:: https://aclanthology.org/2025.acl-long.338/
DOI:: 10.18653/v1/2025.acl-long.338
Award:: Outstanding Paper
Bibkey:
Cite (ACL):: Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, and Dongliang Xu. 2025. Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6816–6831, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling (Luo et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.338.pdf

PDF Cite Search Fix data