Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding

Jinglin Chen; Qiwei Li; Zuchao Li; Baoyuan Qi; Liu Guoming; Haojun Ai; Hai Zhao; Ping Wang

doi:10.18653/v1/2025.emnlp-main.911

Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding

Jinglin Chen, Qiwei Li, Zuchao Li, Baoyuan Qi, Liu Guoming, Haojun Ai, Hai Zhao, Ping Wang

Abstract

As a crucial method in prompt engineering, In-Context Learning (ICL) enhances the generalization and knowledge utilization capabilities of Large Language Models (LLMs) (Dong et al., 2024). However, the lengthy retrieved contexts and limited token throughput in autoregressive models significantly constrain reasoning speed. To address this challenge, we propose N-Gram Trie Speculative Decoding, a novel approach that leverages the overlap between context and model output. This method constructs an n-gram trie from the context to generate drafts, accelerating token generation for LLMs. We evaluate our approach on summarization, Retrieval-Augmented Generation (RAG), and context-based Question Answering (QA) tasks. Experimental results on Vicuna-7B, Llama2-7B-Chat, and Llama3-8B-Instruct demonstrate substantial speed improvements without compromising accuracy. Compared with various strong baselines, our method achieves the highest mean speedup, showcasing its effectiveness and efficiency.

Anthology ID:: 2025.emnlp-main.911
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18040–18051
Language:
URL:: https://aclanthology.org/2025.emnlp-main.911/
DOI:: 10.18653/v1/2025.emnlp-main.911
Bibkey:
Cite (ACL):: Jinglin Chen, Qiwei Li, Zuchao Li, Baoyuan Qi, Liu Guoming, Haojun Ai, Hai Zhao, and Ping Wang. 2025. Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18040–18051, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding (Chen et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.911.pdf
Checklist:: 2025.emnlp-main.911.checklist.pdf

PDF Cite Search Checklist Fix data