Jihoon Tack
2025
Mamba Drafters for Speculative Decoding
Daewon Choi
|
Seunghyuk Oh
|
Saket Dingliwal
|
Jihoon Tack
|
Kyuyoung Kim
|
Woomin Song
|
Seojin Kim
|
Insu Han
|
Jinwoo Shin
|
Aram Galstyan
|
Shubham Katiyar
|
Sravan Babu Bodapati
Findings of the Association for Computational Linguistics: EMNLP 2025
Speculative decoding has emerged as a promising approach to accelerating large language model (LLM) generation using a fast drafter while maintaining alignment with the target model’s distribution. However, existing approaches face a trade-off: external drafters offer flexibility but can suffer from slower drafting, while self-speculation methods use drafters tailored to the target model but require re-training. In this paper, we introduce novel drafters based on Mamba, a state-of-the-art state space model (SSM), as a solution that combines the best aspects of both approaches. By leveraging the linear structure of SSMs, our approach avoids the quadratic complexity inherent in traditional Transformer-based methods, enabling faster drafting and lower memory usage while maintaining the flexibility to work across different target models. We further enhance efficiency with a novel test-time tree search algorithm for generating high-quality draft candidates. Our empirical evaluation demonstrates that Mamba-based drafters not only outperform existing external drafting methods but are also comparable to state-of-the-art self-speculation approaches while using less memory and maintaining their cross-model adaptability.
Think Clearly: Improving Reasoning via Redundant Token Pruning
Daewon Choi
|
Jimin Lee
|
Jihoon Tack
|
Woomin Song
|
Saket Dingliwal
|
Sai Muralidhar Jayanthi
|
Bhavana Ganesh
|
Jinwoo Shin
|
Aram Galstyan
|
Sravan Babu Bodapati
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent large language models have shown promising capabilities in long-form reasoning, following structured chains of thought before arriving at a final answer. However, we observe that these reasoning paths tend to include substantial redundancy; analyzing attention patterns reveals that attention scores are widely scattered, particularly incorrect answers exhibit greater attention sparsity. In this paper, we demonstrate that deliberately removing this redundancy in the reasoning process significantly improves the performance through clear thinking (i.e., removing distraction). Specifically, we systematically identify such redundancy by measuring token-level attention scores to a special end-of-thinking token, which is appended to an explicit instruction inserted to conclude each intermediate reasoning step. Furthermore, we propose structure-aware pruning that prioritizes removing tokens in low-contributing reasoning chunks over individual tokens. After evicting redundant tokens, we remove the injected end-of-thinking instruction, then resume the reasoning generation. We demonstrate that our method significantly improves the over all accuracy across reasoning-intensive benchmarks without any training involved. In particular, our method shows strong performance on challenging mathematics competition benchmarks such as AIME and AMC, where reasoning redundancy is more prevalent.
Search
Fix author
Co-authors
- Sravan Babu Bodapati 2
- Daewon Choi 2
- Saket Dingliwal 2
- Aram Galstyan 2
- Jinwoo Shin 2
- show all...