Yingyan Celine Lin
2026
Think Hard Only When Needed: A Hybrid Best-of-N and Beam Search for Efficient Test-Time Compute
Hyewon Suh | Chaojian Li | Cheng-Jhih Shih | Zheng Wang | Kejing Xia | Yonggan Fu | Yingyan Celine Lin
Findings of the Association for Computational Linguistics: EACL 2026
Hyewon Suh | Chaojian Li | Cheng-Jhih Shih | Zheng Wang | Kejing Xia | Yonggan Fu | Yingyan Celine Lin
Findings of the Association for Computational Linguistics: EACL 2026
Test-time compute has emerged as a promising paradigm that enables small language models (SLMs) to achieve large language model (LLM)-level capabilities by allocating additional compute for explicit reasoning during inference. Two common approaches are beam search and Best-of-N sampling. Beam search improves reasoning quality by scoring and optimizing token sequences using Process Reward Models (PRMs), but can incur non-trivial computational overhead and latency. In contrast, Best-of-N executes all reasoning trajectories without PRM guidance, often wasting compute on low-quality trajectories that may have gone astray early in the generation process. To address both inefficiencies, we propose THROW (THink haRd Only When needed)—a hybrid inference pipeline that combines the diversity of Best-of-N with the reasoning trajectory optimization of beam search. THROW introduces a selective branch truncation and expansion mechanism: it generates shorter initial trajectories than Best-of-N and evaluates them using PRMs to classify each query as "easy" or "hard." Based on this classification, THROW applies branch truncation for easy queries, mimicking Best-of-N, and PRM-guided branch expansion for hard ones, similar to beam search. Evaluations on MATH500, AMC23, and AIME24 demonstrate that THROW achieves 1.54× and 14.38× latency speedups and 35.7% and 80.4% token reductions on average while preserving high reasoning accuracy compared to Best-of-N and Beam Search, respectively.
2025
LAMB: A Training-Free Method to Enhance the Long-Context Understanding of SSMs via Attention-Guided Token Filtering
Zhifan Ye | Zheng Wang | Kejing Xia | Jihoon Hong | Leshu Li | Lexington Whalen | Cheng Wan | Yonggan Fu | Yingyan Celine Lin | Souvik Kundu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Zhifan Ye | Zheng Wang | Kejing Xia | Jihoon Hong | Leshu Li | Lexington Whalen | Cheng Wan | Yonggan Fu | Yingyan Celine Lin | Souvik Kundu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
State space models (SSMs) achieve efficient sub-quadratic compute complexity but often exhibit significant performance drops as context length increases. Recent work attributes this deterioration to an exponential decay in hidden-state memory. While token filtering has emerged as a promising remedy, its underlying rationale and limitations remain largely non-understood. In this paper, we first investigate the attention patterns of Mamba to shed light on why token filtering alleviates long-context degradation. Motivated by these findings, we propose LAMB, a training-free, attention-guided token filtering strategy designed to preserve critical tokens during inference. LAMB can boost long-context performance for both pure SSMs and hybrid models, achieving up to an average improvement of 30.35% over state-of-the-art techniques on standard long-context understanding benchmarks. Our analysis and experiments reveal new insights into the interplay between attention, token selection, and memory retention, and are thus expected to inspire broader applications of token filtering in long-sequence modeling.