Rui Ling

2026

Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit overthinking, where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

pdf bib abs

Recent preference optimization algorithms such as Direct Preference Optimization (DPO) have become prevalent for aligning large language models (LLMs) with human preferences. FocalPO improves upon DPO by introducing a modulating factor that down-weighs misranked preference pairs. However, using a fixed modulating factor throughout training is suboptimal, as the model’s learning capacity evolves during training. We introduce DynamicFocalPO, which employs a dynamic focusing strategy that adapts over the course of training. Inspired by curriculum learning, our method initially focuses on correctly ranked samples to establish a solid foundation, then gradually incorporates harder samples as training progresses. Experiments demonstrate that DynamicFocalPO surpasses both DPO and FocalPO on benchmarks including Alpaca Eval 2.0 and Arena-Hard using Mistral-Base-7B and Llama-3-Instruct-8B. We further provide theoretical analysis showing that the dynamic schedule enables adaptive entropy regularization and selective gradient suppression.

pdf bib abs

Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs’ ability to perform codeflow - implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises two complementary components: CodeFlowBench-Comp, a core collection of 5,000+ competitive programming problems from Codeforces updated via an automated pipeline and CodeFlowBench-Repo, which is sourced from GitHub repositories to better reflect real-world scenarios. Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios. Furthermore, our in-depth analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research.

Co-authors

Venues

Findings2
ACL1

Fix author