Shiyi Cao


2025

pdf bib
S*: Test Time Scaling for Code Generation
Dacheng Li | Shiyi Cao | Chengkun Cao | Xiuyu Li | Shangyin Tan | Kurt Keutzer | Jiarong Xing | Joseph E. Gonzalez | Ion Stoica
Findings of the Association for Computational Linguistics: EMNLP 2025

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* augments the existing parallel scaling approach with sequential scaling to further increase the performance. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions.We evaluate S* across 12 Large Language Models and Large Reasoning Models and show that: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models—GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models—DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Codes, model generations and intermediate experiments results are available under Codes, model generations and intermediate ex-periments results are available under https://github.com/NovaSky-AI/SkyThought.

pdf bib
Language Models Can Easily Learn to Reason from Demonstrations
Dacheng Li | Shiyi Cao | Tyler Griggs | Shu Liu | Xiangxi Mo | Eric Tang | Sumanth Hegde | Kourosh Hakhamaneshi | Shishir G Patil | Matei Zaharia | Joseph E. Gonzalez | Ion Stoica
Findings of the Association for Computational Linguistics: EMNLP 2025

Large reasoning models (LRMs) tackle complex problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that language models can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and further parameter-efficient low-rank adaptation (LoRA). Crucially, we find that the structure of Long CoT is critical to the learning process in this data-efficient fine-tuning process. Training on content-incorrect examples, e.g. those lead to incorrect answers or corrupted digits, still leads to significant performance gains. In contrast, training on structurally incorrect examples, e.g., with shuffled or deleted reasoning steps, yield smaller improvements or even degrade performance.