Zhaoyang Chu
2025
TestEval: Benchmarking Large Language Models for Test Case Generation
Wenhan Wang
|
Chenyuan Yang
|
Zhijie Wang
|
Yuheng Huang
|
Zhaoyang Chu
|
Da Song
|
Lingming Zhang
|
An Ran Chen
|
Lei Ma
Findings of the Association for Computational Linguistics: NAACL 2025
For program languages, testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities.In this paper, we propose TestEval, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate 17 popular LLMs, including both commercial and open-source ones, on TestEval. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths.
Wait, We Don’t Need to “Wait”! Removing Thinking Tokens Improves Reasoning Efficiency
Chenlong Wang
|
Yuanning Feng
|
Dongping Chen
|
Zhaoyang Chu
|
Ranjay Krishna
|
Tianyi Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in large reasoning models have enabled complex, step-by-step reasoning but often introduce significant overthinking, resulting in verbose and redundant outputs that hinder efficiency. In this study, we examine whether explicit self-reflection, signaled by tokens such as “Wait” and “Hmm”, is necessary for advanced reasoning. We propose NoWait, a simple yet effective approach that disables explicit self-reflection by suppressing these tokens during inference. Extensive experiments on ten benchmarks across textual, visual, and video reasoning tasks show that NoWait reduces chain-of-thought trajectory length by up to 27%–51% in five R1-style model series, without compromising model utility. NoWait thus offers a plug-and-play solution for efficient and utility-preserving multimodal reasoning.
Search
Fix author
Co-authors
- An Ran Chen 1
- Dongping Chen 1
- Yuanning Feng 1
- Yuheng Huang 1
- Ranjay Krishna 1
- show all...