Yuyu Zhang
2026
SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
Songcheng Cai | Zhiheng Lyu | Yuansheng Ni | Xiangchao Chen | Baichuan Zhou | Shenzhe Zhu | Yi Lu | Haozhe Wang | Chi Ruan | Benjamin Schneider | Weixu Zhang | Xiang Li | Andy Zheng | Yuyu Zhang | Ping Nie | Wenhu Chen
Findings of the Association for Computational Linguistics: ACL 2026
Songcheng Cai | Zhiheng Lyu | Yuansheng Ni | Xiangchao Chen | Baichuan Zhou | Shenzhe Zhu | Yi Lu | Haozhe Wang | Chi Ruan | Benjamin Schneider | Weixu Zhang | Xiang Li | Andy Zheng | Yuyu Zhang | Ping Nie | Wenhu Chen
Findings of the Association for Computational Linguistics: ACL 2026
Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.
SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction
Hangxiao Zhu | Yuyu Zhang | Ping Nie | Yu Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Hangxiao Zhu | Yuyu Zhang | Ping Nie | Yu Zhang
Findings of the Association for Computational Linguistics: ACL 2026
The rapid growth of scientific literature calls for automated methods to assess and predict research impact.Prior work has largely focused on citation-based metrics, leaving limited evaluation of models’ capability to reason about other impact dimensions.To this end, we introduce SciImpact, a large-scale, multi-dimensional benchmark for scientific impact prediction spanning 19 fields.SciImpact captures various forms of scientific influence, ranging from citation counts to award recognition, media attention, patent reference, and artifact adoption, by integrating heterogeneous data sources and targeted web crawling.It comprises 215,928 contrastive paper pairs reflecting meaningful impact differences in both short- (e.g., Best Paper Award) and long-term settings (e.g., Nobel Prize).We evaluate 11 widely used large language models (LLMs) on SciImpact.Results show that off-the-shelf models show substantial variability across dimensions and fields, while multi-task supervised fine-tuning consistently enables smaller LLMs (e.g., 4B) to markedly outperform much larger models (e.g., 30B) and surpass powerful closed-source LLMs (e.g., o4-mini).These results establish SciImpact as a challenging benchmark and demonstrate its value for multi-dimensional, multi-field scientific impact prediction.Our project homepage is https://flypig23.github.io/sciimpact-homepage/.
Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision
Bingsen Chen | Boyan Li | Ping Nie | Yuyu Zhang | Xi Ye | Chen Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bingsen Chen | Boyan Li | Ping Nie | Yuyu Zhang | Xi Ye | Chen Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new axis. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for systematic multi-turn revision evaluation. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16–27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents leave significant headroom, as they continue to disrupt content outside the feedback’s scope and fail to preserve earlier edits. We further show that these issues are not easily resolvable through inference-time fixes such as prompt engineering and a dedicated sub-agent for revision.
2024
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
Shen Zheng | Yuyu Zhang | Yijie Zhu | Chenguang Xi | Pengyang Gao | Zhou Xun | Kevin Chang
Findings of the Association for Computational Linguistics: NAACL 2024
Shen Zheng | Yuyu Zhang | Yijie Zhu | Chenguang Xi | Pengyang Gao | Zhou Xun | Kevin Chang
Findings of the Association for Computational Linguistics: NAACL 2024
With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI’s legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI’s earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM’s reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
2020
Question Directed Graph Attention Network for Numerical Reasoning over Text
Kunlong Chen | Weidi Xu | Xingyi Cheng | Zou Xiaochuan | Yuyu Zhang | Le Song | Taifeng Wang | Yuan Qi | Wei Chu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Kunlong Chen | Weidi Xu | Xingyi Cheng | Zou Xiaochuan | Yuyu Zhang | Le Song | Taifeng Wang | Yuan Qi | Wei Chu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Numerical reasoning over texts, such as addition, subtraction, sorting and counting, is a challenging machine reading comprehension task, since it requires both natural language understanding and arithmetic computation. To address this challenge, we propose a heterogeneous graph representation for the context of the passage and question needed for such reasoning, and design a question directed graph attention network to drive multi-step numerical reasoning over this context graph. Our model, which combines deep learning and graph reasoning, achieves remarkable results in benchmark datasets such as DROP.
2019
Language Modeling with Shared Grammar
Yuyu Zhang | Le Song
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Yuyu Zhang | Le Song
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Sequential recurrent neural networks have achieved superior performance on language modeling, but overlook the structure information in natural language. Recent works on structure-aware models have shown promising results on language modeling. However, how to incorporate structure knowledge on corpus without syntactic annotations remains an open problem. In this work, we propose neural variational language model (NVLM), which enables the sharing of grammar knowledge among different corpora. Experimental results demonstrate the effectiveness of our framework on two popular benchmark datasets. With the help of shared grammar, our language model converges significantly faster to a lower perplexity on new training corpus.
Search
Fix author
Co-authors
- Ping Nie 3
- Le Song 2
- Songcheng Cai 1
- Kevin Chen-Chuan Chang 1
- Bingsen Chen 1
- Kunlong Chen 1
- Wenhu Chen 1
- Xiangchao Chen 1
- Xingyi Cheng 1
- Wei Chu 1
- Pengyang Gao 1
- Boyan Li 1
- Xiang Li 1
- Yi Lu 1
- Zhiheng Lyu 1
- Yuansheng Ni 1
- Yuan Qi 1
- Chi Ruan 1
- Benjamin Schneider 1
- Haozhe Wang 1
- Taifeng Wang 1
- Chenguang Xi 1
- Zou Xiaochuan 1
- Weidi Xu 1
- Zhou Xun 1
- Xi Ye 1
- Weixu Zhang 1
- Yu Zhang 1
- Chen Zhao 1
- Andy Zheng 1
- Shen Zheng 1
- Baichuan Zhou 1
- Hangxiao Zhu 1
- Shenzhe Zhu 1
- Yijie Zhu 1