Chenxin An
2026
SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution
Zhenyu He | Qingping Yang | Wei Shen | Xiaojian Zhong | Kechi Zhang | Chenxin An | Wenlei Shi | Tianle Cai | Di He | Jiaze Chen | Jingjing Xu
Findings of the Association for Computational Linguistics: ACL 2026
Zhenyu He | Qingping Yang | Wei Shen | Xiaojian Zhong | Kechi Zhang | Chenxin An | Wenlei Shi | Tianle Cai | Di He | Jiaze Chen | Jingjing Xu
Findings of the Association for Computational Linguistics: ACL 2026
Automated software engineering, particularly resolving real-world issues on benchmarks like SWE-bench, remains a significant challenge for Large Language Models (LLMs). To address this, we introduce SWE-Swiss, a two-phase training recipe that systematically develops these capabilities. Our approach first decomposes issue resolution into three core skills: Localization, Repair, and Unit Test Generation. In the first phase, we perform multi-task Supervised Fine-Tuning (SFT) on three new, meticulously curated datasets to build a versatile foundation. The second phase applies targeted Reinforcement Learning (RL), using direct feedback from test execution to boost the critical skill of code repair. The resulting model, SWE-Swiss-32B, establishes a new state-of-the-art for open-source models in its size class, achieving a 60.2% score on the SWE-bench Verified benchmark and placing it in the same top-tier performance bracket as much larger models. Finally, we show that despite its specialized training, SWE-Swiss-32B demonstrates strong generalization to other common LLM benchmarks. To accelerate research in the community, we are open-sourcing the models and our complete training datasets.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
2025
Long Chain-of-Thought Fine-tuning via Understanding-to-Reasoning Transition
Chenxin An | Zhihui Xie | Xiaonan Li | Ming Zhong | Shansan Gong | Lei Li | Jun Zhang | Jingjing Xu | Lingpeng Kong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chenxin An | Zhihui Xie | Xiaonan Li | Ming Zhong | Shansan Gong | Lei Li | Jun Zhang | Jingjing Xu | Lingpeng Kong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Reasoning models have demonstrated remarkable performance on complex tasks by generating long reasoning traces prior to producing final answers. However, previous research on long-context scaling in language models has generally focused on managing lengthy input prompts instead of producing long outputs. To leverage the strong long context understanding abilities of current models, we introduce Understanding-to-Reasoning Transition (URT) fine-tuning, a sequence-level curriculum learning framework that gradually shifts a model’s focus from interpreting long chain-of-thoughts to generating them. By incorporating partial reasoning steps in the input context, URT naturally exposes the model to diverse prompt lengths during training, preserving its performance on long-context comprehension while developing advanced reasoning capabilities. Experiments on rigorous reasoning benchmarks, including AIME24 and GPQA Diamond, reveal that our approach surpasses standard fine-tuning by over 10%, while maintaining robust performance on the understanding tasks in RULER.
2024
L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Chenxin An | Shansan Gong | Ming Zhong | Xingjian Zhao | Mukai Li | Jun Zhang | Lingpeng Kong | Xipeng Qiu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenxin An | Shansan Gong | Ming Zhong | Xingjian Zhao | Mukai Li | Jun Zhang | Lingpeng Kong | Xipeng Qiu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, there has been growing interest in long-context scaling of large language models (LLMs). To facilitate research in this field, we propose L-Eval to institute a more standardized evaluation for Long-Context Language Models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and more than 2,000 human-labeled query-response pairs including diverse task types, domains, and input length (3k~200k tokens). On the other hand, we investigate the effectiveness of evaluation metrics for LCLMs and we show that Length-instruction-enhanced (LIE) evaluation and LLM judges can better correlate with human judgments. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of a more principled evaluation of these models.
2022
CoLo: A Contrastive Learning Based Re-ranking Framework for One-Stage Summarization
Chenxin An | Ming Zhong | Zhiyong Wu | Qin Zhu | Xuanjing Huang | Xipeng Qiu
Proceedings of the 29th International Conference on Computational Linguistics
Chenxin An | Ming Zhong | Zhiyong Wu | Qin Zhu | Xuanjing Huang | Xipeng Qiu
Proceedings of the 29th International Conference on Computational Linguistics
Traditional training paradigms for extractive and abstractive summarization systems always only use token-level or sentence-level training objectives. However, the output summary is always evaluated from summary-level which leads to the inconsistency in training and evaluation. In this paper, we propose a Contrastive Learning based re-ranking framework for one-stage summarization called CoLo. By modeling a contrastive objective, we show that the summarization model is able to directly generate summaries according to the summary-level score without additional modules and parameters. Extensive experiments demonstrate that CoLo boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score while preserving the parameter efficiency and inference efficiency. Compared with state-of-the-art multi-stage systems, we save more than 100 GPU training hours and obtaining 3x 8x speed-up ratio during inference while maintaining comparable results.
Search
Fix author
Co-authors
- Ming Zhong 3
- Shansan Gong 2
- Xuan-Jing Huang (黄萱菁) 2
- Lingpeng Kong 2
- Xipeng Qiu (邱锡鹏) 2
- Jingjing Xu 2
- Tianle Cai 1
- Mingxu Chai 1
- Jiaze Chen 1
- Wenxiang Chen 1
- Chenghao Fan 1
- Ziche Fu 1
- Tao Gui 1
- Di He 1
- Wei He 1
- Zhenyu He 1
- Wenqing Jing 1
- Lei Li 1
- Mukai Li 1
- Xiaonan Li 1
- Zhicheng Liu 1
- Haojie Pan 1
- Wei Shen 1
- Wenlei Shi 1
- Zhiyong Wu 1
- Zhiheng Xi 1
- Zhihui Xie 1
- Qingping Yang 1
- Guoqiang Zhang 1
- Jiazheng Zhang 1
- Jun Zhang 1
- Jun Zhang 1
- Kechi Zhang 1
- Qi Zhang 1
- Xingjian Zhao 1
- Xiaojian Zhong 1
- Dingwei Zhu 1
- Qin Zhu 1