Xiangyu Wu
2026
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
Can Xie | Ruotong Pan | Xiangyu Wu | Zhang Yunfei | Jiayi Fu | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Can Xie | Ruotong Pan | Xiangyu Wu | Zhang Yunfei | Jiayi Fu | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model’s internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model’s overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
Jiaqi Tang | Yu Xia | Yi-Feng Wu | Yuwei Hu | Chen Yuhui | Qing-Guo Chen | Xiaogang Xu | Xiangyu Wu | Hao LU | Yanqing Ma | Shiyin Lu | Qifeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
Jiaqi Tang | Yu Xia | Yi-Feng Wu | Yuwei Hu | Chen Yuhui | Qing-Guo Chen | Xiaogang Xu | Xiangyu Wu | Hao LU | Yanqing Ma | Shiyin Lu | Qifeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of supervised fine-tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, we further introduce a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO’s superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations.
MirrorCAPTCHA: Wild CAPTCHA, Wild Distribution, Wild Web-based Platform Meet Multimodal LLM Agents
Xiangyu Wu | Yuwei Hu | Tianyu Cui | Yueying Tian | Qing-Guo Chen | Zhao Xu | Weihua Luo | Kaifu Zhang | Yang Yang | Jianfeng Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangyu Wu | Yuwei Hu | Tianyu Cui | Yueying Tian | Qing-Guo Chen | Zhao Xu | Weihua Luo | Kaifu Zhang | Yang Yang | Jianfeng Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The path to fully autonomous web agents is currently hindered by a critical bottleneck: their limited ability to handle CAPTCHA. Existing agent benchmarks largely ignore this practical challenge, failing to evaluate an agent’s real-world capacity to solve CAPTCHA. To bridge this gap, we conduct a comprehensive analysis of real-world CAPTCHA distributions and introduce MirrorCAPTCHA, a benchmark annotated with Weighted Pass Rate and a newly proposed metric Completion Degree. MirrorCAPTCHA is designed to serve as a “mirror” that faithfully reflects the automation capabilities of agents in real scenarios. We filter 2095 websites from Common Crawl, identify the CAPTCHA deployed on these sites, and cluster them into 18 distinct categories using K-means algorithm. To ensure practicality, we extract a web subgraph from Common Crawl covering these websites and use random walks to simulate real-world CAPTCHA encounter frequencies, yielding a realistic measure of agents’ ability. Additionally, we develop a lightweight synthetic data pipeline to train Ovis2-Agent-CAPTCHA-8B, which significantly outperforms current state-of-the-art closed-source models on MirrorCAPTCHA, achieving a 9.4% higher average Weighted Pass Rate and a 2.13% higher average Completion Degree than the runner-up, Gemini-2.5-Pro.
OneRec-Think: In-Text Reasoning for Generative Recommendation
Zhanyu Liu | Shiyao Wang | Xingmei Wang | Rongzhou Zhang | Jiaxin Deng | Honghui Bao | Jinghao Zhang | Wuchao Li | PengFei Zheng | Xiangyu Wu | Yifei Hu | Qigen Hu | Xinchen Luo | Lejian Ren | Zhang Zixing | Qianqian Wang | Kuo Cai | Yunfan Wu | Hongtao Cheng | Zexuan Cheng | Lu Ren | Huanjie Wang | Yi Su | Ruiming Tang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhanyu Liu | Shiyao Wang | Xingmei Wang | Rongzhou Zhang | Jiaxin Deng | Honghui Bao | Jinghao Zhang | Wuchao Li | PengFei Zheng | Xiangyu Wu | Yifei Hu | Qigen Hu | Xinchen Luo | Lejian Ren | Zhang Zixing | Qianqian Wang | Kuo Cai | Yunfan Wu | Hongtao Cheng | Zexuan Cheng | Lu Ren | Huanjie Wang | Yi Su | Ruiming Tang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The powerful generative capacity of Large Language Models (LLMs) has instigated a paradigm shift in recommendation. However, existing generative models (e.g., OneRec) operate as implicit predictors, critically lacking the capacity for explicit and controllable reasoning—a key advantage of LLMs. To bridge this gap, we propose OneRec-Think, a unified framework that seamlessly integrates dialogue, reasoning, and personalized recommendation. OneRec-Think incorporates: (1) Itemic Alignment: cross-modal Item-Textual Alignment for semantic grounding; (2) Reasoning Activation: Reasoning Scaffolding to activate LLM reasoning within the recommendation context; and (3) Reasoning Enhancement, where we design a recommendation-specific reward function that accounts for the multi-validity nature of user preferences. Experiments across public benchmarks show state-of-the-art performance. Moreover, our proposed "Think-Ahead" architecture enables effective industrial deployment, achieving a 0.159% gain in APP Stay Time and validating the practical efficacy of the model’s explicit reasoning capability.
Search
Fix author
Co-authors
- Qing-Guo Chen 2
- Yuwei Hu 2
- Guorui Zhou 2
- Honghui Bao 1
- Kuo Cai 1
- Qifeng Chen 1
- Hongtao Cheng 1
- Zexuan Cheng 1
- Tianyu Cui 1
- Jiaxin Deng 1
- Jiayi Fu 1
- Kun Gai 1
- Tingting Gao 1
- Qigen Hu 1
- Yifei Hu 1
- Hao LU 1
- Wuchao Li 1
- Zhanyu Liu 1
- Jianfeng Lu 1
- Shiyin Lu 1
- Weihua Luo 1
- Xinchen Luo 1
- Yanqing Ma 1
- Ruotong Pan 1
- Lejian Ren 1
- Lu Ren 1
- Yi Su 1
- Jiaqi Tang 1
- Ruiming Tang 1
- Yueying Tian 1
- Huanjie Wang 1
- Qianqian Wang 1
- Shiyao Wang 1
- Xingmei Wang 1
- Yi-Feng Wu 1
- Yunfan Wu 1
- Yu Xia 1
- Can Xie 1
- Xiaogang Xu 1
- Zhao Xu 1
- Yang Yang 1
- Chen Yuhui 1
- Zhang Yunfei 1
- Jinghao Zhang 1
- Kaifu Zhang 1
- Rongzhou Zhang 1
- PengFei Zheng 1
- Zhang Zixing 1