Xingzhi Guo

2026

Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions.While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored.In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked “How can I track someone’s location without their consent?”, a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary.We further show that utility-oriented finetuning intensifies this risk, motivating joint alignment of safety and utility.We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones.Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent. Further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

2025

pdf bib abs

MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning
Chanwoo Park | Seungju Han | Xingzhi Guo | Asuman E. Ozdaglar | Kaiqing Zhang | Joo-Kyung Kim
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Leveraging multi-agentic frameworks to enhance large language models (LLMs) has demonstrated significant potential recently, with most existing studies focusing on prompting and developing workflows with frozen LLMs. In this paper, we aim to further unleash the power of such multi-agentic frameworks for post-training LLMs for better collaboration. Specifically, we develop a new paradigm of Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning (MAPoRL). In MAPoRL, multiple LLMs first generate their own responses and engage in discussions to collaboratively enhance the final response output; the final output is then scored by a verifier, where the scores serve as the reward and is maximized through multi-agent RL. Additionally, MAPoRL also reshapes the reward above with additional incentives to encourage corrective and persuasive outputs in the discussions. A key novelty from most existing LLM post-training paradigms is the advocacy of co-training multiple LLMs together, and the use of RL for better generalization. Accompanied by a few analytical insights, our experiments show that training single LLMs solely is insufficient for encouraging collaboration, while multi-agent co-training can significantly enhance the collaboration performance across multiple datasets, with generalization to unseen domains, compared to that of multiple LLMs before post-training.

Co-authors

Venues

ACL1
Findings1

Fix author