Effective Red-Teaming of Policy-Adherent Agents

Itay Nakash; George Kour; Koren Lazar; Matan Vetzler; Guy Uziel; Ateret Anaby Tavor

doi:10.18653/v1/2025.emnlp-main.114

Effective Red-Teaming of Policy-Adherent Agents

Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby Tavor

Abstract

Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing Tau-bench benchmark, we introduce Tau-break, a complementary benchmark designed to rigorously assess the agent’s robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks.

Anthology ID:: 2025.emnlp-main.114
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2250–2268
Language:
URL:: https://aclanthology.org/2025.emnlp-main.114/
DOI:: 10.18653/v1/2025.emnlp-main.114
Bibkey:
Cite (ACL):: Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, and Ateret Anaby Tavor. 2025. Effective Red-Teaming of Policy-Adherent Agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2250–2268, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Effective Red-Teaming of Policy-Adherent Agents (Nakash et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.114.pdf
Checklist:: 2025.emnlp-main.114.checklist.pdf

PDF Cite Search Checklist Fix data