Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations

Yongkang Chen; Xiaohu Du; Xiaotian Zou; Chongyang Zhao; Huan Deng; Hu Li; Xiaohui Kuang

doi:10.18653/v1/2025.emnlp-main.49

Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations

Yongkang Chen, Xiaohu Du, Xiaotian Zou, Chongyang Zhao, Huan Deng, Hu Li, Xiaohui Kuang

Abstract

The responsible deployment of Large Language Models (LLMs) necessitates rigorous safety evaluations. However, a critical challenge arises from inconsistencies between an LLM’s internal refusal decisions and external safety assessments, hindering effective validation. This paper introduces the concept of the ‘refusal gap’ to formally define these discrepancies. We then present a novel, refusal-aware red teaming framework designed to automatically generate test cases that expose such gaps. Our framework employs ‘refusal probes’, which leverage the target model’s hidden states, to detect internal model refusals. These are subsequently contrasted with judgments from an external safety evaluator. The identified discrepancy serves as a signal to guide a red-teaming model in crafting test cases that maximize this refusal gap. To further enhance test case diversity and address challenges related to sparse rewards, we introduce a hierarchical, curiosity-driven mechanism that incentivizes both refusal gap maximization and broad topic exploration. Empirical results demonstrate that our method significantly outperforms existing reinforcement learning-based approaches in generating diverse test cases and achieves a substantially higher discovery rate of refusal gaps.

Anthology ID:: 2025.emnlp-main.49
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 946–955
Language:
URL:: https://aclanthology.org/2025.emnlp-main.49/
DOI:: 10.18653/v1/2025.emnlp-main.49
Bibkey:
Cite (ACL):: Yongkang Chen, Xiaohu Du, Xiaotian Zou, Chongyang Zhao, Huan Deng, Hu Li, and Xiaohui Kuang. 2025. Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 946–955, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations (Chen et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.49.pdf
Checklist:: 2025.emnlp-main.49.checklist.pdf

PDF Cite Search Checklist Fix data