What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Nathalie Maria Kirch; Constantin Niko Weisser; Severin Field; Helen Yannakoudakis; Stephen Casper

doi:10.18653/v1/2025.blackboxnlp-1.28

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Nathalie Maria Kirch, Constantin Niko Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper

Abstract

Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction. Interventions derived from non-linear probes produce larger and more reliable effects than those from linear probes, indicating that features linked to jailbreak success are encoded non-linearly in prompt representations. Overall, the results surface heterogeneous, non-linear structure in jailbreak mechanisms and provide a prompt-side methodology for recovering and testing the features that drive jailbreak outcomes.

Anthology ID:: 2025.blackboxnlp-1.28
Volume:: Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Yonatan Belinkov, Aaron Mueller, Najoung Kim, Hosein Mohebbi, Hanjie Chen, Dana Arad, Gabriele Sarti
Venues:: BlackboxNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 480–520
Language:
URL:: https://aclanthology.org/2025.blackboxnlp-1.28/
DOI:: 10.18653/v1/2025.blackboxnlp-1.28
Bibkey:
Cite (ACL):: Nathalie Maria Kirch, Constantin Niko Weisser, Severin Field, Helen Yannakoudakis, and Stephen Casper. 2025. What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks. In Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 480–520, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks (Kirch et al., BlackboxNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.blackboxnlp-1.28.pdf

PDF Cite Search Fix data