TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Cheng Liu; Xiaolei Liu; Xingyu Li; Bangzhou Xin; Kangyi Ding

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Cheng Liu, Xiaolei Liu, Xingyu Li, Bangzhou Xin, Kangyi Ding

Abstract

Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication only when risk within a local window persistently exceeds a threshold. This mechanism enables the immediate interruption or constraint of subsequent decoding. Extensive experiments across 12 jailbreak attacks and various open-source LLMs show that TrajGuard achieves an average defense rate of 95%. Furthermore, it reduces detection latency to 5.2 ms/token while maintaining a false positive rate below 1.5%. These results confirm that hidden-state trajectories during decoding can effectively support real-time jailbreak detection, highlighting a promising direction for defenses without model modification.

Anthology ID:: 2026.findings-acl.655
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13371–13388
Language:
URL:: https://aclanthology.org/2026.findings-acl.655/
DOI:
Bibkey:
Cite (ACL):: Cheng Liu, Xiaolei Liu, Xingyu Li, Bangzhou Xin, and Kangyi Ding. 2026. TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense. In Findings of the Association for Computational Linguistics: ACL 2026, pages 13371–13388, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense (Liu et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.655.pdf
Checklist:: 2026.findings-acl.655.checklist.pdf

PDF Cite Search Checklist Fix data