Jailbreaks as Inference-Time Alignment: A Framework for Understanding Safety Failures in LLMs

James Beetham; Souradip Chakraborty; Mengdi Wang; Furong Huang; Amrit Singh Bedi; Mubarak Shah

Jailbreaks as Inference-Time Alignment: A Framework for Understanding Safety Failures in LLMs

James Beetham, Souradip Chakraborty, Mengdi Wang, Furong Huang, Amrit Singh Bedi, Mubarak Shah

Abstract

Large language models (LLMs) are safety-aligned to prevent harmful response generation, yet still remain vulnerable to jailbreak attacks. While prior works have focused on improving jailbreak attack effectiveness, they offer little explanation for why safety alignment fails. We address this gap by framing jailbreaks as inference-time alignment, connecting attack design and safety alignment within a unified optimization framework. This framing allows us to extend best-of-N inference-time alignment to the adversarial setting, called LIAR (Leveraging Inference-time Alignment to jailbReak), and derive suboptimality bounds that show LIAR provably approaches an optimal jailbreak as compute scales. Interestingly, our framework allows us to develop the notion of a Safety-Net, a measure of how vulnerable an LLM is to jailbreaks, which helps to explain why safety alignment can fail. Empirically, LIAR produces natural, hard-to-detect prompts that achieve a competitive attack success rate while running 10 to 100x faster than prior suffix-based jailbreaks.

Anthology ID:: 2026.eacl-long.360
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7689–7713
Language:
URL:: https://aclanthology.org/2026.eacl-long.360/
DOI:
Bibkey:
Cite (ACL):: James Beetham, Souradip Chakraborty, Mengdi Wang, Furong Huang, Amrit Singh Bedi, and Mubarak Shah. 2026. Jailbreaks as Inference-Time Alignment: A Framework for Understanding Safety Failures in LLMs. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7689–7713, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Jailbreaks as Inference-Time Alignment: A Framework for Understanding Safety Failures in LLMs (Beetham et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.360.pdf
Checklist:: 2026.eacl-long.360.checklist.pdf

PDF Cite Search Checklist Fix data