Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Faruk Bakman; Duygu Nur Yaldiz; Salman Avestimehr; Sai Praneeth Karimireddy

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment

Yavuz Faruk Bakman, Duygu Nur Yaldiz, Salman Avestimehr, Sai Praneeth Karimireddy

Abstract

Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed “aligned” can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior, which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update–robust alignment evaluation

Anthology ID:: 2026.trustnlp-main.10
Volume:: Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 180–203
Language:
URL:: https://aclanthology.org/2026.trustnlp-main.10/
DOI:
Bibkey:
Cite (ACL):: Yavuz Faruk Bakman, Duygu Nur Yaldiz, Salman Avestimehr, and Sai Praneeth Karimireddy. 2026. Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 180–203, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment (Bakman et al., TrustNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.trustnlp-main.10.pdf

PDF Cite Search Fix data