From Narrow Unlearning to Emergent Misalignment in LLMs

Erum Mushtaq; Anil Ramakrishna; Satyapriya Krishna; Sattvik Sahai; Prasoon Goyal; Kai-Wei Chang; Tao Zhang; Rahul Gupta

From Narrow Unlearning to Emergent Misalignment in LLMs

Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna, Sattvik Sahai, Prasoon Goyal, Kai-Wei Chang, Tao Zhang, Rahul Gupta

Abstract

Recent work has shown that fine-tuning on insecure code data can trigger an emergent misalignment (EMA) phenomenon, where models generate malicious responses even to prompts unrelated to the original insecure code-writing task. Such cross-domain generalization of harmful behavior underscores the need for a deeper understanding of the algorithms, tasks, and datasets that induce emergent misalignment. In this work, we extend this study by demonstrating that emergent misalignment can also arise from narrow refusal unlearning in specific domains. We perform refusal unlearning on Cybersecurity and Safety concept, and evaluate EMA by monitoring refusal scores across seven responsible AI (RAI) domains, Cybersecurity, Safety, Toxicity, Bias, Sensitive Content, Medical/Legal, and Privacy. Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains. Among the two intervened concepts, Cybersecurity and Safety, we find that the safety concept can have larger EMA impact, i.e, causing lower refusal scores, across other unrelated domains such as bias. We observe this effect consistently across two model families, Mistral-7b-0.3v, and Qwen-7b-2.5. Further, we show that refusal unlearning augmented with cross-entropy loss function on a small set of retain data from the affected domains can largely, if not fully, restore alignment across the impacted domains while having lower refusal rate on the concept we perform unlearning on. To investigate the underlying causes of EMA, we analyze concept entanglements at the representation level via concept vectors. Our analysis reveals that concepts with higher representation similarity in earlier layers are more susceptible to EMA after intervention when the refusal stream is altered through targeted refusal unlearning.

Anthology ID:: 2026.acl-short.32
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 386–395
Language:
URL:: https://aclanthology.org/2026.acl-short.32/
DOI:
Bibkey:
Cite (ACL):: Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna, Sattvik Sahai, Prasoon Goyal, Kai-Wei Chang, Tao Zhang, and Rahul Gupta. 2026. From Narrow Unlearning to Emergent Misalignment in LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 386–395, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: From Narrow Unlearning to Emergent Misalignment in LLMs (Mushtaq et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-short.32.pdf
Checklist:: 2026.acl-short.32.checklist.pdf

PDF Cite Search Checklist Fix data