Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Nikita Afonin; Nikita Andriianov; Vahagn Hovhannisyan; Nikhil Bageshpura; Kyle Liu; Kevin Zhu; Sunishchal Dev; Ashwinee Panda; Oleg Rogov; Elena Tutubalina; Alexander Panchenko; Mikhail Seleznyov

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Nikita Afonin, Nikita Andriianov, Vahagn Hovhannisyan, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Oleg Rogov, Elena Tutubalina, Alexander Panchenko, Mikhail Seleznyov

Abstract

Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.

Anthology ID:: 2026.acl-long.1770
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38197–38212
Language:
URL:: https://aclanthology.org/2026.acl-long.1770/
DOI:
Bibkey:
Cite (ACL):: Nikita Afonin, Nikita Andriianov, Vahagn Hovhannisyan, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Oleg Rogov, Elena Tutubalina, Alexander Panchenko, and Mikhail Seleznyov. 2026. Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 38197–38212, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs (Afonin et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1770.pdf
Checklist:: 2026.acl-long.1770.checklist.pdf

PDF Cite Search Checklist Fix data