Neuronal Insights into LLM Attacks: Targeted Neuron Tuning for Precise and Robust Vulnerability Patching

Dan Shi; Renren Jin; Zhuowen Han; Yuqi Ren; Xinwei Wu; Zhigen Li; Deyi Xiong (德意 熊)

Neuronal Insights into LLM Attacks: Targeted Neuron Tuning for Precise and Robust Vulnerability Patching

Dan Shi, Renren Jin, Zhuowen Han, Yuqi Ren, Xinwei Wu, Zhigen Li, Deyi Xiong

Abstract

Despite recent advances in safety alignment, large language models (LLMs) remain highly susceptible to adversarial attacks, while the internal mechanisms behind such vulnerabilities are still poorly understood. Existing gradient-based attribution methods offer valuable interpretability for analyzing information storage and processing in LLMs. However, they are inapplicable to adversarial attacks, which typically occur in open-ended generation settings without fixed ground-truth outputs. To address these challenges, we propose a novel similarity-based gradient attribution method to identify key neurons sensitive to adversarial behaviors in open-ended generation tasks. The detected neurons, termed targeted neurons, play a critical role in safety training. Building on this neuron-level perspective, we uncover two key neuronal patterns: (i) universal neurons that are consistently exploited across multiple attack strategies, and (ii) interference neurons that hinder safety improvements when fine-tuned indiscriminately, providing mechanistic insights into the interpretability of adversarial vulnerabilities. Inspired by these findings, we propose a neuron-level defense strategy, Targeted Neuron Tuning (TNT), which selectively fine-tunes the identified targeted neurons for specific attacks. Experimental evaluations across multiple LLM architectures and scales demonstrate that TNT substantially improves model robustness against a wide range of jailbreak attacks, achieving safe rates exceeding 90% and even approaching 100%, while preserving general task performance, enabling precise and robust safety interventions. Warning: This paper contains example data that may be harmful.

Anthology ID:: 2026.findings-acl.1719
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34414–34435
Language:
URL:: https://aclanthology.org/2026.findings-acl.1719/
DOI:
Bibkey:
Cite (ACL):: Dan Shi, Renren Jin, Zhuowen Han, Yuqi Ren, Xinwei Wu, Zhigen Li, and Deyi Xiong. 2026. Neuronal Insights into LLM Attacks: Targeted Neuron Tuning for Precise and Robust Vulnerability Patching. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34414–34435, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Neuronal Insights into LLM Attacks: Targeted Neuron Tuning for Precise and Robust Vulnerability Patching (Shi et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1719.pdf
Checklist:: 2026.findings-acl.1719.checklist.pdf

PDF Cite Search Checklist Fix data