Precise In-Parameter Concept Erasure in Large Language Models

Yoav Gur-Arieh; Clara Haya Suslik; Yihuai Hong; Fazl Barez; Mor Geva

doi:10.18653/v1/2025.emnlp-main.960

Precise In-Parameter Concept Erasure in Large Language Models

Yoav Gur-Arieh, Clara Haya Suslik, Yihuai Hong, Fazl Barez, Mor Geva

Abstract

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES, a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 41%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

Anthology ID:: 2025.emnlp-main.960
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18986–19006
Language:
URL:: https://aclanthology.org/2025.emnlp-main.960/
DOI:: 10.18653/v1/2025.emnlp-main.960
Bibkey:
Cite (ACL):: Yoav Gur-Arieh, Clara Haya Suslik, Yihuai Hong, Fazl Barez, and Mor Geva. 2025. Precise In-Parameter Concept Erasure in Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18986–19006, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Precise In-Parameter Concept Erasure in Large Language Models (Gur-Arieh et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.960.pdf
Checklist:: 2025.emnlp-main.960.checklist.pdf

PDF Cite Search Checklist Fix data