Distributional Surgery for Language Model Activations

Bao Nguyen, Binh Nguyen, Duy Nguyen, Viet Anh Nguyen


Abstract
Language models, while capable of generating remarkably coherent and seemingly accurate text, can occasionally produce undesirable content including harmful or toxic outputs. In this paper, we present a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for detected undesirable contents, we propose layerwise distributional steering policies that transform the attention heads. These policies are computed through principled semidefinite programming aims to minimally perturb the attention distribution while probabilistically guaranteeing the effectiveness of the editions. Empirical evaluations across multiple language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.
Anthology ID:
2025.findings-emnlp.435
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8192–8212
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.435/
DOI:
Bibkey:
Cite (ACL):
Bao Nguyen, Binh Nguyen, Duy Nguyen, and Viet Anh Nguyen. 2025. Distributional Surgery for Language Model Activations. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8192–8212, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Distributional Surgery for Language Model Activations (Nguyen et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.435.pdf
Checklist:
 2025.findings-emnlp.435.checklist.pdf