Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

Aaron J. Li; Suraj Srinivas; Usha Bhalla; Himabindu Lakkaraju

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju

Abstract

Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the base LLM’s activations. Overall, our results suggest that SAE concept representations are fragile and without further denoising or postprocessing they might be ill-suited for applications in model monitoring and oversight.

Anthology ID:: 2026.eacl-long.279
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5940–5957
Language:
URL:: https://aclanthology.org/2026.eacl-long.279/
DOI:
Bibkey:
Cite (ACL):: Aaron J. Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju. 2026. Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5940–5957, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders (Li et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.279.pdf
Checklist:: 2026.eacl-long.279.checklist.pdf

PDF Cite Search Checklist Fix data