When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

Huaizhi Ge; Yiming Li; Qifan Wang; Yongfeng Zhang; Ruixiang Tang

doi:10.18653/v1/2025.acl-long.114

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, Ruixiang Tang

Abstract

Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs’ behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs’ generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.

Anthology ID:: 2025.acl-long.114
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2278–2296
Language:
URL:: https://aclanthology.org/2025.acl-long.114/
DOI:: 10.18653/v1/2025.acl-long.114
Bibkey:
Cite (ACL):: Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. 2025. When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2278–2296, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations (Ge et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.114.pdf

PDF Cite Search Fix data