Explaining Language Model Predictions with High-Impact Concepts

Ruochen Zhao, Tan Wang, Yongjie Wang, Shafiq Joty


Abstract
To encourage fairness and transparency, there exists an urgent demand for deriving reliable explanations for large language models (LLMs). One promising solution is concept-based explanations, i.e., human-understandable concepts from internal representations. However, due to the compositional nature of languages, current methods mostly discover correlational explanations instead of causal features. Therefore, we propose a novel framework to provide impact-aware explanations for users to understand the LLM’s behavior, which are robust to feature changes and influential to the model’s predictions. Specifically, we extract predictive high-level features (concepts) from the model’s hidden layer activations. Then, we innovatively optimize for features whose existence causes the output predictions to change substantially. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on predictive impact, explainability, and faithfulness compared to the baselines, especially for LLMs.
Anthology ID:
2024.findings-eacl.67
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
995–1012
Language:
URL:
https://aclanthology.org/2024.findings-eacl.67
DOI:
Bibkey:
Cite (ACL):
Ruochen Zhao, Tan Wang, Yongjie Wang, and Shafiq Joty. 2024. Explaining Language Model Predictions with High-Impact Concepts. In Findings of the Association for Computational Linguistics: EACL 2024, pages 995–1012, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Explaining Language Model Predictions with High-Impact Concepts (Zhao et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.67.pdf