IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training

Sean Xie, Soroush Vosoughi, Saeed Hassanpour


Abstract
Attention has long served as a foundational technique for generating explanations. With the recent developments made in Explainable AI (XAI), the multi-faceted nature of interpretability has become more apparent. Can attention, as an explanation method, be adapted to meet the diverse needs that our expanded understanding of interpretability demands? In this work, we aim to address this question by introducing IvRA, a framework designed to directly train a language model’s attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency. Our extensive experimental analysis demonstrates that IvRA outperforms existing methods in guiding language models to generate explanations that are simulatable, faithful, and consistent, in tandem with their predictions. Furthermore, we perform ablation studies to verify the robustness of IvRA across various experimental settings and to shed light on the interactions among different interpretability criteria.
Anthology ID:
2024.blackboxnlp-1.27
Volume:
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
431–451
Language:
URL:
https://aclanthology.org/2024.blackboxnlp-1.27
DOI:
Bibkey:
Cite (ACL):
Sean Xie, Soroush Vosoughi, and Saeed Hassanpour. 2024. IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 431–451, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training (Xie et al., BlackboxNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.blackboxnlp-1.27.pdf