Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Supriya Manna, Niladri Sett


Abstract
Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer’s response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
Anthology ID:
2024.blackboxnlp-1.12
Volume:
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
193–206
Language:
URL:
https://aclanthology.org/2024.blackboxnlp-1.12
DOI:
Bibkey:
Cite (ACL):
Supriya Manna and Niladri Sett. 2024. Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 193–206, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations (Manna & Sett, BlackboxNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.blackboxnlp-1.12.pdf