To what extent do human explanations of model behavior align with actual model behavior?

To what extent do human explanations of model behavior align with actual model behavior? Grusha Prasad author Yixin Nie author Mohit Bansal author Robin Jia author Douwe Kiela author Adina Williams author 2021-11 text Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP Jasmijn Bastings editor Yonatan Belinkov editor Emmanuel Dupoux editor Mario Giulianelli editor Dieuwke Hupkes editor Yuval Pinter editor Hassan Sajjad editor Association for Computational Linguistics Punta Cana, Dominican Republic conference publication prasad-etal-2021-extent 10.18653/v1/2021.blackboxnlp-1.1 https://aclanthology.org/2021.blackboxnlp-1.1/ 2021-11 1 14