Wrapper Boxes for Faithful Attribution of Model Predictions to Training Data

Yiheng Su, Junyi Jessy Li, Matthew Lease


Abstract
Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a “wrapper box” pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.
Anthology ID:
2024.blackboxnlp-1.33
Volume:
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
551–576
Language:
URL:
https://aclanthology.org/2024.blackboxnlp-1.33
DOI:
10.18653/v1/2024.blackboxnlp-1.33
Bibkey:
Cite (ACL):
Yiheng Su, Junyi Jessy Li, and Matthew Lease. 2024. Wrapper Boxes for Faithful Attribution of Model Predictions to Training Data. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 551–576, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
Wrapper Boxes for Faithful Attribution of Model Predictions to Training Data (Su et al., BlackboxNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.blackboxnlp-1.33.pdf