Markus Bayer
2024
XAI-Attack: Utilizing Explainable AI to Find Incorrectly Learned Patterns for Black-Box Adversarial Example Creation
Markus Bayer
|
Markus Neiczer
|
Maximilian Samsinger
|
Björn Buchhold
|
Christian Reuter
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Adversarial examples, capable of misleading machine learning models into making erroneous predictions, pose significant risks in safety-critical domains such as crisis informatics, medicine, and autonomous driving. To counter this, we introduce a novel textual adversarial example method that identifies falsely learned word indicators by leveraging explainable AI methods as importance functions on incorrectly predicted instances, thus revealing and understanding the weaknesses of a model. To evaluate the effectiveness of our approach, we conduct a human and a transfer evaluation and propose a novel adversarial training evaluation setting for better robustness assessment. While outperforming current adversarial example and training methods, the results also show our method’s potential in facilitating the development of more resilient transformer models by detecting and rectifying biases and patterns in training data, showing baseline improvements of up to 23 percentage points in accuracy on adversarial tasks. The code of our approach is freely available for further exploration and use.