Enhancing adversarial robustness in Natural Language Inference using explanations

Alexandros Koulakos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou


Abstract
The surge of state-of-the-art transformer-based models has undoubtedly pushed the limits of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the underexplored task of Natural Language Inference (NLI), since models trained on popular well-suited datasets are susceptible to adversarial attacks, allowing subtle input interventions to mislead the model. In this work, we validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation: only by fine-tuning a classifier on the explanation rather than premise-hypothesis inputs, robustness under various adversarial attacks is achieved in comparison to explanation-free baselines. Moreover, since there is no standard strategy for testing the semantic validity of the generated explanations, we research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models. Our approach is resource-efficient and reproducible without significant computational limitations.
Anthology ID:
2024.blackboxnlp-1.7
Volume:
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
105–117
Language:
URL:
https://aclanthology.org/2024.blackboxnlp-1.7
DOI:
Bibkey:
Cite (ACL):
Alexandros Koulakos, Maria Lymperaiou, Giorgos Filandrianos, and Giorgos Stamou. 2024. Enhancing adversarial robustness in Natural Language Inference using explanations. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 105–117, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
Enhancing adversarial robustness in Natural Language Inference using explanations (Koulakos et al., BlackboxNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.blackboxnlp-1.7.pdf