Textual Manifold-based Defense Against Natural Language Adversarial Examples

Dang Nguyen Minh, Anh Tuan Luu


Abstract
Despite the recent success of large pretrained language models in NLP, they are susceptible to adversarial examples. Concurrently, several studies on adversarial images have observed an intriguing property: the adversarial images tend to leave the low-dimensional natural data manifold. In this study, we find a similar phenomenon occurs in the contextualized embedding space of natural sentences induced by pretrained language models in which textual adversarial examples tend to have their embeddings diverge off the manifold of natural sentence embeddings. Based on this finding, we propose Textual Manifold-based Defense (TMD), a defense mechanism that learns the embedding space manifold of the underlying language model and projects novel inputs back to the approximated structure before classification. Through extensive experiments, we find that our method consistently and significantly outperforms previous defenses under various attack settings while remaining unaffected to the clean accuracy. To the best of our knowledge, this is the first kind of manifold-based defense adapted to the NLP domain.
Anthology ID:
2022.emnlp-main.443
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6612–6625
Language:
URL:
https://aclanthology.org/2022.emnlp-main.443
DOI:
10.18653/v1/2022.emnlp-main.443
Bibkey:
Cite (ACL):
Dang Nguyen Minh and Anh Tuan Luu. 2022. Textual Manifold-based Defense Against Natural Language Adversarial Examples. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6612–6625, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Textual Manifold-based Defense Against Natural Language Adversarial Examples (Nguyen Minh & Luu, EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.443.pdf