What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations

Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani Asthana, Carter Perkins, Sabrina Reis, Zayd Hammoudeh, Daniel Lowd, Sameer Singh


Abstract
Adversarial attacks curated against NLP models are increasingly becoming practical threats. Although various methods have been developed to detect adversarial attacks, securing learning-based NLP systems in practice would require more than identifying and evading perturbed instances. To address these issues, we propose a new set of adversary identification tasks, Attacker Attribute Classification via Textual Analysis (AACTA), that attempts to obtain more detailed information about the attackers from adversarial texts. Specifically, given a piece of adversarial text, we hope to accomplish tasks such as localizing perturbed tokens, identifying the attacker’s access level to the target model, determining the evasion mechanism imposed, and specifying the perturbation type employed by the attacking algorithm. Our contributions are as follows: we formalize the task of classifying attacker attributes, and create a benchmark on various target models from sentiment classification and abuse detection domains. We show that signals from BERT models and target models can be used to train classifiers that reveal the properties of the attacking algorithms. We demonstrate that adversarial attacks leave interpretable traces in both feature spaces of pre-trained language models and target models, making AACTA a promising direction towards more trustworthy NLP systems.
Anthology ID:
2021.blackboxnlp-1.6
Volume:
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
69–78
Language:
URL:
https://aclanthology.org/2021.blackboxnlp-1.6
DOI:
10.18653/v1/2021.blackboxnlp-1.6
Bibkey:
Cite (ACL):
Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani Asthana, Carter Perkins, Sabrina Reis, Zayd Hammoudeh, Daniel Lowd, and Sameer Singh. 2021. What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 69–78, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations (Xie et al., BlackboxNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.blackboxnlp-1.6.pdf
Video:
 https://aclanthology.org/2021.blackboxnlp-1.6.mp4
Data
SST