Wenjie Ruan


2024

pdf bib
CROWD: Certified Robustness via Weight Distribution for Smoothed Classifiers against Backdoor Attack
Siqi Sun | Procheta Sen | Wenjie Ruan
Findings of the Association for Computational Linguistics: EMNLP 2024

Language models are vulnerable to clandestinely modified data and manipulation by attackers. Despite considerable research dedicated to enhancing robustness against adversarial attacks, the realm of provable robustness for backdoor attacks remains relatively unexplored. In this paper, we initiate a pioneering investigation into the certified robustness of NLP models against backdoor triggers.We propose a model-agnostic mechanism for large-scale models that applies to complex model structures without the need for assessing model architecture or internal knowledge. More importantly, we take recent advances in randomized smoothing theory and propose a novel weight-based distribution algorithm to enable semantic similarity and provide theoretical robustness guarantees.Experimentally, we demonstrate the efficacy of our approach across a diverse range of datasets and tasks, highlighting its utility in mitigating backdoor triggers. Our results show strong performance in terms of certified accuracy, scalability, and semantic preservation.

2023

pdf bib
TextVerifier: Robustness Verification for Textual Classifiers with Certifiable Guarantees
Siqi Sun | Wenjie Ruan
Findings of the Association for Computational Linguistics: ACL 2023

When textual classifiers are deployed in safety-critical workflows, they must withstand the onslaught of AI-enabled model confusion caused by adversarial examples with minor alterations. In this paper, the main objective is to provide a formal verification framework, called TextVerifier, with certifiable guarantees on deep neural networks in natural language processing against word-level alteration attacks. We aim to provide an approximation of the maximal safe radius by deriving provable bounds both mathematically and automatically, where a minimum word-level L_0 distance is quantified as a guarantee for the classification invariance of victim models. Here, we illustrate three strengths of our strategy: i) certifiable guarantee: effective verification with convergence to ensure approximation of maximal safe radius with tight bounds ultimately; ii) high-efficiency: it yields an efficient speed edge by a novel parallelization strategy that can process a set of candidate texts simultaneously on GPUs; and iii) reliable anytime estimation: the verification can return intermediate bounds, and robustness estimates that are gradually, but strictly, improved as the computation proceeds. Furthermore, experiments are conducted on text classification on four datasets over three victim models to demonstrate the validity of tightening bounds. Our tool TextVerifier is available at https://github.com/TrustAI/TextVerifer.