Weijun Li
2025
Cut the Deadwood Out: Backdoor Purification via Guided Module Substitution
Yao Tong
|
Weijun Li
|
Xuanli He
|
Haolan Zhan
|
Qiongkai Xu
Findings of the Association for Computational Linguistics: EMNLP 2025
Model NLP models are commonly trained (or fine-tuned) on datasets from untrusted platforms like HuggingFace, posing significant risks of data poisoning attacks. A practical yet underexplored challenge arises when such backdoors are discovered after model deployment, making retraining-required defenses less desirable due to computational costs and data constraints. In this work, we propose Guided Module Substitution (GMS), an effective retraining-free method based on guided merging of the victim model with a single proxy model. Specifically, GMS selectively replaces modules in the victim model based on a trade-off signal between utility and backdoor. GMS offers four desirable properties: (1) robustness to the choice and trustworthiness of the proxy model, (2) applicability under relaxed data assumptions, (3) stability across hyperparameters, and (4) transferability across different attacks. Extensive experiments on encoder models and decoder LLMs demonstrate the strong effectiveness of GMS. GMS significantly outperforms even the strongest defense baseline, particularly against challenging attacks like LWS.
2024
Seeing the Forest through the Trees: Data Leakage from Partial Transformer Gradients
Weijun Li
|
Qiongkai Xu
|
Mark Dras
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent studies have shown that distributed machine learning is vulnerable to gradient inversion attacks, where private training data can be reconstructed by analyzing the gradients of the models shared in training. Previous attacks established that such reconstructions are possible using gradients from all parameters in the entire models. However, we hypothesize that most of the involved modules, or even their sub-modules, are at risk of training data leakage, and we validate such vulnerabilities in various intermediate layers of language models. Our extensive experiments reveal that gradients from a single Transformer layer, or even a single linear component with 0.54% parameters, are susceptible to training data leakage. Additionally, we show that applying differential privacy on gradients during training offers limited protection against the novel vulnerability of data disclosure.