Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

Biru Zhu; Ganqu Cui; Yangyi Chen; Yujia Qin; Lifan Yuan; Chong Fu; Yangdong Deng; Zhiyuan Liu; Maosong Sun (孙茂松); Ming Gu

doi:10.1162/tacl_a_00622

Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, Ming Gu

Abstract

Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.

Anthology ID:: 2023.tacl-1.91
Volume:: Transactions of the Association for Computational Linguistics, Volume 11
Month:
Year:: 2023
Address:: Cambridge, MA
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 1608–1623
Language:
URL:: https://aclanthology.org/2023.tacl-1.91/
DOI:: 10.1162/tacl_a_00622
Bibkey:
Cite (ACL):: Biru Zhu, Ganqu Cui, Yangyi Chen, Yujia Qin, Lifan Yuan, Chong Fu, Yangdong Deng, Zhiyuan Liu, Maosong Sun, and Ming Gu. 2023. Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training. Transactions of the Association for Computational Linguistics, 11:1608–1623.
Cite (Informal):: Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training (Zhu et al., TACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.tacl-1.91.pdf

PDF Cite Search Fix data