B4: A Black-Box Scrubbing Attack on LLM Watermarks

Baizhou Huang; Xiao Pu; Xiaojun Wan

doi:10.18653/v1/2025.naacl-long.460

B⁴: A Black-Box Scrubbing Attack on LLM Watermarks

Abstract

Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose B⁴, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of B⁴ compared with other baselines.

Anthology ID:: 2025.naacl-long.460
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9113–9126
Language:
URL:: https://aclanthology.org/2025.naacl-long.460/
DOI:: 10.18653/v1/2025.naacl-long.460
Bibkey:
Cite (ACL):: Baizhou Huang, Xiao Pu, and Xiaojun Wan. 2025. B4: A Black-Box Scrubbing Attack on LLM Watermarks. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9113–9126, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: B4: A Black-Box Scrubbing Attack on LLM Watermarks (Huang et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.460.pdf

PDF Cite Search Fix data