Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Leyi Pan; Aiwei Liu; Shiyu Huang; Yijian Lu; Xuming Hu; Lijie Wen; Irwin King; Philip S. Yu

doi:10.18653/v1/2025.acl-long.648

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu

Abstract

The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies.

Anthology ID:: 2025.acl-long.648
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13228–13251
Language:
URL:: https://aclanthology.org/2025.acl-long.648/
DOI:: 10.18653/v1/2025.acl-long.648
Bibkey:
Cite (ACL):: Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, and Philip S. Yu. 2025. Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13228–13251, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? (Pan et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.648.pdf

PDF Cite Search Fix data