KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

WenHao Wang; Xiaoyu Liang; Rui Ye; Jingyi Chai; Siheng Chen; Yanfeng Wang

doi:10.18653/v1/2024.emnlp-main.438

KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

WenHao Wang, Xiaoyu Liang, Rui Ye, Jingyi Chai, Siheng Chen, Yanfeng Wang

Abstract

The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy.They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage.Extensive experiments in medical and financial domains demonstrate the effectiveness of *KnowledgeSG*. Our code is now publicly available at https://github.com/wwh0411/KnowledgeSG.

Anthology ID:: 2024.emnlp-main.438
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7677–7695
Language:
URL:: https://aclanthology.org/2024.emnlp-main.438/
DOI:: 10.18653/v1/2024.emnlp-main.438
Bibkey:
Cite (ACL):: WenHao Wang, Xiaoyu Liang, Rui Ye, Jingyi Chai, Siheng Chen, and Yanfeng Wang. 2024. KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7677–7695, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server (Wang et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.438.pdf

PDF Cite Search Fix data