The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

Shenglai Zeng; Jiankun Zhang; Pengfei He; Yiding Liu; Yue Xing; Han Xu; Jie Ren; Yi Chang; Shuaiqiang Wang; Dawei Yin; Jiliang Tang

doi:10.18653/v1/2024.findings-acl.267

The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

Shenglai Zeng, Jiankun Zhang, Pengfei He, Yiding Liu, Yue Xing, Han Xu, Jie Ren, Yi Chang, Shuaiqiang Wang, Dawei Yin, Jiliang Tang

Abstract

Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model generation with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of large language models (LLMs), the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-explored. To this end, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database. Despite the new risks brought by RAG on the retrieval data, we further discover that RAG can be used to mitigate the old risks, i.e., the leakage of the LLMs’ training data. In general, we reveal many new insights in this paper for privacy protection of retrieval-augmented LLMs, which could benefit both LLMs and RAG systems builders.

Anthology ID:: 2024.findings-acl.267
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4505–4524
Language:
URL:: https://aclanthology.org/2024.findings-acl.267/
DOI:: 10.18653/v1/2024.findings-acl.267
Bibkey:
Cite (ACL):: Shenglai Zeng, Jiankun Zhang, Pengfei He, Yiding Liu, Yue Xing, Han Xu, Jie Ren, Yi Chang, Shuaiqiang Wang, Dawei Yin, and Jiliang Tang. 2024. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). In Findings of the Association for Computational Linguistics: ACL 2024, pages 4505–4524, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) (Zeng et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-acl.267.pdf

PDF Cite Search Fix data