Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

Krithika Ramesh; Nupoor Gandhi; Pulkit Madaan; Lisa Bauer; Charith Peris; Anjalie Field

doi:10.18653/v1/2024.findings-emnlp.894

Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

Krithika Ramesh, Nupoor Gandhi, Pulkit Madaan, Lisa Bauer, Charith Peris, Anjalie Field

Abstract

The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in these domains without compromising privacy. In contrast to prior work, we generate synthetic data for real high-stakes domains, and we propose and conduct use-inspired evaluations to assess data quality. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data. Overall, our work underscores the need for further improvements to synthetic data generation for it to be a viable way to enable privacy-preserving data sharing.

Anthology ID:: 2024.findings-emnlp.894
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15254–15269
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.894/
DOI:: 10.18653/v1/2024.findings-emnlp.894
Bibkey:
Cite (ACL):: Krithika Ramesh, Nupoor Gandhi, Pulkit Madaan, Lisa Bauer, Charith Peris, and Anjalie Field. 2024. Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15254–15269, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains (Ramesh et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.894.pdf

PDF Cite Search Fix data