CluSanT: Differentially Private and Semantically Coherent Text Sanitization

Ahmed Musa Awon; Yun Lu; Shera Potka; Alex Thomo

doi:10.18653/v1/2025.naacl-long.187

CluSanT: Differentially Private and Semantically Coherent Text Sanitization

Ahmed Musa Awon, Yun Lu, Shera Potka, Alex Thomo

Abstract

We introduce CluSanT, a novel text sanitization framework based on Metric Local Differential Privacy (MLDP). Our framework consists of three components: token clustering, cluster embedding, and token sanitization. For the first, CluSanT employs Large Language Models (LLMs) to create—a set of potential substitute tokens which we meaningfully cluster. Then, we develop a parameterized cluster embedding that balances the trade-off between privacy and utility. Lastly, we propose a MLDP algorithm which sanitizes/substitutes sensitive tokens in a text with the help of our embedding. Notably, our MLDP-based framework can be tuned with parameters such that (1) existing state-of-the-art (SOTA) token sanitization algorithms can be described—and improved—via our framework with extremal values of our parameters, and (2) by varying our parameters, we allow for a whole spectrum of privacy-utility tradeoffs between the two SOTA. Our experiments demonstrate CluSanT’s balance between privacy and semantic coherence, highlighting its capability as a valuable framework for privacy-preserving text sanitization.

Anthology ID:: 2025.naacl-long.187
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3676–3693
Language:
URL:: https://aclanthology.org/2025.naacl-long.187/
DOI:: 10.18653/v1/2025.naacl-long.187
Bibkey:
Cite (ACL):: Ahmed Musa Awon, Yun Lu, Shera Potka, and Alex Thomo. 2025. CluSanT: Differentially Private and Semantically Coherent Text Sanitization. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3676–3693, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: CluSanT: Differentially Private and Semantically Coherent Text Sanitization (Awon et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.187.pdf

PDF Cite Search Fix data