SPY: Enhancing Privacy with Synthetic PII Detection Dataset

Maksim Savkin; Timur Ionov; Vasily Konovalov

doi:10.18653/v1/2025.naacl-srw.23

SPY: Enhancing Privacy with Synthetic PII Detection Dataset

Maksim Savkin, Timur Ionov, Vasily Konovalov

Abstract

We introduce **SPY Dataset**: a novel synthetic dataset for the task of **Personal Identifiable Information (PII) detection**, underscoring the significance of protecting PII in modern data processing. Our research innovates by leveraging Large Language Models (LLMs) to generate a dataset that emulates real-world PII scenarios. Through evaluation, we validate the dataset’s quality, providing a benchmark for PII detection. Comparative analyses reveal that while PII and Named Entity Recognition (NER) share similarities, **dedicated NER models exhibit limitations** when applied to PII-specific contexts. This work contributes to the field by making the generation methodology and the generated dataset publicly, thereby enabling further research and development in this field.

Anthology ID:: 2025.naacl-srw.23
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:: April
Year:: 2025
Address:: Albuquerque, USA
Editors:: Abteen Ebrahimi, Samar Haider, Emmy Liu, Sammar Haider, Maria Leonor Pacheco, Shira Wein
Venues:: NAACL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 236–246
Language:
URL:: https://aclanthology.org/2025.naacl-srw.23/
DOI:: 10.18653/v1/2025.naacl-srw.23
Bibkey:
Cite (ACL):: Maksim Savkin, Timur Ionov, and Vasily Konovalov. 2025. SPY: Enhancing Privacy with Synthetic PII Detection Dataset. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 236–246, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):: SPY: Enhancing Privacy with Synthetic PII Detection Dataset (Savkin et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-srw.23.pdf

PDF Cite Search Fix data