BPID: A Benchmark for Personal Identity Deduplication

Runhui Wang; Yefan Tao; Adit Krishnan; Luyang Kong; Xuanqing Liu; Yuqian Deng; Yunzhao Yang; Henrik Johnson; Andrew Borthwick; Shobhit Gupta; Aditi Sinha Gundlapalli; Davor Golac

doi:10.18653/v1/2024.emnlp-industry.40

BPID: A Benchmark for Personal Identity Deduplication

Runhui Wang, Yefan Tao, Adit Krishnan, Luyang Kong, Xuanqing Liu, Yuqian Deng, Yunzhao Yang, Henrik Johnson, Andrew Borthwick, Shobhit Gupta, Aditi Sinha Gundlapalli, Davor Golac

Abstract

Data deduplication is a critical task in data management and mining, focused on consolidating duplicate records that refer to the same entity. Personally Identifiable Information (PII) is a critical class of data for deduplication across various industries. Consumer data, stored and generated through various engagement channels, is crucial for marketers, agencies, and publishers. However, a major challenge to PII data deduplication is the lack of open-source benchmark datasets due to stringent privacy concerns, which hinders the research, development, and evaluation of robust solutions.This paper addresses this critical lack of PII deduplication benchmarks by introducing the first open-source, high-quality dataset for this task. We provide two datasets: one with 1,000,000 unlabeled synthetic PII profiles and a subset of 10,000 pairs curated and labeled by trained annotators as matches or non-matches. Our datasets contain synthetic profiles built from publicly available sources that do not represent any real individuals, thus ensuring privacy and ethical compliance. We provide several challenging data variations to evaluate the effectiveness of various deduplication techniques, including traditional supervised methods, deep-learning approaches, and large language models (LLMs). Our work aims to set a new standard for PII deduplication, paving the way for more accurate and secure solutions. We share our data publicly at this link - https://zenodo.org/records/13932202.

Anthology ID:: 2024.emnlp-industry.40
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:: November
Year:: 2024
Address:: Miami, Florida, US
Editors:: Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 538–546
Language:
URL:: https://aclanthology.org/2024.emnlp-industry.40/
DOI:: 10.18653/v1/2024.emnlp-industry.40
Bibkey:
Cite (ACL):: Runhui Wang, Yefan Tao, Adit Krishnan, Luyang Kong, Xuanqing Liu, Yuqian Deng, Yunzhao Yang, Henrik Johnson, Andrew Borthwick, Shobhit Gupta, Aditi Sinha Gundlapalli, and Davor Golac. 2024. BPID: A Benchmark for Personal Identity Deduplication. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 538–546, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):: BPID: A Benchmark for Personal Identity Deduplication (Wang et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-industry.40.pdf

PDF Cite Search Fix data