Junjie Shen


2024

pdf bib
De-Identification of Sensitive Personal Data in Datasets Derived from IIT-CDIP
Stefan Larson | Nicole Cornehl Lima | Santiago Pedroza Diaz | Amogh Manoj Joshi | Siddharth Betala | Jamiu Tunde Suleiman | Yash Mathur | Kaushal Kumar Prajapati | Ramla Alakraa | Junjie Shen | Temi Okotore | Kevin Leach
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The IIT-CDIP document collection is the source of several widely used and publicly accessible document understanding datasets. In this paper, manual inspection of 5 datasets derived from IIT-CDIP uncovers the presence of thousands of instances of sensitive personal data, including US Social Security Numbers (SSNs), birth places and dates, and home addresses of individuals. The presence of such sensitive personal data in commonly-used and publicly available datasets is startling and has ethical and potentially legal implications; we believe such sensitive data ought to be removed from the internet. Thus, in this paper, we develop a modular data de-identification pipeline that replaces sensitive data with synthetic, but realistic, data. Via experiments, we demonstrate that this de-identification method preserves the utility of the de-identified documents so that they can continue be used in various document understanding applications. We will release redacted versions of these datasets publicly.