Automated Anonymization as Spelling Variant Detection

Steven Kester Yuwono, Hwee Tou Ng, Kee Yuan Ngiam


Abstract
The issue of privacy has always been a concern when clinical texts are used for research purposes. Personal health information (PHI) (such as name and identification number) needs to be removed so that patients cannot be identified. Manual anonymization is not feasible due to the large number of clinical texts to be anonymized. In this paper, we tackle the task of anonymizing clinical texts written in sentence fragments and which frequently contain symbols, abbreviations, and misspelled words. Our clinical texts therefore differ from those in the i2b2 shared tasks which are in prose form with complete sentences. Our clinical texts are also part of a structured database which contains patient name and identification number in structured fields. As such, we formulate our anonymization task as spelling variant detection, exploiting patients’ personal information in the structured fields to detect their spelling variants in clinical texts. We successfully anonymized clinical texts consisting of more than 200 million words, using minimum edit distance and regular expression patterns.
Anthology ID:
W16-4214
Volume:
Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Anna Rumshisky, Kirk Roberts, Steven Bethard, Tristan Naumann
Venue:
ClinicalNLP
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
99–103
Language:
URL:
https://aclanthology.org/W16-4214/
DOI:
Bibkey:
Cite (ACL):
Steven Kester Yuwono, Hwee Tou Ng, and Kee Yuan Ngiam. 2016. Automated Anonymization as Spelling Variant Detection. In Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP), pages 99–103, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Automated Anonymization as Spelling Variant Detection (Yuwono et al., ClinicalNLP 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4214.pdf