Utility Preservation of Clinical Text After De-Identification

Thomas Vakili, Hercules Dalianis


Abstract
Electronic health records contain valuable information about symptoms, diagnosis, treatment and outcomes of the treatments of individual patients. However, the records may also contain information that can reveal the identity of the patients. Removing these identifiers - the Protected Health Information (PHI) - can protect the identity of the patient. Automatic de-identification is a process which employs machine learning techniques to detect and remove PHI. However, automatic techniques are imperfect in their precision and introduce noise into the data. This study examines the impact of this noise on the utility of Swedish de-identified clinical data by using human evaluators and by training and testing BERT models. Our results indicate that de-identification does not harm the utility for clinical NLP and that human evaluators are less sensitive to noise from de-identification than expected.
Anthology ID:
2022.bionlp-1.38
Volume:
Proceedings of the 21st Workshop on Biomedical Language Processing
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venues:
ACL | BioNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
383–388
Language:
URL:
https://aclanthology.org/2022.bionlp-1.38
DOI:
10.18653/v1/2022.bionlp-1.38
Bibkey:
Cite (ACL):
Thomas Vakili and Hercules Dalianis. 2022. Utility Preservation of Clinical Text After De-Identification. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 383–388, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Utility Preservation of Clinical Text After De-Identification (Vakili & Dalianis, BioNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bionlp-1.38.pdf