The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text

Hanna Berg, Aron Henriksson, Hercules Dalianis


Abstract
The impact of de-identification on data quality and, in particular, utility for developing models for downstream tasks has been more thoroughly studied for structured data than for unstructured text. While previous studies indicate that text de-identification has a limited impact on models for downstream tasks, it remains unclear what the impact is with various levels and forms of de-identification, in particular concerning the trade-off between precision and recall. In this paper, the impact of de-identification is studied on downstream named entity recognition in Swedish clinical text. The results indicate that de-identification models with moderate to high precision lead to similar downstream performance, while low precision has a substantial negative impact. Furthermore, different strategies for concealing sensitive information affect performance to different degrees, ranging from pseudonymisation having a low impact to the removal of entire sentences with sensitive information having a high impact. This study indicates that it is possible to increase the recall of models for identifying sensitive information without negatively affecting the use of de-identified text data for training models for clinical named entity recognition; however, there is ultimately a trade-off between the level of de-identification and the subsequent utility of the data.
Anthology ID:
2020.louhi-1.1
Volume:
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis
Month:
November
Year:
2020
Address:
Online
Editors:
Eben Holderness, Antonio Jimeno Yepes, Alberto Lavelli, Anne-Lyse Minard, James Pustejovsky, Fabio Rinaldi
Venue:
Louhi
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–11
Language:
URL:
https://aclanthology.org/2020.louhi-1.1
DOI:
10.18653/v1/2020.louhi-1.1
Bibkey:
Cite (ACL):
Hanna Berg, Aron Henriksson, and Hercules Dalianis. 2020. The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pages 1–11, Online. Association for Computational Linguistics.
Cite (Informal):
The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text (Berg et al., Louhi 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.louhi-1.1.pdf
Optional supplementary material:
 2020.louhi-1.1.OptionalSupplementaryMaterial.zip
Video:
 https://slideslive.com/38940038