Hanna Berg


pdf bib
HB Deid - HB De-identification tool demonstrator
Hanna Berg | Hercules Dalianis
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

This paper describes a freely available web-based demonstrator called HB Deid. HB Deid identifies so-called protected health information, PHI, in a text written in Swedish and removes, masks, or replaces them with surrogates or pseudonyms. PHIs are named entities such as personal names, locations, ages, phone numbers, dates. HB Deid uses a CRF model trained on non-sensitive annotated text in Swedish, as well as a rule-based post-processing step for finding PHI. The final step in obscuring the PHI is then to either mask it, show only the class name or use a rule-based pseudonymisation system to replace it.


pdf bib
The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text
Hanna Berg | Aron Henriksson | Hercules Dalianis
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

The impact of de-identification on data quality and, in particular, utility for developing models for downstream tasks has been more thoroughly studied for structured data than for unstructured text. While previous studies indicate that text de-identification has a limited impact on models for downstream tasks, it remains unclear what the impact is with various levels and forms of de-identification, in particular concerning the trade-off between precision and recall. In this paper, the impact of de-identification is studied on downstream named entity recognition in Swedish clinical text. The results indicate that de-identification models with moderate to high precision lead to similar downstream performance, while low precision has a substantial negative impact. Furthermore, different strategies for concealing sensitive information affect performance to different degrees, ranging from pseudonymisation having a low impact to the removal of entire sentences with sensitive information having a high impact. This study indicates that it is possible to increase the recall of models for identifying sensitive information without negatively affecting the use of de-identified text data for training models for clinical named entity recognition; however, there is ultimately a trade-off between the level of de-identification and the subsequent utility of the data.

pdf bib
A Semi-supervised Approach for De-identification of Swedish Clinical Text
Hanna Berg | Hercules Dalianis
Proceedings of the Twelfth Language Resources and Evaluation Conference

An abundance of electronic health records (EHR) is produced every day within healthcare. The records possess valuable information for research and future improvement of healthcare. Multiple efforts have been done to protect the integrity of patients while making electronic health records usable for research by removing personally identifiable information in patient records. Supervised machine learning approaches for de-identification of EHRs need annotated data for training, annotations that are costly in time and human resources. The annotation costs for clinical text is even more costly as the process must be carried out in a protected environment with a limited number of annotators who must have signed confidentiality agreements. In this paper is therefore, a semi-supervised method proposed, for automatically creating high-quality training data. The study shows that the method can be used to improve recall from 84.75% to 89.20% without sacrificing precision to the same extent, dropping from 95.73% to 94.20%. The model’s recall is arguably more important for de-identification than precision.


pdf bib
Augmenting a De-identification System for Swedish Clinical Text Using Open Resources and Deep Learning
Hanna Berg | Hercules Dalianis
Proceedings of the Workshop on NLP and Pseudonymisation

pdf bib
Building a De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text
Hanna Berg | Taridzo Chomutare | Hercules Dalianis
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

This article presents experiments with pseudonymised Swedish clinical text used as training data to de-identify real clinical text with the future aim to transfer non-sensitive training data to other hospitals. Conditional Random Fields (CFR) and Long Short-Term Memory (LSTM) machine learning algorithms were used to train de-identification models. The two models were trained on pseudonymised data and evaluated on real data. For benchmarking, models were also trained on real data, and evaluated on real data as well as trained on pseudonymised data and evaluated on pseudonymised data. CRF showed better performance for some PHI information like Date Part, First Name and Last Name; consistent with some reports in the literature. In contrast, poor performances on Location and Health Care Unit information were noted, partially due to the constrained vocabulary in the pseudonymised training data. It is concluded that it is possible to train transferable models based on pseudonymised Swedish clinical data, but even small narrative and distributional variation could negatively impact performance.