Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data

Thomas Vakili, Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis


Abstract
Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.
Anthology ID:
2022.lrec-1.451
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4245–4252
Language:
URL:
https://aclanthology.org/2022.lrec-1.451
DOI:
Bibkey:
Cite (ACL):
Thomas Vakili, Anastasios Lamproudis, Aron Henriksson, and Hercules Dalianis. 2022. Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4245–4252, Marseille, France. European Language Resources Association.
Cite (Informal):
Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data (Vakili et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.451.pdf
Data
MIMIC-III