Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method

Stella Verkijk; Piek Vossen

Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method

Abstract

Neural Network (NN) architectures are used more and more to model large amounts of data, such as text data available online. Transformer-based NN architectures have shown to be very useful for language modelling. Although many researchers study how such Language Models (LMs) work, not much attention has been paid to the privacy risks of training LMs on large amounts of data and publishing them online. This paper presents a new method for anonymizing a language model by presenting the way in which MedRoBERTa.nl, a Dutch language model for hospital notes, was anonymized. The two-step method involves i) automatic anonymization of the training data and ii) semi-automatic anonymization of the LM’s vocabulary. Adopting the fill-mask task where the model predicts what tokens are most probable in a certain context, it was tested how often the model will predict a name in a context where a name should be. It was shown that it predicts a name-like token 0.2% of the time. Any name-like token that was predicted was never the name originally present in the training data. By explaining how a LM trained on highly private real-world medical data can be published, we hope that more language resources will be published openly and responsibly so the scientific community can profit from them.

Anthology ID:: 2022.lrec-1.118
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1098–1103
Language:
URL:: https://aclanthology.org/2022.lrec-1.118/
DOI:
Bibkey:
Cite (ACL):: Stella Verkijk and Piek Vossen. 2022. Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1098–1103, Marseille, France. European Language Resources Association.
Cite (Informal):: Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method (Verkijk & Vossen, LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.118.pdf

PDF Cite Search Fix data