Revisiting De-Identification of Electronic Medical Records: Evaluation of Within- and Cross-Hospital Generalization

Yiyang Liu, Jinpeng Li, Enwei Zhu


Abstract
The de-identification task aims to detect and remove the protected health information from electronic medical records (EMRs). Previous studies generally focus on the within-hospital setting and achieve great successes, while the cross-hospital setting has been overlooked. This study introduces a new de-identification dataset comprising EMRs from three hospitals in China, creating a benchmark for evaluating both within- and cross-hospital generalization. We find significant domain discrepancy between hospitals. A model with almost perfect within-hospital performance struggles when transferred across hospitals. Further experiments show that pretrained language models and some domain generalization methods can alleviate this problem. We believe that our data and findings will encourage investigations on the generalization of medical NLP models.
Anthology ID:
2023.emnlp-main.224
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3666–3674
Language:
URL:
https://aclanthology.org/2023.emnlp-main.224
DOI:
10.18653/v1/2023.emnlp-main.224
Bibkey:
Cite (ACL):
Yiyang Liu, Jinpeng Li, and Enwei Zhu. 2023. Revisiting De-Identification of Electronic Medical Records: Evaluation of Within- and Cross-Hospital Generalization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3666–3674, Singapore. Association for Computational Linguistics.
Cite (Informal):
Revisiting De-Identification of Electronic Medical Records: Evaluation of Within- and Cross-Hospital Generalization (Liu et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.224.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.224.mp4