RecordTwin: Towards Creating Safe Synthetic Clinical Corpora

Seiji Shimizu; Ibrahim Baroud; Lisa Raithel; Shuntaro Yada; Shoko Wakamiya; Eiji Aramaki

doi:10.18653/v1/2025.findings-acl.759

RecordTwin: Towards Creating Safe Synthetic Clinical Corpora

Seiji Shimizu, Ibrahim Baroud, Lisa Raithel, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki

Abstract

The scarcity of publicly available clinical corpora hinders developing and applying NLP tools in clinical research. While existing work tackles this issue by utilizing generative models to create high-quality synthetic corpora, their methods require learning from the original in-hospital clinical documents, turning them unfeasible in practice. To address this problem, we introduce RecordTwin, a novel synthetic corpus creation method designed to generate synthetic documents from anonymized clinical entities. In this method, we first extract and anonymize entities from in-hospital documents to ensure the information contained in the synthetic corpus is restricted. Then, we use a large language model to fill the context between anonymized entities. To do so, we use a small, privacy-preserving subset of the original documents to mimic their formatting and writing style. This approach only requires anonymized entities and a small subset of original documents in the generation process, making it more feasible in practice. To evaluate the synthetic corpus created with our method, we conduct a proof-of-concept study using a publicly available clinical database. Our results demonstrate that the synthetic corpus has a utility comparable to the original data and a safety advantage over baselines, highlighting the potential of RecordTwin for privacy-preserving synthetic corpus creation.

Anthology ID:: 2025.findings-acl.759
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14714–14726
Language:
URL:: https://aclanthology.org/2025.findings-acl.759/
DOI:: 10.18653/v1/2025.findings-acl.759
Bibkey:
Cite (ACL):: Seiji Shimizu, Ibrahim Baroud, Lisa Raithel, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2025. RecordTwin: Towards Creating Safe Synthetic Clinical Corpora. In Findings of the Association for Computational Linguistics: ACL 2025, pages 14714–14726, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: RecordTwin: Towards Creating Safe Synthetic Clinical Corpora (Shimizu et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.759.pdf

PDF Cite Search Fix data