Lotta Kiefer


2025

pdf bib
Instruction-Tuning LLaMA for Synthetic Medical Note Generation in Swedish and English
Lotta Kiefer | Jesujoba Alabi | Thomas Vakili | Hercules Dalianis | Dietrich Klakow
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The increasing capabilities of large language models (LLMs) have unlocked transformative potential for medical applications, but privacy constraints limit access to high-quality training data from electronic health records (EHRs). In response, we propose a framework to generate synthetic EHRs by instruction-tuning an LLM using descriptions of diagnosis codes. We show that this framework overcomes problems of prior approaches, such as diversity reduction and medical incoherence, while maintaining strong privacy protections. Utility was measured by training models to predict diagnosis codes for EHRs. Real data still has higher utility, but synthetic data approaches real data results with increasing dataset size. The differences in utility were most likely due to noise in the synthetic data. A user study involving medical professionals confirmed no significant loss in readability or medical coherence compared to the real HRs, even though inter-annotator agreement is low. These findings establish synthetic EHRs as a viable alternative for privacypreserving and scalable clinical NLP applications. We release our code on GitHub.