Exploring Transformer Text Generation for Medical Dataset Augmentation

Ali Amin-Nejad; Julia Ive; Sumithra Velupillai

Exploring Transformer Text Generation for Medical Dataset Augmentation

Ali Amin-Nejad, Julia Ive, Sumithra Velupillai

Abstract

Natural Language Processing (NLP) can help unlock the vast troves of unstructured data in clinical text and thus improve healthcare research. However, a big barrier to developments in this field is data access due to patient confidentiality which prohibits the sharing of this data, resulting in small, fragmented and sequestered openly available datasets. Since NLP model development requires large quantities of data, we aim to help side-step this roadblock by exploring the usage of Natural Language Generation in augmenting datasets such that they can be used for NLP model development on downstream clinically relevant tasks. We propose a methodology guiding the generation with structured patient information in a sequence-to-sequence manner. We experiment with state-of-the-art Transformer models and demonstrate that our augmented dataset is capable of beating our baselines on a downstream classification task. Finally, we also create a user interface and release the scripts to train generation models to stimulate further research in this area.

Anthology ID:: 2020.lrec-1.578
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 4699–4708
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.578/
DOI:
Bibkey:
Cite (ACL):: Ali Amin-Nejad, Julia Ive, and Sumithra Velupillai. 2020. Exploring Transformer Text Generation for Medical Dataset Augmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4699–4708, Marseille, France. European Language Resources Association.
Cite (Informal):: Exploring Transformer Text Generation for Medical Dataset Augmentation (Amin-Nejad et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.578.pdf

PDF Cite Search Fix data