Standardizing Genomic Reports: A Dataset, A Standardized Format, and A Prompt-Based Technique for Structured Data Extraction

Tamali Banerjee, Akshit Varmora, Jay J. Gorakhiya, Sanand Sasidharan, Anuradha Kanamarlapudi, Pushpak Bhattacharyya


Abstract
Extracting information from genomic reports of cancer patients is crucial for both healthcare professionals and cancer research. While Large Language Models (LLMs) have shown promise in extracting information, their potential for handling genomic reports remains unexplored. These reports are complex, multi-page documents that feature a variety of visually rich, structured layouts and contain many domain-specific terms. Two primary challenges complicate the process: (i) extracting data from PDFs with intricate layouts and domain-specific terminology and (ii) dealing with variations in report layouts from different laboratories, making extraction layout-dependent and posing challenges for subsequent data processing. To tackle these issues, we propose GR-PROMPT, a prompt-based technique, and GR-FORMAT, a standardized format. Together, these two convert a genomic report in PDF format into GR-FORMAT as a JSON file using a multimodal LLM. To address the lack of available datasets for this task, we introduce GR-DATASET, a synthetic collection of 100 cancer genomic reports in PDF format. Each report is accompanied by key-value information presented in a layout-specific format, as well as structured key-value information in GR-FORMAT. This is the first dataset in this domain to promote further research for the task. We performed our experiment on this dataset.
Anthology ID:
2024.icon-1.5
Volume:
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2024
Address:
AU-KBC Research Centre, Chennai, India
Editors:
Sobha Lalitha Devi, Karunesh Arora
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
45–53
Language:
URL:
https://aclanthology.org/2024.icon-1.5/
DOI:
Bibkey:
Cite (ACL):
Tamali Banerjee, Akshit Varmora, Jay J. Gorakhiya, Sanand Sasidharan, Anuradha Kanamarlapudi, and Pushpak Bhattacharyya. 2024. Standardizing Genomic Reports: A Dataset, A Standardized Format, and A Prompt-Based Technique for Structured Data Extraction. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 45–53, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):
Standardizing Genomic Reports: A Dataset, A Standardized Format, and A Prompt-Based Technique for Structured Data Extraction (Banerjee et al., ICON 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.icon-1.5.pdf