Relation Extraction in Underexplored Biomedical Domains: A Diversity-optimized Sampling and Synthetic Data Generation Approach

Maxime Delmas; Magdalena Wysocka; André Freitas

doi:10.1162/coli_a_00520

Relation Extraction in Underexplored Biomedical Domains: A Diversity-optimized Sampling and Synthetic Data Generation Approach

Maxime Delmas, Magdalena Wysocka, André Freitas

Abstract

The sparsity of labeled data is an obstacle to the development of Relation Extraction (RE) models and the completion of databases in various biomedical areas. While being of high interest in drug-discovery, the literature on natural products, reporting the identification of potential bioactive compounds from organisms, is a concrete example of such an overlooked topic. To mark the start of this new task, we created the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets. To this end, we developed a new sampler, inspired by diversity metrics in ecology, named Greedy Maximum Entropy sampler (https://github.com/idiap/gme-sampler). The strategic optimization of both balance and diversity of the selected items in the evaluation set is important given the resource-intensive nature of manual curation. After quantifying the noise in the training set, in the form of discrepancies between the text of input abstracts and the expected output labels, we explored different strategies accordingly. Framing the task as an end-to-end Relation Extraction, we evaluated the performance of standard fine-tuning (BioGPT, GPT-2, and Seq2rel) and few-shot learning with open Large Language Models (LLMs) (LLaMA 7B-65B). In addition to their evaluation in few-shot settings, we explore the potential of open LLMs as synthetic data generators and propose a new workflow for this purpose. All evaluated models exhibited substantial improvements when fine-tuned on synthetic abstracts rather than the original noisy data. We provide our best performing (F1-score = 59.0) BioGPT-Large model for end-to-end RE of natural products relationships along with all the training and evaluation datasets. See more details at https://github.com/idiap/abroad-re.

Anthology ID:: 2024.cl-3.4
Volume:: Computational Linguistics, Volume 50, Issue 3 - September 2024
Month:: September
Year:: 2024
Address:: Cambridge, MA
Venue:: CL
SIG:
Publisher:: MIT Press
Note:
Pages:: 953–1000
Language:
URL:: https://aclanthology.org/2024.cl-3.4
DOI:: 10.1162/coli_a_00520
Bibkey:
Cite (ACL):: Maxime Delmas, Magdalena Wysocka, and André Freitas. 2024. Relation Extraction in Underexplored Biomedical Domains: A Diversity-optimized Sampling and Synthetic Data Generation Approach. Computational Linguistics, 50(3):953–1000.
Cite (Informal):: Relation Extraction in Underexplored Biomedical Domains: A Diversity-optimized Sampling and Synthetic Data Generation Approach (Delmas et al., CL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.cl-3.4.pdf

PDF Cite Search