Shinjae Yoo


2024

pdf bib
Leveraging LLMs and Web-based Visualizations for Profiling Bacterial Host Organisms and Genetic Toolboxes
Gilchan Park | Vivek Mutalik | Christopher Neely | Carlos Soto | Shinjae Yoo | Paramvir Dehal
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

Building genetic tools to engineer microorganisms is at the core of understanding and redesigning natural biological systems for useful purposes. Every project to build such a genetic toolbox for an organism starts with a survey of available tools. Despite a decade-long investment and advancement in the field, it is still challenging to mine information about a genetic tool published in the literature and connect that information to microbial genomics and other microbial databases. This information gap not only limits our ability to identify and adopt available tools to a new chassis but also conceals available opportunities to engineer a new microbial host. Recent advances in natural language processing (NLP), particularly large language models (LLMs), offer solutions by enabling efficient extraction of genetic terms and biological entities from a vast array of publications. This work present a method to automate this process, using text-mining to refine models with data from bioRxiv and other databases. We evaluated various LLMs to investigate their ability to recognize bacterial host organisms and genetic toolboxes for engineering. We demonstrate our methodology with a web application that integrates a conversational LLM and visualization tool, connecting user inquiries to genetic resources and literature findings, thereby saving researchers time, money and effort in their laboratory work.

2023

pdf bib
Automated Extraction of Molecular Interactions and Pathway Knowledge using Large Language Model, Galactica: Opportunities and Challenges
Gilchan Park | Byung-Jun Yoon | Xihaier Luo | Vanessa Lpez-Marrero | Patrick Johnstone | Shinjae Yoo | Francis Alexander
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Understanding protein interactions and pathway knowledge is essential for comprehending living systems and investigating the mechanisms underlying various biological functions and complex diseases. While numerous databases curate such biological data obtained from literature and other sources, they are not comprehensive and require considerable effort to maintain. One mitigation strategies can be utilizing large language models to automatically extract biological information and explore their potential in life science research. This study presents an initial investigation of the efficacy of utilizing a large language model, Galactica in life science research by assessing its performance on tasks involving protein interactions, pathways, and gene regulatory relation recognition. The paper details the results obtained from the model evaluation, highlights the findings, and discusses the opportunities and challenges.

2019

pdf bib
Visual Detection with Context for Document Layout Analysis
Carlos Soto | Shinjae Yoo
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We present 1) a work in progress method to visually segment key regions of scientific articles using an object detection technique augmented with contextual features, and 2) a novel dataset of region-labeled articles. A continuing challenge in scientific literature mining is the difficulty of consistently extracting high-quality text from formatted PDFs. To address this, we adapt the object-detection technique Faster R-CNN for document layout detection, incorporating contextual information that leverages the inherently localized nature of article contents to improve the region detection performance. Due to the limited availability of high-quality region-labels for scientific articles, we also contribute a novel dataset of region annotations, the first version of which covers 9 region classes and 822 article pages. Initial experimental results demonstrate a 23.9% absolute improvement in mean average precision over the baseline model by incorporating contextual features, and a processing speed 14x faster than a text-based technique. Ongoing work on further improvements is also discussed.