Keyuan Jiang


2026

Accurate normalization of health-related expressions to standardized biomedical concepts is crucial for both healthcare and biomedical research. However, traditional string-based matching methods are limited by lexical variations. In this study, we propose a neural embedding-based normalization framework that utilizes an embedding model trained on biomedical terminology, generating over 3.59 million embeddings corresponding to UMLS terms and Concept Unique Identifiers (CUIs). For clinical data, CUIs were retrieved via semantic matching, while Twitter phrases were first processed using a large language model (LLM) to generate preferred terms prior to embedding-based CUI retrieval. Our approach substantially outperforms exact string matching and MetaMap Lite. For clinical data (3,144 phrases), normalization accuracy improved from 0.679 (string match) and 0.574 (MetaMap Lite) to 0.858. For Twitter data (102 phrases), accuracy increased from 0.235 (string match) and 0.118 (MetaMap Lite) to a range of 0.882 (Gemini 2.5 Flash) to 0.980 (GPT-4o mini). These findings highlight both the effectiveness of embedding-based semantic retrieval and the ability of LLMs to generate preferred terms, enhancing robustness in health concept normalization across diverse text sources.

2021

First-hand experience related to any changes of one’s health condition and understanding such experience can play an important role in advancing medical science and healthcare. Monitoring the safe use of medication drugs is an important task of pharmacovigilance, and first-hand experience of effects about consumers’ medication intake can be valuable to gain insight into how our human body reacts to medications. Social media have been considered as a possible alternative data source for gathering personal experience with medications posted by users. Identifying personal experience tweets is a challenging classification task, and efforts have made to tackle the challenges using supervised approaches requiring annotated data. There exists abundance of unlabeled Twitter data, and being able to use such data for training without suffering in classification performance is of great value, which can reduce the cost of laborious annotation process. We investigated two semi-supervised learning methods, with different mixes of labeled and unlabeled data in the training set, to understand the impact on classification performance. Our results from both pseudo-label and consistency regularization methods show that both methods generated a noticeable improvement in F1 score when the labeled set was small, and consistency regularization could still provide a small gain even a larger labeled set was used.

2020

Post-market surveillance, the practice of monitoring the safe use of pharmaceutical drugs is an important part of pharmacovigilance. Being able to collect personal experience related to pharmaceutical product use could help us gain insight into how the human body reacts to different medications. Twitter, a popular social media service, is being considered as an important alternative data source for collecting personal experience information with medications. Identifying personal experience tweets is a challenging classification task in natural language processing. In this study, we utilized three methods based on Facebook’s Robustly Optimized BERT Pretraining Approach (RoBERTa) to predict personal experience tweets related to medication use: the first one combines the pre-trained RoBERTa model with a classifier, the second combines the updated pre-trained RoBERTa model using a corpus of unlabeled tweets with a classifier, and the third combines the RoBERTa model that was trained with our unlabeled tweets from scratch with the classifier too. Our results show that all of these approaches outperform the published methods (Word Embedding + LSTM) in classification performance (p < 0.05), and updating the pre-trained language model with tweets related to medications could even improve the performance further.

2016