The goal of this paper is to examine the impact of simple feature engineering mechanisms before applying more sophisticated techniques to the task of medical NER. Sometimes papers using scientifically sound techniques present raw baselines that could be improved adding simple and cheap features. This work focuses on entity recognition for the clinical domain for three languages: English, Swedish and Spanish. The task is tackled using simple features, starting from the window size, capitalization, prefixes, and moving to POS and semantic tags. This work demonstrates that a simple initial step of feature engineering can improve the baseline results significantly. Hence, the contributions of this paper are: first, a short list of guidelines well supported with experimental results on three languages and, second, a detailed description of the relevance of these features for medical NER.
Automated extraction of concepts from patient clinical records is an essential facilitator of clinical research. For this reason, the 2010 i2b2/VA Natural Language Processing Challenges for Clinical Records introduced a concept extraction task aimed at identifying and classifying concepts into predefined categories (i.e., treatments, tests and problems). State-of-the-art concept extraction approaches heavily rely on handcrafted features and domain-specific resources which are hard to collect and define. For this reason, this paper proposes an alternative, streamlined approach: a recurrent neural network (the bidirectional LSTM with CRF decoding) initialized with general-purpose, off-the-shelf word embeddings. The experimental results achieved on the 2010 i2b2/VA reference corpora using the proposed framework outperform all recent methods and ranks closely to the best submission from the original 2010 i2b2/VA challenge.
Due to the recent replacements of physical documents with electronic medical records (EMR), the importance of information processing in medical fields has been increased. We have been organizing the MedNLP task series in NTCIR-10 and 11. These workshops were the first shared tasks which attempt to evaluate technologies that retrieve important information from medical reports written in Japanese. In this report, we describe the NTCIR-12 MedNLPDoc task which is designed for more advanced and practical use for the medical fields. This task is considered as a multi-labeling task to a patient record. This report presents results of the shared task, discusses and illustrates remained issues in the medical natural language processing field.
Patient notes contain a wealth of information of potentially great interest to medical investigators. However, to protect patients’ privacy, Protected Health Information (PHI) must be removed from the patient notes before they can be legally released, a process known as patient note de-identification. The main objective for a de-identification system is to have the highest possible recall. Recently, the first neural-network-based de-identification system has been proposed, yielding state-of-the-art results. Unlike other systems, it does not rely on human-engineered features, which allows it to be quickly deployed, but does not leverage knowledge from human experts or from electronic health records (EHRs). In this work, we explore a method to incorporate human-engineered features as well as features derived from EHRs to a neural-network-based de-identification system. Our results show that the addition of features, especially the EHR-derived features, further improves the state-of-the-art in patient note de-identification, including for some of the most sensitive PHI types such as patient names. Since in a real-life setting patient notes typically come with EHRs, we recommend developers of de-identification systems to leverage the information EHRs contain.
Semi-supervised clustering is an attractive alternative for traditional (unsupervised) clustering in targeted applications. By using the information of a small annotated dataset, semi-supervised clustering can produce clusters that are customized to the application domain. In this paper, we present a semi-supervised clustering technique based on a multi-objective evolutionary algorithm (NSGA-II-clus). We apply this technique to the task of clustering medical publications for Evidence Based Medicine (EBM) and observe an improvement of the results against unsupervised and other semi-supervised clustering techniques.
Rapid growth in Electronic Medical Records (EMR) has emerged to an expansion of data in the clinical domain. The majority of the available health care information is sealed in the form of narrative documents which form the rich source of clinical information. Text mining of such clinical records has gained huge attention in various medical applications like treatment and decision making. However, medical records enclose patient Private Health Information (PHI) which can reveal the identities of the patients. In order to retain the privacy of patients, it is mandatory to remove all the PHI information prior to making it publicly available. The aim is to de-identify or encrypt the PHI from the patient medical records. In this paper, we propose an algorithm based on deep learning architecture to solve this problem. We perform de-identification of seven PHI terms from the clinical records. Experiments on benchmark datasets show that our proposed approach achieves encouraging performance, which is better than the baseline model developed with Conditional Random Field.
Paraphrase generation is important in various applications such as search, summarization, and question answering due to its ability to generate textual alternatives while keeping the overall meaning intact. Clinical paraphrase generation is especially vital in building patient-centric clinical decision support (CDS) applications where users are able to understand complex clinical jargons via easily comprehensible alternative paraphrases. This paper presents Neural Clinical Paraphrase Generation (NCPG), a novel approach that casts the task as a monolingual neural machine translation (NMT) problem. We propose an end-to-end neural network built on an attention-based bidirectional Recurrent Neural Network (RNN) architecture with an encoder-decoder framework to perform the task. Conventional bilingual NMT models mostly rely on word-level modeling and are often limited by out-of-vocabulary (OOV) issues. In contrast, we represent the source and target paraphrase pairs as character sequences to address this limitation. To the best of our knowledge, this is the first work that uses attention-based RNNs for clinical paraphrase generation and also proposes an end-to-end character-level modeling for this task. Extensive experiments on a large curated clinical paraphrase corpus show that the attention-based NCPG models achieve improvements of up to 5.2 BLEU points and 0.5 METEOR points over a non-attention based strong baseline for word-level modeling, whereas further gains of up to 6.1 BLEU points and 1.3 METEOR points are obtained by the character-level NCPG models over their word-level counterparts. Overall, our models demonstrate comparable performance relative to the state-of-the-art phrase-based non-neural models.
The proliferation of deep learning methods in natural language processing (NLP) and the large amounts of data they often require stands in stark contrast to the relatively data-poor clinical NLP domain. In particular, large text corpora are necessary to build high-quality word embeddings, yet often large corpora that are suitably representative of the target clinical data are unavailable. This forces a choice between building embeddings from small clinical corpora and less representative, larger corpora. This paper explores this trade-off, as well as intermediate compromise solutions. Two standard clinical NLP tasks (the i2b2 2010 concept and assertion tasks) are evaluated with commonly used deep learning models (recurrent neural networks and convolutional neural networks) using a set of six corpora ranging from the target i2b2 data to large open-domain datasets. While combinations of corpora are generally found to work best, the single-best corpus is generally task-dependent.
Importance of utilizing medical information is getting increased as electronic health records (EHRs) are widely used nowadays. We aim to assign international standardized disease codes, ICD-10, to Japanese textual information in EHRs for users to reuse the information accurately. In this paper, we propose methods to automatically extract diagnosis and to assign ICD codes to Japanese medical records. Due to the lack of available training data, we dare employed rule-based methods rather than machine learning. We observed characteristics of medical records carefully, writing rules to make effective methods by hand. We applied our system to the NTCIR-12 MedNLPDoc shared task data where participants are required to assign ICD-10 codes of possible diagnosis in given EHRs. In this shared task, our system achieved the highest F-measure score among all participants in the most severe evaluation criteria. Through comparison with other approaches, we show that our approach could be a useful milestone for the future development of Japanese medical record processing.
A fine-grained corpus annotation schema of German nephrology records
Roland Roller | Hans Uszkoreit | Feiyu Xu | Laura Seiffe | Michael Mikhailov | Oliver Staeck | Klemens Budde | Fabian Halleck | Danilo Schmidt
In this work we present a fine-grained annotation schema to detect named entities in German clinical data of chronically ill patients with kidney diseases. The annotation schema is driven by the needs of our clinical partners and the linguistic aspects of German language. In order to generate annotations within a short period, the work also presents a semi-automatic annotation which uses additional sources of knowledge such as UMLS, to pre-annotate concepts in advance. The presented schema will be used to apply novel techniques from natural language processing and machine learning to support doctors treating their patients by improved information access from unstructured German texts.
In recent years, detecting Alzheimer disease (AD) in early stages based on natural language processing (NLP) has drawn much attention. To date, vocabulary size, grammatical complexity, and fluency have been studied using NLP metrics. However, the content analysis of AD narratives is still unreachable for NLP. This study investigates features of the words that AD patients use in their spoken language. After recruiting 18 examinees of 53–90 years old (mean: 76.89), they were divided into two groups based on MMSE scores. The AD group comprised 9 examinees with scores of 21 or lower. The healthy control group comprised 9 examinees with a score of 22 or higher. Linguistic Inquiry and Word Count (LIWC) classified words were used to categorize the words that the examinees used. The word frequency was found from observation. Significant differences were confirmed for the usage of impersonal pronouns in the AD group. This result demonstrated the basic feasibility of the proposed NLP-based detection approach.
The number of unstructured medical records kept in hospital information systems is increasing. The conditions of patients are formulated as outcomes in clinical pathway. A variance of an outcome describes deviations from standards of care like a patient’s bad condition. The present paper applied text mining to extract feature words and phrases of the variance from admission records. We report the cases the variances of “pain control” and “no neuropathy worsening” in cerebral infarction.
Clinical narratives in electronic health record systems are a rich resource of patient-based information. They constitute an ongoing challenge for natural language processing, due to their high compactness and abundance of short forms. German medical texts exhibit numerous ad-hoc abbreviations that terminate with a period character. The disambiguation of period characters is therefore an important task for sentence and abbreviation detection. This task is addressed by a combination of co-occurrence information of word types with trailing period characters, a large domain dictionary, and a simple rule engine, thus merging statistical and dictionary-based disambiguation strategies. An F-measure of 0.95 could be reached by using the unsupervised approach presented in this paper. The results are promising for a domain-independent abbreviation detection strategy, because our approach avoids retraining of models or use case specific feature engineering efforts required for supervised machine learning approaches.
The issue of privacy has always been a concern when clinical texts are used for research purposes. Personal health information (PHI) (such as name and identification number) needs to be removed so that patients cannot be identified. Manual anonymization is not feasible due to the large number of clinical texts to be anonymized. In this paper, we tackle the task of anonymizing clinical texts written in sentence fragments and which frequently contain symbols, abbreviations, and misspelled words. Our clinical texts therefore differ from those in the i2b2 shared tasks which are in prose form with complete sentences. Our clinical texts are also part of a structured database which contains patient name and identification number in structured fields. As such, we formulate our anonymization task as spelling variant detection, exploiting patients’ personal information in the structured fields to detect their spelling variants in clinical texts. We successfully anonymized clinical texts consisting of more than 200 million words, using minimum edit distance and regular expression patterns.