Eric Lehman


pdf bib
Learning to Ask Like a Physician
Eric Lehman | Vladislav Lialin | Katelyn Edelwina Legaspi | Anne Janelle Sy | Patricia Therese Pile | Nicole Rose Alberto | Richard Raymund Ragasa | Corinna Victoria Puyat | Marianne Katharina Taliño | Isabelle Rose Alberto | Pia Gabrielle Alfonso | Dana Moukheiber | Byron Wallace | Anna Rumshisky | Jennifer Liang | Preethi Raghavan | Leo Anthony Celi | Peter Szolovits
Proceedings of the 4th Clinical Natural Language Processing Workshop

Existing question answering (QA) datasets derived from electronic health records (EHR) are artificially generated and consequently fail to capture realistic physician information needs. We present Discharge Summary Clinical Questions (DiSCQ), a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. We analyze this dataset to characterize the types of information sought by medical experts. We also train baseline models for trigger detection and question generation (QG), paired with unsupervised answer retrieval over EHRs. Our baseline model is able to generate high quality questions in over 62% of cases when prompted with human selected triggers. We release this dataset (and all code to reproduce baseline model results) to facilitate further research into realistic clinical QA and QG:

pdf bib
Towards Generalizable Methods for Automating Risk Score Calculation
Jennifer J Liang | Eric Lehman | Ananya Iyengar | Diwakar Mahajan | Preethi Raghavan | Cindy Y. Chang | Peter Szolovits
Proceedings of the 21st Workshop on Biomedical Language Processing

Clinical risk scores enable clinicians to tabulate a set of patient data into simple scores to stratify patients into risk categories. Although risk scores are widely used to inform decision-making at the point-of-care, collecting the information necessary to calculate such scores requires considerable time and effort. Previous studies have focused on specific risk scores and involved manual curation of relevant terms or codes and heuristics for each data element of a risk score. To support more generalizable methods for risk score calculation, we annotate 100 patients in MIMIC-III with elements of CHA2DS2-VASc and PERC scores, and explore using question answering (QA) and off-the-shelf tools. We show that QA models can achieve comparable or better performance for certain risk score elements as compared to heuristic-based methods, and demonstrate the potential for more scalable risk score automation without the need for expert-curated heuristics. Our annotated dataset will be released to the community to encourage efforts in generalizable methods for automating risk scores.


pdf bib
Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?
Eric Lehman | Sarthak Jain | Karl Pichotta | Yoav Goldberg | Byron Wallace
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated “attacks” may succeed in doing so: To facilitate such research, we make our experimental setup and baseline probing models available at


pdf bib
ERASER: A Benchmark to Evaluate Rationalized NLP Models
Jay DeYoung | Sarthak Jain | Nazneen Fatema Rajani | Eric Lehman | Caiming Xiong | Richard Socher | Byron C. Wallace
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

State-of-the-art models in NLP are now predominantly based on deep neural networks that are opaque in terms of how they come to make predictions. This limitation has increased interest in designing more interpretable deep models for NLP that reveal the ‘reasoning’ behind model outputs. But work in this direction has been conducted on different datasets and tasks with correspondingly unique aims and metrics; this makes it difficult to track progress. We propose the Evaluating Rationales And Simple English Reasoning (ERASER a benchmark to advance research on interpretable models in NLP. This benchmark comprises multiple datasets and tasks for which human annotations of “rationales” (supporting evidence) have been collected. We propose several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are (i.e., the degree to which provided rationales influenced the corresponding predictions). Our hope is that releasing this benchmark facilitates progress on designing more interpretable NLP systems. The benchmark, code, and documentation are available at

pdf bib
Evidence Inference 2.0: More Data, Better Models
Jay DeYoung | Eric Lehman | Benjamin Nye | Iain Marshall | Byron C. Wallace
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing

How do we most effectively treat a disease or condition? Ideally, we could consult a database of evidence gleaned from clinical trials to answer such questions. Unfortunately, no such database exists; clinical trial results are instead disseminated primarily via lengthy natural language articles. Perusing all such articles would be prohibitively time-consuming for healthcare practitioners; they instead tend to depend on manually compiled systematic reviews of medical literature to inform care. NLP may speed this process up, and eventually facilitate immediate consult of published evidence. The Evidence Inference dataset was recently released to facilitate research toward this end. This task entails inferring the comparative performance of two treatments, with respect to a given outcome, from a particular article (describing a clinical trial) and identifying supporting evidence. For instance: Does this article report that chemotherapy performed better than surgery for five-year survival rates of operable cancers? In this paper, we collect additional annotations to expand the Evidence Inference dataset by 25%, provide stronger baseline models, systematically inspect the errors that these make, and probe dataset quality. We also release an abstract only (as opposed to full-texts) version of the task for rapid model prototyping. The updated corpus, documentation, and code for new baselines and evaluations are available at


pdf bib
Inferring Which Medical Treatments Work from Reports of Clinical Trials
Eric Lehman | Jay DeYoung | Regina Barzilay | Byron C. Wallace
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

How do we know if a particular medical treatment actually works? Ideally one would consult all available evidence from relevant clinical trials. Unfortunately, such results are primarily disseminated in natural language scientific articles, imposing substantial burden on those trying to make sense of them. In this paper, we present a new task and corpus for making this unstructured published scientific evidence actionable. The task entails inferring reported findings from a full-text article describing randomized controlled trials (RCT) with respect to a given intervention, comparator, and outcome of interest, e.g., inferring if a given article provides evidence supporting the use of aspirin to reduce risk of stroke, as compared to placebo. We present a new corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs. Results using a suite of baseline models — ranging from heuristic (rule-based) approaches to attentive neural architectures — demonstrate the difficulty of the task, which we believe largely owes to the lengthy, technical input texts. To facilitate further work on this important, challenging problem we make the corpus, documentation, a website and leaderboard, and all source code for baselines and evaluation publicly available.