Byron C. Wallace

Also published as: Byron Wallace


2021

pdf bib
Disentangling Representations of Text by Masking Transformers
Xiongyi Zhang | Jan-Willem van de Meent | Byron Wallace
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Representations from large pretrained models such as BERT encode a range of features into monolithic vectors, affording strong predictive accuracy across a range of downstream tasks. In this paper we explore whether it is possible to learn disentangled representations by identifying existing subnetworks within pretrained models that encode distinct, complementary aspects. Concretely, we learn binary masks over transformer weights or hidden units to uncover subsets of features that correlate with a specific factor of variation; this eliminates the need to train a disentangled model from scratch for a particular task. We evaluate this method with respect to its ability to disentangle representations of sentiment from genre in movie reviews, toxicity from dialect in Tweets, and syntax from semantics. By combining masking with magnitude pruning we find that we can identify sparse subnetworks within BERT that strongly encode particular aspects (e.g., semantics) while only weakly encoding others (e.g., syntax). Moreover, despite only learning masks, disentanglement-via-masking performs as well as — and often better than —previously proposed methods based on variational autoencoders and adversarial training.

pdf bib
Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data
David Lowell | Brian Howard | Zachary C. Lipton | Byron Wallace
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Unsupervised Data Augmentation (UDA) is a semisupervised technique that applies a consistency loss to penalize differences between a model’s predictions on (a) observed (unlabeled) examples; and (b) corresponding ‘noised’ examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and how to extend the method to sequence labeling tasks. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. Our main contribution is an empirical study of UDA to establish which components of the algorithm confer benefits in NLP. Notably, although prior work has emphasized the use of clever augmentation techniques including back-translation, we find that enforcing consistency between predictions assigned to observed and randomly substituted words often yields comparable (or greater) benefits compared to these more complex perturbation models. Furthermore, we find that applying UDA’s consistency loss affords meaningful gains without any unlabeled data at all, i.e., in a standard supervised setting. In short, UDA need not be unsupervised to realize much of its noted benefits, and does not require complex data augmentation to be effective.

pdf bib
Biomedical Interpretable Entity Representations
Diego Garcia-Olano | Yasumasa Onoe | Ioana Baldini | Joydeep Ghosh | Byron Wallace | Kush Varshney
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations
Chaitanya Shivade | Rashmi Gangadharaiah | Spandana Gella | Sandeep Konam | Shaoqing Yuan | Yi Zhang | Parminder Bhatia | Byron Wallace
Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations

pdf bib
Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?
Eric Lehman | Sarthak Jain | Karl Pichotta | Yoav Goldberg | Byron Wallace
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated “attacks” may succeed in doing so: To facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release.

pdf bib
An Empirical Comparison of Instance Attribution Methods for NLP
Pouya Pezeshkpour | Sarthak Jain | Byron Wallace | Sameer Singh
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Widespread adoption of deep models has motivated a pressing need for approaches to interpret network outputs and to facilitate model debugging. Instance attribution methods constitute one means of accomplishing these goals by retrieving training instances that (may have) led to a particular prediction. Influence functions (IF; Koh and Liang 2017) provide machinery for doing this by quantifying the effect that perturbing individual train instances would have on a specific test prediction. However, even approximating the IF is computationally expensive, to the degree that may be prohibitive in many cases. Might simpler approaches (e.g., retrieving train examples most similar to a given test point) perform comparably? In this work, we evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples. We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods (such as IFs), but that nonetheless exhibit desirable characteristics similar to more complex attribution methods. Code for all methods and experiments in this paper is available at: https://github.com/successar/instance_attributions_NLP.

pdf bib
On the Impact of Random Seeds on the Fairness of Clinical Classifiers
Silvio Amir | Jan-Willem van de Meent | Byron Wallace
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent work has shown that fine-tuning large networks is surprisingly sensitive to changes in random seed(s). We explore the implications of this phenomenon for model fairness across demographic groups in clinical prediction tasks over electronic health records (EHR) in MIMIC-III —— the standard dataset in clinical NLP research. Apparent subgroup performance varies substantially for seeds that yield similar overall performance, although there is no evidence of a trade-off between overall and subgroup performance. However, we also find that the small sample sizes inherent to looking at intersections of minority groups and somewhat rare conditions limit our ability to accurately estimate disparities. Further, we find that jointly optimizing for high overall performance and low disparities does not yield statistically significant improvements. Our results suggest that fairness work using MIMIC-III should carefully account for variations in apparent differences that may arise from stochasticity and small sample sizes.

pdf bib
Paragraph-level Simplification of Medical Texts
Ashwin Devaraj | Iain Marshall | Byron Wallace | Junyi Jessy Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We consider the problem of learning to simplify medical texts. This is important because most reliable, up-to-date information in biomedicine is dense with jargon and thus practically inaccessible to the lay audience. Furthermore, manual simplification does not scale to the rapidly growing body of biomedical literature, motivating the need for automated approaches. Unfortunately, there are no large-scale resources available for this task. In this work we introduce a new corpus of parallel texts in English comprising technical and lay summaries of all published evidence pertaining to different clinical topics. We then propose a new metric based on likelihood scores from a masked language model pretrained on scientific texts. We show that this automated measure better differentiates between technical and lay summaries than existing heuristics. We introduce and evaluate baseline encoder-decoder Transformer models for simplification and propose a novel augmentation to these in which we explicitly penalize the decoder for producing “jargon” terms; we find that this yields improvements over baselines in terms of readability.

pdf bib
What Would it Take to get Biomedical QA Systems into Practice?
Gregory Kell | Iain Marshall | Byron Wallace | Andre Jaun
Proceedings of the 3rd Workshop on Machine Reading for Question Answering

Medical question answering (QA) systems have the potential to answer clinicians’ uncertainties about treatment and diagnosis on-demand, informed by the latest evidence. However, despite the significant progress in general QA made by the NLP community, medical QA systems are still not widely used in clinical environments. One likely reason for this is that clinicians may not readily trust QA system outputs, in part because transparency, trustworthiness, and provenance have not been key considerations in the design of such models. In this paper we discuss a set of criteria that, if met, we argue would likely increase the utility of biomedical QA systems, which may in turn lead to adoption of such systems in practice. We assess existing models, tasks, and datasets with respect to these criteria, highlighting shortcomings of previously proposed approaches and pointing toward what might be more usable QA systems.

2020

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations
Parminder Bhatia | Steven Lin | Rashmi Gangadharaiah | Byron Wallace | Izhak Shafran | Chaitanya Shivade | Nan Du | Mona Diab
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations

pdf bib
ERASER: A Benchmark to Evaluate Rationalized NLP Models
Jay DeYoung | Sarthak Jain | Nazneen Fatema Rajani | Eric Lehman | Caiming Xiong | Richard Socher | Byron C. Wallace
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

State-of-the-art models in NLP are now predominantly based on deep neural networks that are opaque in terms of how they come to make predictions. This limitation has increased interest in designing more interpretable deep models for NLP that reveal the ‘reasoning’ behind model outputs. But work in this direction has been conducted on different datasets and tasks with correspondingly unique aims and metrics; this makes it difficult to track progress. We propose the Evaluating Rationales And Simple English Reasoning (ERASER a benchmark to advance research on interpretable models in NLP. This benchmark comprises multiple datasets and tasks for which human annotations of “rationales” (supporting evidence) have been collected. We propose several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are (i.e., the degree to which provided rationales influenced the corresponding predictions). Our hope is that releasing this benchmark facilitates progress on designing more interpretable NLP systems. The benchmark, code, and documentation are available at https://www.eraserbenchmark.com/

pdf bib
Learning to Faithfully Rationalize by Construction
Sarthak Jain | Sarah Wiegreffe | Yuval Pinter | Byron C. Wallace
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In many settings it is important for one to be able to understand why a model made a particular prediction. In NLP this often entails extracting snippets of an input text ‘responsible for’ corresponding model output; when such a snippet comprises tokens that indeed informed the model’s prediction, it is a faithful explanation. In some settings, faithfulness may be critical to ensure transparency. Lei et al. (2016) proposed a model to produce faithful rationales for neural text classification by defining independent snippet extraction and prediction modules. However, the discrete selection over input tokens performed by this method complicates training, leading to high variance and requiring careful hyperparameter tuning. We propose a simpler variant of this approach that provides faithful explanations by construction. In our scheme, named FRESH, arbitrary feature importance scores (e.g., gradients from a trained model) are used to induce binary labels over token inputs, which an extractor can be trained to predict. An independent classifier module is then trained exclusively on snippets provided by the extractor; these snippets thus constitute faithful explanations, even if the classifier is arbitrarily complex. In both automatic and manual evaluations we find that variants of this simple framework yield predictive performance superior to ‘end-to-end’ approaches, while being more general and easier to train. Code is available at https://github.com/successar/FRESH.

pdf bib
Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
Xiaochuang Han | Byron C. Wallace | Yulia Tsvetkov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Modern deep learning models for NLP are notoriously opaque. This has motivated the development of methods for interpreting such models, e.g., via gradient-based saliency maps or the visualization of attention weights. Such approaches aim to provide explanations for a particular model prediction by highlighting important words in the corresponding input text. While this might be useful for tasks where decisions are explicitly influenced by individual tokens in the input, we suspect that such highlighting is not suitable for tasks where model decisions should be driven by more complex reasoning. In this work, we investigate the use of influence functions for NLP, providing an alternative approach to interpreting neural text classifiers. Influence functions explain the decisions of a model by identifying influential training examples. Despite the promise of this approach, influence functions have not yet been extensively evaluated in the context of NLP, a gap addressed by this work. We conduct a comparison between influence functions and common word-saliency methods on representative tasks. As suspected, we find that influence functions are particularly useful for natural language inference, a task in which ‘saliency maps’ may not have clear interpretation. Furthermore, we develop a new quantitative measure based on influence functions that can reveal artifacts in training data.

pdf bib
Trialstreamer: Mapping and Browsing Medical Evidence in Real-Time
Benjamin Nye | Ani Nenkova | Iain Marshall | Byron C. Wallace
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce Trialstreamer, a living database of clinical trial reports. Here we mainly describe the evidence extraction component; this extracts from biomedical abstracts key pieces of information that clinicians need when appraising the literature, and also the relations between these. Specifically, the system extracts descriptions of trial participants, the treatments compared in each arm (the interventions), and which outcomes were measured. The system then attempts to infer which interventions were reported to work best by determining their relationship with identified trial outcome measures. In addition to summarizing individual trials, these extracted data elements allow automatic synthesis of results across many trials on the same topic. We apply the system at scale to all reports of randomized controlled trials indexed in MEDLINE, powering the automatic generation of evidence maps, which provide a global view of the efficacy of different interventions combining data from all relevant clinical trials on a topic. We make all code and models freely available alongside a demonstration of the web interface.

pdf bib
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020
Karin Verspoor | Kevin Bretonnel Cohen | Mark Dredze | Emilio Ferrara | Jonathan May | Robert Munro | Cecile Paris | Byron Wallace
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

pdf bib
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
Karin Verspoor | Kevin Bretonnel Cohen | Michael Conway | Berry de Bruijn | Mark Dredze | Rada Mihalcea | Byron Wallace
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

pdf bib
Evidence Inference 2.0: More Data, Better Models
Jay DeYoung | Eric Lehman | Benjamin Nye | Iain Marshall | Byron C. Wallace
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing

How do we most effectively treat a disease or condition? Ideally, we could consult a database of evidence gleaned from clinical trials to answer such questions. Unfortunately, no such database exists; clinical trial results are instead disseminated primarily via lengthy natural language articles. Perusing all such articles would be prohibitively time-consuming for healthcare practitioners; they instead tend to depend on manually compiled systematic reviews of medical literature to inform care. NLP may speed this process up, and eventually facilitate immediate consult of published evidence. The Evidence Inference dataset was recently released to facilitate research toward this end. This task entails inferring the comparative performance of two treatments, with respect to a given outcome, from a particular article (describing a clinical trial) and identifying supporting evidence. For instance: Does this article report that chemotherapy performed better than surgery for five-year survival rates of operable cancers? In this paper, we collect additional annotations to expand the Evidence Inference dataset by 25%, provide stronger baseline models, systematically inspect the errors that these make, and probe dataset quality. We also release an abstract only (as opposed to full-texts) version of the task for rapid model prototyping. The updated corpus, documentation, and code for new baselines and evaluations are available at http://evidence-inference.ebm-nlp.com/.

2019

pdf bib
Practical Obstacles to Deploying Active Learning
David Lowell | Zachary C. Lipton | Byron C. Wallace
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Active learning (AL) is a widely-used training strategy for maximizing predictive performance subject to a fixed annotation budget. In AL, one iteratively selects training examples for annotation, often those for which the current model is most uncertain (by some measure). The hope is that active sampling leads to better performance than would be achieved under independent and identically distributed (i.i.d.) random samples. While AL has shown promise in retrospective evaluations, these studies often ignore practical obstacles to its use. In this paper, we show that while AL may provide benefits when used with specific models and for particular domains, the benefits of current approaches do not generalize reliably across models and tasks. This is problematic because in practice, one does not have the opportunity to explore and compare alternative AL strategies. Moreover, AL couples the training dataset with the model used to guide its acquisition. We find that subsequently training a successor model with an actively-acquired dataset does not consistently outperform training on i.i.d. sampled data. Our findings raise the question of whether the downsides inherent to AL are worth the modest and inconsistent performance gains it tends to afford.

pdf bib
An Analysis of Attention over Clinical Notes for Predictive Tasks
Sarthak Jain | Ramin Mohammadi | Byron C. Wallace
Proceedings of the 2nd Clinical Natural Language Processing Workshop

The shift to electronic medical records (EMRs) has engendered research into machine learning and natural language technologies to analyze patient records, and to predict from these clinical outcomes of interest. Two observations motivate our aims here. First, unstructured notes contained within EMR often contain key information, and hence should be exploited by models. Second, while strong predictive performance is important, interpretability of models is perhaps equally so for applications in this domain. Together, these points suggest that neural models for EMR may benefit from incorporation of attention over notes, which one may hope will both yield performance gains and afford transparency in predictions. In this work we perform experiments to explore this question using two EMR corpora and four different predictive tasks, that: (i) inclusion of attention mechanisms is critical for neural encoder modules that operate over notes fields in order to yield competitive performance, but, (ii) unfortunately, while these boost predictive performance, it is decidedly less clear whether they provide meaningful support for predictions.

pdf bib
Browsing Health: Information Extraction to Support New Interfaces for Accessing Medical Evidence
Soham Parikh | Elizabeth Conrad | Oshin Agarwal | Iain Marshall | Byron Wallace | Ani Nenkova
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

Standard paradigms for search do not work well in the medical context. Typical information needs, such as retrieving a full list of medical interventions for a given condition, or finding the reported efficacy of a particular treatment with respect to a specific outcome of interest cannot be straightforwardly posed in typical text-box search. Instead, we propose faceted-search in which a user specifies a condition and then can browse treatments and outcomes that have been evaluated. Choosing from these, they can access randomized control trials (RCTs) describing individual studies. Realizing such a view of the medical evidence requires information extraction techniques to identify the population, interventions, and outcome measures in an RCT. Patients, health practitioners, and biomedical librarians all stand to benefit from such innovation in search of medical evidence. We present an initial prototype of such an interface applied to pre-registered clinical studies. We also discuss pilot studies into the applicability of information extraction methods to allow for similar access to all published trial results.

pdf bib
Predicting Annotation Difficulty to Improve Task Routing and Model Performance for Biomedical Information Extraction
Yinfei Yang | Oshin Agarwal | Chris Tar | Byron C. Wallace | Ani Nenkova
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Modern NLP systems require high-quality annotated data. For specialized domains, expert annotations may be prohibitively expensive; the alternative is to rely on crowdsourcing to reduce costs at the risk of introducing noise. In this paper we demonstrate that directly modeling instance difficulty can be used to improve model performance and to route instances to appropriate annotators. Our difficulty prediction model combines two learned representations: a ‘universal’ encoder trained on out of domain data, and a task-specific encoder. Experiments on a complex biomedical information extraction task using expert and lay annotators show that: (i) simply excluding from the training data instances predicted to be difficult yields a small boost in performance; (ii) using difficulty scores to weight instances during training provides further, consistent gains; (iii) assigning instances predicted to be difficult to domain experts is an effective strategy for task routing. Further, our experiments confirm the expectation that for such domain-specific tasks expert annotations are of much higher quality and preferable to obtain if practical and that augmenting small amounts of expert data with a larger set of lay annotations leads to further improvements in model performance.

pdf bib
Attention is not Explanation
Sarthak Jain | Byron C. Wallace
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Attention mechanisms have seen wide adoption in neural NLP models. In addition to improving predictive performance, these are often touted as affording transparency: models equipped with attention provide a distribution over attended-to input units, and this is often presented (at least implicitly) as communicating the relative importance of inputs. However, it is unclear what relationship exists between attention weights and model outputs. In this work we perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful “explanations” for predictions. We find that they largely do not. For example, learned attention weights are frequently uncorrelated with gradient-based measures of feature importance, and one can identify very different attention distributions that nonetheless yield equivalent predictions. Our findings show that standard attention modules do not provide meaningful explanations and should not be treated as though they do.

pdf bib
Inferring Which Medical Treatments Work from Reports of Clinical Trials
Eric Lehman | Jay DeYoung | Regina Barzilay | Byron C. Wallace
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

How do we know if a particular medical treatment actually works? Ideally one would consult all available evidence from relevant clinical trials. Unfortunately, such results are primarily disseminated in natural language scientific articles, imposing substantial burden on those trying to make sense of them. In this paper, we present a new task and corpus for making this unstructured published scientific evidence actionable. The task entails inferring reported findings from a full-text article describing randomized controlled trials (RCT) with respect to a given intervention, comparator, and outcome of interest, e.g., inferring if a given article provides evidence supporting the use of aspirin to reduce risk of stroke, as compared to placebo. We present a new corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs. Results using a suite of baseline models — ranging from heuristic (rule-based) approaches to attentive neural architectures — demonstrate the difficulty of the task, which we believe largely owes to the lengthy, technical input texts. To facilitate further work on this important, challenging problem we make the corpus, documentation, a website and leaderboard, and all source code for baselines and evaluation publicly available.

2018

pdf bib
Structured Multi-Label Biomedical Text Tagging via Attentive Neural Tree Decoding
Gaurav Singh | James Thomas | Iain Marshall | John Shawe-Taylor | Byron C. Wallace
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose a model for tagging unstructured texts with an arbitrary number of terms drawn from a tree-structured vocabulary (i.e., an ontology). We treat this as a special case of sequence-to-sequence learning in which the decoder begins at the root node of an ontological tree and recursively elects to expand child nodes as a function of the input text, the current node, and the latent decoder state. We demonstrate that this method yields state-of-the-art results on the important task of assigning MeSH terms to biomedical abstracts.

pdf bib
Learning Disentangled Representations of Texts with Application to Biomedical Abstracts
Sarthak Jain | Edward Banner | Jan-Willem van de Meent | Iain J. Marshall | Byron C. Wallace
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose a method for learning disentangled representations of texts that code for distinct and complementary aspects, with the aim of affording efficient model transfer and interpretability. To induce disentangled embeddings, we propose an adversarial objective based on the (dis)similarity between triplets of documents with respect to specific aspects. Our motivating application is embedding biomedical abstracts describing clinical trials in a manner that disentangles the populations, interventions, and outcomes in a given trial. We show that our method learns representations that encode these clinically salient aspects, and that these can be effectively used to perform aspect-specific retrieval. We demonstrate that the approach generalizes beyond our motivating application in experiments on two multi-aspect review corpora.

pdf bib
A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature
Benjamin Nye | Junyi Jessy Li | Roma Patel | Yinfei Yang | Iain Marshall | Ani Nenkova | Byron Wallace
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a corpus of 5,000 richly annotated abstracts of medical articles describing clinical randomized controlled trials. Annotations include demarcations of text spans that describe the Patient population enrolled, the Interventions studied and to what they were Compared, and the Outcomes measured (the ‘PICO’ elements). These spans are further annotated at a more granular level, e.g., individual interventions within them are marked and mapped onto a structured medical vocabulary. We acquired annotations from a diverse set of workers with varying levels of expertise and cost. We describe our data collection process and the corpus itself in detail. We then outline a set of challenging NLP tasks that would aid searching of the medical literature and the practice of evidence-based medicine.

pdf bib
Syntactic Patterns Improve Information Extraction for Medical Search
Roma Patel | Yinfei Yang | Iain Marshall | Ani Nenkova | Byron Wallace
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Medical professionals search the published literature by specifying the type of patients, the medical intervention(s) and the outcome measure(s) of interest. In this paper we demonstrate how features encoding syntactic patterns improve the performance of state-of-the-art sequence tagging models (both neural and linear) for information extraction of these medically relevant categories. We present an analysis of the type of patterns exploited and of the semantic space induced for these, i.e., the distributed representations learned for identified multi-token patterns. We show that these learned representations differ substantially from those of the constituent unigrams, suggesting that the patterns capture contextual information that is otherwise lost.

2017

pdf bib
Aggregating and Predicting Sequence Labels from Crowd Annotations
An Thanh Nguyen | Byron Wallace | Junyi Jessy Li | Ani Nenkova | Matthew Lease
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite sequences being core to NLP, scant work has considered how to handle noisy sequence labels from multiple annotators for the same text. Given such annotations, we consider two complementary tasks: (1) aggregating sequential crowd labels to infer a best single set of consensus annotations; and (2) using crowd annotations as training data for a model that can predict sequences in unannotated text. For aggregation, we propose a novel Hidden Markov Model variant. To predict sequences in unannotated text, we propose a neural approach using Long Short Term Memory. We evaluate a suite of methods across two different applications and text genres: Named-Entity Recognition in news articles and Information Extraction from biomedical abstracts. Results show improvement over strong baselines. Our source code and data are available online.

pdf bib
Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization
Ye Zhang | Matthew Lease | Byron C. Wallace
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A fundamental advantage of neural models for NLP is their ability to learn representations from scratch. However, in practice this often means ignoring existing external linguistic resources, e.g., WordNet or domain specific ontologies such as the Unified Medical Language System (UMLS). We propose a general, novel method for exploiting such resources via weight sharing. Prior work on weight sharing in neural networks has considered it largely as a means of model compression. In contrast, we treat weight sharing as a flexible mechanism for incorporating prior knowledge into neural models. We show that this approach consistently yields improved performance on classification tasks compared to baseline strategies that do not exploit weight sharing.

pdf bib
Automating Biomedical Evidence Synthesis: RobotReviewer
Iain Marshall | Joël Kuiper | Edward Banner | Byron C. Wallace
Proceedings of ACL 2017, System Demonstrations

pdf bib
A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
Ye Zhang | Byron Wallace
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Convolutional Neural Networks (CNNs) have recently achieved remarkably strong performance on the practically important task of sentence classification (Kim, 2014; Kalchbrenner et al., 2014; Johnson and Zhang, 2014; Zhang et al., 2016). However, these models require practitioners to specify an exact model architecture and set accompanying hyperparameters, including the filter region size, regularization parameters, and so on. It is currently unknown how sensitive model performance is to changes in these configurations for the task of sentence classification. We thus conduct a sensitivity analysis of one-layer CNNs to explore the effect of architecture components on model performance; our aim is to distinguish between important and comparatively inconsequential design decisions for sentence classification. We focus on one-layer CNNs (to the exclusion of more complex models) due to their comparative simplicity and strong empirical performance, which makes it a modern standard baseline method akin to Support Vector Machine (SVMs) and logistic regression. We derive practical advice from our extensive empirical results for those interested in getting the most out of CNNs for sentence classification in real world settings.

2016

pdf bib
Modelling Context with User Embeddings for Sarcasm Detection in Social Media
Silvio Amir | Byron C. Wallace | Hao Lyu | Paula Carvalho | Mário J. Silva
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Retrofitting Word Vectors of MeSH Terms to Improve Semantic Similarity Measures
Zhiguo Yu | Trevor Cohen | Byron Wallace | Elmer Bernstam | Todd Johnson
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Leveraging coreference to identify arms in medical abstracts: An experimental study
Elisa Ferracane | Iain Marshall | Byron C. Wallace | Katrin Erk
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Rationale-Augmented Convolutional Neural Networks for Text Classification
Ye Zhang | Iain Marshall | Byron C. Wallace
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
MGNC-CNN: A Simple Approach to Exploiting Multiple Word Embeddings for Sentence Classification
Ye Zhang | Stephen Roller | Byron C. Wallace
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib
Sparse, Contextually Informed Models for Irony Detection: Exploiting User Communities, Entities and Sentiment
Byron C. Wallace | Do Kook Choe | Eugene Charniak
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Humans Require Context to Infer Ironic Intent (so Computers Probably do, too)
Byron C. Wallace | Do Kook Choe | Laura Kertz | Eugene Charniak
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication
Byron C. Wallace | Thomas A. Trikalinos | M. Barton Laws | Ira B. Wilson | Eugene Charniak
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
Multiple Narrative Disentanglement: Unraveling Infinite Jest
Byron Wallace
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies