Ellie Pavlick


2021

pdf bib
Proceedings of the Society for Computation in Linguistics 2021
Allyson Ettinger | Ellie Pavlick | Brandon Prickett
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering
Najoung Kim | Ellie Pavlick | Burcu Karagol Ayan | Deepak Ramachandran
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Many Question-Answering (QA) datasets contain unanswerable questions, but their treatment in QA systems remains primitive. Our analysis of the Natural Questions (Kwiatkowski et al. 2019) dataset reveals that a substantial portion of unanswerable questions (~21%) can be explained based on the presence of unverifiable presuppositions. Through a user preference study, we demonstrate that the oracle behavior of our proposed system—which provides responses based on presupposition failure—is preferred over the oracle behavior of existing QA systems. Then, we present a novel framework for implementing such a system in three steps: presupposition generation, presupposition verification, and explanation generation, reporting progress on each. Finally, we show that a simple modification of adding presuppositions and their verifiability to the input of a competitive end-to-end QA system yields modest gains in QA performance and unanswerability detection, demonstrating the promise of our approach.

pdf bib
AND does not mean OR: Using Formal Languages to Study Language Models’ Representations
Aaron Traylor | Roman Feiman | Ellie Pavlick
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

A current open question in natural language processing is to what extent language models, which are trained with access only to the form of language, are able to capture the meaning of language. This question is challenging to answer in general, as there is no clear line between meaning and form, but rather meaning constrains form in consistent ways. The goal of this study is to offer insights into a narrower but critical subquestion: Under what conditions should we expect that meaning and form covary sufficiently, such that a language model with access only to form might nonetheless succeed in emulating meaning? Focusing on several formal languages (propositional logic and a set of programming languages), we generate training corpora using a variety of motivated constraints, and measure a distributional language model’s ability to differentiate logical symbols (AND, OR, and NOT). Our findings are largely negative: none of our simulated training corpora result in models which definitively differentiate meaningfully different symbols (e.g., AND vs. OR), suggesting a limitation to the types of semantic signals that current models are able to exploit.

pdf bib
Are Rotten Apples Edible? Challenging Commonsense Inference Ability with Exceptions
Nam Do | Ellie Pavlick
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Interpretability and Analysis in Neural NLP
Yonatan Belinkov | Sebastian Gehrmann | Ellie Pavlick
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

While deep learning has transformed the natural language processing (NLP) field and impacted the larger computational linguistics community, the rise of neural networks is stained by their opaque nature: It is challenging to interpret the inner workings of neural network models, and explicate their behavior. Therefore, in the last few years, an increasingly large body of work has been devoted to the analysis and interpretation of neural network models in NLP. This body of work is so far lacking a common framework and methodology. Moreover, approaching the analysis of modern neural networks can be difficult for newcomers to the field. This tutorial aims to fill this gap and introduce the nascent field of interpretability and analysis of neural networks in NLP. The tutorial will cover the main lines of analysis work, such as structural analyses using probing classifiers, behavioral studies and test suites, and interactive visualizations. We will highlight not only the most commonly applied analysis methods, but also the specific limitations and shortcomings of current approaches, in order to inform participants where to focus future efforts.

pdf bib
Are “Undocumented Workers” the Same as “Illegal Aliens”? Disentangling Denotation and Connotation in Vector Spaces
Albert Webson | Zhizhong Chen | Carsten Eickhoff | Ellie Pavlick
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In politics, neologisms are frequently invented for partisan objectives. For example, “undocumented workers” and “illegal aliens” refer to the same group of people (i.e., they have the same denotation), but they carry clearly different connotations. Examples like these have traditionally posed a challenge to reference-based semantic theories and led to increasing acceptance of alternative theories (e.g., Two-Factor Semantics) among philosophers and cognitive scientists. In NLP, however, popular pretrained models encode both denotation and connotation as one entangled representation. In this study, we propose an adversarial nerual netowrk that decomposes a pretrained representation as independent denotation and connotation representations. For intrinsic interpretability, we show that words with the same denotation but different connotations (e.g., “immigrants” vs. “aliens”, “estate tax” vs. “death tax”) move closer to each other in denotation space while moving further apart in connotation space. For extrinsic application, we train an information retrieval system with our disentangled representations and show that the denotation vectors improve the viewpoint diversity of document rankings.

pdf bib
A Visuospatial Dataset for Naturalistic Verb Learning
Dylan Ebert | Ellie Pavlick
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

We introduce a new dataset for training and evaluating grounded language models. Our data is collected within a virtual reality environment and is designed to emulate the quality of language data to which a pre-verbal child is likely to have access: That is, naturalistic, spontaneous speech paired with richly grounded visuospatial context. We use the collected data to compare several distributional semantics models for verb learning. We evaluate neural models based on 2D (pixel) features as well as feature-engineered models based on 3D (symbolic, spatial) features, and show that neither modeling approach achieves satisfactory performance. Our results are consistent with evidence from child language acquisition that emphasizes the difficulty of learning verbs from naive distributional data. We discuss avenues for future work on cognitively-inspired grounded language learning, and release our corpus with the intent of facilitating research on the topic.

pdf bib
What Happens To BERT Embeddings During Fine-tuning?
Amil Merchant | Elahe Rahimtoroghi | Ellie Pavlick | Ian Tenney
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

While much recent work has examined how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques—supervised probing, unsupervised similarity analysis, and layer-based ablations—we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes some significant changes, there is no catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning is a conservative process that primarily affects the top layers of BERT, albeit with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.

2019

pdf bib
How well do NLI models capture verb veridicality?
Alexis Ross | Ellie Pavlick
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In natural language inference (NLI), contexts are considered veridical if they allow us to infer that their underlying propositions make true claims about the real world. We investigate whether a state-of-the-art natural language inference model (BERT) learns to make correct inferences about veridicality in verb-complement constructions. We introduce an NLI dataset for veridicality evaluation consisting of 1,500 sentence pairs, covering 137 unique verbs. We find that both human and model inferences generally follow theoretical patterns, but exhibit a systematic bias towards assuming that verbs are veridical–a bias which is amplified in BERT. We further show that, encouragingly, BERT’s inferences are sensitive not only to the presence of individual verb types, but also to the syntactic role of the verb, the form of the complement clause (to- vs. that-complements), and negation.

pdf bib
Using Grounded Word Representations to Study Theories of Lexical Concepts
Dylan Ebert | Ellie Pavlick
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

The fields of cognitive science and philosophy have proposed many different theories for how humans represent “concepts”. Multiple such theories are compatible with state-of-the-art NLP methods, and could in principle be operationalized using neural networks. We focus on two particularly prominent theories–Classical Theory and Prototype Theory–in the context of visually-grounded lexical representations. We compare when and how the behavior of models based on these theories differs in terms of categorization and entailment tasks. Our preliminary results suggest that Classical-based representations perform better for entailment and Prototype-based representations perform better for categorization. We discuss plans for additional experiments needed to confirm these initial observations.

pdf bib
Probing What Different NLP Tasks Teach Machines about Function Word Comprehension
Najoung Kim | Roma Patel | Adam Poliak | Patrick Xia | Alex Wang | Tom McCoy | Ian Tenney | Alexis Ross | Tal Linzen | Benjamin Van Durme | Samuel R. Bowman | Ellie Pavlick
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on CCG—our most syntactic objective—performs the best on average across our probing tasks, suggesting that syntactic knowledge helps function word comprehension. Language modeling also shows strong performance, supporting its widespread use for pretraining state-of-the-art NLP models. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.

pdf bib
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
Tom McCoy | Ellie Pavlick | Tal Linzen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.

pdf bib
Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling
Alex Wang | Jan Hula | Patrick Xia | Raghavendra Pappagari | R. Thomas McCoy | Roma Patel | Najoung Kim | Ian Tenney | Yinghui Huang | Katherin Yu | Shuning Jin | Berlin Chen | Benjamin Van Durme | Edouard Grave | Ellie Pavlick | Samuel R. Bowman
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo’s pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.

pdf bib
BERT Rediscovers the Classical NLP Pipeline
Ian Tenney | Dipanjan Das | Ellie Pavlick
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Pre-trained text encoders have rapidly advanced the state of the art on many NLP tasks. We focus on one such model, BERT, and aim to quantify where linguistic information is captured within the network. We find that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference. Qualitative analysis reveals that the model can and often does adjust this pipeline dynamically, revising lower-level decisions on the basis of disambiguating information from higher-level representations.

pdf bib
Inherent Disagreements in Human Textual Inferences
Ellie Pavlick | Tom Kwiatkowski
Transactions of the Association for Computational Linguistics, Volume 7

We analyze human’s disagreements about the validity of natural language inferences. We show that, very often, disagreements are not dismissible as annotation “noise”, but rather persist as we collect more ratings and as we vary the amount of context provided to raters. We further show that the type of uncertainty captured by current state-of-the-art models for natural language inference is not reflective of the type of uncertainty present in human disagreements. We discuss implications of our results in relation to the recognizing textual entailment (RTE)/natural language inference (NLI) task. We argue for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments.

2018

pdf bib
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation
Adam Poliak | Aparajita Haldar | Rachel Rudinger | J. Edward Hu | Ellie Pavlick | Aaron Steven White | Benjamin Van Durme
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We present a large scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation encoded by a neural network captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. Our collection of diverse datasets is available at http://www.decomp.net/, and will grow over time as additional resources are recast and added from novel sources.

pdf bib
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation
Adam Poliak | Aparajita Haldar | Rachel Rudinger | J. Edward Hu | Ellie Pavlick | Aaron Steven White | Benjamin Van Durme
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present a large-scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. We refer to our collection as the DNC: Diverse Natural Language Inference Collection. The DNC is available online at https://www.decomp.net, and will grow over time as additional resources are recast and added from novel sources.

pdf bib
WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
Manaal Faruqui | Ellie Pavlick | Ian Tenney | Dipanjan Das
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We release a corpus of 43 million atomic edits across 8 languages. These edits are mined from Wikipedia edit history and consist of instances in which a human editor has inserted a single contiguous phrase into, or deleted a single contiguous phrase from, an existing sentence. We use the collected data to show that the language generated during editing differs from the language that we observe in standard corpora, and that models trained on edits encode different aspects of semantics and discourse than models trained on raw text. We release the full corpus as a resource to aid ongoing research in semantics, discourse, and representation learning.

pdf bib
Learning Scalar Adjective Intensity from Paraphrases
Anne Cocos | Skyler Wharton | Ellie Pavlick | Marianna Apidianaki | Chris Callison-Burch
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Adjectives like “warm”, “hot”, and “scalding” all describe temperature but differ in intensity. Understanding these differences between adjectives is a necessary part of reasoning about natural language. We propose a new paraphrase-based method to automatically learn the relative intensity relation that holds between a pair of scalar adjectives. Our approach analyzes over 36k adjectival pairs from the Paraphrase Database under the assumption that, for example, paraphrase pair “really hot” <–> “scalding” suggests that “hot” < “scalding”. We show that combining this paraphrase evidence with existing, complementary pattern- and lexicon-based approaches improves the quality of systems for automatically ordering sets of scalar adjectives and inferring the polarity of indirect answers to “yes/no” questions.

2017

pdf bib
Identifying 1950s American Jazz Musicians: Fine-Grained IsA Extraction via Modifier Composition
Ellie Pavlick | Marius Paşca
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a method for populating fine-grained classes (e.g., “1950s American jazz musicians”) with instances (e.g., Charles Mingus ). While state-of-the-art methods tend to treat class labels as single lexical units, the proposed method considers each of the individual modifiers in the class label relative to the head. An evaluation on the task of reconstructing Wikipedia category pages demonstrates a >10 point increase in AUC, over a strong baseline relying on widely-used Hearst patterns.

2016

pdf bib
The Gun Violence Database: A new task and data set for NLP
Ellie Pavlick | Heng Ji | Xiaoman Pan | Chris Callison-Burch
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Tense Manages to Predict Implicative Behavior in Verbs
Ellie Pavlick | Chris Callison-Burch
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Most “babies” are “little” and most “problems” are “huge”: Compositional Entailment in Adjective-Nouns
Ellie Pavlick | Chris Callison-Burch
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Simple PPDB: A Paraphrase Database for Simplification
Ellie Pavlick | Chris Callison-Burch
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
An Empirical Analysis of Formality in Online Communication
Ellie Pavlick | Joel Tetreault
Transactions of the Association for Computational Linguistics, Volume 4

This paper presents an empirical study of linguistic formality. We perform an analysis of humans’ perceptions of formality in four different genres. These findings are used to develop a statistical model for predicting formality, which is evaluated under different feature settings and genres. We apply our model to an investigation of formality in online discussion forums, and present findings consistent with theories of formality and linguistic coordination.

pdf bib
Optimizing Statistical Machine Translation for Text Simplification
Wei Xu | Courtney Napoles | Ellie Pavlick | Quanze Chen | Chris Callison-Burch
Transactions of the Association for Computational Linguistics, Volume 4

Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.

pdf bib
So-Called Non-Subsective Adjectives
Ellie Pavlick | Chris Callison-Burch
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf bib
Effectively Crowdsourcing Radiology Report Annotations
Anne Cocos | Aaron Masino | Ting Qian | Ellie Pavlick | Chris Callison-Burch
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib
Adding Semantics to Data-Driven Paraphrasing
Ellie Pavlick | Johan Bos | Malvina Nissim | Charley Beller | Benjamin Van Durme | Chris Callison-Burch
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Domain-Specific Paraphrase Extraction
Ellie Pavlick | Juri Ganitkevitch | Tsz Ping Chan | Xuchen Yao | Benjamin Van Durme | Chris Callison-Burch
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
FrameNet+: Fast Paraphrastic Tripling of FrameNet
Ellie Pavlick | Travis Wolfe | Pushpendre Rastogi | Chris Callison-Burch | Mark Dredze | Benjamin Van Durme
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification
Ellie Pavlick | Pushpendre Rastogi | Juri Ganitkevitch | Benjamin Van Durme | Chris Callison-Burch
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop
Kuan-Yu Chen | Angelina Ivanova | Ellie Pavlick | Emily Bender | Chin-Yew Lin | Stephan Oepen
Proceedings of the ACL-IJCNLP 2015 Student Research Workshop

pdf bib
Inducing Lexical Style Properties for Paraphrase and Genre Differentiation
Ellie Pavlick | Ani Nenkova
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Crowdsourcing for NLP
Chris Callison-Burch | Lyle Ungar | Ellie Pavlick
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

2014

pdf bib
The Language Demographics of Amazon Mechanical Turk
Ellie Pavlick | Matt Post | Ann Irvine | Dmitry Kachaev | Chris Callison-Burch
Transactions of the Association for Computational Linguistics, Volume 2

We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers’ self-reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems.

pdf bib
Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
Rui Yan | Mingkun Gao | Ellie Pavlick | Chris Callison-Burch
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)