Percy Liang


2021

pdf bib
Conditional probing: measuring usable information beyond a baseline
John Hewitt | Kawin Ethayarajh | Percy Liang | Christopher Manning
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Probing experiments investigate the extent to which neural representations make properties—like part-of-speech—predictable. One suggests that a representation encodes a property if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embeddings. Instead of using baselines as a point of comparison, we’re interested in measuring information that is contained in the representation but not in the baseline. For example, current methods can detect when a representation is more useful than the word identity (a baseline) for predicting part-of-speech; however, they cannot detect when the representation is predictive of just the aspects of part-of-speech not explainable by the word identity. In this work, we extend a theory of usable information called V-information and propose conditional probing, which explicitly conditions on the information in the baseline. In a case study, we find that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.

pdf bib
LM-Critic: Language Models for Unsupervised Grammatical Error Correction
Michihiro Yasunaga | Jure Leskovec | Percy Liang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Grammatical error correction (GEC) requires a set of labeled ungrammatical / grammatical sentence pairs for training, but obtaining such annotation can be prohibitively expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples, but this relies on a perfect critic (e.g., a compiler) that returns whether an example is valid or not, which does not exist for the GEC task. In this work, we show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical if the LM assigns it a higher probability than its local perturbations. We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector. We evaluate our approach on GEC datasets on multiple domains (CoNLL-2014, BEA-2019, GMEG-wiki and GMEG-yahoo) and show that it outperforms existing methods in both the unsupervised setting (+7.7 F0.5) and the supervised setting (+0.5 F0.5).

pdf bib
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering
Michihiro Yasunaga | Hongyu Ren | Antoine Bosselut | Percy Liang | Jure Leskovec
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) presents two challenges: given a QA context (question and answer choice), methods need to (i) identify relevant knowledge from large KGs, and (ii) perform joint reasoning over the QA context and KG. Here we propose a new model, QA-GNN, which addresses the above challenges through two key innovations: (i) relevance scoring, where we use LMs to estimate the importance of KG nodes relative to the given QA context, and (ii) joint reasoning, where we connect the QA context and KG to form a joint graph, and mutually update their representations through graph-based message passing. We evaluate QA-GNN on the CommonsenseQA and OpenBookQA datasets, and show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning, e.g., correctly handling negation in questions.

pdf bib
Swords: A Benchmark for Lexical Substitution with Improved Data Coverage and Quality
Mina Lee | Chris Donahue | Robin Jia | Alexander Iyabor | Percy Liang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We release a new benchmark for lexical substitution, the task of finding appropriate substitutes for a target word in a context. For writing, lexical substitution systems can assist humans by suggesting words that humans cannot easily think of. However, existing benchmarks depend on human recall as the only source of data, and therefore lack coverage of the substitutes that would be most helpful to humans. Furthermore, annotators often provide substitutes of low quality, which are not actually appropriate in the given context. We collect higher-coverage and higher-quality data by framing lexical substitution as a classification problem, guided by the intuition that it is easier for humans to judge the appropriateness of candidate substitutes than conjure them from memory. To this end, we use a context-free thesaurus to produce candidates and rely on human judgement to determine contextual appropriateness. Compared to the previous largest benchmark, our Swords benchmark has 3x as many substitutes per target word for the same level of quality, and its substitutes are 1.4x more appropriate (based on human judgement) for the same number of substitutes.

pdf bib
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li | Percy Liang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Fine-tuning is the de facto way of leveraging large pretrained language models for downstream tasks. However, fine-tuning modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen and instead optimizes a sequence of continuous task-specific vectors, which we call the prefix. Prefix-tuning draws inspiration from prompting for language models, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”. We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We show that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics that are unseen during training.

2020

pdf bib
Task-Oriented Dialogue as Dataflow Synthesis
Jacob Andreas | John Bufe | David Burkett | Charles Chen | Josh Clausman | Jean Crawford | Kate Crim | Jordan DeLoach | Leah Dorner | Jason Eisner | Hao Fang | Alan Guo | David Hall | Kristin Hayes | Kellie Hill | Diana Ho | Wendy Iwaszuk | Smriti Jha | Dan Klein | Jayant Krishnamurthy | Theo Lanman | Percy Liang | Christopher H. Lin | Ilya Lintsbakh | Andy McGovern | Aleksandr Nisnevich | Adam Pauls | Dmitrij Petters | Brent Read | Dan Roth | Subhro Roy | Jesse Rusak | Beth Short | Div Slomin | Ben Snyder | Stephon Striplin | Yu Su | Zachary Tellman | Sam Thomson | Andrei Vorobev | Izabela Witoszko | Jason Wolfe | Abby Wray | Yuchen Zhang | Alexander Zotov
Transactions of the Association for Computational Linguistics, Volume 8

We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.

pdf bib
ExpBERT: Representation Engineering with Natural Language Explanations
Shikhar Murty | Pang Wei Koh | Percy Liang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Suppose we want to specify the inductive bias that married couples typically go on honeymoons for the task of extracting pairs of spouses from text. In this paper, we allow model developers to specify these types of inductive biases as natural language explanations. We use BERT fine-tuned on MultiNLI to “interpret” these explanations with respect to the input sentence, producing explanation-guided representations of the input. Across three relation extraction tasks, our method, ExpBERT, matches a BERT baseline but with 3–20x less labeled data and improves on the baseline by 3–10 F1 points with the same amount of labeled data.

pdf bib
Enabling Language Models to Fill in the Blanks
Chris Donahue | Mina Lee | Percy Liang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document. While infilling could enable rich functionality especially for writing assistance tools, more attention has been devoted to language modeling—a special case of infilling where text is predicted at the end of a document. In this paper, we aim to extend the capabilities of language models (LMs) to the more general task of infilling. To this end, we train (or fine tune) off-the-shelf LMs on sequences containing the concatenation of artificially-masked text and the text which was masked. We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics. Furthermore, we show that humans have difficulty identifying sentences infilled by our approach as machine-generated in the domain of short stories.

pdf bib
Robust Encodings: A Framework for Combating Adversarial Typos
Erik Jones | Robin Jia | Aditi Raghunathan | Percy Liang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Despite excellent performance on many tasks, NLP systems are easily fooled by small adversarial perturbations of inputs. Existing procedures to defend against such perturbations are either (i) heuristic in nature and susceptible to stronger attacks or (ii) provide guaranteed robustness to worst-case attacks, but are incompatible with state-of-the-art models like BERT. In this work, we introduce robust encodings (RobEn): a simple framework that confers guaranteed robustness, without making compromises on model architecture. The core component of RobEn is an encoding function, which maps sentences to a smaller, discrete space of encodings. Systems using these encodings as a bottleneck confer guaranteed robustness with standard training, and the same encodings can be used across multiple tasks. We identify two desiderata to construct robust encoding functions: perturbations of a sentence should map to a small set of encodings (stability), and models using encodings should still perform well (fidelity). We instantiate RobEn to defend against a large family of adversarial typos. Across six tasks from GLUE, our instantiation of RobEn paired with BERT achieves an average robust accuracy of 71.3% against all adversarial typos in the family considered, while previous work using a typo-corrector achieves only 35.3% accuracy against a simple greedy attack.

pdf bib
Shaping Visual Representations with Language for Few-Shot Classification
Jesse Mu | Percy Liang | Noah Goodman
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

By describing the features and abstractions of our world, language is a crucial tool for human learning and a promising source of supervision for machine learning models. We use language to improve few-shot visual classification in the underexplored scenario where natural language task descriptions are available during training, but unavailable for novel tasks at test time. Existing models for this setting sample new descriptions at test time and use those to classify images. Instead, we propose language-shaped learning (LSL), an end-to-end model that regularizes visual representations to predict language. LSL is conceptually simpler, more data efficient, and outperforms baselines in two challenging few-shot domains.

pdf bib
Selective Question Answering under Domain Shift
Amita Kamath | Robin Jia | Percy Liang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

To avoid giving wrong answers, question answering (QA) models need to know when to abstain from answering. Moreover, users often ask questions that diverge from the model’s training data, making errors more likely and thus abstention more critical. In this work, we propose the setting of selective question answering under domain shift, in which a QA model is tested on a mixture of in-domain and out-of-domain data, and must answer (i.e., not abstain on) as many questions as possible while maintaining high accuracy. Abstention policies based solely on the model’s softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. Instead, we train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Crucially, the calibrator benefits from observing the model’s behavior on out-of-domain data, even if from a different domain than the test data. We combine this method with a SQuAD-trained QA model and evaluate on mixtures of SQuAD and five other QA datasets. Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model’s probabilities only answers 48% at 80% accuracy.

pdf bib
RNNs can generate bounded hierarchical languages with optimal memory
John Hewitt | Michael Hahn | Surya Ganguli | Percy Liang | Christopher D. Manning
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recurrent neural networks empirically generate natural language with high syntactic fidelity. However, their success is not well-understood theoretically. We provide theoretical insight into this success, proving in a finite-precision setting that RNNs can efficiently generate bounded hierarchical languages that reflect the scaffolding of natural language syntax. We introduce Dyck-(k,m), the language of well-nested brackets (of k types) and m-bounded nesting depth, reflecting the bounded memory needs and long-distance dependencies of natural language syntax. The best known results use O(km2) memory (hidden units) to generate these languages. We prove that an RNN with O(m log k) hidden units suffices, an exponential reduction in memory, by an explicit construction. Finally, we show that no algorithm, even with unbounded computation, can suffice with o(m log k) hidden units.

pdf bib
The EOS Decision and Length Extrapolation
Benjamin Newman | John Hewitt | Percy Liang | Christopher D. Manning
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.

pdf bib
On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks
Stephen Mussmann | Robin Jia | Percy Liang
Findings of the Association for Computational Linguistics: EMNLP 2020

Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2.4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32.5% on QQP and 20.1% on WikiQA.

pdf bib
Learning Adaptive Language Interfaces through Decomposition
Siddharth Karamcheti | Dorsa Sadigh | Percy Liang
Proceedings of the First Workshop on Interactive and Executable Semantic Parsing

Our goal is to create an interactive natural language interface that efficiently and reliably learns from users to complete tasks in simulated robotics settings. We introduce a neural semantic parsing system that learns new high-level abstractions through decomposition: users interactively teach the system by breaking down high-level utterances describing novel behavior into low-level steps that it can understand. Unfortunately, existing methods either rely on grammars which parse sentences with limited flexibility, or neural sequence-to-sequence models that do not learn efficiently or reliably from individual examples. Our approach bridges this gap, demonstrating the flexibility of modern neural systems, as well as the one-shot reliable generalization of grammar-based methods. Our crowdsourced interactive experiments suggest that over time, users complete complex tasks more efficiently while using our system by leveraging what they just taught. At the same time, getting users to trust the system enough to be incentivized to teach high-level utterances is still an ongoing challenge. We end with a discussion of some of the obstacles we need to overcome to fully realize the potential of the interactive paradigm.

2019

pdf bib
Designing and Interpreting Probes with Control Tasks
John Hewitt | Percy Liang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Probes, supervised models trained to predict properties (like parts-of-speech) from representations (like ELMo), have achieved high accuracy on a range of linguistic tasks. But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task? In this paper, we propose control tasks, which associate word types with random outputs, to complement linguistic tasks. By construction, these tasks can only be learned by the probe itself. So a good probe, (one that reflects the representation), should be selective, achieving high linguistic task accuracy and low control task accuracy. The selectivity of a probe puts linguistic task accuracy in context with the probe’s capacity to memorize from word types. We construct control tasks for English part-of-speech tagging and dependency edge prediction, and show that popular probes on ELMo representations are not selective. We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective. Finally, we find that while probes on the first layer of ELMo yield slightly better part-of-speech tagging accuracy than the second, probes on the second layer are substantially more selective, which raises the question of which layer better represents parts-of-speech.

pdf bib
Certified Robustness to Adversarial Word Substitutions
Robin Jia | Aditi Raghunathan | Kerem Göksel | Percy Liang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

State-of-the-art NLP models can often be fooled by adversaries that apply seemingly innocuous label-preserving transformations (e.g., paraphrasing) to input text. The number of possible transformations scales exponentially with text length, so data augmentation cannot cover all transformations of an input. This paper considers one exponentially large family of label-preserving transformations, in which every word in the input can be replaced with a similar word. We train the first models that are provably robust to all word substitutions in this family. Our training procedure uses Interval Bound Propagation (IBP) to minimize an upper bound on the worst-case loss that any combination of word substitutions can induce. To evaluate models’ robustness to these transformations, we measure accuracy on adversarially chosen word substitutions applied to test examples. Our IBP-trained models attain 75% adversarial accuracy on both sentiment analysis on IMDB and natural language inference on SNLI; in comparison, on IMDB, models trained normally and ones trained with data augmentation achieve adversarial accuracy of only 12% and 41%, respectively.

pdf bib
Distributionally Robust Language Modeling
Yonatan Oren | Shiori Sagawa | Tatsunori B. Hashimoto | Percy Liang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Language models are generally trained on data spanning a wide range of topics (e.g., news, reviews, fiction), but they might be applied to an a priori unknown target distribution (e.g., restaurant reviews). In this paper, we first show that training on text outside the test distribution can degrade test performance when using standard maximum likelihood (MLE) training. To remedy this without the knowledge of the test distribution, we propose an approach which trains a model that performs well over a wide range of potential test distributions. In particular, we derive a new distributionally robust optimization (DRO) procedure which minimizes the loss of the model over the worst-case mixture of topics with sufficient overlap with the training distribution. Our approach, called topic conditional value at risk (topic CVaR), obtains a 5.5 point perplexity reduction over MLE when the language models are trained on a mixture of Yelp reviews and news and tested only on reviews.

pdf bib
Unifying Human and Statistical Evaluation for Natural Language Generation
Tatsunori B. Hashimoto | Hugh Zhang | Percy Liang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

pdf bib
Pun Generation with Surprise
He He | Nanyun Peng | Percy Liang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We tackle the problem of generating a pun sentence given a pair of homophones (e.g., “died” and “dyed”). Puns are by their very nature statistically anomalous and not amenable to most text generation methods that are supervised by a large corpus. In this paper, we propose an unsupervised approach to pun generation based on lots of raw (unhumorous) text and a surprisal principle. Specifically, we posit that in a pun sentence, there is a strong association between the pun word (e.g., “dyed”) and the distant context, but a strong association between the alternative word (e.g., “died”) and the immediate context. We instantiate the surprisal principle in two ways: (i) as a measure based on the ratio of probabilities given by a language model, and (ii) a retrieve-and-edit approach based on words suggested by a skip-gram model. Based on human evaluation, our retrieve-and-edit approach generates puns successfully 30% of the time, doubling the success rate of a neural generation baseline.

2018

pdf bib
Textual Analogy Parsing: What’s Shared and What’s Compared among Analogous Facts
Matthew Lamm | Arun Chaganty | Christopher D. Manning | Dan Jurafsky | Percy Liang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

To understand a sentence like “whereas only 10% of White Americans live at or below the poverty line, 28% of African Americans do” it is important not only to identify individual facts, e.g., poverty rates of distinct demographic groups, but also the higher-order relations between them, e.g., the disparity between them. In this paper, we propose the task of Textual Analogy Parsing (TAP) to model this higher-order meaning. Given a sentence such as the one above, TAP outputs a frame-style meaning representation which explicitly specifies what is shared (e.g., poverty rates) and what is compared (e.g., White Americans vs. African Americans, 10% vs. 28%) between its component facts. Such a meaning representation can enable new applications that rely on discourse understanding such as automated chart generation from quantitative text. We present a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem.

pdf bib
QuAC: Question Answering in Context
Eunsol Choi | He He | Mohit Iyyer | Mark Yatskar | Wen-tau Yih | Yejin Choi | Percy Liang | Luke Zettlemoyer
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-of-the-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.

pdf bib
Decoupling Strategy and Generation in Negotiation Dialogues
He He | Derek Chen | Anusha Balakrishnan | Percy Liang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We consider negotiation settings in which two agents use natural language to bargain on goods. Agents need to decide on both high-level strategy (e.g., proposing $50) and the execution of that strategy (e.g., generating “The bike is brand new. Selling for just $50!”). Recent work on negotiation trains neural models, but their end-to-end nature makes it hard to control their strategy, and reinforcement learning tends to lead to degenerate solutions. In this paper, we propose a modular approach based on coarse dialogue acts (e.g., propose(price=50)) that decouples strategy and generation. We show that we can flexibly set the strategy using supervised learning, reinforcement learning, or domain-specific knowledge without degeneracy, while our retrieval-based generation can maintain context-awareness and produce diverse utterances. We test our approach on the recently proposed DEALORNODEAL game, and we also collect a richer dataset based on real items on Craigslist. Human evaluation shows that our systems achieve higher task success rate and more human-like negotiation behavior than previous approaches.

pdf bib
Mapping natural language commands to web elements
Panupong Pasupat | Tian-Shun Jiang | Evan Liu | Kelvin Guu | Percy Liang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The web provides a rich, open-domain environment with textual, structural, and spatial properties. We propose a new task for grounding language in this environment: given a natural language command (e.g., “click on the second article”), choose the correct element on the web page (e.g., a hyperlink or text box). We collected a dataset of over 50,000 commands that capture various phenomena such as functional references (e.g. “find who made this site”), relational reasoning (e.g. “article by john”), and visual reasoning (e.g. “top-most article”). We also implemented and analyzed three baseline models that capture different phenomena present in the dataset.

pdf bib
The price of debiasing automatic metrics in natural language evalaution
Arun Chaganty | Stephen Mussmann | Percy Liang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13% cost reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks—the automatic metric and the prompt shown to human evaluators—both of which need to be improved to obtain greater cost savings.

pdf bib
Training Classifiers with Natural Language Explanations
Braden Hancock | Paroma Varma | Stephanie Wang | Martin Bringmann | Percy Liang | Christopher Ré
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training accurate classifiers requires many labels, but each label provides only limited information (one bit for binary classification). In this work, we propose BabbleLabble, a framework for training classifiers in which an annotator provides a natural language explanation for each labeling decision. A semantic parser converts these explanations into programmatic labeling functions that generate noisy labels for an arbitrary amount of unlabeled data, which is used to train a classifier. On three relation extraction tasks, we find that users are able to train classifiers with comparable F1 scores from 5-100 faster by providing explanations instead of just labels. Furthermore, given the inherent imperfection of labeling functions, we find that a simple rule-based semantic parser suffices.

pdf bib
Know What You Don’t Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar | Robin Jia | Percy Liang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuADRUn, a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuADRUn, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuADRUn is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD achieves only 66% F1 on SQuADRUn. We release SQuADRUn to the community as the successor to SQuAD.

pdf bib
Generating Sentences by Editing Prototypes
Kelvin Guu | Tatsunori B. Hashimoto | Yonatan Oren | Percy Liang
Transactions of the Association for Computational Linguistics, Volume 6

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.

pdf bib
Planning, Inference and Pragmatics in Sequential Language Games
Fereshte Khani | Noah D. Goodman | Percy Liang
Transactions of the Association for Computational Linguistics, Volume 6

We study sequential language games in which two players, each with private information, communicate to achieve a common goal. In such games, a successful player must (i) infer the partner’s private information from the partner’s messages, (ii) generate messages that are most likely to help with the goal, and (iii) reason pragmatically about the partner’s strategy. We propose a model that captures all three characteristics and demonstrate their importance in capturing human behavior on a new goal-oriented dataset we collected using crowdsourcing.

pdf bib
Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer
Juncen Li | Robin Jia | He He | Percy Liang
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We consider the task of text attribute transfer: transforming a sentence to alter a specific attribute (e.g., sentiment) while preserving its attribute-independent content (e.g., “screen is just the right size” to “screen is too small”). Our training data includes only sentences labeled with their attribute (e.g., positive and negative), but not pairs of sentences that only differ in the attributes, so we must learn to disentangle attributes from attribute-independent content in an unsupervised way. Previous work using adversarial methods has struggled to produce high-quality outputs. In this paper, we propose simpler methods motivated by the observation that text attributes are often marked by distinctive phrases (e.g., “too small”). Our strongest method extracts content words by deleting phrases associated with the sentence’s original attribute value, retrieves new phrases associated with the target attribute, and uses a neural model to fluently combine these into a final output. Based on human evaluation, our best method generates grammatical and appropriate responses on 22% more inputs than the best previous system, averaged over three attribute transfer datasets: altering sentiment of reviews on Yelp, altering sentiment of reviews on Amazon, and altering image captions to be more romantic or humorous.

2017

pdf bib
Naturalizing a Programming Language via Interactive Learning
Sida I. Wang | Samuel Ginn | Percy Liang | Christopher D. Manning
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Our goal is to create a convenient natural language interface for performing well-specified but complex actions such as analyzing data, manipulating text, and querying databases. However, existing natural language interfaces for such tasks are quite primitive compared to the power one wields with a programming language. To bridge this gap, we start with a core programming language and allow users to “naturalize” the core language incrementally by defining alternative, more natural syntax and increasingly complex concepts in terms of compositions of simpler ones. In a voxel world, we show that a community of users can simultaneously teach a common system a diverse language and use it to build hundreds of complex voxel structures. Over the course of three days, these users went from using only the core language to using the naturalized language in 85.9% of the last 10K utterances.

pdf bib
From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood
Kelvin Guu | Panupong Pasupat | Evan Liu | Percy Liang
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Our goal is to learn a semantic parser that maps natural language utterances into executable programs when only indirect supervision is available: examples are labeled with the correct execution result, but not the program itself. Consequently, we must search the space of programs for those that output the correct result, while not being misled by spurious programs: incorrect programs that coincidentally output the correct result. We connect two common learning paradigms, reinforcement learning (RL) and maximum marginal likelihood (MML), and then present a new learning algorithm that combines the strengths of both. The new algorithm guards against spurious programs by combining the systematic search traditionally employed in MML with the randomized exploration of RL, and by updating parameters such that probability is spread more evenly across consistent programs. We apply our learning algorithm to a new neural semantic parser and show significant gains over existing state-of-the-art results on a recent context-dependent semantic parsing task.

pdf bib
Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings
He He | Anusha Balakrishnan | Mihail Eric | Percy Liang
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study a symmetric collaborative dialogue setting in which two agents, each with private knowledge, must strategically communicate to achieve a common goal. The open-ended dialogue state in this setting poses new challenges for existing dialogue systems. We collected a dataset of 11K human-human dialogues, which exhibits interesting lexical, semantic, and strategic elements. To model both structured knowledge and unstructured language, we propose a neural model with dynamic knowledge graph embeddings that evolve as the dialogue progresses. Automatic and human evaluations show that our model is both more effective at achieving the goal and more human-like than baseline neural and rule-based models.

pdf bib
Importance sampling for unbiased on-demand evaluation of knowledge base population
Arun Chaganty | Ashwin Paranjape | Percy Liang | Christopher D. Manning
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Knowledge base population (KBP) systems take in a large document corpus and extract entities and their relations. Thus far, KBP evaluation has relied on judgements on the pooled predictions of existing systems. We show that this evaluation is problematic: when a new system predicts a previously unseen relation, it is penalized even if it is correct. This leads to significant bias against new systems, which counterproductively discourages innovation in the field. Our first contribution is a new importance-sampling based evaluation which corrects for this bias by annotating a new system’s predictions on-demand via crowdsourcing. We show this eliminates bias and reduces variance using data from the 2015 TAC KBP task. Our second contribution is an implementation of our method made publicly available as an online KBP evaluation service. We pilot the service by testing diverse state-of-the-art systems on the TAC KBP 2016 corpus and obtain accurate scores in a cost effective manner.

pdf bib
Macro Grammars and Holistic Triggering for Efficient Semantic Parsing
Yuchen Zhang | Panupong Pasupat | Percy Liang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

To learn a semantic parser from denotations, a learning algorithm must search over a combinatorially large space of logical forms for ones consistent with the annotated denotations. We propose a new online learning algorithm that searches faster as training progresses. The two key ideas are using macro grammars to cache the abstract patterns of useful logical forms found thus far, and holistic triggering to efficiently retrieve the most relevant patterns based on sentence similarity. On the WikiTableQuestions dataset, we first expand the search space of an existing model to improve the state-of-the-art accuracy from 38.7% to 42.7%, and then use macro grammars and holistic triggering to achieve an 11x speedup and an accuracy of 43.7%.

pdf bib
Adversarial Examples for Evaluating Reading Comprehension Systems
Robin Jia | Percy Liang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of 75% F1 score to 36%; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to 7%. We hope our insights will motivate the development of new models that understand language more precisely.

2016

pdf bib
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar | Jian Zhang | Konstantin Lopyrev | Percy Liang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Data Recombination for Neural Semantic Parsing
Robin Jia | Percy Liang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Inferring Logical Forms From Denotations
Panupong Pasupat | Percy Liang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
How Much is 131 Million Dollars? Putting Numbers in Perspective with Compositional Descriptions
Arun Chaganty | Percy Liang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Unanimous Prediction for 100% Precision with Application to Learning Semantic Mappings
Fereshte Khani | Martin Rinard | Percy Liang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Simpler Context-Dependent Logical Forms via Model Projections
Reginald Long | Panupong Pasupat | Percy Liang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Learning Language Games through Interaction
Sida I. Wang | Percy Liang | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Traversing Knowledge Graphs in Vector Space
Kelvin Guu | John Miller | Percy Liang
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Imitation Learning of Agenda-based Semantic Parsers
Jonathan Berant | Percy Liang
Transactions of the Association for Computational Linguistics, Volume 3

Semantic parsers conventionally construct logical forms bottom-up in a fixed order, resulting in the generation of many extraneous partial logical forms. In this paper, we combine ideas from imitation learning and agenda-based parsing to train a semantic parser that searches partial logical forms in a more strategic order. Empirically, our parser reduces the number of constructed partial logical forms by an order of magnitude, and obtains a 6x-9x speedup over fixed-order parsing, while maintaining comparable accuracy.

pdf bib
Environment-Driven Lexicon Induction for High-Level Instructions
Dipendra Kumar Misra | Kejia Tao | Percy Liang | Ashutosh Saxena
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Building a Semantic Parser Overnight
Yushi Wang | Jonathan Berant | Percy Liang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Compositional Semantic Parsing on Semi-Structured Tables
Panupong Pasupat | Percy Liang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing
Phil Blunsom | Shay Cohen | Paramveer Dhillon | Percy Liang
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

2014

pdf bib
Zero-shot Entity Extraction from Web Pages
Panupong Pasupat | Percy Liang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Semantic Parsing via Paraphrasing
Jonathan Berant | Percy Liang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Feature Noising for Log-Linear Structured Prediction
Sida Wang | Mengqiu Wang | Stefan Wager | Percy Liang | Christopher D. Manning
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Semantic Parsing on Freebase from Question-Answer Pairs
Jonathan Berant | Andrew Chou | Roy Frostig | Percy Liang
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning Dependency-Based Compositional Semantics
Percy Liang | Michael I. Jordan | Dan Klein
Computational Linguistics, Volume 39, Issue 2 - June 2013

2011

pdf bib
Learning Dependency-Based Compositional Semantics
Percy Liang | Michael Jordan | Dan Klein
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Type-Based MCMC
Percy Liang | Michael I. Jordan | Dan Klein
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
A Game-Theoretic Approach to Generating Spatial Descriptions
Dave Golland | Percy Liang | Dan Klein
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Simple Domain-Independent Probabilistic Approach to Generation
Gabor Angeli | Percy Liang | Dan Klein
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
Online EM for Unsupervised Models
Percy Liang | Dan Klein
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Learning Semantic Correspondences with Less Supervision
Percy Liang | Michael Jordan | Dan Klein
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Learning Bilingual Lexicons from Monolingual Corpora
Aria Haghighi | Percy Liang | Taylor Berg-Kirkpatrick | Dan Klein
Proceedings of ACL-08: HLT

pdf bib
Analyzing the Errors of Unsupervised Learning
Percy Liang | Dan Klein
Proceedings of ACL-08: HLT

2007

pdf bib
The Infinite PCFG Using Hierarchical Dirichlet Processes
Percy Liang | Slav Petrov | Michael Jordan | Dan Klein
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
A Probabilistic Approach to Diachronic Phonology
Alexandre Bouchard | Percy Liang | Thomas Griffiths | Dan Klein
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Alignment by Agreement
Percy Liang | Ben Taskar | Dan Klein
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
An End-to-End Discriminative Approach to Machine Translation
Percy Liang | Alexandre Bouchard-Côté | Dan Klein | Ben Taskar
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

Search
Co-authors