Rowan Zellers


pdf bib
NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints
Ximing Lu | Peter West | Rowan Zellers | Ronan Le Bras | Chandra Bhagavatula | Yejin Choi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Conditional text generation often requires lexical constraints, i.e., which words should or shouldn’t be included in the output text. While the dominant recipe for conditional text generation has been large-scale pretrained language models that are finetuned on the task-specific training data, such models do not learn to follow the underlying constraints reliably, even when supervised with large amounts of task-specific examples. We propose NeuroLogic Decoding, a simple yet effective algorithm that enables neural language models – supervised or not – to generate fluent text while satisfying complex lexical constraints. Our approach is powerful yet efficient. It handles any set of lexical constraints that is expressible under predicate logic, while its asymptotic runtime is equivalent to conventional beam search. Empirical results on four benchmarks show that NeuroLogic Decoding outperforms previous approaches, including algorithms that handle a subset of our constraints. Moreover, we find that unsupervised models with NeuroLogic Decoding often outperform supervised models with conventional decoding, even when the latter is based on considerably larger networks. Our results suggest the limit of large-scale neural networks for fine-grained controllable generation and the promise of inference-time algorithms.

pdf bib
TuringAdvice: A Generative and Dynamic Evaluation of Language Use
Rowan Zellers | Ari Holtzman | Elizabeth Clark | Lianhui Qin | Ali Farhadi | Yejin Choi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

pdf bib
Probing Contextual Language Models for Common Ground with Visual Representations
Gabriel Ilharco | Rowan Zellers | Ali Farhadi | Hannaneh Hajishirzi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of language models.

pdf bib
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da | Maxwell Forbes | Rowan Zellers | Anthony Zheng | Jena D. Hwang | Antoine Bosselut | Yejin Choi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Understanding manipulated media, from automatically generated ‘deepfakes’ to manually edited ones, raises novel research challenges. Because the vast majority of edited or manipulated images are benign, such as photoshopped images for visual enhancements, the key challenge is to understand the complex layers of underlying intents of media edits and their implications with respect to disinformation. In this paper, we study Edited Media Frames, a new formalism to understand visual media manipulation as structured annotations with respect to the intents, emotional reactions, attacks on individuals, and the overall implications of disinformation. We introduce a dataset for our task, EMU, with 56k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 48.2% of the time. At the same time, there is still much work to be done – and we provide analysis that highlights areas for further progress.

pdf bib
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World
Rowan Zellers | Ari Holtzman | Matthew Peters | Roozbeh Mottaghi | Aniruddha Kembhavi | Ali Farhadi | Yejin Choi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast what happens next given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.


pdf bib
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers | Ari Holtzman | Yonatan Bisk | Ali Farhadi | Yejin Choi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as “A woman sits at a piano,” a machine must select the most likely followup: “She sets her fingers on the keys.” With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical ‘Goldilocks’ zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.


pdf bib
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Rowan Zellers | Yonatan Bisk | Roy Schwartz | Yejin Choi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Given a partial description like “she opened the hood of the car,” humans can reason about the situation and anticipate what might come next (”then, she examined the engine”). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.


pdf bib
Zero-Shot Activity Recognition with Verb Attribute Induction
Rowan Zellers | Yejin Choi
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we investigate large-scale zero-shot activity recognition by modeling the visual and linguistic attributes of action verbs. For example, the verb “salute” has several properties, such as being a light movement, a social act, and short in duration. We use these attributes as the internal mapping between visual and textual representations to reason about a previously unseen action. In contrast to much prior work that assumes access to gold standard attributes for zero-shot classes and focuses primarily on object attributes, our model uniquely learns to infer action attributes from dictionary definitions and distributed word representations. Experimental results confirm that action attributes inferred from language can provide a predictive signal for zero-shot prediction of previously unseen activities.