Yoav Artzi


2024

pdf bib
CoGen: Learning from Feedback with Coupled Comprehension and Generation
Mustafa Omer Gul | Yoav Artzi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Systems with both language comprehension and generation capabilities can benefit from the tight connection between the two. This work studies coupling comprehension and generation with focus on continually learning from interaction with users. We propose techniques to tightly integrate the two capabilities for both learning and inference. We situate our studies in two-player reference games, and deploy various models for thousands of interactions with human users, while learning from interaction feedback signals. We show dramatic improvements in performance over time, with comprehension-generation coupling leading to performance improvements up to 26% in absolute terms and up to 17% higher accuracies compared to a non-coupled system. Our analysis also shows coupling has substantial qualitative impact on the system’s language, making it significantly more human-like.

2023

pdf bib
lilGym: Natural Language Visual Reasoning with Reinforcement Learning
Anne Wu | Kiante Brantley | Noriyuki Kojima | Yoav Artzi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/.

pdf bib
CB2: Collaborative Natural Language Interaction Research Platform
Jacob Sharf | Mustafa Omer Gul | Yoav Artzi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

CB2 is a multi-agent platform to study collaborative natural language interaction in a grounded task-oriented scenario. It includes a 3D game environment, a backend server designed to serve trained models to human agents, and various tools and processes to enable scalable studies. We deploy CB2 at https://cb2.ai as a system demonstration with a learned instruction following model.

pdf bib
Continually Improving Extractive QA via Human Feedback
Ge Gao | Hung-Ting Chen | Yoav Artzi | Eunsol Choi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Our experiments show effective improvement from user feedback of extractive QA models over time across different data regimes, including significant potential for domain adaptation.

2022

pdf bib
Simulating Bandit Learning from User Feedback for Extractive Question Answering
Ge Gao | Eunsol Choi | Yoav Artzi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We cast the problem as contextual bandit learning, and analyze the characteristics of several learning scenarios with focus on reducing data annotation. We show that systems initially trained on few examples can dramatically improve given feedback from users on model-predicted answers, and that one can use existing datasets to deploy systems in new domains without any annotation effort, but instead improving the system on-the-fly via user feedback.

pdf bib
Abstract Visual Reasoning with Tangram Shapes
Anya Ji | Noriyuki Kojima | Noah Rush | Alane Suhr | Wai Keen Vong | Robert Hawkins | Yoav Artzi
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels. We use this resource to evaluate the abstract visual reasoning capacities of recent multi-modal models. We observe that pre-trained weights demonstrate limited abstract reasoning, which dramatically improves with fine-tuning. We also observe that explicitly describing parts aids abstract reasoning for both humans and models, especially when jointly encoding the linguistic and visual inputs.

pdf bib
Analysis of Language Change in Collaborative Instruction Following
Anna Effenberger | Eva Yan | Rhia Singh | Alane Suhr | Yoav Artzi
Proceedings of the Society for Computation in Linguistics 2022

2021

pdf bib
Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior
Noriyuki Kojima | Alane Suhr | Yoav Artzi
Transactions of the Association for Computational Linguistics, Volume 9

We study continual learning for natural language instruction generation, by observing human users’ instruction execution. We focus on a collaborative scenario, where the system both acts and delegates tasks to human users using natural language. We compare user execution of generated instructions to the original system intent as an indication to the system’s success communicating its intent. We show how to use this signal to improve the system’s ability to generate instructions via contextual bandit learning. In interaction with real users, our system demonstrates dramatic improvements in its ability to generate language over time.

pdf bib
When in Doubt: Improving Classification Performance with Alternating Normalization
Menglin Jia | Austin Reiter | Ser-Nam Lim | Yoav Artzi | Claire Cardie
Findings of the Association for Computational Linguistics: EMNLP 2021

We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal computation overhead. We analyze the properties of CAN using simulated experiments, and empirically demonstrate its effectiveness across a diverse set of classification tasks.

pdf bib
Analysis of Language Change in Collaborative Instruction Following
Anna Effenberger | Rhia Singh | Eva Yan | Alane Suhr | Yoav Artzi
Findings of the Association for Computational Linguistics: EMNLP 2021

We analyze language change over time in a collaborative, goal-oriented instructional task, where utility-maximizing participants form conventions and increase their expertise. Prior work studied such scenarios mostly in the context of reference games, and consistently found that language complexity is reduced along multiple dimensions, such as utterance length, as conventions are formed. In contrast, we find that, given the ability to increase instruction utility, instructors increase language complexity along these previously studied dimensions to better collaborate with increasingly skilled instruction followers.

pdf bib
Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection
Alane Suhr | Clara Vania | Nikita Nangia | Maarten Sap | Mark Yatskar | Samuel R. Bowman | Yoav Artzi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Crowdsourcing from non-experts is one of the most common approaches to collecting data and annotations in NLP. Even though it is such a fundamental tool in NLP, crowdsourcing use is largely guided by common practices and the personal experience of researchers. Developing a theory of crowdsourcing use for practical language problems remains an open challenge. However, there are various principles and practices that have proven effective in generating high quality and diverse data. This tutorial exposes NLP researchers to such data collection crowdsourcing methods and principles through a detailed discussion of a diverse set of case studies. The selection of case studies focuses on challenging settings where crowdworkers are asked to write original text or otherwise perform relatively unconstrained work. Through these case studies, we discuss in detail processes that were carefully designed to achieve data with specific properties, for example to require logical inference, grounded reasoning or conversational understanding. Each case study focuses on data collection crowdsourcing protocol details that often receive limited attention in research presentations, for example in conferences, but are critical for research success.

2020

pdf bib
What is Learned in Visually Grounded Neural Syntax Acquisition
Noriyuki Kojima | Hadar Averbuch-Elor | Alexander Rush | Yoav Artzi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax from a visual training signal. By constructing simplified versions of the model, we isolate the core factors that yield the model’s strong performance. Contrary to what the model might be capable of learning, we find significantly less expressive versions produce similar predictions and perform just as well, or even better. We also find that a simple lexical signal of noun concreteness plays the main role in the model’s predictions as opposed to more complex syntactic reasoning.

pdf bib
Interactive Classification by Asking Informative Questions
Lili Yu | Howard Chen | Sida I. Wang | Tao Lei | Yoav Artzi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We study the potential for interaction in natural language classification. We add a limited form of interaction for intent classification, where users provide an initial query using natural language, and the system asks for additional information using binary or multi-choice questions. At each turn, our system decides between asking the most informative question or making the final classification pre-diction. The simplicity of the model allows for bootstrapping of the system without interaction data, instead relying on simple crowd-sourcing tasks. We evaluate our approach on two domains, showing the benefit of interaction and the advantage of learning to balance between asking additional questions and making the final prediction.

pdf bib
Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View
Harsh Mehta | Yoav Artzi | Jason Baldridge | Eugene Ie | Piotr Mirowski
Proceedings of the Third International Workshop on Spatial Language Understanding

The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in (Chen et al., 2019) and show that the panoramas we have added to StreetLearn support both Touchdown tasks and can be used effectively for further research and comparison.

pdf bib
Evaluating Models’ Local Decision Boundaries via Contrast Sets
Matt Gardner | Yoav Artzi | Victoria Basmov | Jonathan Berant | Ben Bogin | Sihao Chen | Pradeep Dasigi | Dheeru Dua | Yanai Elazar | Ananth Gottumukkala | Nitish Gupta | Hannaneh Hajishirzi | Gabriel Ilharco | Daniel Khashabi | Kevin Lin | Jiangming Liu | Nelson F. Liu | Phoebe Mulcaire | Qiang Ning | Sameer Singh | Noah A. Smith | Sanjay Subramanian | Reut Tsarfaty | Eric Wallace | Ally Zhang | Ben Zhou
Findings of the Association for Computational Linguistics: EMNLP 2020

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

2019

pdf bib
A Corpus for Reasoning about Natural Language Grounded in Photographs
Alane Suhr | Stephanie Zhou | Ally Zhang | Iris Zhang | Huajun Bai | Yoav Artzi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

pdf bib
Executing Instructions in Situated Collaborative Interactions
Alane Suhr | Claudia Yan | Jack Schluger | Stanley Yu | Hadi Khader | Marwa Mouallem | Iris Zhang | Yoav Artzi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We study a collaborative scenario where a user not only instructs a system to complete tasks, but also acts alongside it. This allows the user to adapt to the system abilities by changing their language or deciding to simply accomplish some tasks themselves, and requires the system to effectively recover from errors as the user strategically assigns it new goals. We build a game environment to study this scenario, and learn to map user instructions to system actions. We introduce a learning approach focused on recovery from cascading errors between instructions, and modeling methods to explicitly reason about instructions with multiple goals. We evaluate with a new evaluation protocol using recorded interactions and online games with human users, and observe how users adapt to the system abilities.

2018

pdf bib
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
Max Grusky | Mor Naaman | Yoav Artzi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges. The dataset is available online at summari.es.

pdf bib
Learning to Map Context-Dependent Sentences to Executable Formal Queries
Alane Suhr | Srinivasan Iyer | Yoav Artzi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We propose a context-dependent model to map utterances within an interaction to executable formal queries. To incorporate interaction history, the model maintains an interaction-level encoder that updates after each turn, and can copy sub-sequences of previously predicted queries during generation. Our approach combines implicit and explicit modeling of references between utterances. We evaluate our model on the ATIS flight planning interactions, and demonstrate the benefits of modeling context and explicit references.

pdf bib
Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation
Alane Suhr | Yoav Artzi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a learning approach for mapping context-dependent sequential instructions to actions. We address the problem of discourse and state dependencies with an attention-based model that considers both the history of the interaction and the state of the world. To train from start and goal states without access to demonstrations, we propose SESTRA, a learning algorithm that takes advantage of single-step reward observations and immediate expected reward maximization. We evaluate on the SCONE domains, and show absolute accuracy improvements of 9.8%-25.3% across the domains over approaches that use high-level logical representations.

pdf bib
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Yoav Artzi | Jacob Eisenstein
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
Dipendra Misra | Andrew Bennett | Valts Blukis | Eyvind Niklasson | Max Shatkhin | Yoav Artzi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our evaluation demonstrates the advantages of our model decomposition, and illustrates the challenges posed by our new benchmarks.

pdf bib
Simple Recurrent Units for Highly Parallelizable Recurrence
Tao Lei | Yu Zhang | Sida I. Wang | Hui Dai | Yoav Artzi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5—9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model (Vaswani et al., 2017) on translation by incorporating SRU into the architecture.

2017

pdf bib
Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
Dipendra Misra | John Langford | Yoav Artzi
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent’s exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.

pdf bib
Proceedings of the First Workshop on Language Grounding for Robotics
Mohit Bansal | Cynthia Matuszek | Jacob Andreas | Yoav Artzi | Yonatan Bisk
Proceedings of the First Workshop on Language Grounding for Robotics

pdf bib
A Corpus of Natural Language for Visual Reasoning
Alane Suhr | Mike Lewis | James Yeh | Yoav Artzi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present a new visual reasoning language dataset, containing 92,244 pairs of examples of natural statements grounded in synthetic images with 3,962 unique sentences. We describe a method of crowdsourcing linguistically-diverse data, and present an analysis of our data. The data demonstrates a broad set of linguistic phenomena, requiring visual and set-theoretic reasoning. We experiment with various models, and show the data presents a strong challenge for future research.

2016

pdf bib
Neural Shift-Reduce CCG Semantic Parsing
Dipendra Kumar Misra | Yoav Artzi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
Event Detection and Factuality Assessment with Non-Expert Supervision
Kenton Lee | Yoav Artzi | Yejin Choi | Luke Zettlemoyer
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Broad-coverage CCG Semantic Parsing with AMR
Yoav Artzi | Kenton Lee | Luke Zettlemoyer
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Proceedings of the ACL 2014 Workshop on Semantic Parsing
Yoav Artzi | Tom Kwiatkowski | Jonathan Berant
Proceedings of the ACL 2014 Workshop on Semantic Parsing

pdf bib
Learning Compact Lexicons for CCG Semantic Parsing
Yoav Artzi | Dipanjan Das | Slav Petrov
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

bib
Semantic Parsing with Combinatory Categorial Grammars
Yoav Artzi | Nicholas Fitzgerald | Luke Zettlemoyer
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Semantic parsers map natural language sentences to formal representations of their underlying meaning. Building accurate semantic parsers without prohibitive engineering costs is a long-standing, open research problem.The tutorial will describe general principles for building semantic parsers. The presentation will be divided into two main parts: learning and modeling. In the learning part, we will describe a unified approach for learning Combinatory Categorial Grammar (CCG) semantic parsers, that induces both a CCG lexicon and the parameters of a parsing model. The approach learns from data with labeled meaning representations, as well as from more easily gathered weak supervision. It also enables grounded learning where the semantic parser is used in an interactive environment, for example to read and execute instructions. The modeling section will include best practices for grammar design and choice of semantic representation. We will motivate our use of lambda calculus as a language for building and representing meaning with examples from several domains.The ideas we will discuss are widely applicable. The semantic modeling approach, while implemented in lambda calculus, could be applied to many other formal languages. Similarly, the algorithms for inducing CCG focus on tasks that are formalism independent, learning the meaning of words and estimating parsing parameters. No prior knowledge of CCG is required. The tutorial will be backed by implementation and experiments in the University of Washington Semantic Parsing Framework (UW SPF, http://yoavartzi.com/spf).

pdf bib
Learning to Automatically Solve Algebra Word Problems
Nate Kushman | Yoav Artzi | Luke Zettlemoyer | Regina Barzilay
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Context-dependent Semantic Parsing for Time Expressions
Kenton Lee | Yoav Artzi | Jesse Dodge | Luke Zettlemoyer
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Scaling Semantic Parsers with On-the-Fly Ontology Matching
Tom Kwiatkowski | Eunsol Choi | Yoav Artzi | Luke Zettlemoyer
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning Distributions over Logical Forms for Referring Expression Generation
Nicholas FitzGerald | Yoav Artzi | Luke Zettlemoyer
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Semantic Parsing with Combinatory Categorial Grammars
Yoav Artzi | Nicholas FitzGerald | Luke Zettlemoyer
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Tutorials)

pdf bib
Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions
Yoav Artzi | Luke Zettlemoyer
Transactions of the Association for Computational Linguistics, Volume 1

The context in which language is used provides a strong signal for learning to recover its meaning. In this paper, we show it can be used within a grounded CCG semantic parsing approach that learns a joint model of meaning and context for interpreting and executing natural language instructions, using various types of weak supervision. The joint nature provides crucial benefits by allowing situated cues, such as the set of visible objects, to directly influence learning. It also enables algorithms that learn while executing instructions, for example by trying to replicate human actions. Experiments on a benchmark navigational dataset demonstrate strong performance under differing forms of supervision, including correctly executing 60% more instruction sets relative to the previous state of the art.

2012

pdf bib
Predicting Responses to Microblog Posts
Yoav Artzi | Patrick Pantel | Michael Gamon
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

pdf bib
Bootstrapping Semantic Parsers from Conversations
Yoav Artzi | Luke Zettlemoyer
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Search