Jesse Thomason


2023

pdf bib
Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation
Yuliang Cai | Jesse Thomason | Mohammad Rostami
Findings of the Association for Computational Linguistics: EMNLP 2023

The size and the computational load of fine-tuning large-scale pre-trained neural network are becoming two major obstacles in adopting machine learning in many applications. Continual learning (CL) can serve as a remedy through enabling knowledge-transfer across sequentially arriving tasks which relaxes the need to fine-tune all network weights from scratch. However, existing CL algorithms primarily consider learning unimodal vision-only or language-only tasks. We develop a transformer-based CL architecture for learning bimodal vision-and-language tasks based on increasing the number of the learnable parameters dynamically and using knowledge distillation. The new additional parameters are used to specialize the network for each task. Our approach enables sharing information between the tasks while addressing the challenge of catastrophic forgetting. Our approach is scalable learning to a large number of tasks because it requires little memory and time overhead. Our model reaches state-of-the-art performance on challenging vision-and-language tasks.

pdf bib
Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering
Wang Zhu | Jesse Thomason | Robin Jia
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We propose Chain-of-Questions, a framework that trains a model to robustly answer multistep questions by generating and answering sub-questions. We obtain supervision for sub-questions from human-annotated question decomposition meaning representation (QDMR), but QDMR does not include annotated answers to sub-questions. To overcome this technical challenge, we treat sub-answers as latent variables and infer them with a novel dynamic mixture of Hard-EM and MAPO. Chain-of-Questions is effective and robust, greatly outperforming strong neuro-symbolic methods by 9.0 F1 on a DROP contrast set and GPT-3.5 by 24.3 F1 on a HotpotQA adversarial set.

pdf bib
Improving Sign Recognition with Phonology
Lee Kezar | Jesse Thomason | Zed Sehyr
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that take in pose estimations of a signer producing a single sign to predict not only the sign but additionally its phonological characteristics, such as the handshape. These auxiliary predictions lead to a nearly 9% absolute gain in sign recognition accuracy on the WLASL benchmark, with consistent improvements in ISLR regardless of the underlying prediction model architecture. This work has the potential to accelerate linguistic research in the domain of signed languages and reduce communication barriers between deaf and hearing people.

2022

pdf bib
Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems
Wang Zhu | Jesse Thomason | Robin Jia
Findings of the Association for Computational Linguistics: EMNLP 2022

For vision-and-language reasoning tasks, both fully connectionist, end-to-end methods and hybrid, neuro-symbolic methods have achieved high in-distribution performance. In which out-of-distribution settings does each paradigm excel? We investigate this question on both single-image and multi-image visual question-answering through four types of generalization tests: a novel segment-combine test for multi-image queries, contrast set, compositional generalization, and cross-benchmark transfer. Vision-and-language end-to-end trained systems exhibit sizeable performance drops across all these tests. Neuro-symbolic methods suffer even more on cross-benchmark transfer from GQA to VQA, but they show smaller accuracy drops on the other generalization tests and their performance quickly improves by few-shot training. Overall, our results demonstrate the complementary benefits of these two paradigms, and emphasize the importance of using a diverse suite of generalization tests to fully characterize model robustness to distribution shift.

pdf bib
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
Jing Gu | Eliana Stefani | Qi Wu | Jesse Thomason | Xin Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we also highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.

pdf bib
ALFRED-L: Investigating the Role of Language for Action Learning in Interactive Visual Environments
Arjun Akula | Spandana Gella | Aishwarya Padmakumar | Mahdi Namazifar | Mohit Bansal | Jesse Thomason | Dilek Hakkani-Tur
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Embodied Vision and Language Task Completion requires an embodied agent to interpret natural language instructions and egocentric visual observations to navigate through and interact with environments. In this work, we examine ALFRED, a challenging benchmark for embodied task completion, with the goal of gaining insight into how effectively models utilize language. We find evidence that sequence-to-sequence and transformer-based models trained on this benchmark are not sufficiently sensitive to changes in input language instructions. Next, we construct a new test split – ALFRED-L to test whether ALFRED models can generalize to task structures not seen during training that intuitively require the same types of language understanding required in ALFRED. Evaluation of existing models on ALFRED-L suggests that (a) models are overly reliant on the sequence in which objects are visited in typical ALFRED trajectories and fail to adapt to modifications of this sequence and (b) models trained with additional augmented trajectories are able to adapt relatively better to such changes in input language instructions.

2020

pdf bib
Experience Grounds Language
Yonatan Bisk | Ari Holtzman | Jesse Thomason | Jacob Andreas | Yoshua Bengio | Joyce Chai | Mirella Lapata | Angeliki Lazaridou | Jonathan May | Aleksandr Nisnevich | Nicolas Pinto | Joseph Turian
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utterances meaningful. Natural language processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.

pdf bib
Proceedings of the First Workshop on Advances in Language and Vision Research
Xin Wang | Jesse Thomason | Ronghang Hu | Xinlei Chen | Peter Anderson | Qi Wu | Asli Celikyilmaz | Jason Baldridge | William Yang Wang
Proceedings of the First Workshop on Advances in Language and Vision Research

pdf bib
RMM: A Recursive Mental Model for Dialogue Navigation
Homero Roman Roman | Yonatan Bisk | Jesse Thomason | Asli Celikyilmaz | Jianfeng Gao
Findings of the Association for Computational Linguistics: EMNLP 2020

Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models the guiding agent to simulate answers given candidate generated questions. The guiding agent in turn models the navigating agent to simulate navigation steps it would take to generate answers. We use the progress agents make towards the goal as a reinforcement learning reward signal to directly inform not only navigation actions, but also both question and answer generation. We demonstrate that RMM enables better generalization to novel environments. Interlocutor modelling may be a way forward for human-agent RMM where robots need to both ask and answer questions.

2019

pdf bib
Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
Jesse Thomason | Daniel Gordon | Yonatan Bisk
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We demonstrate the surprising strength of unimodal baselines in multimodal domains, and make concrete recommendations for best practices in future research. Where existing work often compares against random or majority class baselines, we argue that unimodal approaches better capture and reflect dataset biases and therefore provide an important comparison when assessing the performance of multimodal techniques. We present unimodal ablations on three recent datasets in visual navigation and QA, seeing an up to 29% absolute gain in performance over published baselines.

pdf bib
Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP)
Archna Bhatia | Yonatan Bisk | Parisa Kordjamshidi | Jesse Thomason
Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP)

2017

pdf bib
Guiding Interaction Behaviors for Multi-modal Grounded Language Learning
Jesse Thomason | Jivko Sinapov | Raymond Mooney
Proceedings of the First Workshop on Language Grounding for Robotics

Multi-modal grounded language learning connects language predicates to physical properties of objects in the world. Sensing with multiple modalities, such as audio, haptics, and visual colors and shapes while performing interaction behaviors like lifting, dropping, and looking on objects enables a robot to ground non-visual predicates like “empty” as well as visual predicates like “red”. Previous work has established that grounding in multi-modal space improves performance on object retrieval from human descriptions. In this work, we gather behavior annotations from humans and demonstrate that these improve language grounding performance by allowing a system to focus on relevant behaviors for words like “white” or “half-full” that can be understood by looking or lifting, respectively. We also explore adding modality annotations (whether to focus on audio or haptics when performing a behavior), which improves performance, and sharing information between linguistically related predicates (if “green” is a color, “white” is a color), which improves grounding recall but at the cost of precision.

pdf bib
Integrated Learning of Dialog Strategies and Semantic Parsing
Aishwarya Padmakumar | Jesse Thomason | Raymond J. Mooney
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Natural language understanding and dialog management are two integral components of interactive dialog systems. Previous research has used machine learning techniques to individually optimize these components, with different forms of direct and indirect supervision. We present an approach to integrate the learning of both a dialog strategy using reinforcement learning, and a semantic parser for robust natural language understanding, using only natural dialog interaction for supervision. Experimental results on a simulated task of robot instruction demonstrate that joint learning of both components improves dialog performance over learning either of these components alone.

pdf bib
Improving Black-box Speech Recognition using Semantic Parsing
Rodolfo Corona | Jesse Thomason | Raymond Mooney
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Speech is a natural channel for human-computer interaction in robotics and consumer applications. Natural language understanding pipelines that start with speech can have trouble recovering from speech recognition errors. Black-box automatic speech recognition (ASR) systems, built for general purpose use, are unable to take advantage of in-domain language models that could otherwise ameliorate these errors. In this work, we present a method for re-ranking black-box ASR hypotheses using an in-domain language model and semantic parser trained for a particular task. Our re-ranking method significantly improves both transcription accuracy and semantic understanding over a state-of-the-art ASR’s vanilla output.

2014

pdf bib
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
Jesse Thomason | Subhashini Venugopalan | Sergio Guadarrama | Kate Saenko | Raymond Mooney
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Differences in User Responses to a Wizard-of-Oz versus Automated System
Jesse Thomason | Diane Litman
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies