Vasudev Lal


pdf bib
NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation
Shachar Rosenman | Vasudev Lal | Phillip Howard
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive framework that automatically enhances a user’s prompt to improve the quality of generations produced by text-to-image models. Our framework utilizes constrained text decoding with a pre-trained language model that has been adapted to generate prompts similar to those produced by human prompt engineers. This approach enables higher-quality text-to-image generations and provides user control over stylistic features via constraint set specification. We demonstrate the utility of our framework by creating an interactive application for prompt enhancement and image generation using Stable Diffusion. Additionally, we conduct experiments utilizing a large dataset of human-engineered prompts for text-to-image generation and show that our approach automatically produces enhanced prompts that result in superior image quality. We make our code, a screencast video demo and a live demo instance of NeuroPrompts publicly available.


pdf bib
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu | Bei Li | Chenfei Wu | Shao-Yen Tseng | Anahita Bhiwandiwalla | Shachar Rosenman | Vasudev Lal | Wanxiang Che | Nan Duan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels. The managers introduced in each cross-modal layer can adaptively aggregate uni-modal semantic knowledge to facilitate more comprehensive cross-modal alignment and fusion. ManagerTower outperforms previous strong baselines both with and without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower achieves superior performances on various downstream VL tasks, especially 79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K. Code and checkpoints are available at


pdf bib
Opinion-based Relational Pivoting for Cross-domain Aspect Term Extraction
Ayal Klein | Oren Pereg | Daniel Korat | Vasudev Lal | Moshe Wasserblat | Ido Dagan
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

Domain adaptation methods often exploit domain-transferable input features, a.k.a. pivots. The task of Aspect and Opinion Term Extraction presents a special challenge for domain transfer: while opinion terms largely transfer across domains, aspects change drastically from one domain to another (e.g. from restaurants to laptops). In this paper, we investigate and establish empirically a prior conjecture, which suggests that the linguistic relations connecting opinion terms to their aspects transfer well across domains and therefore can be leveraged for cross-domain aspect term extraction. We present several analyses supporting this conjecture, via experiments with four linguistic dependency formalisms to represent relation patterns. Subsequently, we present an aspect term extraction method that drives models to consider opinion–aspect relations via explicit multitask objectives. This method provides significant performance gains, even on top of a prior state-of-the-art linguistically-informed model, which are shown in analysis to stem from the relational pivoting signal.

pdf bib
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation
Yongfei Liu | Chenfei Wu | Shao-Yen Tseng | Vasudev Lal | Xuming He | Nan Duan
Findings of the Association for Computational Linguistics: NAACL 2022

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework, which suffer from restrictive object concept space, limited image context and inefficient computation. In this paper, we propose an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. More importantly, we propose to perform object knowledge distillation to facilitate learning cross-modal alignment at different semantic levels. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision: 1.) Object-guided masked vision modeling task focuses on enforcing object-aware representation learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space. Extensive experiments on a wide range of vision-language tasks demonstrate the efficacy of our proposed framework, and we achieve competitive or superior performances over the existing pretraining strategies.

pdf bib
NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation
Phillip Howard | Gadi Singer | Vasudev Lal | Yejin Choi | Swabha Swayamdipta
Findings of the Association for Computational Linguistics: EMNLP 2022

While counterfactual data augmentation offers a promising step towards robust generalization in natural language processing, producing a set of counterfactuals that offer valuable inductive bias for models remains a challenge. Most existing approaches for producing counterfactuals, manual or automated, rely on small perturbations via minimal edits, resulting in simplistic changes. We introduce NeuroCounterfactuals, designed as loose counterfactuals, allowing for larger edits which result in naturalistic generations containing linguistic diversity, while still bearing similarity to the original document. Our novel generative approach bridges the benefits of constrained decoding, with those of language model adaptation for sentiment steering. Training data augmentation with our generations results in both in-domain and out-of-domain improvements for sentiment classification, outperforming even manually curated counterfactuals, under select settings. We further present detailed analyses to show the advantages of NeuroCounterfactuals over approaches involving simple, minimal edits.


pdf bib
InterpreT: An Interactive Visualization Tool for Interpreting Transformers
Vasudev Lal | Arden Ma | Estelle Aflalo | Phillip Howard | Ana Simoes | Daniel Korat | Oren Pereg | Gadi Singer | Moshe Wasserblat
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

With the increasingly widespread use of Transformer-based models for NLU/NLP tasks, there is growing interest in understanding the inner workings of these models, why they are so effective at a wide range of tasks, and how they can be further tuned and improved. To contribute towards this goal of enhanced explainability and comprehension, we present InterpreT, an interactive visualization tool for interpreting Transformer-based models. In addition to providing various mechanisms for investigating general model behaviours, novel contributions made in InterpreT include the ability to track and visualize token embeddings through each layer of a Transformer, highlight distances between certain token embeddings through illustrative plots, and identify task-related functions of attention heads by using new metrics. InterpreT is a task agnostic tool, and its functionalities are demonstrated through the analysis of model behaviours for two disparate tasks: Aspect Based Sentiment Analysis (ABSA) and the Winograd Schema Challenge (WSC).