Abhishek Das


pdf bib
Proceedings of the Fourth Workshop on Visually Grounded Interaction and Language
Cătălina Cangea | Abhishek Das | Drew Hudson | Jacob Krantz | Stefan Lee | Jiayuan Mao | Florian Strub | Alane Suhr | Erik Wijmans
Proceedings of the Fourth Workshop on Visually Grounded Interaction and Language


pdf bib
ABSA-Bench: Towards the Unified Evaluation of Aspect-based Sentiment Analysis Research
Abhishek Das | Wei Emma Zhang
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association

Aspect-Based Sentiment Analysis (ABSA)has gained much attention in recent years. It is the task of identifying fine-grained opinionpolarity towards a specific aspect associated with a given target. However, there is a lack of benchmarking platform to provide a unified environment under consistent evaluation criteria for ABSA, resulting in the difficulties for fair comparisons. In this work, we address this issue and define a benchmark, ABSA-Bench, by unifying the evaluation protocols and the pre-processed publicly available datasets in a Web-based platform. ABSA-Bench provides two means of evaluations for participants to submit their predictions or models for online evaluation. Performances are ranked in the leader board and a discussion forum is supported to serve as a collaborative platform for academics and researchers to discuss queries.


pdf bib
Improving Generative Visual Dialog by Answering Diverse Questions
Vishvak Murahari | Prithvijit Chattopadhyay | Dhruv Batra | Devi Parikh | Abhishek Das
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Prior work on training generative Visual Dialog models with reinforcement learning ((Das et al., ICCV 2017) has explored a Q-Bot-A-Bot image-guessing game and shown that this ‘self-talk’ approach can lead to improved performance at the downstream dialog-conditioned image-guessing task. However, this improvement saturates and starts degrading after a few rounds of interaction, and does not lead to a better Visual Dialog model. We find that this is due in part to repeated interactions between Q-Bot and A-BOT during self-talk, which are not informative with respect to the image. To improve this, we devise a simple auxiliary objective that incentivizes Q-Bot to ask diverse questions, thus reducing repetitions and in turn enabling A-Bot to explore a larger state space during RL i.e. be exposed to more visual concepts to talk about, and varied questions to answer. We evaluate our approach via a host of automatic metrics and human studies, and demonstrate that it leads to better dialog, i.e. dialog that is more diverse (i.e. less repetitive), consistent (i.e. has fewer conflicting exchanges), fluent (i.e., more human-like), and detailed, while still being comparably image-relevant as prior work and ablations.


pdf bib
Connecting Language and Vision to Actions
Peter Anderson | Abhishek Das | Qi Wu
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress – from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding free-form conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments. To reduce the entry barrier for new researchers, this tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing state-of-the-art approaches to selected tasks such as image captioning, visual question answering (VQA) and visual dialog, presenting the key architectural building blocks (such as co-attention) and novel algorithms (such as cooperative/adversarial games) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose.


pdf bib
Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?
Abhishek Das | Harsh Agrawal | Larry Zitnick | Devi Parikh | Dhruv Batra
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing