Unnat Jain

2026

Compositional Reasoning via Joint Image and Language Decomposition
Dwip Dalal | Madhav Kanda | Zhenhailong Wang | Heng Ji | Unnat Jain
Findings of the Association for Computational Linguistics: EACL 2026

Multimodal reasoning tasks such as visual question answering (VQA) require models to process both language and visual inputs. However, existing approaches typically decompose only language queries, treating images as monolithic inputs. We introduce REDI, a framework that jointly decomposes both images and questions into visual sub-domains (segmentation, material, depth, and color) with corresponding sub-questions. REDI uses an MLLM orchestrator to select the sub-domains required for each query, generate domain-specific sub-questions with grounded object references (via shared object labels), and fuse worker outputs via consistency-aware aggregation (verify–refine–override) to produce the final answer. This hierarchical multi-agent design mitigates error propagation and improves compositional reasoning across both open- and closed-source MLLMs. On SEEDBench, MMBench, and CLEVR, REDI achieves absolute accuracy improvements of 8.9%, 8.2%, and 16.0% over chain-of-thought and visual programming baselines. Project webpage: https://madhav-kanda.github.io/redi

2021

pdf bib abs

Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
Sonia Raychaudhuri | Saim Wani | Shivansh Patel | Unnat Jain | Angel Chang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle ‘off the path’ scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent’s location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.

Co-authors

Sonia Raychaudhuri 1

Zhenhailong Wang 1

Saim Wani 1

Venues

EMNLP1
Findings1

Fix author