Rohan Pandey


pdf bib
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Rohan Pandey | Rulin Shao | Paul Pu Liang | Ruslan Salakhutdinov | Louis-Philippe Morency
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., ‘mug in grass’) with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the language attention from ‘mug’ to ‘grass’ (capturing the semantic relation ‘in’) to match the visual attention from the mug to the grass (capturing the corresponding physical relation). Tokens and their corresponding objects are softly identified using a weighted mean of cross-modal attention. We prove that this notion of soft cross-modal equivalence is equivalent to enforcing congruence between vision and language attention matrices under a ‘change of basis’ provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to fine-tune UNITER and improve its Winoground Group score by 5.75 points.

pdf bib
Syntax-guided Neural Module Distillation to Probe Compositionality in Sentence Embeddings
Rohan Pandey
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Past work probing compositionality in sentence embedding models faces issues determining the causal impact of implicit syntax representations. Given a sentence, we construct a neural module net based on its syntax parse and train it end-to-end to approximate the sentence’s embedding generated by a transformer model. The distillability of a transformer to a Syntactic NeurAl Module Net (SynNaMoN) then captures whether syntax is a strong causal model of its compositional ability. Furthermore, we address questions about the geometry of semantic composition by specifying individual SynNaMoN modules’ internal architecture & linearity. We find differences in the distillability of various sentence embedding models that broadly correlate with their performance, but observe that distillability doesn’t considerably vary by model size. We also present preliminary evidence that much syntax-guided composition in sentence embedding models is linear, and that non-linearities may serve primarily to handle non-compositional phrases.


pdf bib
Athena 2.0: Contextualized Dialogue Management for an Alexa Prize SocialBot
Juraj Juraska | Kevin Bowden | Lena Reed | Vrindavan Harrison | Wen Cui | Omkar Patil | Rishi Rajasekaran | Angela Ramirez | Cecilia Li | Eduardo Zamora | Phillip Lee | Jeshwanth Bheemanpally | Rohan Pandey | Adwait Ratnaparkhi | Marilyn Walker
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Athena 2.0 is an Alexa Prize SocialBot that has been a finalist in the last two Alexa Prize Grand Challenges. One reason for Athena’s success is its novel dialogue management strategy, which allows it to dynamically construct dialogues and responses from component modules, leading to novel conversations with every interaction. Here we describe Athena’s system design and performance in the Alexa Prize during the 20/21 competition. A live demo of Athena as well as video recordings will provoke discussion on the state of the art in conversational AI.