Song-chun Zhu

Also published as: Song-Chun Zhu


CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization
Arjun Akula | Soravit Changpinyo | Boqing Gong | Piyush Sharma | Song-Chun Zhu | Radu Soricut
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

One challenge in evaluating visual question answering (VQA) models in the cross-dataset adaptation setting is that the distribution shifts are multi-modal, making it difficult to identify if it is the shifts in visual or language features that play a key role. In this paper, we propose a semi-automatic framework for generating disentangled shifts by introducing a controllable visual question-answer generation (VQAG) module that is capable of generating highly-relevant and diverse question-answer pairs with the desired dataset style. We use it to create CrossVQA, a collection of test splits for assessing VQA generalization based on the VQA2, VizWiz, and Open Images datasets. We provide an analysis of our generated datasets and demonstrate its utility by using them to evaluate several state-of-the-art VQA systems. One important finding is that the visual shifts in cross-dataset VQA matter more than the language shifts. More broadly, we present a scalable framework for systematically evaluating the machine with little human intervention.

Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions
Arjun Akula | Spandana Gella | Keze Wang | Song-Chun Zhu | Siva Reddy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural module networks (NMN) are a popular approach for grounding visual referring expressions. Prior implementations of NMN use pre-defined and fixed textual inputs in their module instantiation. This necessitates a large number of modules as they lack the ability to share weights and exploit associations between similar textual contexts (e.g. “dark cube on the left” vs. “black cube on the left”). In this work, we address these limitations and evaluate the impact of contextual clues in improving the performance of NMN models. First, we address the problem of fixed textual inputs by parameterizing the module arguments. This substantially reduce the number of modules in NMN by up to 75% without any loss in performance. Next we propose a method to contextualize our parameterized model to enhance the module’s capacity in exploiting the visiolinguistic associations. Our model outperforms the state-of-the-art NMN model on CLEVR-Ref+ dataset with +8.1% improvement in accuracy on the single-referent test set and +4.3% on the full test set. Additionally, we demonstrate that contextualization provides +11.2% and +1.7% improvements in accuracy over prior NMN models on CLOSURE and NLVR2. We further evaluate the impact of our contextualization by constructing a contrast set for CLEVR-Ref+, which we call CC-Ref+. We significantly outperform the baselines by as much as +10.4% absolute accuracy on CC-Ref+, illustrating the generalization skills of our approach.

GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning
Zilong Zheng | Shuwen Qiu | Lifeng Fan | Yixin Zhu | Song-Chun Zhu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues
Liang Qiu | Yuan Liang | Yizhou Zhao | Pan Lu | Baolin Peng | Zhou Yu | Ying Nian Wu | Song-Chun Zhu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Inferring social relations from dialogues is vital for building emotionally intelligent robots to interpret human language better and act accordingly. We model the social network as an And-or Graph, named SocAoG, for the consistency of relations among a group and leveraging attributes as inference cues. Moreover, we formulate a sequential structure prediction task, and propose an 𝛼-𝛽-𝛾 strategy to incrementally parse SocAoG for the dynamic inference upon any incoming utterance: (i) an 𝛼 process predicting attributes and relations conditioned on the semantics of dialogues, (ii) a 𝛽 process updating the social relations based on related attributes, and (iii) a 𝛾 process updating individual’s attributes based on interpersonal social relations. Empirical results on DialogRE and MovieGraph show that our model infers social relations more accurately than the state-of-the-art methods. Moreover, the ablation study shows the three processes complement each other, and the case study demonstrates the dynamic relational inference.

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
Pan Lu | Ran Gong | Shibiao Jiang | Liang Qiu | Siyuan Huang | Xiaodan Liang | Song-Chun Zhu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Geometry problem solving has attracted much attention in the NLP community recently. The task is challenging as it requires abstract problem understanding and symbolic reasoning with axiomatic knowledge. However, current datasets are either small in scale or not publicly available. Thus, we construct a new large-scale benchmark, Geometry3K, consisting of 3,002 geometry problems with dense annotation in formal language. We further propose a novel geometry solving approach with formal language and symbolic reasoning, called Interpretable Geometry Problem Solver (Inter-GPS). Inter-GPS first parses the problem text and diagram into formal language automatically via rule-based text parsing and neural object detecting, respectively. Unlike implicit learning in existing methods, Inter-GPS incorporates theorem knowledge as conditional rules and performs symbolic reasoning step by step. Also, a theorem predictor is designed to infer the theorem application sequence fed to the symbolic solver for the more efficient and reasonable searching path. Extensive experiments on the Geometry3K and GEOS datasets demonstrate that Inter-GPS achieves significant improvements over existing methods. The project with code and data is available at


Words Aren’t Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions
Arjun Akula | Spandana Gella | Yaser Al-Onaizan | Song-Chun Zhu | Siva Reddy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn’t matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn’t. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at

Structured Attention for Unsupervised Dialogue Structure Induction
Liang Qiu | Yizhou Zhao | Weiyan Shi | Yuan Liang | Feng Shi | Tao Yuan | Zhou Yu | Song-Chun Zhu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Inducing a meaningful structural representation from one or a set of dialogues is a crucial but challenging task in computational linguistics. Advancement made in this area is critical for dialogue system design and discourse analysis. It can also be extended to solve grammatical inference. In this work, we propose to incorporate structured attention layers into a Variational Recurrent Neural Network (VRNN) model with discrete latent states to learn dialogue structure in an unsupervised fashion. Compared to a vanilla VRNN, structured attention enables a model to focus on different parts of the source sentence embeddings while enforcing a structural inductive bias. Experiments show that on two-party dialogue datasets, VRNN with structured attention learns semantic structures that are similar to templates used to generate this dialogue corpus. While on multi-party dialogue datasets, our model learns an interactive structure demonstrating its capability of distinguishing speakers or addresses, automatically disentangling dialogues without explicit human annotation.


Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration
Changsong Liu | Shaohua Yang | Sari Saba-Sadiya | Nishant Shukla | Yunzhong He | Song-Chun Zhu | Joyce Chai
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Grounded Semantic Role Labeling
Shaohua Yang | Qiaozi Gao | Changsong Liu | Caiming Xiong | Song-Chun Zhu | Joyce Y. Chai
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies