Joshua Tenenbaum
2025
Elements of World Knowledge ( EWoK ): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models
Anna A. Ivanova | Aalok Sathe | Benjamin Lipkin | Unnathi U. Kumar | Setayesh Radkani | Thomas H. Clark | Carina Kauf | Jennifer Hu | R. T. Pramod | Gabriel Grand | Vivian C. Paulun | Maria Ryskina | Ekin Akyürek | Ethan G. Wilcox | Nafisa Rashid | Leshem Choshen | Roger Levy | Evelina Fedorenko | Joshua Tenenbaum | Jacob Andreas
Transactions of the Association for Computational Linguistics, Volume 13
Anna A. Ivanova | Aalok Sathe | Benjamin Lipkin | Unnathi U. Kumar | Setayesh Radkani | Thomas H. Clark | Carina Kauf | Jennifer Hu | R. T. Pramod | Gabriel Grand | Vivian C. Paulun | Maria Ryskina | Ekin Akyürek | Ethan G. Wilcox | Nafisa Rashid | Leshem Choshen | Roger Levy | Evelina Fedorenko | Joshua Tenenbaum | Jacob Andreas
Transactions of the Association for Computational Linguistics, Volume 13
The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems—especially those based on language models—has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models’ understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B–70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.
2024
MMToM-QA: Multimodal Theory of Mind Question Answering
Chuanyang Jin | Yutong Wu | Jing Cao | Jiannan Xiang | Yen-Ling Kuo | Zhiting Hu | Tomer Ullman | Antonio Torralba | Joshua Tenenbaum | Tianmin Shu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chuanyang Jin | Yutong Wu | Jing Cao | Jiannan Xiang | Yen-Ling Kuo | Zhiting Hu | Tomer Ullman | Antonio Torralba | Joshua Tenenbaum | Tianmin Shu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory of Mind (ToM), the ability to understand people’s mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets – either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person’s mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person’s activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
2023
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers
Theo Olausson | Alex Gu | Ben Lipkin | Cedegao Zhang | Armando Solar-Lezama | Joshua Tenenbaum | Roger Levy
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Theo Olausson | Alex Gu | Ben Lipkin | Cedegao Zhang | Armando Solar-Lezama | Joshua Tenenbaum | Roger Levy
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Logical reasoning, i.e., deductively inferring the truth value of a conclusion from a set of premises, is an important task for artificial intelligence with wide potential impacts on science, mathematics, and society. While many prompting-based strategies have been proposed to enable Large Language Models (LLMs) to do such reasoning more effectively, they still appear unsatisfactory, often failing in subtle and unpredictable ways. In this work, we investigate the validity of instead reformulating such tasks as modular neurosymbolic programming, which we call LINC: Logical Inference via Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, translating premises and conclusions from natural language to expressions in first-order logic. These expressions are then offloaded to an external theorem prover, which symbolically performs deductive inference. Leveraging this approach, we observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate. On ProofWriter, augmenting the comparatively small open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, respectively. When used with GPT-4, LINC scores 26% higher than CoT on ProofWriter while performing comparatively on FOLIO. Further analysis reveals that although both methods on average succeed roughly equally often on this dataset, they exhibit distinct and complementary failure modes. We thus provide promising evidence for how logical reasoning over natural language can be tackled through jointly leveraging LLMs alongside symbolic provers. All corresponding code is publicly available.
Search
Fix author
Co-authors
- Roger Levy 2
- Ekin Akyürek 1
- Jacob Andreas 1
- Jing Cao 1
- Leshem Choshen 1
- Thomas H. Clark 1
- Evelina Fedorenko 1
- Gabriel Grand 1
- Alex Gu 1
- Zhiting Hu 1
- Jennifer Hu 1
- Anna A. Ivanova 1
- Chuanyang Jin 1
- Carina Kauf 1
- Unnathi U. Kumar 1
- Yen-Ling Kuo 1
- Ben Lipkin 1
- Benjamin Lipkin 1
- Theo Olausson 1
- Vivian C. Paulun 1
- R. T. Pramod 1
- Setayesh Radkani 1
- Nafisa Rashid 1
- Maria Ryskina 1
- Aalok Sathe 1
- Tianmin Shu 1
- Armando Solar-Lezama 1
- Antonio Torralba 1
- Tomer Ullman 1
- Ethan G. Wilcox 1
- Yutong Wu 1
- Jiannan Xiang 1
- Cedegao Zhang 1