Kevin Liu
2025
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
Qiyue Gao | Xinyu Pi | Kevin Liu | Junrong Chen | Ruolan Yang | Xinqi Huang | Xinyu Fang | Lu Sun | Gautham Kishore | Bo Ai | Stone Tao | Mengyang Liu | Jiaxi Yang | Chao-Jung Lai | Chuanyang Jin | Jiannan Xiang | Benhao Huang | Zeming Chen | David Danks | Hao Su | Tianmin Shu | Ziqiao Ma | Lianhui Qin | Zhiting Hu
Findings of the Association for Computational Linguistics: ACL 2025
Qiyue Gao | Xinyu Pi | Kevin Liu | Junrong Chen | Ruolan Yang | Xinqi Huang | Xinyu Fang | Lu Sun | Gautham Kishore | Bo Ai | Stone Tao | Mengyang Liu | Jiaxi Yang | Chao-Jung Lai | Chuanyang Jin | Jiannan Xiang | Benhao Huang | Zeming Chen | David Danks | Hao Su | Tianmin Shu | Ziqiao Ma | Lianhui Qin | Zhiting Hu
Findings of the Association for Computational Linguistics: ACL 2025
Internal world models (WMs) enable agents to understand the world’s state and predict transitions, serving as the basis for advanced deliberative reasoning.Recent large Vision-Language Models (VLMs), such as GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs’ fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses **perception** (visual, spatial, temporal, quantitative, and motion) and **prediction** (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce **WM-ABench**, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding—e.g., they tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
2023
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Kevin Liu | Stephen Casper | Dylan Hadfield-Menell | Jacob Andreas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Kevin Liu | Stephen Casper | Dylan Hadfield-Menell | Jacob Andreas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Neural language models (LMs) can be used to evaluate the truth of factual statements in two ways: they can be either queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs “lie’ or otherwise encode non-cooperative communicative intents. Is this an accurate description of today’s LMs, or can query–probe disagreement arise in other ways? We identify three different classes of disagreement, which we term confabulation, deception, and heterogeneity. In many cases, the superiority of probes is simply attributable to better calibration on uncertain answers rather than a greater fraction of correct, high-confidence answers. In some cases, queries and probes perform better on different subsets of inputs, and accuracy can further be improved by ensembling the two.
2021
Stanford MLab at SemEval-2021 Task 1: Tree-Based Modelling of Lexical Complexity using Word Embeddings
Erik Rozi | Niveditha Iyer | Gordon Chi | Enok Choe | Kathy J. Lee | Kevin Liu | Patrick Liu | Zander Lack | Jillian Tang | Ethan A. Chi
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Erik Rozi | Niveditha Iyer | Gordon Chi | Enok Choe | Kathy J. Lee | Kevin Liu | Patrick Liu | Zander Lack | Jillian Tang | Ethan A. Chi
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
This paper presents our system for the single- and multi-word lexical complexity prediction tasks of SemEval Task 1: Lexical Complexity Prediction. Text comprehension depends on the reader’s ability to understand the words present in it; evaluating the lexical complexity of such texts can enable readers to find an appropriate text and systems to tailor a text to an audience’s needs. We present our model pipeline, which applies a combination of embedding-based and manual features to predict lexical complexity on the CompLex English dataset using various tree-based and linear models. Our method is ranked 27 / 54 on single-word prediction and 14 / 37 on multi-word prediction.
Search
Fix author
Co-authors
- Bo Ai 1
- Jacob Andreas 1
- Stephen Casper 1
- Junrong Chen 1
- Zeming Chen 1
- Gordon Chi 1
- Ethan A. Chi 1
- Enok Choe 1
- David Danks 1
- Xinyu Fang 1
- Qiyue Gao 1
- Dylan Hadfield-Menell 1
- Zhiting Hu 1
- Xinqi Huang 1
- Benhao Huang 1
- Niveditha Iyer 1
- Chuanyang Jin 1
- Gautham Kishore 1
- Zander Lack 1
- Chao-Jung Lai 1
- Kathy J. Lee 1
- Patrick Liu 1
- Mengyang Liu 1
- Ziqiao Ma 1
- Xinyu Pi 1
- Lianhui Qin 1
- Erik Rozi 1
- Tianmin Shu 1
- Hao Su 1
- Lu Sun 1
- Jillian Tang 1
- Stone Tao 1
- Jiannan Xiang 1
- Ruolan Yang 1
- Jiaxi Yang 1