OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han; Bin Zhu; Shiqi Hu; Franklin Mingzhe Li; Patrick Carrington; Roger Zimmermann; Jingjing Chen

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen

Abstract

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text–video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object’s state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action–object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)–based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models. Project page: https://hanxjing.github.io/OSCBench.

Anthology ID:: 2026.acl-long.1425
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30867–30884
Language:
URL:: https://aclanthology.org/2026.acl-long.1425/
DOI:
Bibkey:
Cite (ACL):: Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, and Jingjing Chen. 2026. OSCBench: Benchmarking Object State Change in Text-to-Video Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30867–30884, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: OSCBench: Benchmarking Object State Change in Text-to-Video Generation (Han et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1425.pdf
Checklist:: 2026.acl-long.1425.checklist.pdf

PDF Cite Search Checklist Fix data