OSCaR: Object State Captioning and State Change Representation

Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu


Abstract
The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating Multimodal Large Language Models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.
Anthology ID:
2024.findings-naacl.226
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3565–3576
Language:
URL:
https://aclanthology.org/2024.findings-naacl.226
DOI:
Bibkey:
Cite (ACL):
Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, and Chenliang Xu. 2024. OSCaR: Object State Captioning and State Change Representation. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3565–3576, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
OSCaR: Object State Captioning and State Change Representation (Nguyen et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.226.pdf
Copyright:
 2024.findings-naacl.226.copyright.pdf