Non-Event Oriented Video Assessments in Long-Form Robot Videos

Stephanie M. Lukin; Kimberly A. Pollard; Claire Bonial; Cory Hayes; Ron Artstein; Kallirroi Georgila; David Traum

Non-Event Oriented Video Assessments in Long-Form Robot Videos

Stephanie M. Lukin, Kimberly A. Pollard, Claire Bonial, Cory J. Hayes, Ron Artstein, Kallirroi Georgila, David Traum

Abstract

We introduce Video-SCOUT, a novel dataset of sixty 20-minute robot-recorded videos from human-robot collaborative exploration exercises, together with a new video analysis method for these types of exploration videos. Unlike video from stationary cameras where detection of motion can help identify events of interest, the camera in an exploration task is constantly in motion while the environment is stationary. Our analysis method—Non-Event Oriented Video Assessments (NOVA)—uses vision-language models to select frames relevant for supporting a particular assessment within continuous long-form videos. Results of testing with two different video-language models reveals a trade-off in precision and recall, and exhibits gains in overall recall when combined with a human’s knowledge, suggesting that NOVA may improve a human analysis of robot-navigation. We outline future work to mitigate miscommunication in human-robot interaction by leveraging dialogue with NOVA in support of better collaboration.

Anthology ID:: 2026.magmar-main.8
Volume:: Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026)
Month:: July
Year:: 2026
Address:: San Diego, USA
Editors:: Kenton Murray, Reno Kriz
Venues:: MAGMaR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27–41
Language:
URL:: https://aclanthology.org/2026.magmar-main.8/
DOI:
Bibkey:
Cite (ACL):: Stephanie M. Lukin, Kimberly A. Pollard, Claire Bonial, Cory J. Hayes, Ron Artstein, Kallirroi Georgila, and David Traum. 2026. Non-Event Oriented Video Assessments in Long-Form Robot Videos. In Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026), pages 27–41, San Diego, USA. Association for Computational Linguistics.
Cite (Informal):: Non-Event Oriented Video Assessments in Long-Form Robot Videos (Lukin et al., MAGMaR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.magmar-main.8.pdf

PDF Cite Search Fix data