ACT-Thor: A Controlled Benchmark for Embodied Action Understanding in Simulated Environments

Michael Hanna; Federico Pedeni; Alessandro Suglia; Alberto Testoni; Raffaella Bernardi

ACT-Thor: A Controlled Benchmark for Embodied Action Understanding in Simulated Environments

Michael Hanna, Federico Pedeni, Alessandro Suglia, Alberto Testoni, Raffaella Bernardi

Abstract

Artificial agents are nowadays challenged to perform embodied AI tasks. To succeed, agents must understand the meaning of verbs and how their corresponding actions transform the surrounding world. In this work, we propose ACT-Thor, a novel controlled benchmark for embodied action understanding. We use the AI2-THOR simulated environment to produce a controlled setup in which an agent, given a before-image and an associated action command, has to determine what the correct after-image is among a set of possible candidates. First, we assess the feasibility of the task via a human evaluation that resulted in 81.4% accuracy, and very high inter-annotator agreement (84.9%). Second, we design both unimodal and multimodal baselines, using state-of-the-art visual feature extractors. Our evaluation and error analysis suggest that only models that have a very structured representation of the actions together with powerful visual features can perform well on the task. However, they still fall behind human performance in a zero-shot scenario where the model is exposed to unseen (action, object) pairs. This paves the way for a systematic way of evaluating embodied AI agents that understand grounded actions.

Anthology ID:: 2022.coling-1.495
Volume:: Proceedings of the 29th International Conference on Computational Linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 5597–5612
Language:
URL:: https://aclanthology.org/2022.coling-1.495/
DOI:
Bibkey:
Cite (ACL):: Michael Hanna, Federico Pedeni, Alessandro Suglia, Alberto Testoni, and Raffaella Bernardi. 2022. ACT-Thor: A Controlled Benchmark for Embodied Action Understanding in Simulated Environments. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5597–5612, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):: ACT-Thor: A Controlled Benchmark for Embodied Action Understanding in Simulated Environments (Hanna et al., COLING 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.coling-1.495.pdf

PDF Cite Search Fix data