Learning Human Action Representations from Temporal Context in Lifestyle Vlogs

Oana Ignat; Santiago Castro; Weiji Li; Rada Mihalcea

Learning Human Action Representations from Temporal Context in Lifestyle Vlogs

Oana Ignat, Santiago Castro, Weiji Li, Rada Mihalcea

Abstract

We address the task of human action representation and show how the approach to generating word representations based on co-occurrence can be adapted to generate human action representations by analyzing their co-occurrence in videos. To this end, we formalize the new task of human action co-occurrence identification in online videos, i.e., determine whether two human actions are likely to co-occur in the same interval of time.We create and make publicly available the Co-Act (Action Co-occurrence) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring.We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains.

Anthology ID:: 2024.textgraphs-1.1
Volume:: Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Dmitry Ustalov, Yanjun Gao, Alexander Panchenko, Elena Tutubalina, Irina Nikishina, Arti Ramesh, Andrey Sakhovskiy, Ricardo Usbeck, Gerald Penn, Marco Valentino
Venues:: TextGraphs | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–18
Language:
URL:: https://aclanthology.org/2024.textgraphs-1.1/
DOI:
Bibkey:
Cite (ACL):: Oana Ignat, Santiago Castro, Weiji Li, and Rada Mihalcea. 2024. Learning Human Action Representations from Temporal Context in Lifestyle Vlogs. In Proceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing, pages 1–18, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Learning Human Action Representations from Temporal Context in Lifestyle Vlogs (Ignat et al., TextGraphs 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.textgraphs-1.1.pdf

PDF Cite Search Fix data