Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding

Erika Loc, Keith Curtis, George Awad, Shahzad Rajput, Ian Soboroff


Abstract
In this paper we introduce our approach and methods for collecting and annotating a new dataset for deep video understanding. The proposed dataset is composed of 3 seasons (15 episodes) of the BBC Land Girls TV Series in addition to 14 Creative Common movies with total duration of 28.5 hr. The main contribution of this paper is a novel annotation framework on the movie and scene levels to support an automatic query generation process that can capture the high-level movie features (e.g. how characters and locations are related to each other) as well as fine grained scene-level features (e.g. character interactions, natural language descriptions, and sentiments). Movie-level annotations include constructing a global static knowledge graph (KG) to capture major relationships, while the scene-level annotations include constructing a sequence of knowledge graphs (KGs) to capture fine-grained features. The annotation framework supports generating multiple query types. The objective of the framework is to provide a guide to annotating long duration videos to support tasks and challenges in the video and multimedia understanding domains. These tasks and challenges can support testing automatic systems on their ability to learn and comprehend a movie or long video in terms of actors, entities, events, interactions and their relationship to each other.
Anthology ID:
2022.pvlam-1.3
Volume:
Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Patrizia Paggio, Albert Gatt, Marc Tanti
Venue:
PVLAM
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
12–16
Language:
URL:
https://aclanthology.org/2022.pvlam-1.3
DOI:
Bibkey:
Cite (ACL):
Erika Loc, Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff. 2022. Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding. In Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind, pages 12–16, Marseille, France. European Language Resources Association.
Cite (Informal):
Development of a MultiModal Annotation Framework and Dataset for Deep Video Understanding (Loc et al., PVLAM 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.pvlam-1.3.pdf