Video Event Detection by Exploiting Word Dependencies from Image Captions

Sang Phan, Yusuke Miyao, Duy-Dinh Le, Shin’ichi Satoh


Abstract
Video event detection is a challenging problem in information and multimedia retrieval. Different from single action detection, event detection requires a richer level of semantic information from video. In order to overcome this challenge, existing solutions often represent videos using high level features such as concepts. However, concept-based representation can be confusing because it does not encode the relationship between concepts. This issue can be addressed by exploiting the co-occurrences of the concepts, however, it often leads to a very huge number of possible combinations. In this paper, we propose a new approach to obtain the relationship between concepts by exploiting the syntactic dependencies between words in the image captions. The main advantage of this approach is that it significantly reduces the number of informative combinations between concepts. We conduct extensive experiments to analyze the effectiveness of using the new dependency representation for event detection on two large-scale TRECVID Multimedia Event Detection 2013 and 2014 datasets. Experimental results show that i) Dependency features are more discriminative than concept-based features. ii) Dependency features can be combined with our current event detection system to further improve the performance. For instance, the relative improvement can be as far as 8.6% on the MEDTEST14 10Ex setting.
Anthology ID:
C16-1313
Volume:
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Yuji Matsumoto, Rashmi Prasad
Venue:
COLING
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
3318–3327
Language:
URL:
https://aclanthology.org/C16-1313
DOI:
Bibkey:
Cite (ACL):
Sang Phan, Yusuke Miyao, Duy-Dinh Le, and Shin’ichi Satoh. 2016. Video Event Detection by Exploiting Word Dependencies from Image Captions. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3318–3327, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Video Event Detection by Exploiting Word Dependencies from Image Captions (Phan et al., COLING 2016)
Copy Citation:
PDF:
https://aclanthology.org/C16-1313.pdf
Data
ImageNetMS COCO