Sketchy Scene Captioning: Learning Multi-Level Semantic Information from Sparse Visual Scene Cues

Zhou Lian, Chen Yangdong, Zhang Yuejie


Abstract
To enrich the research about sketch modality a new task termed Sketchy Scene Captioning isproposed in this paper. This task aims to generate sentence-level and paragraph-level descrip-tions for a sketchy scene. The sentence-level description provides the salient semantics of asketchy scene while the paragraph-level description gives more details about the sketchy scene. Sketchy Scene Captioning can be viewed as an extension of sketch classification which can onlyprovide one class label for a sketch. To generate multi-level descriptions for a sketchy scene ischallenging because of the visual sparsity and ambiguity of the sketch modality. To achieve ourgoal we first contribute a sketchy scene captioning dataset to lay the foundation of this new task. The popular sequence learning scheme e.g. Long Short-Term Memory neural network with vi-sual attention mechanism is then adopted to recognize the objects in a sketchy scene and inferthe relations among the objects. In the experiments promising results have been achieved on the proposed dataset. We believe that this work will motivate further researches on the understanding of sketch modality and the numerous sketch-based applications in our daily life. The collected dataset is released at https://github.com/SketchysceneCaption/Dataset.
Anthology ID:
2021.ccl-1.104
Volume:
Proceedings of the 20th Chinese National Conference on Computational Linguistics
Month:
August
Year:
2021
Address:
Huhhot, China
Editors:
Sheng Li (李生), Maosong Sun (孙茂松), Yang Liu (刘洋), Hua Wu (吴华), Kang Liu (刘康), Wanxiang Che (车万翔), Shizhu He (何世柱), Gaoqi Rao (饶高琦)
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
1167–1177
Language:
English
URL:
https://aclanthology.org/2021.ccl-1.104
DOI:
Bibkey:
Cite (ACL):
Zhou Lian, Chen Yangdong, and Zhang Yuejie. 2021. Sketchy Scene Captioning: Learning Multi-Level Semantic Information from Sparse Visual Scene Cues. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1167–1177, Huhhot, China. Chinese Information Processing Society of China.
Cite (Informal):
Sketchy Scene Captioning: Learning Multi-Level Semantic Information from Sparse Visual Scene Cues (Lian et al., CCL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.ccl-1.104.pdf
Data
SketchSketchyScene