Query-based Image Captioning from Multi-context 360cdegree Images

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki


Abstract
A 360-degree image captures the entire scene without the limitations of a camera’s field of view, which makes it difficult to describe all the contexts in a single caption. We propose a novel task called Query-based Image Captioning (QuIC) for 360-degree images, where a query (words or short phrases) specifies the context to describe. This task is more challenging than the conventional image captioning task, which describes salient objects in images, as it requires fine-grained scene understanding to select the contents consistent with user’s intent based on the query. We construct a dataset for the new task that comprises 3,940 360-degree images and 18,459 pairs of queries and captions annotated manually. Experiments demonstrate that fine-tuning image captioning models further on our dataset can generate more diverse and controllable captions from multiple contexts of 360-degree images.
Anthology ID:
2023.findings-emnlp.463
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6940–6954
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.463
DOI:
10.18653/v1/2023.findings-emnlp.463
Bibkey:
Cite (ACL):
Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki. 2023. Query-based Image Captioning from Multi-context 360cdegree Images. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6940–6954, Singapore. Association for Computational Linguistics.
Cite (Informal):
Query-based Image Captioning from Multi-context 360cdegree Images (Maeda et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.463.pdf