When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions

Nikolai Ilinykh, Simon Dobnik


Abstract
Generating multi-sentence image descriptions is a challenging task, which requires a good model to produce coherent and accurate paragraphs, describing salient objects in the image. We argue that multiple sources of information are beneficial when describing visual scenes with long sequences. These include (i) perceptual information and (ii) semantic (language) information about how to describe what is in the image. We also compare the effects of using two different pooling mechanisms on either a single modality or their combination. We demonstrate that the model which utilises both visual and language inputs can be used to generate accurate and diverse paragraphs when combined with a particular pooling mechanism. The results of our automatic and human evaluation show that learning to embed semantic information along with visual stimuli into the paragraph generation model is not trivial, raising a variety of proposals for future experiments.
Anthology ID:
2020.inlg-1.40
Volume:
Proceedings of the 13th International Conference on Natural Language Generation
Month:
December
Year:
2020
Address:
Dublin, Ireland
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
338–348
Language:
URL:
https://aclanthology.org/2020.inlg-1.40
DOI:
Bibkey:
Cite (ACL):
Nikolai Ilinykh and Simon Dobnik. 2020. When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions. In Proceedings of the 13th International Conference on Natural Language Generation, pages 338–348, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
When an Image Tells a Story: The Role of Visual and Semantic Information for Generating Paragraph Descriptions (Ilinykh & Dobnik, INLG 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.inlg-1.40.pdf
Data
Image Description SequencesImage Paragraph Captioning