Learning Visually Grounded Sentence Representations

Douwe Kiela, Alexis Conneau, Allan Jabri, Maximilian Nickel


Abstract
We investigate grounded sentence representations, where we train a sentence encoder to predict the image features of a given caption—i.e., we try to “imagine” how a sentence would be depicted visually—and use the resultant features as sentence representations. We examine the quality of the learned representations on a variety of standard sentence representation quality benchmarks, showing improved performance for grounded models over non-grounded ones. In addition, we thoroughly analyze the extent to which grounding contributes to improved performance, and show that the system also learns improved word embeddings.
Anthology ID:
N18-1038
Volume:
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Month:
June
Year:
2018
Address:
New Orleans, Louisiana
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
408–418
Language:
URL:
https://aclanthology.org/N18-1038
DOI:
10.18653/v1/N18-1038
Bibkey:
Cite (ACL):
Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Nickel. 2018. Learning Visually Grounded Sentence Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 408–418, New Orleans, Louisiana. Association for Computational Linguistics.
Cite (Informal):
Learning Visually Grounded Sentence Representations (Kiela et al., NAACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/N18-1038.pdf
Video:
 http://vimeo.com/277631178
Data
COCOMPQA Opinion CorpusSICKSNLISSTSentEval