Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents

Jack Hessel, Lillian Lee, David Mimno


Abstract
Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present. We present algorithms that discover image-sentence relationships without relying on explicit multimodal annotation in training. We experiment on seven datasets of varying difficulty, ranging from documents consisting of groups of images captioned post hoc by crowdworkers to naturally-occurring user-generated multimodal documents. We find that a structured training objective based on identifying whether collections of images and sentences co-occur in documents can suffice to predict links between specific sentences and specific images within the same document at test time.
Anthology ID:
D19-1210
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2034–2045
Language:
URL:
https://aclanthology.org/D19-1210
DOI:
10.18653/v1/D19-1210
Bibkey:
Cite (ACL):
Jack Hessel, Lillian Lee, and David Mimno. 2019. Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2034–2045, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents (Hessel et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-1210.pdf
Attachment:
 D19-1210.Attachment.zip
Code
 jmhessel/multi-retrieval +  additional community code
Data
MS COCORecipeQA