IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

Varuna Krishna Kolla, Suryavardan Suresh, Shreyash Mishra, Sathyanarayanan Ramamoorthy, Parth Patwa, Megha Chakraborty, Aman Chadha, Amitava Das, Amit Sheth


Abstract
Word embeddings, i.e., semantically meaningful vector representation of words, are largely influenced by the distributional hypothesis You shall know a word by the company it keeps (Harris, 1954), whereas modern prediction- based neural network embeddings rely on de- sign choices and hyperparameter optimization. Word embeddings like Word2Vec, GloVe etc. well capture the contextuality and real-world analogies but contemporary convolution-based image embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge. The popular king-queen analogy does not hold true for most commonly used vision embeddings. In this paper, we introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects. JE is a way to encode multimodal data into a vec- tor space where the text modality serves as the grounding key, which the complementary modality (in this case, the image) is anchored with. IMAGINATOR encapsulates three in- dividual representations: (i) object-object co- location, (ii) word-object co-location, and (iii) word-object correlation. These three ways cap- ture complementary aspects of the two modal- ities which are further combined to obtain the final object-word JEs. Generated JEs are intrinsically evaluated to assess how well they capture the contextual- ity and real-world analogies. We also evalu- ate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Im- age2Tweet, and (iii) text-based image retrieval. IMAGINATOR establishes a new standard on the aforementioned downstream tasks by out- performing the current SoTA on all the selected tasks. The code is available at https:// github.com/varunakk/IMAGINATOR.
Anthology ID:
2023.icon-1.1
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
Jyoti D. Pawar, Sobha Lalitha Devi
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
1–18
Language:
URL:
https://aclanthology.org/2023.icon-1.1
DOI:
Bibkey:
Cite (ACL):
Varuna Krishna Kolla, Suryavardan Suresh, Shreyash Mishra, Sathyanarayanan Ramamoorthy, Parth Patwa, Megha Chakraborty, Aman Chadha, Amitava Das, and Amit Sheth. 2023. IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 1–18, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images (Kolla et al., ICON 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.icon-1.1.pdf