When does CLIP generalize better than unimodal models? When judging human-centric concepts

Romain Bielawski, Benjamin Devillers, Tim Van De Cruys, Rufin Vanrullen


Abstract
CLIP, a vision-language network trained with a multimodal contrastive learning objective on a large dataset of images and captions, has demonstrated impressive zero-shot ability in various tasks. However, recent work showed that in comparison to unimodal (visual) networks, CLIP’s multimodal training does not benefit generalization (e.g. few-shot or transfer learning) for standard visual classification tasks such as object, street numbers or animal recognition. Here, we hypothesize that CLIP’s improved unimodal generalization abilities may be most prominent in domains that involve human-centric concepts (cultural, social, aesthetic, affective...); this is because CLIP’s training dataset is mainly composed of image annotations made by humans for other humans. To evaluate this, we use 3 tasks that require judging human-centric concepts”:” sentiment analysis on tweets, genre classification on books or movies. We introduce and publicly release a new multimodal dataset for movie genre classification. We compare CLIP’s visual stream against two visually trained networks and CLIP’s textual stream against two linguistically trained networks, as well as multimodal combinations of these networks. We show that CLIP generally outperforms other networks, whether using one or two modalities. We conclude that CLIP’s multimodal training is beneficial for both unimodal and multimodal tasks that require classification of human-centric concepts.
Anthology ID:
2022.repl4nlp-1.4
Volume:
Proceedings of the 7th Workshop on Representation Learning for NLP
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Spandana Gella, He He, Bodhisattwa Prasad Majumder, Burcu Can, Eleonora Giunchiglia, Samuel Cahyawijaya, Sewon Min, Maximilian Mozes, Xiang Lorraine Li, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Laura Rimell, Chris Dyer
Venue:
RepL4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29–38
Language:
URL:
https://aclanthology.org/2022.repl4nlp-1.4
DOI:
10.18653/v1/2022.repl4nlp-1.4
Bibkey:
Cite (ACL):
Romain Bielawski, Benjamin Devillers, Tim Van De Cruys, and Rufin Vanrullen. 2022. When does CLIP generalize better than unimodal models? When judging human-centric concepts. In Proceedings of the 7th Workshop on Representation Learning for NLP, pages 29–38, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
When does CLIP generalize better than unimodal models? When judging human-centric concepts (Bielawski et al., RepL4NLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.repl4nlp-1.4.pdf
Video:
 https://aclanthology.org/2022.repl4nlp-1.4.mp4
Data
Book Cover Dataset