Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Martha Lewis, Nihal Nayak, Peilin Yu, Jack Merullo, Qinan Yu, Stephen Bach, Ellie Pavlick


Abstract
Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying ‘red cube’ by reasoning over the constituents ‘red’ and ‘cube’. In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating ‘cube behind sphere’ from ‘sphere behind cube’). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets – single-object, two-object, and relational – designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.
Anthology ID:
2024.findings-eacl.101
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1487–1500
Language:
URL:
https://aclanthology.org/2024.findings-eacl.101
DOI:
Bibkey:
Cite (ACL):
Martha Lewis, Nihal Nayak, Peilin Yu, Jack Merullo, Qinan Yu, Stephen Bach, and Ellie Pavlick. 2024. Does CLIP Bind Concepts? Probing Compositionality in Large Image Models. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1487–1500, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models (Lewis et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-eacl.101.pdf
Software:
 2024.findings-eacl.101.software.zip