Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlighted severe limitations of these models in their ability to perform compositional reasoning over objects, attributes, and relations. Scene graphs have emerged as an effective way to understand images compositionally. These are graph-structured semantic representations of images that contain objects, their attributes, and relations with other objects in a scene. In this work, we consider the scene graph parsed from text as a proxy for the image scene graph and propose a graph decomposition and augmentation framework along with a coarse-to-fine contrastive learning objective between images and text that aligns sentences of various complexities to the same image. Along with this, we propose novel negative mining techniques in the scene graph space for improving attribute binding and relation understanding. Through extensive experiments, we demonstrate the effectiveness of our approach that significantly improves attribute binding, relation understanding, systematic generalization, and productivity on multiple recently proposed benchmarks (For example, improvements upto $18\%$ for systematic generalization, $16.5\%$ for relation understanding over a strong baseline), while achieving similar or better performance than CLIP on various general multimodal tasks.


Introduction
Recent progress in contrastive learning using largescale image-text data for joint image-text representation learning has led to Vision-Language models (VLMs) like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) that show remarkable zeroshot classification and retrieval capabilities.However, recent works have shown that these models struggle at compositional reasoning (Yuksekgonul    † Work done while at Meta. et al., 2022;Thrush et al., 2022;Ma et al., 2022).In particular, they struggle with binding correct attributes to the correct objects, understanding relations between objects, generalizing systematically to unseen combinations of concepts and to larger and more complex sentences.
Some works have made progress on this problem.Yuksekgonul et al. (2022) show that hard negative mining of images and text during fine-tuning is a promising first step to improving compositionality.However, performance gains are highly dependent on how clean the training data is, and generalizing to unseen combinations of concepts remains a challenge.Doveh et al. (2023) use LLMs for hard negative mining and Cascante-Bonilla et al. (2023) explore using synthetic datasets to improve compositional understanding in VLMs.Synthetic datasets lead to a domain gap compared to natural datasets.We aim to develop a general-purpose approach for improving compositionality of all such contrastively trained VLMs.
In this paper, we consider a scene graph representation of the image and text.We observe that multiple sub-graphs of the text scene graph with different semantic complexities can be matched with the same image.Performing this matching improves fine-grained and hierarchical understanding of text and thereby, of images.We achieve this by developing a scene graph-based text decomposition strategy that creates a scene graph for any given text, decomposing it into sub-graphs, and matching an image to multiple sentences derived from these sub-graphs (See Fig. 2 for an overview).Each sub-graph represents a distinct part of the image, aligning well with CLIP's original imagetext matching objective.Focused on improving attribute binding and relation understanding, we develop novel hard negative graph creation strategies which helps VL contrastive learning.We provide a novel Image-to-Multi-Text contrastive loss for matching individual images to multiple sentences.Our approach of matching texts of different complexity (from coarse-grained to fine-grained) to the image leads to fine-grained and hierarchical text understanding.Our resulting model is MosaiCLIP.
Our approach leads to significant improvements across compositionality benchmarks.For example, Figure 1 b) and c) shows that MosaiCLIP improves performance by 11.5% and 9.1% on CREPE and ARO dataset over a strong baseline and by > 20% over CLIP.Our contributions encompass: • A novel graph-based text decomposition and augmentation framework and a coarse-to-fine contrastive learning objective for matching images to text sub-graphs of varying complexity.
• Hard-negative mining techniques using graph transformations of the text scene graphs, that are seamlessly coupled with our text decomposition strategy, and applied over any text.
• A thorough analysis for understanding why MosaiCLIP improves vision-language compositionality, disentangling the effect of image and text encoders and providing a novel tree-score based analysis showing that Mo-saiCLIP exhibits improved hierarchical text understanding.
• Extensive experiments over three model architectures, two pre-training datasets, three fine-tuning datasets and test over four compositionality benchmarks (11 datasets) to prove the efficacy of MosaiCLIP for improving compositionality.

Related Work
Contrastive Vision-Language Pre-training: Large-scale contrastive learning for Vision and Language is utilized to create models like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021).These models showcase impressive performance on a variety of tasks, including image classification, text and image retrieval, image captioning (Mokady et al., 2021), object detection (Zhong et al., 2022;Li et al., 2022c) etc.
Visio-Linguistic Compositionality: Various studies have introduced benchmarks for assessing the compositional reasoning abilities of vision-language foundation models (VLMs).For instance, Winoground (Thrush et al., 2022) is a handpicked collection of 400 test cases, each comprising two images and two sentences.Sentences have the same word content and differ in word-order.Diwan et al. (2022) show that the Winoground dataset tests additional challenges along with compositionality, including handling ambiguous image-text pairs and unusual examples.Yuksekgonul et al. (2022) proposed the ARO benchmark for probing VLMs ability to understand Attribute, Relations, and Word-Order.Ma et al. (2022) proposed CREPE for measuring two aspects of compositionality: systematic generalization and productivity.All benchmarks suggest that contrastively trained VLMs have severe difficulty in compositional reasoning.As a remedy, NegCLIP (Yuksekgonul et al., 2022) and Teaching SVLC (Doveh et al., 2023) create targeted rule-based and LLM-guided hard negative sentences, SyViC (Cascante-Bonilla et al., 2023) fine-tunes CLIP with million scale synthetic images-text pairs, for improving relational and attribute understanding.We observe that previous methods are either highly dependent on how clean the training data is, use expensive LLM's for data augmentation or use synthetic datasets that require special solutions to resolve the synthetic-to-real domain gap.We hence develop a coarse-to-fine contrastive learning framework that matches images with texts of multiple complexities, which serves as a general-purpose solution to improve fine-grained and hierarchical text understanding, thereby improving compositionality.
Scene Graphs are structured representations of visual scenes, consisting of objects, their attributes, and relationships between objects.Scene graphs are beneficial for a range of tasks including image retrieval (Wu et al., 2019;Johnson et al., 2015), image captioning (Yang et al., 2019), and image generation (Johnson et al., 2018) among others.

Overview
Here we present the key high-level ideas of our approach.We first present a graph-centric view of the standard image-text matching objective in CLIP, which serves as a motivation to develop our approach (Sec.3.2).We create scene graphs derived from the text, decompose them into multiple sub-graphs (Sec.3.3) and apply augmentations on these sub-graphs to create negative sub-graphs (Sec.3.4) which are used as hard negatives in a batch.Sec.3.5 formally defines the Image-to-Multi-Text and Text-to-Image losses used for a batch of V-L inputs which is key for learning from multiple positive and negative texts derived from sub-graphs.Matching images with coarse-to-fine sub-graphs results in improved fine-grained and hierarchical understanding of text.Sec.3.6 provides a twostage curriculum learning strategy for improved fine-tuning performance.

Image-Text-Graph Alignment
Our approach builds on the idea that the standard image-text contrastive learning in CLIP can be viewed as a matching between an image scene graph and its sub-graph.Formally, given an imagetext pair (I, T ), the image can be viewed by its scene graph, According to this assumption, during contrastive learning in CLIP, we implicitly bring the representation of the image scene graph close to one of its sub-graph (the text scene graph).Now, let S G = {g|g ⊂ G} represent the set of sub-graphs of a graph G.According to the assumption above, g ∈ S G T ⇒ g ∈ S G I .Hence ∀g ∈ S G T , (g, G I ) becomes a correct matching pair during contrastive learning.We match multiple sub-graphs of the text scene graph to the same image, while also including hard negative sub-graphs in the batch.Matching between graphs is an implicit concept, and all graphs are first converted to text via templates, converted to embeddings using transformerbased (text) encoders, and matched to image embeddings.

Scene Graph Guided Text Decomposition
Scene graphs are succinct representations of images.However, an image scene graph generator used for generating a scene graph for any given input image is expensive to train since it requires supervised scene graph annotations for training (Li et al., 2017;Xu et al., 2017;Zhang et al., 2019), and also leads to issues like low coverage or biased generations against the long tail nature of objects and relationship annotations.We instead use the text scene graph created using an off-the-shelf text scene graph parser1 (Wu et al., 2019).This serves as a proxy for the scene graph of (part of) the image and is assumed to be a sub-graph of the image scene graph, as also depicted by Figure 2.
Let the text scene graph obtained be G T = (V T , E T ), where V T represent the nodes of the graph, which are either objects or their attributes.E T are the edges of the graph that represent relations between objects.See Fig. 2 for an example of a text scene graph.As shown in the figure, we decompose this scene graph into multiple positive sub-graphs where M is the max number of decomposed subgraphs and is a hyperparameter.Each sub-graph is a representation of a part of the image.We then convert sub-graphs to sentences so that they can be easily processed by transformer-based (text) encoders commonly used to train CLIP.For this, we use a simple template-based approach.For e.g., we create templates of the form "{N 1 } {R} {N 2 }" if we need to convert a graph having two nodes (N 1 , N 2 ) and a relation R, into a sentence format.Corresponding to each sub-graph, we obtain one positive text for the image, creating a positive text set

Negative Sub-Graph Creation
Corresponding to sub-graphs in P g , we create negative sub-graphs N g = { n g 1 , n g 2 , n g 3 , • • • }.Subgraphs in N g are a minimally perturbed versions of the positive sub-graphs in P g .Similar to positive sub-graphs, we convert sub-graphs in N g to text using the same template-based approach, and obtain  as hard negative texts in a given batch, see Fig. 2. We focus on creating negative sub-graphs that improve the attribute binding and relation understanding capabilities of the model, for which we use the following strategies for negative graph creation: We first consider an external set of objects (N ), attributes (A), and relations (R). 1) Node Swapping and Replacement: We swap nodes in sub-graphs, these can be swaps of nodes which are attributes or objects.We also replace nodes with external nodes from N , A based on their type.2) Edge Replacement: We replace edges with randomly sampled edges from the external relations set, R. 3) Connecting Sub-graphs: Here we join two sub-graphs.For this, we use one sub-graph from P g , and another random graph created using nodes and edges sampled from external sets N , A, R.This creates an overall hard negative graph.Sub-graphs are joined by simply joining nodes from both graphs through a randomly sampled edge from R. These strategies result in minimally perturbed hard negative subgraphs for improving attribute and relation understanding.We define multiple graph transformations {f g : G −→ P (G)}f rel , f attr , f obj using the above techniques and create hard negative subgraphs.See Appendix Sec.B for more details regarding negative sub-graph creation.

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space
Given an image-text batch during training B = {(x i , t i )} n i=1 , consider separately the batch of images B I = {x i } n i=1 and a batch of texts B T = {t i } n i=1 .The sentences in the text batch are first converted to scene graphs to obtain a batch of scene graphs B G = {G i } n i=1 , followed by decomposition to sub-graphs to obtain the positive sub-graph batch B pos g = {g i } m i=1 , m > n. r negative subgraphs are sampled and added to the batch to obtain B g = {g i } m+r i=1 .We convert these sub-graphs to text to obtain the final text batch B t = {t g i } m+r i=1 .Consider an image encoder model f θ parameterized by θ, a text encoder f ϕ parameterized by ϕ.For any image x, text t, ũ = f θ (x) is the unnormalized image feature, and ṽ = f ϕ (t) is the unnormalized text feature.As common practice, the features are normalized to obtain u = ũ/∥ũ∥ and v = ṽ/∥ṽ∥.The Image-to-Multi-Text contrastive loss is given by: where The Text-to-Image contrastive loss is only calcu-lated for the positive texts.It is given by: represent the texts in B t , obtained from positive and negative sub-graphs respectively.The overall loss is L MosaiCLIP = (L MC t2i + L MC i2t )/2.

Curriculum and Robust Fine-tuning
For fine-tuning experiments, we develop a twostage curriculum learning strategy motivated by recent work (Goyal et al., 2022;Wortsman et al., 2022;Kumar et al., 2022) that show how finetuning can distort pre-trained features and closely mimicking the contrastive pre-training objective while fine-tuning CLIP can help mitigate this problem (Goyal et al., 2022).However, our coarse-tofine contrastive learning objective naturally deviates from pre-training in two ways.a) Existence of hard negative texts in the batch, and b) Having multiple positive and negative texts for an image.This can lead to a gap in pre-training vs finetuning objective, and a lower than optimal performance after fine-tuning.To solve this, our twostage curriculum learning strategy first fine-tunes the model while sampling (at max) a single positive and negative sub-graph per image, followed by fine-tuning it with multiple positive and negative sub-graphs.The hardness of data in this curriculum learning setup is defined by the amount of difference the fine-tuning setup has as compared to the pre-training setup.According to this intuition, it is easier for the model to first learn to handle hard negatives in a batch and then learn to handle multiple positive and hard negative sentences at once.We see consistent improvements using this strategy compared to a direct one-step fine-tuning, which we term as MosaiCLIP NoCurric in our ablations.For better performance on non-compositonal tasks, we use the robust fine-tuning approach (Wortsman et al., 2022) of weight space ensembling of the vision encoder, before and after fine-tuning.This model is called MosaiCLIP WiSE-FT

Experiments
Evaluation Datasets: We test MosaiCLIP and baselines on large scale benchmarks that require compositional reasoning: CREPE-Systematicity (Ma et al., 2022) measures systematic generalization, ARO (Yuksekgonul et al., 2022) measures attribute, relation and word-order understanding, SVO (Hendricks and Nematzadeh, 2021) measures verb (relation) understanding, VL-Checklist (Zhao et al., 2022) measures relation, attribute and object understanding.We use CREPE-Productivity (Ma et al., 2022) for measuring model's ability to productively generalize to more complex and long sentences.Methods for improving compositionality should be tested on general downstream tasks used to evaluate the quality of learned representations of language and vision.For this, we utilize the popular ELEVATER benchmark (Li et al., 2022a) consisting of 20 datasets and ImageNet (Deng et al., 2009) following prior work (Doveh et al., 2023).Baselines: We compare with all recent techniques used for improving compositionality of CLIP style models including NegCLIP (Yuksekgonul et al., 2022), Teaching SVLC (Doveh et al., 2023) and Syn-CLIP (Cascante-Bonilla et al., 2023) along with CLIP (Radford et al., 2021) as well as CLIP-FT (fine-tuned) on datasets we use.See Appendix Sec.F for more details.

Training and Evaluation Details:
Fine-tuning: NegCLIP (Yuksekgonul et al., 2022) was developed by fine-tuning CLIP on the COCO dataset (Lin et al., 2014), however, COCO images might overlap with benchmarks like CREPE and ARO which may lead to confounding of results.Hence we consider 2 additional similar sized finetuning datasets randomly sampled from CC-12M (Sharma et al., 2018;Changpinyo et al., 2021) and YFCC-15M (Thomee et al., 2016) and call them CC-FT, YFCC-FT.We also use CC3M (Sharma et al., 2018) for comparing with recent baselines.We fine-tune the commonly used OpenAI CLIP-ViT-B32 model and report results on all datasets, except for CREPE dataset which tests the systematic generalization for which we used OpenCLIP (Ilharco et al., 2021) models pre-trained on {CC-12M, YFCC-15M}, fine-tune them on {CC-FT, YFCC-FT}, and report results on {CC-12M,YFCC-15M} splits of CREPE.See Appendix E.3 for more information on evaluation datasets.
Pre-training: We pre-train MosaiCLIP, NegCLIP and CLIP on two prominent large-scale pre-training datasets, CC-12M and YFCC-15M, and use two different backbones (ResNet-50 and Swin-Tiny) following prior work (Yang et al., 2022) and report zero-shot performance on all test datasets.See Appendix H.1 for hyperparameters details.

Results
In this section we provide experimental results in both pre-training and fine-tuning settings to show the efficacy of our approach.These are as follows: Fine-tuning: Main fine-tuning results are shown in Table 1 and 2, where we fine-tune CLIP models using our method and compare it to baselines.Notably, we see that the generalization performance on unseen compounds and atoms as measured by the CREPE dataset is up to 18% higher than NegCLIP.Additionally MosaiCLIP shows upto 16.5%, 5.3%, 32.3% of improvement over NegCLIP in understanding relations, attributes and word order respectively.
MosaiCLIP also shows consistent improvements in the verb understanding task as measured by the SVO dataset.Additional Comparisons: We also compare with latest contemporary works in Table 2 and Appendix Sec.D.1.We find significant improvements (upto 14% on ARO) over models that use LLMs or synthetic data for making CLIP more compositonal.
Pre-training: Table 3 shows pre-training results over all benchmarks.CREPE results show respectively across pretraining settings.We also note that order understanding of MosaiCLIP is worse than that of NegCLIP for the CC-12M pre-training dataset, while better than NegCLIP for the YFCC-15M dataset.Notably, there is a large variance in NegCLIP's performance across pre-training datasets as seen in Table 3, and it also performs poorly when the pre-training dataset has higher noise (e.g.YFCC-15M).MosaiCLIP is fairly consistent and more robust to the change in the pre-training dataset.In Appendix C.5 we find that MosaiCLIP can provide improvements over NegCLIP while using as low as 0.3x of the total pre-training or fine-tuning data.
Results on classification and retrieval: On average, MosaiCLIP achieves +3.3%, +6.3% better performance on the ELEVATER classification benchmark compared to NegCLIP and CLIP while pre-training and maintains similar accuracy as CLIP while fine-tuning.We also try using our method along with the robust fine-tuning technique (WiSE-FT) so that performance degra-   Productivity: As defined by Ma et al. (2022), a productive VL model can handle arbitrarily long and complex sentences and is an important aspect of compositionality.Although we do not explicitly train our models for generalization to longer sentences, the improved hierarchical language understanding using our methods lead to an emergent behavior such that MosaiCLIP generalizes better than NegCLIP and CLIP to more complex sentences.We can see this effect in Fig. 4 a) and Appendix Fig. 8 and 9.We report the average of retrieval over swap and atom splits and find MosaiCLIP significantly improves over NegCLIP by upto 15% across different text complexities (4-12).
Application to more advanced VLMs: While our focus in this work has been on CLIP style, dual encoder models due to their various benefits, we believe our methods are model agnostic and aimed at improving contrastive learning through our coarse-to-fine learning framework and negative mining techniques.In this section we test our model on an advanced VLM, BLIP.We modified BLIP's original image-text contrastive learning objective and create two variants, one called BLIP+NegCLIP where we use NegCLIP style hard negatives and the other BLIP+MosaiCLIP which uses our methods of scene graph guided text decomposition and negative sub-graph creation.We fine-tune BLIP model taken from the official BLIP repository and use the "BLIP w/ ViT-B and CapFilt-L model (pre-trained on 129M examples)" as our base model.Results for fine-tuning experiment using COCO dataset is shown in Table 5.We use the hyperparameters used by the official codebase (for the task of fine-tuning on COCO dataset for image-text retrieval).For each setting, we report performance of four models, namely BLIP (before fine-tuned version), BLIP-FT (vanilla fine-tuned version), BLIP+NegCLIP, BLIP+MosaiCLIP.The model are evaluated on the ARO dataset to measure attribute, relation and word-order understanding, using the evaluation scripts provided by the authors of the dataset (Yuksekgonul et al., 2022).We find that

Model
Rel Attr Ord Avg BLIP 53.5 91.0 53.5 66.0 BLIP-FT 58.9 88.4 58.9 68.7 BLIP+NegCLIP 63.6 90.7 63.6 72.6 BLIP+MosaiCLIP 69.9 91.1 69.9 77.0 compared to vanilla fine-tuning, both NegCLIP and MosaiCLIP methodologies bring improvements to relation and word order understanding, while maintaining or improving performance on attribute understanding.The MosaiCLIP methodology significantly improves relational reasoning performance and word-order understanding compared to the NegCLIP methodology, up to 6.3%.Attribute understanding performance remains nearly the same as the baseline BLIP performance, with the MosaiCLIP methodology bringing in slight gains over NegCLIP's methodology.On average Mo-saiCLIP's methodology brings more improvements to BLIP than NegCLIP or vanilla fine-tuning.

Analysis
We provide a detailed analysis of our models and baselines, across different dimensions as follows: Disentangling MosaiCLIP improvements: We quantify the relative importance of the vision and language side by freezing the language and vision encoder individually while fine-tuning all models.See Fig. 4 c,d for the results.Notably, we find that 1) Language encoder has significant scope for improvement over NegCLIP's language encoder, and MosaiCLIP is able to successfully exploit this potential and deliver an enhanced compositional understanding of language, which is evident by performance increase of +3.7, +6.9% over NegCLIP when only the language encoder is fine-tuned, as shown in Fig. 4 c,d.
2) Improvements brought by MosaiCLIP over NegCLIP in the text encoder are always higher than improvements in the image encoder.This is evident from Fig. 4 c,d where the performance increase over NegCLIP when only the language encoder is fine-tuned is always higher as compared to when only the image encoder is fine-tuned; for example, 3.7% > 0.0%, 6.9% > 1.8% for ARO-Relation, ARO-Attribution.3) MosaiCLIP brings significant improvements on the image encoder side (higher than NegCLIP) without using any image negative mining, unlike NegCLIP.
MosaiCLIP improves hierarchical text understanding: For further understanding MosaiCLIP's improved compositional understanding, we provide a novel analysis by considering the recently proposed Tree-Score (Murty et al., 2022) that measures the degree to which a transformer (text) encoder processes text in a hierarchical manner.We hypothesize that having tree-like hierarchical computation over language can be one leading factor for explaining the compositionality (or lack thereof) of CLIP-like models.Along with this, we have previously shown that the language encoder has the most prominent effect in improving compositionality in the case of MosaiCLIP .These two reasons motivate the use of tree-score to compare the language encoder's hierarchical understanding capability.Fig. 4 a) shows that MosaiCLIP's language encoder has higher tree-scores than NegCLIP's language encoder, suggesting that MosaiCLIP performs more tree-like computations.This explains the improved language compositionality of MosaiCLIP since a hierarchical tree-structured computation allows the language encoder to better understand input text compositionally, thereby improving vision-language compositionality.This is in line with the hypothesis that human's semantic understanding of sentences involves a hierarchical (tree-structured) computation which has significant evidence (Crain and Nakayama, 1987;Hale et al., Figure 5: Qualitative analysis on ARO dataset (Top:ARO-Attribution, Bottom: ARO-Relation).Models highlighted in blue match the image to the correct sentence (in green) while the models in white match the image to the incorrect sentence (in red).Here, models are taken from our fine-tuning experiments on COCO from Table 1.
MosaiCLIP is Robust: Noisy texts often have meaningful sub-texts which can be exploted by MosaiCLIP, hence MosaiCLIP often achieves consistent performance increase regardless of noise in the pre-training or fine-tuning dataset.For example, NegCLIP achieves significantly low performance on ARO when fine-tuned with YFCC-FT (having more noise in text) as compared CC-FT or COCO as shown in Table 1.NegCLIP takes a > 10% hit in performance across various ARO datasets when the fine-tuning dataset is changed from COCO to YFCC, whereas, Mo-saiCLIP achieves similar performance using both datasets.Appendix Sec.D.3 shows that pre-trained MosaiCLIP is robust to natural distributon shifts.
Qualitative Analysis: We take MosaiCLIP, NegCLIP and CLIP fine-tuned on COCO and filter out examples from the ARO dataset where MosaiCLIP and NegCLIP's disagree.Some notable examples in Fig. 5 include cases where NegCLIP and CLIP often struggle to understand simple concepts like understanding the color of the cat and table (top-left Fig. 5 or understanding the "is holding" relation b/w sandwich and the box in bottom-right Fig. 5.

Ablations
Table 6 and Appendix Tables 8,9 show the effect of curriculum learning and robust fine-tunining where we find that curriculum learning can bring consistent improvements of up to 1.2% on average and robust-finetuning (WiSE-FT) technique performs the best on zero-shot tasks (i.e.minimal forgetting while fine-tuning), while still improving over NegCLIP by about 5% on compositional reasoning tasks.

Conclusion
We present a method to improve the compositional reasoning capabilities of contrastively trained large vision-language models.In particular, we provide a coarse-to-fine contrastive learning framework and a scene graph-based text decomposition strategy for matching subgraphs of the text scene graph having varying complexity to an image during contrastive learning.We also develop hard negative graph creation strategies focused on improving attribute binding and relation understanding capabilities.Our techniques leads to significant improvements in compositional reasoning capabilities.We investigate the reasons for improved compositionality and present a novel finding based on language encoder tree-scores, suggesting that our models learn improved fine-grained and hierarchical text understanding, which is likely the key reason for improved vision and language compositionality of MosaiCLIP as compared to baselines.

Limitations
Computational Cost: Although MosaiCLIP leads to significant performance increase on several benchmarks that test compositional reasoining, it requires a higher per-batch computational cost while training.For this we give a detailed analysis on the computational cost in Appendix C.6 and show that simply providing more compute to prior methods in the form of larger batch sizes does not improve compositional reasoning.We also show ways to tackle this computational cost, by using less data in Appendix C.5, since MosaiCLIP is data efficient and can provide improvements over baselines with as low as 0.3x of the total data.This along with our ablations in Appendix C.1 gives some control to any practitioner to vary either the training dataset size or the number of sub-graphs in our method, and obtain a clean tradeoff between accuracy and compute.As future work we would like to develop a coarse-to-fine grained objective requiring minimal extra computation cost per batch.Future work should also look at decreasing the extra computational cost incurred by contemporary methods like Syn-CLIP (Cascante-Bonilla et al., 2023) and Teaching SVLC (Doveh et al., 2023).
Other Vision Language Models: In our current work we primarily aim to improve the compositionality of CLIP-Style, dual-tower models trained using large scale contrastive learning, since they severely lacked compostional reasoning capabilities as shown by (Yuksekgonul et al., 2022).Many other VLMs exist such as those that undergo cross modal interactions between vision and language such as BLIP (Li et al., 2022b), X-VLM (Zeng et al., 2021), LXMERT (Tan and Bansal, 2019).Although our methods show promise in improving more advanced VLMs like BLIP as shown in Section 4 and Table 5, a more thorough analysis will be beneficial to study the extent to which our methods can improve vision-language contrastive learning for these models.
Sentence Templates: For simplicity, we currently use manually curated templates to convert subgraphs to sentences, however, this can lead to similar looking and synthetic sentences.Large language models like GPT-4 (OpenAI, 2023), BLOOM (Mitchell et al., May 2021-May 2022) should be looked into for developing sentences from scene-graphs, by directly giving the LLM a scene-graph as input and requiring it to generate a sentence.This approach might be effective but may also lead to higher computational cost while training.

Appendix A Background
Contrastive Language-Image pre-training (Radford et al., 2021) (CLIP) aims to learn generalpurpose representations of vision and language using paired image-text data.This is achieved using contrastive learning in the image-text space.In particular consider a pre-training dataset of size n, D ⊂ X × T , D = {x i , t i } n i=1 .Here X and T are the space of images and text, respectively, and x i , t i are images and text in the dataset.Also, consider access to image and text encoders, that we represent by f θ : X → R d and f ϕ : T → R d respectively.To learn distributed representations for images and text, the following contrastive losses are used: Where B represents the batch during one iteration of training.u i , v i are the ℓ 2 normalized embeddings of ũi , ṽi , where ũi = f θ (x i ), ṽi = f ϕ (t i ).τ is the temperature parameter and is trainable.The overall loss is L clip = (L t2i + L i2t )/2.

B Scene Graph Decomposition
Here we provide additional details for text scene graph decomposition.Denote the text scene graph obtained from the scene graph parser by G T = (V T , E T ), where V T represent the nodes of the graph, which are either objects or their attributes.E T are the edges of the graph that represent relations between objects.Let G denote the set of all possible scene graphs.We first consider an external set of objects (N ), attributes (A), and relations (R) that we use for creating negative sub-graphs.In practice, we create this set from Visual Genome (VG) dataset (Krishna et al., 2016).Following Zhang et al. (2021), we sample a total of 1594 entities that have 30 instances of them in the VG dataset.The attribute and Relation list contains 524, and 50 unique instances, respectively.Hence |N | = 1594, |A| = 524, |R| = 50.We first sample all possible sub-graphs having one or two objects in them, and these can have multiple attributes for the objects.We develop and use scene graph transformations that take a sub-graph as input and return a (set of) modified versions of the graph (minimally-perturbed negative sub-graphs for the image).For this, we define three graph transformations as follows: • f obj : G −→ P (G) takes input a single object scene graph, where the object has attributes A o .For each attribute, a ∈ A o , a random attribute a ′ is sampled uniformly at random from A. We finally obtain a set of sub-graphs G obj ∈ P (G) where P (.) denotes the power set.Each g ∈ G obj contains one object node connected with an attribute node which is sampled from A.
• f rel : G −→ P (G) takes input sub-graphs having one relation edge and gives output a set of sub-graphs G rel ∈ P (G) where each g ∈ G rel has either object nodes shuffled, replaced by an external object node n ′ sampled uniformly at random from N , and/or relation replaced by external relation r ′ sampled uniformly at random from R. Along with this, we also join the input positive sub-graph with a random sub-graph created by sampling random nodes and edges from N , A, R.
• f attr : G −→ P (G) takes input sub-graphs having one relation edge and gives output a set of sub-graphs G attr ∈ P (G) where each g ∈ G attr has attribute nodes shuffled, and/or replaced by an external attribute node a ′ sampled uniformly at random from A.
f obj , f attr broadly aims at improving the model's attribute understanding, while f rel broadly targets improved relation understanding.For each positive sub-graph, we sample all possible negative subgraphs using f obj , f rel , f attr and make positive-negative sub-graph pairs (g pos i , {g neg i }).These pairs can be classified into three categories C = {c obj , c rel , c attr } according to the transformation that created the negative sub-graphs.We sample sub-graph pairs from these categories according to probabilities p i , i ∈ {1, 2, 3} corresponding to the three categories respectively, and p i = 1.These probabilities are hyperparameters; see Appendix Section H.1 for more details.Multiple subgraph pairs can have common positive or negative sub-graphs, and sampling these pairs would result in duplication, hence for each image, we make sure to deduplicate sub-graphs so that all sub-graphs, and therefore the text made from them are unique for a given image in a batch.After sampling, all sub-graphs are transformed to text using simple templates, as explained in Section 3.3.

C Ablations and Model Analysis C.1 Sampling more subgraphs
We analyze the effect of increasing the maximum number of sub-graphs sampled for any given image in a batch of data during training.See Figures 6  and 7, in which we test the performance on ARO and CREPE benchmarks (averaged over three finetuning datasets considered in this work), as we increase the max positive and negative sub-graphs per image.We find that as we increase both positive and negative sub-graphs for an image, the performance steadily increases up to a point for all datasets, after which the performance can either flatten out, increase, or even decrease in some of the datasets.This is intuitive since a larger number of positive and negative sub-graphs per image leads to a gap w.r.t the pre-training stage as described in Sec.3.6.Also, different compositional splits require different reasoning skills, and as we keep sampling positive and negative sub-graphs for an image, it is natural for certain types of positive and negative sub-graphs to be more pronounced, depending on the dataset statistics, and this can have varied effects on different datasets.

C.2 Effect of different sub-graph types
Here we analyze the effect of sampling different kinds of sub-graphs from the original scene graph of the text.In particular, we measure the effect of graph transformations that we define in Appendix Sec.B. Results are presented in Table 7.We observe that both f rel and f attr as described in Appendix Sec.B, are useful for improving relation and attribute understanding (as measured on the ARO benchmark), across fine-tuning datasets.

C.3 Effect of curriculum training
As shown in Table 8, in all fine-tuning results, we can see consistent improvements when using our curriculum learning strategy, such as upto 2% on systematic generalization, and sometimes more than 6% as seen for ARO-Order results when the fine-tuning dataset is YFCC-FT.

C.4 Effect of robust fine-tuning
Among many other techniques developed for mitigating forgetting in large models when they are fine-tuned, one prominent one is robust fin-tuning-WiSE-FT, (Wortsman et al., 2022).Following Wortsman et al. (2022) we perform weight-space ensembling on the image encoder before and after fine-tuning using our method and call this model MosaiCLIP WiSE-FT .The results on compositionality benchmarks can be seen in Table 8 while results on 21 multimodal tasks from ELEVATER and ImageNet can be seen in Table 9.We find that MosaiCLIP WiSE-FT has a slight performance decrease on some compositonal benchmarks as compared to MosaiCLIP, however, it is significantly better than NegCLIP on most benchmarks.The real benefit of using MosaiCLIP WiSE-FT is that it leads to least forgetting, and there is little to no performance degradation on 21 tasks as showin in Table 9.

C.5 Data efficiency
We find that our technique leads to significant data efficiency requiring about 0.3x-0.6xfo the total fine-tuning or pre-training data to match or exceed NegCLIP performance.Results are shown in Tables 10 and 11.

C.6 Computational cost
Even though MosaiCLIP uses the same global batch size of image-text pairs, it requires more compute as compared to NegCLIP or CLIP owing to the fact that decomposing sub-graph leads to a larger effective text-batch size and hence a larger contrastive learning matrix.It is a common practice in literature to trade-off larger compute for improving CLIP's compositionality, as also done by previous methods Syn-CLIP (Cascante-Bonilla et al., 2023) that generate data using external graphics engines, and Teaching-SVLC (Doveh et al., 2023) which use LLMs requiring massive compute even during inference.
Providing NegCLIP with more compute: One can argue that providing more compute to Neg-CLIP can lead to better performance, however, on the contrary we found that NegCLIP's performance decreases as batch size is scaled (from 256 to 4096, much beyond MosaiCLIP's text or image batch size), as shown in Table 12.
Performance-Compute Tradeoff: It is to be noted that MosaiCLIP performance continues to increase up to a threshold, as sub-graphs are increased as shown in Table 7 and 6 hence this provides a clean tradeoff between number of sub-graphs and compute, and a practitioner can choose the number of sub-graphs their compute availablility.Along with this, in Appendix Sec.C.5 we showed that we can achieve improved performance compared to NegCLIP with as low as 0.3x data closing the gap between NegCLIP and MosaiCLIP compute even more.It is to be noted that MosaiCLIP is a drop in replacement for CLIP after training and requires the same inference cost as CLIP.

D.1 Comparison with recent baselines
We compare with recently published and contemporary works (Cascante-Bonilla et al., 2023;Doveh et al., 2023).Doveh et al. (2023) show that one can create rule-based hard negative sentences and Large Language Models (LLMs) based hard negative sentences and use them when training CLIP style models to obtain an improved model that is better at handling tasks that require compositional reasoning.We fine-tune on CC3M (Sharma et al., 2018) for a fair comparison with Doveh et al. (2023).Results are reported in Table 13.A fair comparison with Syn-CLIP Cascante-Bonilla et al. ( 2023) is not possible since their synthetic dataset is not released.However in Table 13 we find that performance difference is large between MosaiCLIP and Syn-CLIP showing that our general coarse-to-fine grained approach is better than using targeted synthetic datasets for inducing compositional understanding in VLMs.Comparisons with Doveh et al. (2023) in Table show that our approach is competitve or better at attribute, relation and object understanding as measured by the VL-Checklist benchmark (Zhao et al., 2022).Zero Shot performance on 21 datasets suffers minimally using our approach, and is even better than (Zhao et al., 2022).It is to be noted that both approaches Syn-CLIP (Cascante-Bonilla et al., 2023) and Doveh et al. (2023) are orthogonal to our approach and combining them with our coarseto-fine understanding approach will likely result in much better performance overall, as compared to individual techniques.In particular, Syn-CLIP (Cascante-Bonilla et al., 2023) faces the issue of having long captions for images, and they average out embeddings of parts of the caption before matching it to the image.This issue can be eaily resolved using our framework which can easily handle multiple positive captions for an image.Performing this ablation would be future work for us, once synthetic datasets like that used by Cascante-Bonilla et al. ( 2023) are open-sourced and gain more popularity.Our approach can similarly also include captions generated from LLMs, as explored by Doveh et al. (2023).

D.2 Standard deviations for fine-tuning results
Here we provide fine-tuning results on the CC-FT dataset with standard deviations over 3 random seeds where OpenAI CLIP-ViT-B-32 is fine-tuned on CC-FT using MosaiCLIP and baseline techniques.See Table 14 for the results.The main paper Table 1 have average results for CC-FT while for COCO and YFCC-FT fine-tuning datasets, the results are for one seed.We do-not run multiple pre-training experiments since they significantly more costly.

D.3 Robustness to natural distribution shifts
We find that pre-trained images should intuitively help in improving performance on robustness benchmarks given that the model will now be able to recognise details in images and texts more accuractely.

E Dataset Details
Here we provide detailes about datasets used for fine-tuning, pre-training and evaluating models in this study.A summary is shown in Table 16 E.1 Fine-tuning datasets Following NegCLIP (Yuksekgonul et al., 2022) we use the COCO dataset released by (Yuksekgonul et al., 2022) having 109k samples that had hard negative sentences that (Yuksekgonul et al., 2022) create for training NegCLIP.As mentioned in the main paper, COCO dataset images are used for creating Visual Genome (Krishna et al., 2016), and this is further used to create datasets such as CREPE (Ma et al., 2022), ARO (Yuksekgonul et al., 2022) and a part of VL-Checklist (Zhao et al., 2022).This can lead to confounding and potentially mislead-ing results, since it is unclear if the performance increase using any method comes from the finetuning dataset (COCO) being close to the domain of test datasets, or if it's the fine-tuning methodology that leads to an increase in performance.Hence, for rigourous experimentation of the developed methods, one must use other datasets to finetune contrastively trained VLMs.We randomly sample similar sized (100k datapoints) from popular pre-training datasets CC-12M and YFCC-15M, and call these smaller datasets CC-FT and YFCC-FT.To train NegCLIP, hard negative sentences and images are required, for which we first use the code released by (Yuksekgonul et al., 2022) 2 to create hard negatives sentences as well as sample three hard negative images for each image based on OpenAI CLIP ViT-B/32 features, strictly following (Yuksekgonul et al., 2022).For comparing with contemporary works (Doveh et al., 2023), (Cascante-Bonilla et al., 2023) (as shown in Table and YFCC-15M (Thomee et al., 2016) for pretraining all models in this study, including CLIP, NegCLIP and MosaiCLIP.

E.3 Evaluation datasets
Here we list the evaluation detailes used in this study and also provide a short description for each CREPE-Systematicity (Ma et al., 2022) VL-Checklist (Zhao et al., 2022): This benchmark is created by combining annotations from datasets like Visual Genome (Krishna et al., 2016), SWiG (Pratt et al., 2020), HAKE (Li et al., 2019), VAW (Pham et al., 2021).Each image in the resulting dataset has two captions, a positive and a negative.The positive caption is taken from the source dataset of the image, while the negative caption differs from the positive in only one word which makes it a hard negative and helps in testing compositional and fine-grained understanding of VLMs across various dimensions like attributes, relations, and size and locations of objects.

F Baselines:
Here we list the baselines used in this study and also provide a short description for each.CLIP (Radford et al., 2021): Our first baseline is CLIP model released by OpenAI CLIP (Radford et al., 2021) and OpenCLIP (Ilharco et al., 2021).
In particular we use the ViT-B/32 model for fine-tuning results  (Gan et al., 2021).This contemporary work is complementary to our data-centric approach and we believe our methods can help fine-tuning with synthetic datasets as well.Cascante-Bonilla et al. (2023) in their paper showed how dense and long captions can be obtained for synthetic images and which require splitting into sub-captons followed by averaging of features from all captions while fine-tuning CLIP.This is one avenue where we believe our method can be useful since our method inherently allows matching of images to multiple texts.This is part of future work, once such synthetic datasets are released and are easily available.

G Detailed Experimental Results
In the main paper Table 1 and Table 3 we had provided concise results for some datasets, based on lack of space due to extensive experimental results.Here we provide detailed results on these datasets: G.1 VL-Checklist: detailed results Detailed Fine-tuning results on VL-Checklist dataset are provided in Table 17.These are an extension to the VL-Checklist results provided in the main paper Table 1.Detailed Pre-training results for VL-Checklist dataset are provided in Table 18 which are an extension to the VL-Checklist results provided in the main paper Table 3.

G.2 SVO-Probes: detailed results
Detailed Fine-tuning results on SVO-Probes dataset are provided in Table 19.These are an extension to the SVO-Probes results provided in the main paper Table 1.Detailed Pre-training results for SVO-Probes dataset are provided in Table 20 which are an extension to the SVO-Probes results provided in the main paper Table 3.

G.3 CREPE-Systematicity: detailed results
Here we provide detailed results on CREPE-Systematicity dataset used for measuring systematic generalization.In the main paper we had only provided the results related to systematic generalization (i.e., the unseen split), but here we provide results on both the seen and unseen split, for both hard negative retrieval sets (Comp and Atom) that are used when evaluating performance on CREPE by Ma et al. (2022).Detailed Fine-tuning results on CREPE-Systematicity dataset on both the seen and unseen splits are provided in Table 21.These are an extension to the CREPE-Systematicity results provided in the main paper Table 1.Detailed Pretraining results for CREPE-Systematicity dataset are provided in Table 22 which are an extension to the CREPE-Systematicity results provided in the main paper Table 3.

H Reproducibility
Here we provide necessary details to reproduce our work, that might not have been included in the main paper.

H.1 Training and hyperparameter details
Fine-tuning: For all fine-tuning experiments, we follow Yuksekgonul et al. (2022)   set to 0.1.Training is performed using 64 NVIDIA A100 GPUs.NegCLIP's hard negative text creation method often results in no negative text for some texts in the pre-training dataset.Removing all such image-text pairs with no possible hard negative text results in poor performance for NegCLIP (due to fewer data to pre-train on).If we include these image-text pairs, the text batch size might differ for different GPUs since some image-text pairs are without hard negative texts and this causes instabilities.We hence keep a cache of sentences from previous batches and add it to the batch as negative examples so that all GPUs have the same text batch size during training.The same is done for MosaiCLIP since not all images might have the same number of unique positive and negative sub-graphs available.For NegCLIP we create hard negative sentences using code released by (Yuksek-

Figure 1 :
Figure 1: (Left) a) A typical example from the ARO benchmark for testing attribute understanding of VLMs.VLMs struggle with matching the image to the correct caption (in green).(Right) Average scores of Mo-saiCLIP (our method) compared with NegCLIP and CLIP on prominent compositionality benchmarks for measuring b) Systematic Generalization c) Attribute, Relation, and Word Order understanding.

Figure 2 :
Figure 2: Overview of our approach.a) Depiction of the scene graph of an image (hypothetical) and a scene graph parsed from text.The text scene graph is a sub-graph of the image scene graph.The text scene graph is decomposed into sub-graphs from which b) minimally perturbed hard-negative sub-graphs are created.c) The Ground truth similarity matrix used for a batch of data during contrastive learning.Solid boxes represent a match between the image and the corresponding text.Different from CLIP, each image can be matched to multiple texts in our method.

Figure 3 :
Figure 3: MosaiCLIP's average score difference with NegCLIP on 20 datasets from ELEVATER benchmark.asignificant gain in ability to systematically generalize to unseen combinations of concepts.Across pre-training settings, MosaiCLIP improves over NegCLIP by up to 42.5%, 4.9% when evaluated against HN-Comp (CU), HN-Atom (AU) hard negatives respectively.Significant improvements are observed in attribute and relation understanding, giving gains of up to 8.3%, 12.0% respectively across pretraining settings.We also note that order understanding of MosaiCLIP is worse than that of NegCLIP for the CC-12M pre-training dataset, while better than NegCLIP for the YFCC-15M dataset.Notably, there is a large variance in NegCLIP's performance across pre-training datasets as seen in Table3, and it also performs poorly when the pre-training dataset has higher noise (e.g.YFCC-15M).MosaiCLIP is fairly consistent and more robust to the change in the pre-training dataset.In Appendix C.5 we find that MosaiCLIP can provide improvements over NegCLIP while using as low as 0.3x of the total pre-training or fine-tuning data.

Figure 6 :
Figure 6: Effect of increasing the number of positive and negative subgraphs on ARO benchmark when finetuning MosaiCLIP.Results are averaged over 3 finetuning datasets considered in this work

Figure 7 :
Figure 7: Effect of increasing the number of positive and negative subgraphs on CREPE -Systematicity benchmark, when fine-tuning MosaiCLIP (Here we use Open-CLIP RN-50 model pre-trained on CC-12M and finetune it on CC-FT).

Figure 11 :
Figure 11: Comparing of CLIP, NegCLIP and MosaiCLIP on 20 datasets of from the ELEVATER (Li et al., 2022a) benchmark.Models in this graph are pretrained with CC-12M data and have Swin-Tiny as the vision backbone.See Sec.4.1 for more details.

Figure 12 :
Figure 12: Comparing of CLIP, NegCLIP and MosaiCLIP on 20 datasets of from the ELEVATER (Li et al., 2022a) benchmark.Models in this graph are pretrained with YFCC-15M data and have Swin-Tiny as the vision backbone.See Sec.4.1 for more details.

Figure 13 :
Figure 13: Comparing of CLIP, NegCLIP and MosaiCLIP on 20 datasets of from the ELEVATER (Li et al., 2022a) benchmark.Models in this graph are pretrained with CC-12M data and have ResNet-50 as the vision backbone.See Sec.4.1 for more details.

Figure 14 :
Figure 14: Comparing of CLIP, NegCLIP and MosaiCLIP on 20 datasets of from the ELEVATER (Li et al., 2022a) benchmark.Models in this graph are pretrained with YFCC-15M data and have ResNet-50 as the vision backbone.See Sec.4.1 for more details.

Table 4 :
Comparison of Recall@1 scores of MosaiCLIP with NegCLIP and CLIP.All models are pre-traind on YFCC-15M with swin-Tiny backbone dation during fine-tuning is minimal, as shown in Appendix Table9.See Fig.3for average results on ELEVATER over four training settings and Table4for results on retrieval benchmarks where we see a +5.4 point improvement over NegCLIP.We use the popular Karpathy splits having a 5K and 1K sized test set for obtaining the COCO and Flickr30k retrieval scores respectively.

Table 5 :
(Li et al., 2022b)(Li et al., 2022b)and finetuned version of BLIP with BLIP models that have integrated NegCLIP and MosaiCLIP methodology while training.Fine-tuning has been performed on COCO.

Table 6 :
Table 7 shows the effects of different kinds of sub-graphs sampled during training.More details including the effect of sampling larger number of sub-graphs are presented in Appendix Sec. C. MosaiCLIP WiSE-FT 78.8 69.4 82.6 67.5 41.2 76.4 88.08 72.0Effect of Curriculum learning and Robust Finetuning (MosaiCLIP WiSE-FT ) using CC-FT data.

Table 7 :
Effect of different positive-negative sub-graph types sampled while training.Results are presented on the ARO benchmark.

Table 8 :
Rel. Attr.Ord.Avg.Avg.Rel.Attr.Ord.CU AU Avg.Avg.Rel.Attr.Ord.CU AU Avg.Avg.Avg.Ablating the effect of Curriculum learning and Robust fine-tuning.MosaiCLIP NoCurric refers to the version of our model without any curriculum learning.MosaiCLIP WiSE-FT refers to the version where the image encoder of the final model (after fine-tuning) and before fine-tuning are weight-space ensembled.CLIP and NegCLIP scores are also shown for reference.See Appendix Sec.C.3.

Table 10 :
Data efficiency of MosaiCLIP during pretraining.Numbers in blue are lowest numbers that are within 1% or greater than NegCLIP performance.Pre-Training dataset: YFCC-15M.

Table 11 :
Data efficiency of MosaiCLIP during finetuning.Numbers in blue are lowest numbers that are within 1% or greater than NegCLIP performance.Finetuning dataset: CC-FT.Curriculum learning has not been used for these experiments.

Table 12 :
Performance of NegCLIP with increasing batch size.A batch size of B corresponds to an effective batch size of 8*B in NegCLIP after image and text negative mining.Fine-tuning dataset: CC-FT.

Table 13 :
(Cascante-Bonilla et al., -May 2022)y published and contemporary works Syn-CLIP(Cascante-Bonilla et al., 2023)and TeachingSVLC Doveh et al. (2023).Results are reported on VL-Checklist, ARO and Average Zero Shot results on 21 datasets from ELEVATER and Imagenet.Performance numbers of these models are reported from their respective papers (blank fields (-) are not reported in respective papers).†Usesmillion-scale synthetic data for fine-tuning.‡Usesexternal Large Language Models (LLMs) like BLOOM(Mitchell et al., May 2021-May 2022)for text augmentation and hard negative text creation.See Sec.D.1 for more details.

Table 14 :
Fine-Tuning Results on CC-FT dataset with standard deviations across 3 random seeds.These results correspond to the CC-FT fine-tuning results in main paper Table1.Here the base model which is fine-tuned using different techniques is OpenAI-CLIP-ViT-B-32.

Table 17 :
Radford et al. (2021)n particular, all models are fine-tuned for 5 epochs, with a batch size of 256, using a cosine learning rate schedule with 50 steps of warmup and random-crop augmentation during training.AdamW is used for optimization.1e−5 is used as the initial learning rate.Training is performed using 4 NVIDIA A100 GPUs for all models.From the ARO dataset, 10% examples from attribute and relation splits are used as validation examples, and the rest are used as the test set for all models.On all other datasets, we evaluate zero-shot performance.For MosaiCLIP, we find that sampling a maximum of 3 positive and 6 negative sub-graphs per image during fine-tuning gives the best result on the ARO validation set and hence is used in all our experiments (including pre-training experiments).For MosaiCLIP, we keep sub-graph sampling probabilities as p 2 = p 3 .We vary p 1 in {0, 0.08, 0.15} while fine-tuning on the randomly chosen YFCC dataset.We choose the best model according to the ARO val-set and keep the hyperparameters the same for all other fine-tuning datasets.Pre-training:For pre-training experiments, we follow the training protocol used inYang et al. (2022);Radford et al. (2021).In particular, all models are trained for 32 epochs, with a batch size of 4096, using a cosine learning rate schedule with 5000 steps of warmup and random-crop augmentation during training.AdamW is used for optimization.The initial learning rate is 1e − 3, and weight decay is Attr.Rel.Obj.Attr.Rel.Obj.Attr.Rel.MosaiCLIPNoCurric 86.0 77.2 77.7 84.0 72.2 75.1 89.2 70.4 72.6 MosaiCLIP WiSE-FT 85.3 71.4 72.4 83.6 69.5 69.6 88.5 75.5 77.0 MosaiCLIP 86.0 76.8 78.4 84.1 72.1 74.8 89.0 70.1 71.3 Fine-tuning results on the VL-Checklist benchmark, for testing compositionality in terms of attribute, relation and object understanding.OpenAI CLIP VIT-B-32 pre-trained model is used as the base model for finetuning.See Sec.G.1 for more details.

Table 18 :
Pre-training results on VL-Checklist benchmark, for testing compositionality in terms of attribute, relation and object understanding.Results for both backbones Swin-Tiny and RN-50 are shown.See Sec.G.1 for more details.MosaiCLIP NoCurric 93.37 89.74 83.62 88.91 MosaiCLIP WiSE-FT 92.65 88.69 82.90 88.08 CC-100K

Table 19 :
Detailed Fine-tuning results on the SVO-Probes dataset.See Sec.G.2 for more details.

Table 20 :
Detailed Pre-training results on the SVO-Probes dataset.See Sec.G.2 for more details.