Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

,


Introduction
In recent years, Vision and Language (VL) models (e.g., CLIP [63], BLIP [47]) have shown impressive results across a wide range of tasks, and extraordinary zero-shot capabilities for tasks such as visual question answering, image captioning, and object detection.To achieve these capabilities, these large models are trained on massive datasets containing image-text pairs (e.g., LAION 400M [68]).However, despite the impressive capabilities of these models, recent empirical studies [14,76,85] have shown that even the strongest VL models struggle to perform scene understanding, including identifying object attributes and inter-object relations.More broadly, it has been argued that VL models exhibit little compositionality [56].
Understanding the structure of visual scenes is a fundamental problem in machine perception and has been explored extensively in many previous works [10,25,49,81,83,87].In particular, datasets containing scene graph (SG) annotations (e.g., Visual Genome [43]) have been collected and used for training models using structure.While they contribute to scene understanding, these dataset are relatively small and expensive to collect compared to largescale image-text pair datasets, and thus are not considered in many large VL models. 1 This raises the question: can small datasets containing structured annotations provide sufficient information for finetuning VL models to improve scene understanding?Next, we show that it is indeed possible to improve VL models using such data by utilizing a specialized model architecture and a new training paradigm.
Considering that VL models process both textual and visual representations, these representations must contain structure-related information in order for these models to accurately represent complex scenes.Consequently, our approach incorporates components that directly supervise each of these representations, when learning from SGs paired with images.Our first step is to convert image-SG pairs into image-text pairs, which are natural inputs for VL model training, such that the text accurately describes the structure of the scene graph.A key advantage here is that the resulting captions are dense, and tend to be more exhaustive than in datasets like LAION.
We found that simply introducing these image-caption pairs does not improve performance on Winoground and VL-Checklist.This is aligned with recent work [16,84], showing that commonly used contrastive learning approaches allow the model to concentrate mainly on object labels disregarding other important aspects, such as relations and attributes.To further introduce structure, we observe that SGs can be used naturally to generate hardnegatives.For example, if an SG contains an edge "dogchasing-cat", then we can reverse that edge "cat-chasingdog" and generate a corresponding negative caption.A contrastive loss between the negative and the original captions can be used to force the model to focus on finer details.
We next turn to introduce structure into the visual representation.Intuitively, we would like the visual encoder to capture structure-related information.We argue that this can be achieved via predicting scene graph information from the visual encoder by adding tokens that are designed to capture objects and relations, inspired by prompt learning approaches [35,24,92,90].We refer to these as "adaptive scene graph tokens" and train them to predict scene graph information directly using an open vocabulary approach, which predicts objects and relations described by text inputs rather than relying on a fixed vocabulary.
Finally, integrating complex structured knowledge into large VL models using standard finetuning techniques may result in catastrophic forgetting, as shown recently [13,16].To alleviate this issue, we design a new adaptation technique tailored specifically to the prompting approach.In particular, we modify the transformer layers by adding parameters that are specific to the SG tokens, so that the SG component has a reduced effect on VL parameters.We name our proposed approach SGVL (Scene Graphs for Vision-Language Models).See Figure 1 for an overview.
To summarize, our main contributions are as follows: (i) we propose to exploit a small set of SG annotations to incorporate structured representations into pretrained large VL models.(ii) we propose a new approach that captures structure-related information for both encoders by directly supervising the visual and textual components when learning from SG labels.(iii) we further design a new adaptation technique tailored specifically to the SG tokens that allows better learning of the graph prediction task while still maintaining zero-shot capabilities; (iv) our method shows improved performance upon CLIP [63] and BLIP [47] on the Winoground [76] and VL-CheckList [89] datasets with only a mild degradation in zero-shot performance, highlighting the effectiveness of the proposed approach.

Related Work
Vision and Language (VL) Models.In recent years, VL models [11,18,22,34,47,50,63,69] have emerged as powerful models for a wide range of tasks, demonstrating extraordinary zero-shot capabilities.These models are trained with image-text pairs in order to align the  two modalities in a joint embedding space.More recently, popular vision-language methods, such as CLIP [63], Cy-CLIP [22], BLIP [47], ALIGN [34], PyramidCLIP [18], and others, have proposed bridging the two modalities by learning two encoders simultaneously.The advantage of large-scale pretraining is exploited by these models through the use of large neural networks and large datasets.However, recent empirical studies (e.g., VL-CheckList [89], Winoground [76], and the recent ARO [84] benchmark) have shown that these models do not perform well on tasks that require understanding of the structure of the image, including identifying relations between objects and mapping between objects and attributes.Moreover, recent research [16] has shown that these models tend to learn a representation of a "bag of objects", leading them to be less structure-aware.Here we show that even training with a small number of structured annotations can contribute to gaining this structured knowledge for large VL models.
A key principle behind these works is to gain efficiency over full fine-tuning by achieving comparable performance while storing fewer parameters.Other popular approaches have used adapters [29,64,52] to enhance efficiency by fine-tuning models with fewer parameters.Adapters have been introduced as powerful modules added to frozen layers, in which only the adapter parameters are updated.They have been widely used to avoid catastrophic forgetting in multiple topics, such as NLP [30,4], VL [72,57], continual learning [70], video understanding [52], and more.Unlike these works, here, we add special tokens that are designed to capture objects and relations, supervised by SG annotations rather than the main contrastive loss.Since SG prediction is challenging and may result in large parameter changes in training, we modify the transformer layers by adding parameters specific to these tokens to minimize the impact on VL parameters.This allows us to maximize the potential of the structured task without losing zero-shot performance.

Scene Graphs for Vision-Language Models
We begin by describing the standard VL transformer architecture, the scene graph annotations, and the problem setup (Section 3.1).We then introduce our structural considerations for both the language (Section 3.2) and vision (Section 3.3) encoders, and the training losses (Section 3.4).Our method is illustrated in Figure 2.

Preliminaries
VL models are typically trained with pairs of images and texts: X i = (I i , T i ).Each of these modalities is processed by a separate transformer encoder, and the objective of training is to map the embeddings of images and their corresponding texts to nearby positions, usually via a contrastive loss.Next, we briefly describe the encoders.Language Transformer Encoder E T .The text encoder is a transformer [77] as described in CLIP [63] and BLIP [47], where a CLS token is appended to the beginning of the text, and the final CLS embedding is used as the text embedding.Vision Transformer Encoder E I .A typical vision transformer model [15] (ViT) takes an image I ∈ R 3×H×W as input, extracts N non-overlapping patches, and projects them into a lower-dimension d.We refer to these patches as "patch tokens", and denote them PT i .Then, spatial position embeddings PE ∈ R N ×d are added to provide spatial location information, resulting in a new embedding: z i = PT i + PE i .This forms the sequence of input tokens to the vision transformer encoder: where z CLS is a learnable token.The input z is fed into a standard transformer, and the final representation of the CLS token is the image embedding.Scene Graphs (SGs).As mentioned above, our motivation is to improve VL models by leveraging structured annotations from a scene graph dataset.The nodes of the scene graph represent objects and the edges correspond to relationships.Formally, an SG is a tuple G = (V, E) defined as follows: Since the scene graphs can be quite large, we simplify them and generate sub-graphs, along with corresponding image crops.For more details, see Section A.1.Next, we describe the textual and visual structured components that capture structure in both modalities.

Structural Language Component
We begin by discussing how an SG is transformed into text, and then explain how to manipulate the SG with our Graph-Based Negatives to further capture structure in the model.For a visualization, see Figure 4 in supplementary.Scene Graph to Text.Given an Image I and a corresponding scene graph G = (V, E) from the training dataset, we use the graph G to generate a textual caption for the image.We start by iterating over the connected components of the graph one by one and generate a textual caption for each component by concatenating the class labels from the graph edges over a Hamiltonian path.We prepend a corresponding attribute before the object class label if such exists.Last, we generate a single caption by concatenating the captions of the connected components separated by a dot.For a visualization, see Figure 4 in supplementary.Graph-Based Negatives (GN).We have found that using the scene graph data solely as image-text pairs with the contrastive loss is not enough to force the model to develop structural understanding.As shown in recent work [16,84], the commonly used contrastive learning allows the model to concentrate mainly on object labels disregarding other important aspects, such as relations and attributes.In order to provide more focus on such aspects, we exploit the SG structure and propose a set of predefined graph-based rules (See Section A.2) that modify SGs and make them semantically inconsistent with the image.Next, these SGs are transformed into negative textual captions, which are used with a specified loss to motivate the model to focus on structural aspects.For a visualization, see Figure 4 in supplementary.

Structural Visual Component
Scene Graphs Tokens.As mentioned earlier, a natural way to capture structure-related information in the visual encoder is to use it to predict scene graphs.Towards this end, we add a set of "SG tokens" that are learned queries and are designed to predict objects and relationship from the graph.The SG tokens consist of two groups.The first are "object tokens" meant to represent objects, their locations and their attributes.The second are "relationship tokens" meant to represent SG edges that describe relationships.
Formally, we define a fixed set of ñ learned object queries and denote them by Similarly, we define a fixed set of m relationships queries and denote them by p r 1 , p r 2 , • • • , p r m ∈ R 1×d .We refer to these queries as the learned SG tokens.We note that ñ > n max , m > m max , where n max /m max is the maximal number of objects/relationships.This implies that some queries predict the categories "no object" and "no relationship".
The scene graph tokens are concatenated with the standard CLS and patch tokens to obtain the following set of inputs to the transformer: Next, the transformer processes the input z, resulting in a new representation for each token (i.e., the CLS token, the SG tokens, and the patch tokens).Given an input image I that induces an input z, we denote F j O (I), F k R (I) the representations of the j th object token and k th relationship token.We use these representations to predict object and relationship labels and localization, by passing them through two feed-forward networks (see Equation 8).These predictions are supervised with ground-truth SGs annotations through a matching process.Figure 3 visualizes the SG tokens learned by our model.More details are in Section 3.4.Adaptive tokens.Our training approach is to fine-tune an existing VL model on the SG data, with the added SG tokens.The finetuning of machine learning models may result in catastrophic forgetting, and therefore the model may lose its VL skills, including zero-shot capabilities.This is of special concern in our setup, as the task of scene graph prediction is challenging and deviates significantly from the initial training scheme of the model, potentially resulting in large changes in the parameters.To alleviate this issue, we design a new adaptation technique tailored specifically to the prompting approach.We propose to add parameters that are specific to the SG tokens, so that the SG component has a reduced effect on VL parameters.Recall that transformer layers have matrices for mapping to queries, keys and values.We denote these matrices for the pretrained VL model by Q P , K P , V P .The SG tokens we add could have used the same matrices.However, we have found that the SG prediction task does not converge well using this setup, and thus we introduce a separate set of matrices Q SG , K SG , V SG that is used with the SG tokens.Importantly, the attention is performed over all tokens (patch and SG).Similarly, for the MLP component in the transformer layer, we also have a different version for the patch tokens (denoted by MLP P ) and the SG tokens (denoted by MLP SG ).
Finally, we also use Low Rank Adapters (LoRA [30]), which restrict the parameter updates to low-rank matrices, and have been shown to be effective in fine-tuning of transformers.Specifically, for each trainable matrix W ∈ R u×v (e.g., W = Q P or W = Q SG ) we let W 0 denote its pretrained weights, and parameterize the learned matrix as: where A ∈ u × r and B ∈ r × v are r-rank matrices and r is a hyperparameter. 2We note that we use two distinct r: r p and r SG for weights associated with the patch tokens and SG tokens (as described above), respectively.This allows the use of a higher rank for the SG parameters in order to perform better at the SG prediction.During training, W 0 is kept frozen, while A, B are learned.Overall, our additional trainable parameters are only 7.5% of the model.Open Vocabulary Scene Graph Prediction.The annotated SGs contain category information (for objects, relationships and attributes).Our SG tokens are meant to predict these categories.Naive implementation of this idea would require a prediction head with an output for each category.However, the VG dataset contains approximately 70K object categories and 40K relationship categories, and thus poses a significant challenge in terms of handling the imbalanced, limited, and sparse nature of the data.One possible solution [81] introduced a split containing only 100 object labels and 50 relation labels, which further restricted the already small dataset.Rather than restricting our data, we use an open vocabulary approach, namely, we utilize the text encoder to embed the categories from the SG components.Next, we use these embeddings in training to supervise the SG tokens.For example, if node i has category "dog" and attribute "black", we train some object token to predict the embedding of the phrase "black dog".This also applies to the prediction of relationship tokens, which might be the embedding of the relationship "dog chasing cat".More details are in Scene Graph Loss in Section 3.4.

Training and Losses
As previously mentioned, we use CLIP [63] and BLIP [47] that both include a transformer image encoder E I and transformer text encoder E T .The similarity between an image I and a text T is calculated as follows: where cos is the cosine similarity.
As previously described (Section 3.2), during training we have a set of image-text pairs (I, T ) and a set of image-SG pairs (I G , G).We use the image-SG pairs to generate a set of images with positive textual captions (I G , T p ) and negative textual captions (I G , T n ).We use these as inputs to our model (3), while optimizing the following losses: Image-Text Loss.Our image-text loss is comprised of two terms, contrastive loss and graph negatives loss.Contrastive Loss: The standard approach to train VL models is via contrastive loss on image-text pairs, as in [63].We do this here with the standard pairs (I, T ) and those generated from the scene graphs (I G , T p ).Thus, the loss is: where Graph-Based Negative Loss: For each image with an SG, we have a text T p that faithfully describes it, and a negative text T n that does not describe it.We thus use a loss that drives T p to be more similar to I G than T n (e.g., see [16]): Finally, the image-text loss is a weighted sum of the contrastive loss and the graph negatives loss: where λ GN is a hyperparameter.As in the BLIP [47] paper, we use the image-text matching module to apply a loss to all image-text pairs.We also add an additional term to the Graph-based Negative loss that uses this module.See Section A.4 in supplementary.Scene Graph Loss.As mentioned in Section 3.3, we incorporate SG tokens into the model, which are used to predict the SG that corresponds to the image.We next explain this process.The graph G contains several annotations: the set of object categories O = {o i } n i=1 , the set of bounding boxes B O = {b o i } n i=1 , and set of relationship categories R = {r i } m i=1 .We also augment the relationships with a set of bounding boxes, that are constructed as the union of the boxes of the corresponding objects, and denote these by As mentioned, we do not aim to predict the object and relationship categories directly, but rather use the embeddings from the VL model.Thus, we extract the embeddings of these labels with the text encoder E T to get class embeddings: Õ = E T (O) ∈ R (n+1)×d , and R = E T (R) ∈ R (m+1)×d .We note that (n+1) and (m+1) classes are due to "no object" and "no relationship" categories (denoted as ∅), which are represented with learned embeddings.
Thus far we described the elements of the SG that we aim to predict from the SG tokens.We next describe the prediction process, and the corresponding losses.The image encoder outputs a set of ñ object queries and m relationship queries.We apply two feed-forward networks, FFN bb and FFN e , on top of these queries to predict bounding boxes and label embeddings respectively.Specifically, given the j th object token and k th relationship token, we predict where bj , bk ∈ R 1×4 are bounding box predictions and êj , êk ∈ R 1×d are class embeddings predictions.Next, we use the class embeddings matrices to predict probabilites over n + 1 and m + 1 classes: where qo j ∈ R 1×n+1 and qr k ∈ R 1×m+1 .Next, we need to match the predictions of the SG tokens with the groundtruth SG, in order to determine which SG tokens correspond to which GT objects and relationships.We follow the matching approach as in DETR [9], except that in our case, objects and relationships are matched separately.For example, for the objects, this matching process checks the compatibility between GT and permuted objects in terms of object category (i.e., the probability assigned by the query to the GT object o i ) and in terms of bounding boxes (i.e., how well the predicted box matches the GT one).With the   optimal matching σ we calculate the following Objects loss: where L Box is a weighted combination of L giou [65] and L 1 losses.The relationship loss L Rel is calculated in a similar manner, and the total loss is the sum of these losses.Finally, our method can be applied on top of a variety of VL models.For our experiments, we use the CLIP [63] and BLIP [47] models.The complete matching and losses details are in Section A.3.

Datasets
For training, we use the LAION dataset as "standard" image-text pairs, along with SG data from Visual Genome (VG).For evaluation we use the VL-Checklist [89] and Winoground datasets, as well as zero-shot classification tasks.(1) LAION 400M [68] is a large-scale image-text pair dataset that was automatically curated from the Internet and has been filtered using the pretrained CLIP model.LAION 400M has been used to reproduce CLIP and it achieves similar performance to the original model [63].(2) Visual Genome [43] (VG) is annotated with 108, 077 images and scene graphs.On average, images have 12 entities and 7 relations per image.For the datasets that are used for testing: (1) VL-Checklist [89] is a new study that combines the following datasets: Visual Genome [43], SWiG [61], VAW [60], and HAKE [51].For each image, two captions are given, a positive and a negative.The positive caption is coherent with the visual structure of the image, and the negative caption is constructed by modifying one word in the positive caption that corresponds to a structural visual aspect (e.g., attribute).We report results on a com-bined VL-Checklist datasets excluding Visual Genome3 .For more details see Section D.1 (2) Winoground [76] is a dataset that probes compositionality in VL models.The dataset contains 400 samples, each composed of two imagetext pairs.The pairs have overlapping lexical content but are differentiated by a swapping of an object, a relation, or both.Each sample tests the performance on two textretrieval tasks (text score), and two image retrieval tasks (image score).The combined performance is represented by the group score.A recent study [14] has shown that solving Winoground requires not just compositionality but also other abilities such as commonsense reasoning.The study proposed a new split to Winoground differentiating the samples by the source of their hardness.Along our work we use this suggested split, For more info, Section D.2.Zero-Shot Classification.Our method was evaluated using 21 classification datasets following the Zero-Shot classification protocol described in ELEVATER [46].The evaluation includes common classification datasets such as Ima-geNet [66], CIFAR100 [44], EuroSat [23], and others.We report the average results over the 21 tasks in Table 1.

Implementation Details
We have implemented SGVL using Pytorch, and the code will be released upon acceptance and is included in the supplementary.Our code and training procedures are based on CLIP and BLIP.For CLIP [63], we use the ML-Foundation Open-CLIP repository [31], and for BLIP [47] we use the official implementation 4 .We use the ViT/B-32 model architecture for CLIP and ViT/B-16 for BLIP.The models are initialized with original weights released by the respective authors.For more info, please refer to Section D.

Baselines
We compare SGVL to two models reported in previous work.The first is CLIP [63], the OpenAI pretrained model, which was trained on 400M image-text pairs and achieves high zero-shot results.Second, we consider the recent BLIP [47] model, which shows improved results on VL understanding tasks.To ensure a fair comparison, all methods use the same network architecture and initialization from the respective pretrained models.We provide additional results in Section B in the supplementary.
The evaluation of Winoground and VL-Checklist requires matching images and texts.We follow the standard protocols for CLIP and BLIP.Namely, for CLIP, we compute the cosine similarity between the image and the text CLS embeddings.For BLIP, we use the ITM module that predicts a positive-negative score for an image-text pair.

Results
We evaluate our method on Winoground and VL-Checklist, and the results are shown in Table 1.On Winoground, CLIP-SGVL improves the image, text, and group scores, while BLIP-SGVL improves the image and group scores significantly, and the text score remains comparable.We note that our BLIP-SGVL is the first to report above chance level results on all three Winoground score splits (See "Random Chance").On VL-Checklist, CLIP-SGVL improves all tasks, and BLIP-SGVL achieves gains on the "Attribute" and "Object" tests, while the results for the "Relation" test slightly degrade.We note that these improvements come at the price of some degradation in Zero-Shot performance compared to CLIP and BLIP models.
Table 2 shows our performance on fine-grained Winoground splits.We report the NoTag, Visually Difficult and Complex Reasoning splits, as introduced [14].The first split measures compositional understanding, while the second tests the detection quality of visual elements.To resolve the last category, common sense reasoning or world knowledge is required.It can be seen that our approach incorporates well structure in order to enhance compositionality, detection quality, and common sense reasoning.
Last, Table 3 shows improvements on VL-Checklist.Specifically in the "Attribute" and "Object" categories we show consistent improvements for CLIP-SGVL and BLIP-SGVL.For the "relation" category, we assume that our graph-based negatives do not focus as much on action re-lations as they do on spatial relations (which is more representative of the scene graphs from VG).

Ablations
We perform an ablation study on the Winground [76] dataset with our SGVL approach using the BLIP model (See Table 4).For more ablations, see Section B.2 in supp.Scene Graph Information.To validate the best way of utilizing the SG information, Table 4a illustrates how the visual and textual components contribute to the approach.By adding only the textual description generated from the SGs (BLIP + Graph Text in the table), we obtain the following Text/Image/Group scores over the BLIP baseline: +2.7/-2.2/-0.2.The results compared to the baseline improves to +1.5/+6.3/+3.2 when the graph-based negatives (GN) are included, demonstrating the effectiveness of the generated negatives.Moreover, when the SG tokens are added, the image/group scores significantly improved by +11.3/6.5, whereas the text score remains comparable.This highlights that supervising both visual and textual components when learning from SG labels is beneficial.Visual Encoder Components.In this ablation, we justify our design choices within the visual encoder, as we use the same generated captions and negatives for all variants.We report performance in Table 4b on three different tasks: Winoground (group score), Zero-shot classification (ZS), and SG prediction (IoU metric).To measure the contribution of the SG tokens, we propose an alternative variant, BLIP MT (BLIP Multi-Task), which does not include SG tokens and predicts the graph using the CLS token.It can be seen that the BLIP+SG Tokens outperforms the BLIP MT, demonstrating the benefits of the SG tokens.Last, our "adaptive SG tokens" outperforms all other models, demonstrating that our proposed adaptation technique tailored to the SG tokens indeed allows better learning of the SG prediction task and Winoground while still maintaining zero-shot capabilities.Image-SG Comprehensiveness.To examine the significance of SG comprehensiveness with respect to the image, we train our model with partial graphs.Specifically, we train two variants where objects and relations from the graphs are randomly removed (30% and 70%) and a third variant in which all relations are removed.As can be seen in Table 4c, our results show that our model performs better when the graphs are denser, richer, and describe the image more accurately, highlighting the motivation to utilize SGs.

Discussion and Limitations
Structured understanding of complex scenes is a key element of human visual perception, but its modeling still remains a challenge.In this work, we propose a new approach for incorporating structured information into pretrained VL models from a small set of SGs annotations to improve performance of scene understanding.We demonstrate improved performance on Winoground and VL-CheckList with only a mild degradation in zero-shot performance.Although our work exploits SG annotations, it is important to note that these annotations are limited in availability and are expensive.Therefore, we leave for future work the challenge of using our approach in an unsupervised manner.
In this supplementary file, we provide additional information about our experimental results, qualitative examples, implementation details, and datasets.Specifically, Section A provides more additional method details, Section B provides more experiment results, Section C provides qualitative visualizations to illustrate our approach, and Section D provides additional implementation details.

A. Additional Method Details
We begin by presenting additional model details regarding our graph preprocessing procedure (section A.1). Next, we explain in detail our method for creating graph-based negatives (section A.2), which is illustrated in Figure 4. We conclude by giving more details on our scene graph loss (section A.3) and some modifications made to our loss calculation when training BLIP [47] (section A.4).

A.1. Graph Preprocessing
We next describe how we process the image-SG pairs to create our training dataset.Our guiding principle is to create image-SG training pairs where the graphs are dense enough but not too large, in order to allow structured and short descriptions.To this end, given an image I and a corresponding graph G(V, E), we extract the sub-graphs by taking a random walk on the graph.The random walk is initialized by randomly picking a relationship from the graph (edge e ∈ E and nodes v 1 , v 2 ∈ V such that e = (v 1 , v 2 )) and ends when a node that has no outgoing edges is reached, resulting in a sub-graph G 1 = (V 1 , E 1 ).Next, the image is cropped to the union of the bounding boxes of all objects (v ∈ V 1 ) in the extracted sub-graph, resulting image I 1 .We finish the process by adding new nodes and relationships to G 1 from the residual graph G r = (V /V 1 , E/E 1 ) that are visible in I 1 .We use G 1 and I 1 as a training sample only if the derived G 1 contains at most 10 objects (i.e.|E 1 | ≤ 10).This process creates SGs composed of connected components that are all DAGs with a single Hamiltonian path, which facilitates caption generation.

A.2. Graph-Based Negatives
In order to generate negative image captions, we propose a set of predefined rules that when applied to an imagescene graph pair, result in a negative scene graph that incorrectly describes the image.Our scene graph to caption scheme transforms negative scene graphs into negative captions that are semantically inconsistent with the images they accompany.For each training sample we randomly apply one of the following negative rules focused on object attributes and relationships in the graph: (i) asymmetric relations swapping -we call a relationship R asymmetric, if for two objects a, b, aRb =⇒ ¬bRa.We manually annotated a relation as asymmetric out of the 300 most common VG relations.We use these to generate a negative scene graph by searching for an edge e = (v 1 , v 2 ) ∈ E representing an asymmetric relationship, and modify the graph by replacing e with an edge e n = (v 2 , v 1 ).For example, in the case of a graph describing the phrase "dog chasing cat", such a negative will result in the phrase "cat chasing dog".(ii) relation falsification -we replace relations in the graph with false relations.For example, we turn "cup on table" to "cup under table".For negatives focused on object attributes, we first scan the dataset and split the attributes into the following categories: color, material, size, state.Next, we use this split to perform two types of negatives: (i) attributes falsification -we replace attributes for objects in the graph with false attribute from the same category.For example, turning "blue ball" to "red ball".(ii) attributes swappingwe search the graph for two objects that are annotated with attributes from the same category.Given that such a pair has been found we switch between the attributes, resulting for example, "silver spoon and golden knife" from "golden spoon and silver knife".

A.3. Scene Graph Loss
We need to match the predictions of the SG tokens with the ground-truth SG, in order to determine which SG tokens correspond to which ground-truth objects and relationships.We follow the matching approach as in DETR [9], except that in our case, objects and relationships are matched separately.We describe the object matching below.Given a permutation σ over the object queries, we define the matchingcost between the permutation and the GT by: Namely, we check the compatibility between the GT and permuted objects both in terms of object category (i.e., the probability assigned by the query to the GT object o i ) and in terms of bounding boxes (i.e., how well the predicted box matches the GT one).Here L giou is from [65], and λ l1 and λ giou are hyperparameters.
Finally, we use the optimal matching σ from above to calculate the following Objects loss: The relation loss L Rel is calculated in a similar manner, and the total scene graph loss is the sum of L Obj and L Rel : Finally, we optimize our final loss as the sum of the scene graph loss and the image-text loss:

A.4. BLIP Image-Text Loss Details
Besides image and text unimodal encoders trained using a contrastive loss, BLIP [47] also includes an imagegrounded text encoder that uses additional image-text crossattention layers.The encoder is equipped with a binary classification head (a linear layer) and is trained to predict whether an image-text pair is positive (matching) or negative (unmatching).In the training procedure described by the authors, the encoder uses a hard negative mining strategy to calculate an additional loss for all image-text pairs in the batch.When training our BLIP-SGVL model, we apply this loss as well.Additionally, we use this encoder to add another term to our graph-based negative loss (L GN ).Let E P IT (I, T ) denote the positive score given by the encoder to the image-text pair (I, T ), then the following term is added to L GN :

B. Additional Experiment Results
We begin by presenting additional results (Section B.1). Next, we present additional ablations (Section B.2) we performed in order to test the contribution of the different SGVL components.

B.1. Additional Results
We start by presenting additional results from two recent datasets: the ARO and the VSR datasets, which demonstrate that our approach works on a variety of datasets.We then show additional results on all splits in Winoground.The ARO and VSR datasets.To further evaluate our proposed SGVL approach, we experiment on two recent benchmarks: the ARO [85] and the VSR [53] datasets in Table 5.
For the ARO dataset (more details are in Section D.3), we test our models on the COCO Order and Flickr30K tasks, that test the ability of VL models to identify the correct order of words in a caption.We compare our CLIP-SGVL and BLIP-SGVL models to the CLIP and BLIP baselines, as well as to the NegCLIP model [85], a variant of CLIP that was trained on the COCO dataset with hard negatives that were generated from the captions without using graph structure.It can be seen that our models significantly improve CLIP and BLIP baselines by +41.2 and +42.1 on COCO, and +31.5 and +39.7 on Flickr30K, respectively.Last, our CLIP-SGVL model also improves NegCLIP by +1.2 on COCO, demonstrating the effectiveness of our graph-based negative approach over some hardnegative mining as used in NegCLIP.
For the VSR dataset (more details are in Section D.4), Table 5: Results on ARO and VSR.We report accuracy on the ARO dataset [84] and VSR [53]  that checks the ability of VL models to correctly identify spatial relations in images, we compare our BLIP-SGVL model to the BLIP baseline.Our performance improves across all categories, including an average improvement of +5.8.We note that we could not evaluate CLIP since the task requires assigning a true or false label to an image-text pair.CLIP, however, does not allow this to be done in a straightforward manner.
Winoground results.We provide a complete analysis in Table 6 for all different splits, as reported in [14].As can be seen, our SGVL approach is consistent across all splits.

B.2. Additional Ablations
Next, we provide additional ablations that further illustrates the benefits of our SGVL.Training with negatives from non-graph data.In order to demonstrate the importance of structured information in textual descriptions, we examine the performance of the model when only LAION captions are used.More specifically, we trained on generated negatives that were not derived from scene graph data but were generated in a similar manner to our graph-based negatives.Since we do not have the graphs in this setup, we apply the following augmentations: (i) swapping asymmetric relations -We swap the nouns that are relevant to the relation by using a standard parser.(ii) relation falsification -The relation is replaced with one from a closed set of relations we manually annotated in order to obtain the wrong semantic meaning.(iii) attributes swapping -We swap attributes from a closed set of attribute categories that we manually annotated (e.g., color, etc.).The Text/Image/Group scores compared to the BLIP baseline are +4.0/-0.7/-0.8,while using BLIP with our graph-based augmentations (with SG tokens) obtains +1.5/+6.3/+3.2compared to the BLIP baseline.It can be seen that the generated negatives from LAION improve only the Text score while applying our graph-based negatives improves all the metrics.This indicates that the main reason for the improvement is the structured information contained in the descriptions generated from the SGs.The importance of the scene graph data.To examine the significance of the information provided by scene graphs, we suggest learning scene graphs without any useful information.Thus, we run an experiment in which the scene graphs are completely random.This ablation obtains on Winoground 37.8/18.7/14.0compared to our BLIP-SGVL 39.0/30.5/21.5 for Text/Image/Group scores (the BLIP baseline obtains 39.0/19.2/15.0).This illustrates that the approach we employ is not merely a regularization, but also provides important information about the scene graphs that can be used by pretrained large VL models.The effect of image-SG data size.In this experiment, we train our method using varying amounts of image-SG pairs of data (10%, 40%, 70%, and 100% of the dataset) in order to examine the effect of the data portion.Figure 5 shows the Winoground group score performance as a function of the image-SG pairs data portion.As can be seen, as more data is used, the results continue to improve, suggesting that adding image-SG data consistently leads to better results.The importance of simultaneous training.In our SGVL approach, as described in the main paper, we train with the original image-text pairs from LAION as well as the image-SG pairs from Visual Genome simultaneously.To verify the effectiveness of simultaneous training, we train only with VG data and obtain a degradation of -2.0/-1.2/-1.7 for Text/Image/Group scores compared to our BLIP-SGVL.
In addition, we observe a degradation of 3% in zero-shot performance when compared to our BLIP-SGVL model.
The results indicate that training on image-text pairs from LAION and image-SG pairs from Visual Genome simultaneously is crucial for performance and zero-shot.Token specialization.Our SGVL approach learns a different specialization for each scene graph token in Figure 6.
We visualize the box predicted by 10 different object tokens and 7 random relationship tokens for all images in the COCO val set.We observe that these tokens are specialized in different locations in the image, whereas the relationship tokens tend to be centered since their boxes are larger and spread over a greater area.

Scene graph token representations.
To verify what the scene graph tokens learned, we can evaluate the ability of object and relationship tokens to be utilized explicitly for the auxiliary task as a simple scene graph predictor in images.This is accomplished by predicting the scene graphs on Visual Genome based on the learned SG tokens.We compared the learned SG tokens with a BLIP model extended with object and relationship heads.Our model achieved an mAP of 17.3, while the proposed baseline achieved an mAP of 14.4.These results indicate that the SG tokens learn meaningful and useful representations.

C. Qualitative Visualizations
Figure 4 shows a visualization of the generation process of captions, including positive captions as well as negative captions, based on our Graph-based Negatives module.As shown in the figure, captions generated from scene graphs are much more focused on describing fine-grained details.Furthermore, we show in Figure 7 visualizations of "scene graph tokens" predictions for images from Visual Genome, which the model was not trained on.It can be seen that although the model has not been trained on these images, the predictions are reasonable.Finally, we show in Figure 8 and Figure 9 error analysis on Winoground and VL-Checklist to evaluate the success and errors of our method and the baselines.This illustrates which examples our model succeeds on while the BLIP model fails.
For our experiments, we choose the CLIP [63] and BLIP [47] models as they are among the most popular and easy-to-use methods.These models are implemented based on the Open-CLIP library [31] and the BLIP code base (available at https://github.com/salesforce/BLIP), and we implement SGVL based on these repositories.As described above, our approach is trained using both the original image-text pairs from LAION and the image-SG pairs from the Visual Genome.In particular, we use for CLIP-SGVL 3M image-text pairs, while for BLIP-SGVL, we use 750K due to computational constrains.
Last, we set the number of object and relationship tokens to 25 and 7, respectively.We also set the LoRA ranks for r SG = 32 and r p = 16, and the λ parameters (see Equation 7 and Equation 13) for λ GN to 1.0 and λ SG to 0.1.Next, we elaborate on the additional implementation details for each dataset, including inference information.

D.1. VL-Checklist
Dataset.VL-Checklist [89] is a new study that combines the following datasets: Visual Genome [43], SWiG [61], VAW [60], and HAKE [51] datasets.For each image, two captions are given, a positive and a negative.The positive caption is derived from the source dataset and is coherent with the visual structure of the image.the negative caption is constructed by modifying one word in the positive caption that corresponds to a structural aspect in the image.To correctly solve a sample the model needs to identify the caption faithfully describing the image.Specifically, VL-Checklist evaluates the following structured concepts: (1) Object: identifying objects invariantly to their spatial location and size, (2) Relation: spatial or action relation between two objects, and (3) Attribute: color, material, size, state, and action bounded to objects.In the following sections, we report results on a combined VL-Checklist dataset excluding Visual Genome.Inference details.A test sample consists of an image and two captions.For CLIP, we compute the cosine similarity between the image and the captions and report the positive caption as the one with the higher similarity.For BLIP, we use the ITM head, which predicts both a positive and negative score for each pair.We consider the caption with the higher positive score to be the correct one.

D.2. Winoground
Dataset.Winoground [76] is a new challenging dataset that evaluates the ability of VL models to capture compositionality in vision & language.The dataset contains 400 samples, each composed of two image-text pairs (I 0 , C 0 ), (I 1 , C 1 ).The pairs have overlapping lexical content but are differentiated by a swapping of an object, a relation, or both.To correctly solve the sample the model needs to correctly solve two text retrieval and two image retrieval tasks.A recent study [14] has shown that solving Winoground requires not just compositional understanding but also other abilities such as commonsense reasoning.The study proposed a new split to the dataset differentiating the samples by the source of their hardness.Specifically, the split of the samples into the following categories is as follows: Non Compositional -There are 30 samples in this category that do not require compositional reasoning.Visually Difficult -The model must be able to detect an item that is visually difficult to identify (small, blurry, in the background, etc.) in order to sort these samples correctly.This category includes 38 samples.Ambiguously Correct -This category includes 46 samples where at least one caption accurately describes both images or doesn't quite describe any of the images.Unusual Text & Unusual Image -There are 106 samples in these categories, all of which contain unrealistic or awkward texts or images that make it difficult to solve them with a VL model.Complex Reasoning -This category consists of 78 samples that require common sense reasoning or knowledge of the world around us.No Tag -These are vanilla Winoground examples that solely probe compositional understanding.Inference details.For testing, the pairs are given, and a text score, an image score, and a group score for a sample is computed in the following way: The text score is 1 if and only if image I 0 has a higher similarity to caption C 0 than C 1 , and image I 1 has a higher similarity to caption C 1 than C 0 .Similarly the image score is 1 if and only if caption C 0 has a higher similarity to image I 0 than image I 1 and C 1 has a higher similarity to image I 1 than image I 0 .The group score is 1 if and only if both text and image scores are 1.Thus, the random chances for both the image and text score, is 1/4 while for group score it is 1/6.Similarities between image-text pairs is computed as in section D.1.

D.3. ARO
Dataset.ARO [84] (Attribution, Relation, and Order) is a new benchmark that tests compositionality in VL models.This dataset consists of four large-scale tasks designed to test the relational, attributive, and order understanding of the model.For evaluation, the authors propose four tasks that are sensitive to order and composition, namely Visual Genome Relation, Visual Genome Attribution, COCO& Flickr30k Order.Since our approach is trained on Visual Genome, we report only the COCO and Flickr30k order task (PRC).For the order task, image-text pairs from the mentioned datasets are used.The words in the text are reordered in order to create false captions for the image, ac-cording to the following perturbations: nouns and adjectives shuffle, everything but nouns and adjectives shuffle, trigrams shuffle and words within trigrams shuffle.Inference details.During inference, each sample consists of an image and five textual descriptions.The similarity of each text to the image is measured as in section D.1, and the text with the highest similarity to the image is reported as the real caption.

D.4. VSR
Dataset.VSR [53] VSR (Visual Spatial Reasoning) is a new benchmark for measuring the spatial understanding of vision-language models.The VSR dataset consists of natural image-text pairs in English, each example contains an image and a natural language description of the spatial relationship between two objects shown in the image.The VL model needs to classify images and captions as either true or false, indicating whether a caption accurately describes the spatial relationship.The dataset has more than 10K imagetext samples, derived from 6,940 COCO images and covers 65 spatial relations.The dataset is split into a train, validation and test sets, however, since we evaluate in a zero-shot manner we test our model and baselines using all samples from the train, validation, and test splits.The spatial relations are divided into 7 meta-categories: Adjacency, Directional, Orientation, Projctive, Proximity, Topological, Unallocated.We report results according to these categories, as well as the average over all spatial relations.Inference details.We do not evaluate CLIP on this task since the task requires assigning a true or false label to an image-text pair.CLIP, however, does not allow this to be done in a straightforward manner, and Therefore only the BLIP model can be used.We use the ITM head to determine whether the sample is true or false, as done in [47].

D.5. Scene Graph Datasets
In our work, we use the LAION dataset as "standard" image-text pairs, along with image-SG data pair from Visual Genome [43] (VG).Visual Genome is annotated with 108, 077 images accompanied by their corresponding scene graphs.On average, images have 35 entities, 21 relationships, and 26 attributes per image.Additionally, there are approximately 70K object categories and 40K relationship categories.In general, Visual Genome scene graphs can be viewed as dense knowledge representations for images, similar to the format used for knowledge bases in natural language processing.

D.6. Licenses and Privacy
The license, PII, and consent details of each dataset are in the respective papers.In addition, we wish to emphasize that the datasets we use do not contain any harmful or offensive content, as many other papers in the field also use them.Thus, we do not anticipate a specific negative impact, but, as with any Machine Learning method, we recommend to exercise caution.

Figure 2 :
Figure2: Our Scene Graphs for Vision-Language Models (SGVL) Approach.Our approach trains simultaneously on image-SG pairs (black solid arrows) and "standard" image-text pairs (black dotted arrows).Our key aspect is to capture structure-related information in both visual and textual components, when learning from scene graph labels.For the textual side, we generate captions and negative captions using the graph (Graph-Based Negatives and SG to text modules).For the visual side, we predict SG information (classes and boxes) by adding tokens that are intended to capture objects (pink tokens & arrows) and relationships (cyan tokens & arrows), and these predictions are supervised with ground truth labels using bipartite matching.The object and relationship class labels are matched in embedding space between SG token representations and textual embeddings of ground truth classes.Finally, our new adaptation technique, shown in the figure on the right, is tailored specifically to the SG tokens and allows better learning of the graph prediction and VL tasks.

Figure 3 :
Figure 3: "Scene Graph Tokens" Visualization.The predictions of object tokens (pink) and relationship tokens (cyan) are shown for images not in the Visual Genome training data.

5 Table 4 :
Ablations on the Winoground Dataset.We report Text, Image, and Group scores.We show (a) The contribution of scene graph information.(b) Method components, and (c) Image-SG comprehensiveness.More ablations are in Section B in supplementary.

Figure 4 :
Figure 4: Qualitative visualization of the Graph-based Negatives and SG to Text modules.We show the generation process of positive captions (green) and negative captions using the graph (red).

Figure 5 :
Figure 5: Image-SG Pair Data Size.We report the performance of our model on Winoground group score as a function of the amount of scene-graph data used during training (percentage of the available data).

Figure 6 :
Figure 6: Tokens Specialization.We visualize the box predictions of 10 random object tokens (left) and 7 relationship tokens (right) on all images from the COCO validation set.Each box is represented as a point with the normalized coordinates of its center.Colors indicate the predictions made by different tokens.

Figure 7 :
Figure 7: Scene Graph Prediction.We show the predictions of the "scene graph tokens" on images from Visual Genome that were not trained by our model.

Figure 8 :
Figure 8: Error Analysis on Winoground.We demonstrate on the left in green where our BLIP-SGVL model succeeds, while the baseline BLIP model fails.On the right, in red, we can observe examples in which our BLIP-SGVL model fails.As visible, are model improves in samples that require understanding relations between objects, binding attributes to objects, and counting objects.

Table 1 :
Winoground and VL-Checklist Summary Results.The gains and losses are in color.

Table 3 :
VL-Checklist Result on the Attribute, Object, and Relation tests, excluding the Visual Genome dataset.