Probing Cross-Modal Representations in Multi-Step Relational Reasoning

We investigate the representations learned by vision and language models in tasks that require relational reasoning. Focusing on the problem of assessing the relative size of objects in abstract visual contexts, we analyse both one-step and two-step reasoning. For the latter, we construct a new dataset of three-image scenes and define a task that requires reasoning at the level of the individual images and across images in a scene. We probe the learned model representations using diagnostic classifiers. Our experiments show that pretrained multimodal transformer-based architectures can perform higher-level relational reasoning, and are able to learn representations for novel tasks and data that are very different from what was seen in pretraining.


Introduction
Intelligence is classically described as "the ability to see the similarities among dissimilar things and the dissimilarities among similar things" (Thomas Acquinas, 1225-1274, reported by Ruiz, 2011). Developing systems that can reason over objects and their relations is indeed a long-standing goal of artificial intelligence research, as argued by Johnson et al. (2017). In recent years, huge progress toward this goal has been made in the language and vision community. Starting from Malinowski and Fritz (2014) and Antol et al. (2015), a wealth of studies have focused on language-driven visual reasoning, namely the problem of reasoning about an image given some linguistic input.
Generally speaking, there are two main types of problems in visual reasoning datasets (see Santoro et al., 2017): non-relational, requiring models to focus only on a given object (e.g., answering the question "What material is the cube made of?"), and relational, requiring models to pay attention to several or even all the objects in the image (e.g., indi-cating whether the statement "There are four cubes that are red" is true or false). Relational problems call for higher-level abilities, such as counting or directly comparing objects, both of which involve recognising the (dis)similarities among things.
In this paper, we focus on an important but understudied, relational reasoning task: assessing the relative size of objects in visual contexts, that is, determining whether an object counts as 'big' or 'small' in an image. We define a multi-step relational reasoning problem formulated as a sentence verification task. We construct a dataset of three-image scenes where a given target object, e.g., a blue triangle, is present in each image: two images have target objects with the same contextually-defined size and one image stands out in this regard. The task requires verifying whether a simple natural language statement standing for a first-order logical form describes a scene, e.g., "There is exactly one blue triangle that is small in its image in this scene" (Figure 1). Such multi-step relational reasoning is at play in many real-life situations: e.g., the same exact pan may count as 'big' in all contexts except a restaurant kitchen.
We experiment with two types of models to solve this task: a modular neural network (Hu et al., 2017) and LXMERT, a pre-trained multimodal transformer (Tan and Bansal, 2019). We probe the learned representations of LXMERT to assess whether, and to what extent, it has learned the underlying structure of the data. By means of two experiments with probing classifiers (Alain and Bengio, 2017;Hupkes et al., 2018;Belinkov and Glass, 2019), we first verify that it is able to perform the task at the image level (i.e., to compute the relative size of the target object at the image level); then, we test its ability to reason at the multi-image level and detect the image that stands out.
The experiments show that LXMERT is able to solve the multi-step relational reasoning task there is exactly one blue triangle that is small in its image in this scene there are exactly two blue triangles that are small in their images in this scene there are exactly two blue triangles that are big in their images in this scene F F T T there is exactly one blue triangle that is big in its image in this scene Figure 1: One sample scene from our dataset and the four statements it can be paired with, including corresponding truth values assigned as explained in Section 4.1. For clarity, the odd-one-out image (holding the odd size) is framed in red. Best viewed in color.
with an accuracy of 88.8%, and that the majority of errors occur when the relative size of the target object is difficult to determine. Our analyses show that (i) in most cases, different attention heads in LXMERT specialise to localising the smallest and biggest objects in the images, (ii) that the cross-modal representations learned appear encode a threshold function that controls whether an object is 'big' or 'small' in an image, and (iii) that a simple diagnostic classifier successfully identifies the instance that stands out in a three-image scene. Taken together, these findings lend further support to the advanced reasoning abilities of pretrained transformer-based architectures, showing that they can perform higher-level relational reasoning and are able to deal with novel tasks and novel data, including synthetic data not available during pre-training. 1

Problem Formulation
We investigate multi-step relational reasoning by formulating the problem as a visually grounded sentence verification task (see Figure 1). Given a pair scene, statement consisting of a visual scene and a statement about such scene, the task consists in classifying the statement as either true or false. In our setup, a scene consists of 3 images: img 1 , img 2 , img 3 , each including an instance of the target object (e.g., a blue triangle) together with other geometrical shapes of the same type (e.g., triangles of other colours). A statement paired with a scene is of the following form: "there is exactly one blue triangle that is small in its image in this scene" or "there are exactly two blue triangles that are big in their im-ages in this scene". As we will explain in detail in Sec. 4.1, the dataset is created such that the target object counts as either 'big' or 'small' in only one of the three images in a scene.
Arguably, solving the task requires the following two steps of relational reasoning: (1) identifying whether the target object counts as either 'big' or 'small' in each image, and (2) counting how many images include a big/small target. However, in our setup there is no direct supervision for any of these steps. In other words, the training data does not indicate which images contain an object that counts as big/small nor explicitly how many images contain a big/small target.

Visual Reasoning
To evaluate reasoning abilities of multimodal models, several datasets of synthetic scenes and questions, such as CLEVR (Johnson et al., 2017), ShapeWorld (Kuhnle and Copestake, 2017), and MALeViC (Pezzelle and Fernández, 2019) have been proposed in recent years. Our work directly builds on them, and particularly on approaches adopting a multi-image setting, such as NLVR (Suhr et al., 2017) and NLVR2 (which, however, contains pairs of natural scenes; Suhr et al., 2019). In NLVR, in particular, a crowdsourced statement is coupled with a synthetic scene including 3 independent images, and models must verify whether the statement is true or false with respect to the entire visual input. This involves handling phenomena such as counting, negation or comparisons, that require perform relational reasoning over the entire scene, e.g.: There is a black item in every box, There is a tower with yellow base, etc. However, most scene, statement pairs do not challenge models to do the same at the level of the single image (or box), where a low-level understanding of the object(s) of interest (shape, color, etc.) often suffices. Our approach is novel since it requires two steps of relational reasoning: at the level of both the single image and the multi-image context.

Multi-Image Approaches
Our approach is also related to other work in language and vision involving multiple images. One is the spot-the-difference task: in Jhamtani and Berg-Kirkpatrick (2018), models are fed with pairs of video-surveillance images that only differ in one detail, and asked to generate text which de-scribes such difference. The same task-with different real-scene datasets-is explored by Forbes et al. (2019) and Su et al. (2017); others experiment with pairs of similar images drawn from CLEVR (Johnson et al., 2017) or similar synthetic 3D datasets (Park et al., 2019;Qiu et al., 2020). This task is akin to ours since it requires a higherlevel reasoning step: systems must reason over the two independent representations to describe what is different. However, in practice, it does not always require semantic understanding (Jhamtani and Berg-Kirkpatrick, 2018); when it does, the changes often involve one object's fixed attribute (color, shape, material, etc.) rather than a contextually-defined property whose applicability depends on the other objects in the image. 2 A similar, partially overlapping task is discriminative captioning: systems are fed with a set of similar images and asked to provide a description that unequivocally refers to a target one. Many approaches have been proposed focusing on synthetic Achlioptas et al., 2019) or natural scenes (Vedantam et al., 2017;Cohn-Gordon et al., 2018;Vered et al., 2019), very often embedding pragmatic components based on the Rational Speech Acts framework (RSA; Goodman and Frank, 2016). Also in this case, however, differences among images mainly involve intrinsic attributes of the objects rather than relational properties defined at the level of the image.

3POS1 Dataset
Our dataset is based on the POS1 dataset from MALeViC (Pezzelle and Fernández, 2019), in which images contain 4 to 9 same-shape objects, e.g., squares. Each object is labeled with a groundtruth relative size, indicating whether the object counts as either big or small in that particular context. The label is determined by the following threshold function motivated by cognitive science studies on how humans interpret relative gradable adjectives (Schmidt et al., 2009): where Max and Min represent the areas of the biggest and smallest objects in the image, and k is a positive value < 0.5. 3 Thus, an object with a certain area can count as big in one image and as small in another one. In total, the POS1 dataset contains 20K image, statement datapoints (16K train, 2K val, 2K test), where statements are about the size of a target object based on its unique color: e.g., "the blue triangle is a small triangle". The dataset for the present experiments, which we name 3POS1, is constructed as follows: For each image in each split of POS1, we randomly sample two images from that split with the same target object (e.g., a blue triangle) but the opposite ground-truth size (e.g., big). We obtain 20K sets of three images where one size is prevalent, i.e., present in two images, and one is odd, i.e., held by only one image. 4 The sizes big and small are the prevalent ones in 10K cases each, thus the dataset is balanced. Then, for each three-image scene, we generate four logic-based templated statements, two of which are true and two false for the given scene. 5 The only variation in the statements is the target object. The four types of statement are (alongside examples with respect to Figure 1): (i) one shape, color small: "There is exactly one blue triangle that is small in its image in this scene" → True (ii) one shape, color big: "There is exactly one blue triangle that is big in its image in this scene" → False (iii) two shapes, color small: "There are exactly two blue triangles that are small in their images in this scene" → False (iv) two shapes, color big: "There are exactly two blue triangles that are big in their images in this scene" → True

Models
To tackle the visually grounded sentence verification task, we use two models that achieve state of the art results on the NLVR (Suhr et al., 2017) and NLVR2 (Suhr et al., 2019) tasks, respectively: N2NMN (Hu et al., 2017) and LXMERT (Tan and Bansal, 2019). The End-to-End Module Network (N2NMN), belongs to the family of modular networks, which treat a sentence as a collection of predefined subproblems (e.g., counting, localization, conjunction, etc.), each handled by a dedicated module. Compared to its direct predecessor NMN , in particular, N2NMN does not require any external supervision (e.g., a parser) to process the sentence into its components. The latter, Learning Cross-Modality Encoder Representations from Transformers (LXMERT), is a transformer-based multimodal architecture pretrained on several language-and-vision tasks; as such, it is claimed to be universal, that is, capable of solving virtually any visual reasoning problem. LXMERT uses BERT (Devlin et al., 2019) to encode the language input; as for the image, it considers the sequence of N salient regions output by Faster R- CNN (Ren et al., 2015).
To assess the suitability of these models for the 3POS1 task, we first evaluate them on the original POS1 task where statements are evaluated against a single image. For N2NMN, we use a public implementation, 6 specifically, the code developed for training and an evaluating the model on the CLEVR dataset (Johnson et al., 2017). For LXMERT, we use a snapshot pre-trained on several multi-modal tasks, 7 that we fine-tune using the training set of POS1. The ceiling performance for this task is 97% accuracy (using a fixed interpretation of the threshold parameter k = 0.29). LXMERT achieves 93.4% accuracy, which outperforms both N2NMN (78.1%) and the models tested by Pezzelle and Fernández (2019). This shows the overall advantage of transformer-based architectures over competing methods, in line with previous findings (Devlin et al., 2019). Moreover, it indicates the capability of LXMERT-which is pre-trained on natural images and language-to deal with synthetic data after fine-tuning (crucially, when not fine-tuned it yields an accuracy of 50%, i.e., random). Based on its performance, we focus on LXMERT in the main experiments and analyses in this paper.

Experimental Setup
We fine-tune LXMERT on the 3POS1 dataset by adapting the method applied by Suhr et al. (2019) for the two-image scenes of NLVR2 to our three-image scenes. More concretely, each datum in 3POS1 is composed of 3 images 6 https://github.com/ronghanghu/n2nmn. 7 Downloaded from http://nlp1.cs.unc.edu/ data/model_LXRT.pth img 0 , img 1 , img 2 , a statement stat, and a ground truth label True or False. Recall, that the visually grounded sentence verification task is to predict a label (True or False), given a representation of the images and the statement. An overview of how this is achieved with LXMERT is shown in Figure 2. First, visual features are extracted separately for each image with Faster R-CNN (Ren et al., 2015). Then cross-modal representations x i are extracted from the [CLS] from the LXMERT encoder for each image in a scene: x 0 = lxmert encoder(img 0 , stat) x 1 = lxmert encoder(img 1 , stat) For label prediction, we train a classifier on the concatenation of the three image-statement representations (Eqn. 3), followed by a linear layer with learned parameters W and a bias vector b (Eqn. 4), followed by layer normalization (Ba et al., 2016) and a GeLU activation (Hendrycks and Gimpel, 2016) The LXMERT encoder and the classifier are finetuned for 12 epochs to prevent overfitting with a batch size 64. The learning rate of the Adam optimizer (Kingma and Ba, 2014) is 5e-5. The finetuning is performed for 5 random seeds.

Results
Overall, LXMERT achieves a very high accuracy on the task, averaged across 5 runs: 0.8909±0.004 in validation set, 0.8864 ± 0.005 in test set. Moreover, its performance turns out to be fairly stable across various statement types, with the best model run's accuracy (see Table 1) ranging from 0.868 (one shape, color big, true) to 0.924 (two shapes, color small, f alse). Interestingly, for all four statement types, the model experiences a slight advantage with false over true statements, even though the dataset was carefully balanced. Taken together, these results indicate that the model, which is pre-trained on natural images, can deal with the synthetic scenes in our dataset after finetuning. This is in line with the claim that off-theshelf transformer-based models can be applied to a wide range of different learning problems and data. At the same time, the model yields random accuracy when not fine-tuned, which reveals that our new dataset is challenging and involves a type of reasoning not captured during pre-training.
In Pezzelle and Fernández (2019), models were shown to make more errors when the area of the queried object is closer to the threshold (see Eq. 1). there is exactly one green circle that is small in its image in this scene F Figure 3: A sample from the test split of 3POS1, for which LXMERT predicts the incorrect label (True, instead of False). The numbers above the images are the distances of the target object (green circle) from the image-specific threshold. Here, the target object in the leftmost image is very close to that image's threshold value, so it is challenging for the model to detect whether it is big or small. The odd-one-out image is framed in red. Best viewed in color.
We check if this is the case also for LXMERT on our 3POS1 task. To do so, we consider the cases where the model gives a wrong prediction. Among the 3 images in a scene, we take the one with the lowest distance from the threshold. We then check whether the model makes more errors when such distance is lower, i.e., when there is at least one image in the scene with a borderline size. As reported in Table 2, this is indeed the case: 75% of incorrect predictions involve cases where (at least) in one image the target object is close to the threshold (< 0.1) (see Figure 3 , where the leftmost image is borderline). In contrast, only around 3% of the errors involve clear-cut cases, i.e., images where the target object's distance from threshold is ≥ 0.2. As observed by Pezzelle and Fernández (2019), this may suggest that the model is genuinely learning to compute the threshold function based on the areas of the relevant objects in the scene. Further support for this is given by the performance of the model on the 15 cases in the test set where the target object has the same area in the three-image scene. These cases could be expected to act as a confound for the model, 9 but LXMERT succeeds in 14/15 cases. Consistently with the error pattern reported above, the missed case contains low-distance objects (the lowest distance is equal to 0.1). In the next section, we more extensively explore this issue.

Analysis at the Individual Image Level
Our results show that LXMERT achieves a high level of accuracy on our visually-grounded sentence verification task on the three-image 3POS1 9 The target objects have exactly the same area in pixels but each target object has its own context-defined size.   dataset. In this section, we investigate how the model may be solving the task. Specifically, we explore what visual information the model attends to within each image and whether the representations learned by the model encode information about the context-dependent threshold that determines what counts as big or small in a given image.

Visual Attention over Key Object Types
Recall that the ground truth labels in our dataset are assigned based on the function in Eqn. 1, which was shown to fit well with human judgements about relative gradable adjectives (Schmidt et al., 2009). This function computes a threshold value taking into account the biggest and smallest objects in the context of an image. Thus, a possible strategy adopted by the model at the level of individual images could be to identify the target object and reason about the context by focusing on the biggest and smallest objects. We test this hypothesis by checking whether the model pays particular attention to these object types (target, biggest, smallest) or whether its attention is rather uniformly distributed over all regions detected by Faster R- CNN (Ren et al., 2015).
To compute which objects are the most attended, we use the Intersection over Union (IoU) metric (Russakovsky et al., 2015). We take the attention weights provided by the [CLS] token representation, extracted from the final layer of the best fine-tuned model with frozen parameters. We then use IoU Precision @ K to find the percentage of the labels correctly predicted by the model using the following steps: 1. Extract top-K object proposals: For each correctly predicted label, separately for each of the three images in a scene, we take the object proposals of the image regions detected by Faster R-CNN with K-highest scores in the [CLS] token. We perform the procedure for each attention head of the representation, extracted from the cross-modality encoder for the corresponding visual-language input. We ignore the object proposals related to the background areas of the image, which we identify based on the labels provided by Faster R-CNN. 10 2. Extract ground-truth bounding boxes: We take the ground-truth bounding boxes of the biggest/the smallest/target objects from all three images in the input scene. 11 3. Calculate Pairwise IoU: We calculate the pairwise IoU between the top-K object proposals and the ground truth bounding boxes, obtained in Steps 1 and 2. We take the highest IoU value calculated for all these pairs.

Calculate IoU Precision@K: The IoU precision @ K is the percentage of all the IoU values obtained in
Step 3 that are > 0.5.
We also compute a random baseline for all three categories with the same steps, except in Step 1 we randomly select K objects from the 36 detected by Faster R-CNN, instead of using the ones with the highest attention scores. We use the smallest possible value for K = 1, as the most illustrative case in which the metric only 10 The attributes predicted for the regions corresponding to the black background in our scenes could be black or dark. 11 We calculate the coordinates of the boxes using objects position and radius provided in the annotation of the POS1 dataset by Pezzelle and Fernández (2019).
there are exactly two green circles that are big in their images in this scene T Figure 5: Example of object proposals most attended to by the 9th head of the last layer of the cross-modality encoder. In each image, the model attends to all of the objects except the biggest ones. Simultaneously, in the leftmost image, it also focuses on the green circle, which is the target object in this scene.
looks at the single object in each image to which the model attends the most. Figure 4 shows the results of the IoU Precision @ K for the 12 attention heads in LXMERT. In particular, Figure 4a shows that many of the attention heads attend to the target object that is queried directly in the input sentence. Figures 4b and 4c demonstrate that the model also looks at the surrounding visual context, which is needed to perform relational reasoning. A comparison of behaviour across the Figures reveals that different attention heads appear to specialise on different object types: attention head 9 learns to attend to the smallest objects while it pays no attention to the biggest objects and less than random attention to the target objects. We also highlight the observed behaviour of attention head 11, which is the only head that reliably attends to the biggest objects. Figure 5 shows an example of the objects attended to by attention head 9 in one sample scene. Here, we can see that the model is primarily attending to the smallest objects in the scene.

Implicit Knowledge of the Threshold
The analysis above showed that the model, besides the target object, also pays attention to key contextual information, particularly to the smallest and biggest objects in an image. These objects are critical to compute the threshold to determine if a target object is big or small relative to the context of an image. To test whether the representations learned by the model implicitly encode information about the context-dependent threshold, we use a diagnostic classifier (Alain and Bengio, 2017;Hupkes et al., 2018;Belinkov and Glass, 2019). Probing or diagnostic tests are useful tools to better understand the inner workings of deep models. Given a hypothesis about information that may be encoded by a trained model, a probe checks whether such information is accessible by a relatively simple classifier.
Concretely, in this experiment we use a linear regression classifier 12 to predict the threshold values for each of the three images in a scene given the cross-modality features learned by the LXMERT encoder (x 0 , x 1 , x 2 in Eqn. 2). The classifier uses the same train/val/test splits of the 3POS1 dataset. The predicted and actual values are displayed in Figure 6, which shows that a simple linear classifier can predict the threshold values for each image in a scene remarkably accurately (mean squared error on the test set is 6.64e − 05). This confirms that the cross-modality representations learned by the model are representing the threshold in each image.

Analysis at the Multi-Image Level
In the previous section, we analysed the model representations at the level of the independent images. Here, we probe the representations with respect to the entire three-image scene. First, we investigate whether the representations encode information on the overall configuration of the scene (Sec. 7.1). Second, we probe their effectiveness in identifying the odd-one-out image in the scene (Sec. 7.2). In both analyses, we use diagnostic classifiers, 13 that take as input the concatenation of the three imagestatement cross-modal representations (Eqn. 3).

Scene Configuration Classification
We first investigate whether the representations learned by the model encode the configuration of the scene, that is, whether they are effective to distinguish between scenes where 1 target object counts as small and 2 as big (hence, 1small2big), and vice versa (1big2small). In principle, this counting step is necessary to solve the sentenceverification task (see Sec. 2), and this probe determines whether the model is reasoning at the level of the scene or exploiting other strategies, such as capturing random correlations in the data. We use an SVM classifier with linear kernel (Boser et al., 1992) 14 to probe the representations learned by the model, and find that they are indeed useful for predicting the configurations. Accuracy on the test set is 88.15%, which is well above chance level (50%). As reported in Table 3, in the large majority of cases (85.7%) a correct prediction in the sentence verification task corresponds to a correct assessment by the diagnostic classifier. This confirms that LXMERT learns representations that encode the configuration of the scene.

Odd-One-Out Image Identification
Our results so far show that the model is able to perform the multi-step sentence verification task with high accuracy and that the representations encode information about different configurations of scenes. However, there is yet no guarantee that the model is able to identify the odd-one-out image (i.e., the image that is not prevalent; see Sec. 4.1). We test this by means of another diagnostic classifier: given a scene representation, the task is to predict the position of the odd-one-out image (hence, OOO), namely image 0, 1, or 2.
We initially experiment with the same type of diagnostic classifier used in the previous analysis: an  SVM with a linear kernel. However, this linear classifier was only able to accurately classify the position of odd-one-out images associated with imagescene instances labelled True, suggesting that the prediction of the position of the odd-one-out cannot be solved by a linear classifier. Therefore, we use a non-linear MLP and also report the results of a control task, where the labels are randomly assigned to the instances (Hewitt and Liang, 2019). The MLP is a two-layer neural network with 128 units in each layer followed by a ReLU activation function, and finally a learned projection into 3 output units, followed by a softmax normalisation. We train the MLP with a cross-entropy objective function for four epochs using the Adam optimiser with the default learning rate. Table 4 reports the results of the non-linear diagnostic classifier in both the OOO and control settings. As can be seen, while the MLP does not exceed chance level in the control setting, in the OOO it achieves a striking 87.67% accuracy, a similar performance as the one reported in Sec. 7.1. On the one hand, this indicates that the model cannot fit the data when the assigned labels are not related to the actual OOO image positions. On the other hand, these results show that the representations learned by LXMERT do encode information regarding the odd-one-out object in the scene.
Taken together, these analyses demonstrate that LXMERT reasons over the multi-image scene to perform the sentence-verification task. In particular, it is able to compute the contextually-defined size of the objects in the scene and perform higherlevel reasoning over these representations.

Conclusion
We performed an in-depth analysis of the representations learned by the pretrained multimodal transformer LXMERT when performing relational reasoning. We proposed a multimodal reasoning task that requires multi-step relational reasoning and showed that LXMERT can perform the task with high accuracy. Our analysis reveals that the majority of the errors arise from target objects with contextually-defined sizes close to the threshold, and that LXMERT solves the task by (i) encoding information regarding the size of objects and by (ii) reasoning over that size. Most of its errors concern borderline cases for which the first, image-level reasoning step was shown to be challenging. Overall, our results show that transformer-based architectures pretrained on natural images can generalise to synthetic datasets. We leave to future work an extensive exploration of the extent to which our findings apply to similar tasks and models, for example other vision and langauge transformers (Bugliarello et al., 2021), as well as to natural multimodal data.

C Hyperparameters and fine-tuning for LXMERT
For the fine-tuning of LXMERT, the pre-trained model with standard hyperparameters was used 17 , with only the learning rate changed from 1e-5 to 5e-5, since even with these out-of-the-box parameters, it was able to achieve high performance on the given task. We fine-tuned this model with the POS1 training split using early stopping after 12 epochs, with the parameter number of epochs of BertADAM optimizer set to 150, learning rate 1e-5, and batch size 32 (the only difference in the used hyperparameters during the fine-tuning with 3POS1 was in the batch size 64). We validated the model after each epoch, then the best model was selected, which showed the highest validation ac-15 https://github.com/ronghanghu/n2nmn 16 https://github.com/sandropezzelle/ malevic 17 https://github.com/airsplay/lxmert. git curacy during the 12 epochs, and further evaluated on the test split.
The running time of each fine-tuning epoch for the POS1 dataset was 3 minutes, while each epoch of fine-tuning with 3POS1 took around 6 minutes.