Visually Grounded Concept Composition

We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning. Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy. Notably, our model can model grounded concepts forming at both the finer-grained sentence level and the coarser-grained intermediate level (or word-level). Composer leads to pronounced improvement in matching accuracy when the evaluation data has significant compound divergence from the training data.


Introduction
Visually grounded text expressions denote the images they describe. These expressions of visual concepts are naturally organized hierarchically in sub-expressions. The organization reveals structural relations that do not manifest when the subexpressions are studied in isolation. For example, the phrase "a soccer ball in a gift-box" is a compound of two shorter phrases, i.e., "a soccer ball" and "a gift-box", but carries the meaning of the spatial relationship "something in something" that goes beyond the two shorter phrases separately. The compositional structure of the grounded expression requires a concept learner to understand what primitive concepts are visually appearing and * Part of work done while at Google † Part of work done while at USC ‡ Work done as a Google AI resident.
how the compound relating multiple primitives modifies their appearance. Existing approaches (Kiros et al., 2014;Faghri et al., 2017;Lu et al., 2019;Chen et al., 2020Chen et al., , 2021 tackle visual grounding via end-to-end learning, which typically learns to align image and text information using neural networks without explicitly modeling their compositional structures. While neural networks have shown strong generalization capabilities in test examples that are i.i.d to the training distribution (Devlin et al., 2019), they often struggle in dealing with out-of-domain examples of novel compositional structures, in many tasks such as Visual Reasoning (Johnson et al., 2017;Bahdanau et al., 2019;Pezzelle and Fernández, 2019), Semantic Parsing (Finegan-Dollak et al., 2018;Keysers et al., 2020), and (Grounded) Command Following (Lake and Baroni, 2018;Chaplot et al., 2018;Hermann et al., 2017;Ruis et al., 2020).
In this work, we investigate how complex concepts, composed of simpler ones, are grounded in images at sentences, phrases and tokens levels. In particular, we investigate whether the structures of how these concepts are composed can be exploited as a modeling prior to improve visual grounding. To this end, we design Concept & Relation Graph (CRG), which is derived from constituency parse trees. The resulting CRG is a graph-structured database where concept nodes encode language expressions of concepts and their visual denotations (e.g., a set of images corresponding to the concept), and predicate nodes define how a concept is semantically composed from its child concepts. Our graph is related to the denotation graph (Young et al., 2014;Zhang et al., 2020) but differs in two key aspects. First, our graph extracts the concepts without specially crafted heuristic rules 1 . Secondly, CRG's predicate can encode richer information explicitly than the subsumption relationships implicitly expressed in the denotation graphs. An illustrative figure of the graph is shown in Figure 1.
In addition to CRG, we propose Concept cOMPOSition transformER (COMPOSER) that leverages the structure of text expressions to recursively encode the grounded concept embeddings, from coarse-level such as the noun words that refer to objects, to finer-grained ones with multiple levels of compositions. Transformer (Vaswani et al., 2017) is used as a building block in our model, to encode the predicates, and perform grounded concept composition. We learn COMPOSER using the task of visual-semantic alignment. Unlike traditional approaches, we perform hierarchical learning of visual-semantic alignment, which aligns the image to words, phrases, and sentences, and preserves the order of matching confidences.
We conduct experiments on multi-modal matching and show that COMPOSER achieves strong grounding capability in both sentence-to-image and phrase-to-image retrieval on the popular benchmarks. We validate the generalization capability of COMPOSER by designing an evaluation procedure for a more challenging compositional generalization task that uses test examples with maximum compound divergence (MCD) to the training data (Shaw et al., 2020;Keysers et al., 2020). Experiments show that COMPOSER is more robust to the compositional generalization than other approaches.
Our contributions are summarized as below: • We study the compositional structure of visually grounded concepts and design Concept & Relation Graph that reflects such structures.
• We propose Concept cOMPOSition transformER (COMPOSER) that recursively composes concepts using the child concepts and the semantically meaningful rules, which leads to strong compositional generalization performances.
• We propose a new evaluation task to assess the model's compositional generalization performances on the task of text-to-image matching and conduct comprehensive experiments to evaluate both baseline models and COMPOSER.
been developed and evaluated on English language corpus, and its multilingual utility depends on the parsing techniques for those languages other than English.

Concept & Relation Graph
We introduce multi-modal Concept and Relation Graph (CRG), a graph composed of concept and predicate nodes, which compose visually grounded descriptive phrases and sentences. Figure 1 provides an illustrative example. The concepts include sentences and intermediate phrases, shown as blue nodes. The primitives are the leaf nodes (typically noun words) that refer to visual objects, shown as green nodes. The predicates (red nodes) are n-ary functions that define the meaning of the concept composition. Their "signatures" consist of lexicalized templates, the number of arguments, and the syntactic type of the arguments. They combine primitives or simpler concepts into more complex ones.
Identifying concepts and relations. Given pairs of aligned image and sentence, we first parse a sentence into a constituency tree, using a state-ofthe-art syntactic parser (Kitaev and Klein, 2018). We use the sentence's constituent tags to identify concepts and their relations. The set of relations are regarded as n-ary functions with placeholders denoted with constituency tags. We refer to such functions as predicates. Simpler concepts are arguments to the predicates, and the return values of the functions are complex concepts. The edges of the graph represent the relationship between predicates and their arguments. We restrict the type of constituents that can be concepts and how the predicates can be formed. A concrete example is as follows: given an input concept "two dogs running on the grass", the algorithm extracts the predicate "[NP] running  on [NP]" and the child concepts "two dogs" and "the grass". Here we use syntactic placeholders to replace the concept phrases. Details are in the Appendix. This idea is closely related to the semantically augmented parse trees (Ge and Mooney, 2009), though we focus on visually grounded concepts.
Finding visually grounded concepts. We take paired images and texts 2 , and convert the texts into derived trees of predicates and primitives. With the generated text graph, we then group all images that refers to the same concept to form the image denotation, similar as Young et al. (2014) and Zhang et al. (2020). The image denotation is the set of images that contain the referred concept. For example, the image denotation of the concept "ball" is all the images that have the visual object category "ball". As a result, we associate the image denotation with each concept in the format of words, phrases, and sentences, which creates a multi-modal graph database as Figure 1.

COMPOSER: Recursive Modeling of the Compositional Structure
The main idea of COMPOSER is to recursively compose primitive concepts into sentences of complex structure, using composition rules defined by the predicates. Figure 2 presents a conceptual diagram of the high-level idea. Concretely, it first takes the primitive word embedding as the inputs and performs cross-modal attention to obtain their visually grounded word embeddings. Next, the COMPOSER calls the composition procedure to modify or combine primitive or intermediate concepts, according to the description of its predicates. At the end of this recursive procedure, we obtain the desired sentence concept embedding. In the rest of this section, we first discuss the notation and backgrounds, then introduce how primitives and predicates are encoded ( § 3.1), and present the recursive composition procedures in detail ( § 3.2). Finally, we discuss the learning objectives ( § 3.3).
Notation. We denote a paired image and sentence as (x, y) and the corresponding concepts and predicate for a tree (x, U , E), where U , E corresponds to the set of primitives and the set of predicates, respectfully. We also denote all concepts from a sentence y to be C, where U ⊂ C and y ∈ C.
Multi-head attention mechanism. Multi-Head Attention (MHA) (Vaswani et al., 2017) is the building block of our model. It takes three sets of input elements, i.e., the key set K, the query set Q, and the value set V , and perform scaled dot-product attention as: Here, d is the dimension of elements in K and Q. FFN is a feed-forward neural network. With different choices of K and V , MHA can be categorized as self-attention (SelfAtt) and cross-attention (CrossAtt), which corresponds to the variants with K and V including only the single-modality or cross-modality features.

Encoding Primitives and Predicates
Given a paired image and sentence (x, y), we parse the sentence as the tree of primitives and predicates (x, U , E). Here, we represent the image as a set  of visual feature vectors {φ}, which are the objectcentric features from an object detector (Anderson et al., 2018). Noted that we didn't use structural information beyond object proposals/regions. Our COMPOSER takes the primitives and predicates as input and output the visually grounded concept embeddings, with both the primitives and predicates as continuous vectors of different contextualization.
Representing primitives with visual context. The primitive concepts refer to tokens which can be visually grounded, and we represent them as word embeddings contextualized with visual features. As such, we use a one-layer Transformer with the CrossAtt mechanism, where K, V , and Q are linear transformations of φ, φ, and u, respectively. This essentially uses the word embedding to query the visual features and outputs the grounded primitive embeddingsÛ = {û}. Note that the output is always a single vector for each primitive as it is a single word.

Representing predicates as neural templates.
A predicate e is a semantic n-place function that combines multiple concepts into one. We represent it as a template sentence with words and syntactic placeholders, such as "[NP] 1 running on [NP] 2 ", where those syntactic placeholders denote the positions and types of arguments. We encode such template sentences via SelfAtt mechanism, using a multi-layer Predicate Transformer (PT). The output of this model is a contextualized sequence of the words and syntactic placeholders asê.

Recursive Concept Composition
With the encoded primitivesÛ and predicateŝ E, the COMPOSER then performs multiple recursive composition steps to obtain the grounded concept embedding, v(x, y), representing the visuallinguistic embedding of the sentence and the image as shown in the Figure 2. To further illustrate this process, we detail the composition function in below, as shown in Figure 3.
Input concept modulation. We use a modulator to bind the arguments in the predicate to the input child concepts. Given a encoded predicateê = {[NP] 1 , running, on, [NP] 2 , with, [NP] 3 } and a input concept c 1 = "a man", the modulator is a neural network that takes the concept embedding c 1 and its corresponding syntactic placeholder [NP] 1 as input and outputs a modulated embedding. This embedding is then reassembled with the embeddings of non-arguments in the predicate and used for the later stage. Contextualization with visual context. After concept modulation, we get a sequence of embeddings for non-argument words of the predicate and the binded child concepts, which is then fed as an input to a Composition Transformer (CT) model. This Transformer has multiple layers, with both CrossAtt layers that attends to the object-centric visual features and SelfAtt layers that contextualize between tokens. Please refer to Appendix for the detailed network architecture. Given that our model is recursive by nature, the computation complexity of CT is proportional to the depth of the tree. We provide a comprehensive study in § 5.3 to show the correlation between the parameter/complexity and model's performances.

Learning COMPOSER with Visual-Semantic Alignments
With the composed grounded concept embedding v(x, y), we use the visual-semantic alignment as the primary objective to learn COMPOSER. To this end, we compute the alignment score by learning an additional linear regressor θ: where p(x, y) is the probability that the sentence and image is a good match pair. Then we learn the sentence to image alignment by minimizing the negative log-likelihood (NLL): To properly normalize the probability, it is necessary to sample a set of negative examples to contrast. Thus, we generate D − i using the strategy of Lu et al. (2019). Multi-level visual-semantic alignment (MVSA). Since COMPOSER composes grounded concepts recursively from the primitives, we obtain the embeddings of all the intermediate concepts automatically. Therefore, it is natural to extend the alignment learning objectives to all those intermediate concepts. We optimize the triplet hinge loss (Kiros et al., 2014): denotes the hinge loss and α is the margin to be tuned. We derive the negative concepts c − from the negative sentences in the D − i . We observe that negative concepts at word/phrase levels are noisier than the ones at sentence level because many are common objects presented in the positive image and lead to ambiguity in learning. Therefore, we choose hinge loss over NLL because it is more robust to label noises (Biggio et al., 2011).
Learning to preserve orders in the tree. Finally, we use an order-preserving objective proposed by Zhang et al. (2020), to ensure that a finegrained concept (closer to sentence) can produce a more confident alignment score than a coarsegrained concept (closer to primitive): Here, e jk represents a predicate connecting the c j and c k , with c j to be the fine-grained parent concept which is closer to the sentence and c k to be the coarse-grained child concept which is closer to the primitives. β is the margin that sets the constraint on how hard the order of embeddings should be reserved.
The complete learning objective is a weighted combination of three individual losses defined above, with the loss weights λ 1 = 1 and λ 2 = 1: The details of model optimization and hyperparameter setting are included in the Appendix. (2017) evaluate RL agents' capability to generalize to a novel composition of shape, size, and color in 3D simulators, which shows that RL agents generalize poorly. gSCAN (Ruis et al., 2020) perform a systematic benchmark to assess command following in a grounded environment. In this work, we focus on assessing model composition generalization under the visual context.
Compositional networks. State-of-the-art visually grounded language learning typically use deep Transformer models (Vaswani et al., 2017) such as ViLBERT (Lu et al., 2019), LXMERT (Tan and Bansal, 2019) and UNITER (Chen et al., 2020). Though being effective for data over i.i.d distribution, these models do not explicitly exploit the structure of the language and are thus prone to fail on compositional generalization. In contrast, another thread of works (Andreas et al., 2016;Yi et al., 2018;Mao et al., 2019;Shi et al., 2019;Wang et al., 2018) parse the language into an executable program composed as a graph of atomic neural modules, where each module is designed to perform atomic tasks and are learned end-to-end. Such models show almost perfect performances on synthetic benchmarks (Johnson et al., 2017) but perform subpar on the real-world data (Young et al., 2014;Chen et al., 2015) that are noisy and highly variable. Unlike them, we propose using a compositional neural network based on the Transformer architecture, which extends state-of-the-art neural networks to explicitly exploits language structure.

Experiment
In this section, we perform experiments to validate the proposed COMPOSER model on the tasks of sentence-to-image retrieval and phrase-to-image retrieval. We begin with introducing the setup in § 5.1 and then present the main results in § 5.2, comparing models for their in-domain, cross-dataset evaluation, and compositional generalization performance. Finally, we perform an analysis and ablation study of our model design in § 5.3. Compositional generalization evaluation. To generate evaluations of compositional generalization, we use a method similar to that of Shaw et al. (2020) and Keysers et al. (2020) which maximizes compound divergence between the distribution of compounds in the evaluation set and in the training set. Here compounds are defined based on the pred-icates occurring in captions. Following this method, we first calculate the overall divergence of compounds from the evaluation data to the training data using predicates from all the sentences. Then, for each sentence in the evaluation data, we calculate a compound divergence with this specific example removed. We rank those sentences based on the difference of the compound divergence. Finally, we choose the top-K sentences with the largest compound divergence differences and its corresponding images to form the evaluation splits.

Experiment Setup
Using this method, we generate evaluation splits with 1,000 images and 5,000 text queries, COCO-MCD and F30K-MCD, to assess models trained on F30K and COCO, respectively. Therefore, these splits assess both compositional generalization and cross-dataset transfer. Defining such splits across datasets is also helpful to achieve greater compound divergence than is otherwise possible, given the small amount of available in-domain test data. More details are included in Appendix.
CRG construction. We constructed two CRGs on the F30K and C30K datasets, using the procedure mentioned in § 2. The key statistics of the graph we generated as shown in Table 1.
Baselines and our approach. We compare COMPOSER to two strong baseline methods, i.e., ViLBERT (Lu et al., 2019) and VSE (Kiros et al., 2014). We make sure all models are using the same object-centric visual features extracted from the Up-Down object detector (Anderson et al., 2018) for fair comparison. For the texts, both ViLBERT and the re-implemented VSE use the pre-trained BERT model as initialization. For the COMPOSER, we only initialize the predicate Transformer with the pre-trained BERT, which uses the first six layers. Note that the ViLBERT results are re-produced using the codebase from its author. ViLBERT is not pre-trained on any additional data of imagetext pairs to prevent information leak in both crossdataset evaluation and compositional generalization. Therefore, we used the pre-trained BERT models provided by HuggingFace to initalize the text stream of ViLBERT, and then followed the rest procedure in the original ViLBERT paper. Please refer to Appendix for complete details.

Main Results
We compare the COMPOSER with ViLBERT (Lu et al., 2019)    F30k and COCO for in-domain, zero-shot crossdataset transfer, and compositional generalization (e.g. F30K→COCO-MCD). The notation A→B means that the model is trained on A and evaluated on B. We report the results of sentence-to-image retrieval in the main paper and defer more ablation study results to the Appendix.
In-domain performance. Table 2 presents the in-domain performance on both F30k and COCO datasets. First, we observe that both COMPOSER and ViLBERT consistently outperform VSE, which is expected as ViLBERT contains a cross-modal transformer with stronger modeling capacity. Comparing to ViLBERT, the COMPOSER performs on par.
Zero-shot cross-dataset transfer. We also consider zero-shot cross dataset transfer where we evaluate models on a dataset that is different from the training dataset. In this setting, the COMPOSER outperforms ViBLERT and VSE significantly. Concretely, on the F30k→COCO setting, the COM-POSER improves R1 and R5 by 11.0% and 7.0%    Figure 4. With the increases of CD, we observe the performance of COMPOSER and ViLBERT decreases. Compared to ViLBERT, we observe that COMPOSER is relatively more robust to this distribution shift, as the relative performance improvement is increasing with CD increases.

Analysis and Ablation Study
We perform several ablation studies to analyze COMPOSER, and provide qualitative results to demonstrate the model's interpretability.
Is CrossAtt in primitive encoding useful? Table 3 compares variants of COMPOSER with and without CrossAtt for primitive encoding, and shows that CrossAtt improves all metrics in indomain and cross-dataset evaluation.
Which modulator works better? We consider three modulators to combine input concepts with the syntax token embeddings for later composition, which are Replace, MLP, and FiLM. The Replace directly replaces the syntax embedding with the input concept embedding. This is an inferior approach by design as it ignores the relative  position of each concept. MLP model applies multilayer neural networks on the concatenated syntax and input concept embeddings. FiLM model uses the syntax embedding to infer the parameter of an affine transformation, which is then applied to the input concepts. We show the results in Table 4.
Replace achieves the worst performance, indicating the importance of identifying the position of input concepts. COMPOSER chooses FiLM as the modulator given its strong performance over all metrics.
Is MVSA supervision useful? We evaluate the influence of multi-level visual-semantic alignment on sentence and phrase to image retrieval. In the phrase-to-image experiments, we sample 5 nonsentence concepts from the CRG for each annotation in the corresponding test data and use them as the query to report results (in R1).   Performance under different parsing qualities. CRG is generated based on constituent parser. We investigate the performance of COMPOSER with CRG under different parsing qualites. Given a parsing tree, We randomly remove its branches randomly with a probability of 0.1, 0.3, or 0.5 to generate a tree with degraded parsing quality. We evaluate COMPOSER on the resulting CRGs. We summarized the results in Table 7. When parsing quality drops, both in-domain and cross-dataset transfer performance drops. The performance degrades by 12%, when half of the parse could be missing. We expect with better parsing quality, COMPOSER can achieve stronger performance.
Interpreting COMPOSER's decision. Despite the solid performance, COMPOSER is also highly interpretable. Specifically, we visualize its alignment scores along with the concept composition procedure in Figure 5. Empirically, we observe that most failures are caused by visually grounding mistakes at the primitive concepts level. The error then propagates "upwards" towards concept composition. For instance, the left example shows that COM-POSER is confusing between the ground truth and negative image when only the text of shared visual concept "a bold man" is presented. With more information are given, it gets clarified immediately as it notices that the target sentence is composed not only with the above subject, but also with the prepositional phrases "by the beer pumps at the bar" that reflects the visual environment.
Scalability to full COCO dataset. Finally, we trained our model (PT=6, CT=5) on the full COCO training split and evaluated for both in-domain and cross-dataset transfer task. We use the same hyperparameters as C30K. However, COMPOSER underperforms the ViLBERT in this setting, as it achieves 56.06% and 44.24% in R1 for the indomain task (COCO→COCO) and cross-dataset evaluation tasks (COCO→F30k), while ViLBERT obtains 56.83% and 46.62%, respectively. We hypothesize that this negative result is largely due to the limited model capacity of the proposed COM-POSER, as it has relatively 33% less parameters comparing to ViLBERT. Meanwhile, it is also observed that COMPOSER performs worse than ViL-BERT in fitting training data. We observe that doubling the training epoch would increase both indomain and out-of-domain performance by 2% relatively. Increasing the layer of Composition Transformer (CT) to 7 would also improves R1 by 2.5% relatively. Further scaling up COMPOSER may resolve this issue but requires more computational resources, and we leave this for future research.

Conclusion
In this paper, we propose the concept and relation graph (CRG) to explore the compositional structure in visually grounded text data. We further develop a novel concept composition neural network (COMPOSER) on top of the CRG, which leverages the explicit structure to compose concepts from word-level to sentence-level. We conduct extensive experiments to validate our model on image-text matching benchmarks. Comparing with prior methods, COMPOSER achieves significant improvements, particularly in zero-shot cross-dataset transfer and compositional generalization. Despite these highlights, there are also many challenges that COMPOSER does not address in the scope of this paper. First, it requires high-quality parsing results to achieve strong performances, which may not be readily available in languages beyond English. Moreover, similar to other recursive neural networks, COMPOSER is also computationally resource demanding, which sets a limit to its scalability to large-scale data.
As mentioned in the main paper, we parse the sentence and convert it into a tree of concepts and primitives. Particularly, we first perform constituency parsing using the self-attention parser (Kitaev and Klein, 2018). Table 8 provides the visualization for two examples of the syntax sub-trees. Next, we perform a tree search (i.e., breadth-first search) on the constituency tree of the current input concept to extract the sub-concepts and predicate functions. Note that this step is applied recursively until we can no longer decompose a concept into any subconcepts. On a single step of the extraction, we enumerate each node in the constituency tree of current input text expression and examine whether a constituent satisfies the criterion that defines the visually grounded concept.
The concept criterion defined for the Flickr30K and COCO dataset contains several principles: (1) If the constituent is a word, it is a primitive concept if its Part-of-Speech (POS) tag is one of the following:  Table 8. For instance, in the first example, we search the text "two dogs are running on the grass" and extract two noun constituents, "two dogs" and "the grass" as the concepts. We use the remaining text "[NP] is running on [NP]" as the predicate that indicates the semantic meaning of how these two sub-concepts composes into the original sentence.

B Details on Generation of Compositional Evaluation Splits
As mentioned in the main text, we generate compositional generalization (CG) splits with 1,000 images and 5,000 text queries, maximizing the Compound Divergence (MCD) as Shaw et al. (2020) 3 , to assess models' capability in generalizing to the data with different predicate distribution. Concretely, we select Flickr30K training data to generate the F30K-MCD split. First, we remove all F30K test data that has unseen primitive concepts to the COCO training data. Next, we collect and count the predicates for each image among all the remaining data over the five associated captions. These predicates correspond to the "compounds" defined in (Keysers et al., 2020;Shaw et al., 2020), and the objective is to maximize the divergence between compound distribution of the evaluation data to the training data. As a result of this step, we end up with a data set formed with pairs of (image, predicates counts), which are then used for computing the overall compound divergence (CD ALL ) to the training dataset. Afterwards, we enumerate over each pair of data, and again compute the compound divergence to the training dataset but with this specific data is removed. We denote the change of compound divergence as ∆ i = CD i − CD ALL , and use it as an additional score to associate every data. Finally, we sort all the data with regard to the difference of compound divergence ∆ i , and use the top ranking one thousand examples as the maximum compound divergence (MCD) split. The process for generating the COCO-MCD split is symmetrical to the above process, except the data is collected from COCO val+test splits (as it is sufficiently large). Similarly, to generate different CDs for making Figure 4 of the main text, we can also make use of the above data sorted by ∆ i . Concretely, we put a sliding window with 1,000 examples and enumerate over the sorted data to obtain a massive combination of data (we can take a Sub-Concepts NP1="two dogs" NP2="the grass" NP1="a small pizza" NP2="a white plate" Table 8: Explanatory example of extracting predicates and sub-concepts from a concept stride to make this computation sparser.) For each window of data, we measure the compound divergence and only take the windows that are at the satisfaction to our criteria. In Figure 4, we keep the windows that has the closest CD values to desired X-axis values for plotting.

C Implementation Details of COMPOSER and Baselines
Visual feature pre-processing We follow ViL-BERT (Lu et al., 2019) that extracts the patchbased ResNet feature using the Bottom-Up Attention model. The image patch feature has a dimension of 2048. A 5-dimension position feature that describes the normalized up-top and bottom-down position is extracted alongside the image patch feature. Therefore, each image region is described by both the image patch feature and the position feature. We extracted features from up to 100 patches in one image. We implement the CrossAtt model as a one-layer multi-head cross-modal Transformer that contains 768 dimension with 12 attention heads. The query set Q is the sub-word token embeddings of the primitive word, and the key and value set K and V are the union of sub-word token embeddings and the object-centric visual features (which is linearly transformed to have the same dimensionality). We use the average of the contextualized sub-word token embeddings as the final primitive encoding.
• Predicate Transformer (PT). We use 6 layers text Transformers with 768 hidden dimension and 12 attention heads to instantiate the Predicate Transformer. This network is initialized with the first 6 layers of a pre-trained BERT model.
• Modulator. We use FiLM (Perez et al., 2018) as the modulator. Specifically, it contains two MLP models with a hidden dimension size of 768 to generate the scale a and bias vectors b, using the syntactic placeholders as input. The scale a and bias b are then used to transform the input concept embedding c as a c+b. Here represents the element-wise multiplication. This modulated concept embedding is then projected by another MLP with 768 hidden dimensions, and used for reassembling with the predicate sequence.
• Composition Transformer (CT). We follow the architecture of ViLBERT (Lu et al., 2019) to design the Composition Transformer (shown in Figure 6). Specifically, it has interleaved SelfAtt Transformer and CrossAtt Transformer in the network. For example, if we consider a three-layer Composition Transformer, we have a SelfAtt Transformer at the beginning for both modality, followed with a CrossAtt Transformer that interchanges the information between the modality, and then another SelfAtt Transformer that only operates on the text modality. The output embedding of this last text SelfAtt Transformer is then used for computing the visual-semantic alignment scores using the linear regressor θ. Thus, when we consider shallower or deeper network, we add or remove the two layers of interleaved SelfAtt and CrossAtt Transformers. The hidden dimension of SelfAtt Transformer is 768, and

D Additional Experiments on COMPOSER
We report additional ablation studies that are omitted in the main paper due to space limitation.  across both in-domain and cross-dataset generalization settings. Therefore, for all the experiments training with MVSA, we use hinge loss instead.
Ablation study on α and β. We study COM-POSER performance on the different margin of MVSA and Order objectives. First, we fix the margin of order objectives β and tune the margin for MVSA α. COMPOSER with a larger margin for MVSA achieves better R1 in-domain performance. Alternatively, by fixing the α and tuning β, COMPOSER achieves the best R1 in-domain performance and best R5 in cross-dataset generalization setting with β = 0.2.