The Role of Syntactic Planning in Compositional Image Captioning

Image captioning has focused on generalizing to images drawn from the same distribution as the training set, and not to the more challenging problem of generalizing to different distributions of images. Recently, Nikolaus et al. (2019) introduced a dataset to assess compositional generalization in image captioning, where models are evaluated on their ability to describe images with unseen adjective–noun and noun–verb compositions. In this work, we investigate different methods to improve compositional generalization by planning the syntactic structure of a caption. Our experiments show that jointly modeling tokens and syntactic tags enhances generalization in both RNN- and Transformer-based models, while also improving performance on standard metrics.


Introduction
Image captioning is a core task in multimodal NLP, where the aim is to automatically describe the content of an image in natural language. To succeed in this task, a model first needs to recognize and understand the properties of the image. Then, it needs to generate well-formed sentences, requiring both a syntactic and a semantic knowledge of the language (Hossain et al., 2019). Deep learning techniques are the standard approach to tackling this problem: images are represented by visual features extracted from Convolutional Neural Networks (e.g. He et al. 2016), and sentences are generated by conditioning Recurrent Neural Networks (e.g. Hochreiter and Schmidhuber 1997), or Transformers (Vaswani et al., 2017) on the extracted visual features.
While deep neural networks achieve impressive performance in a variety of applications, including image captioning, their ability to demonstrate compositionality, defined as the algebraic potential to understand and produce novel combinations from known components (Loula et al., 2018), has been questioned. Semantic compositionality of language in neural networks has attracted interest in the community (Irsoy and Cardie, 2014; Lake and Baroni, 2019) as compositionality is conjectured to be a core feature not only of language but also of human thought (Fodor and Lepore, 2002).
In image captioning, improving compositional generalization is a fundamental step towards generalizable systems that can be employed in daily life. To this end, Nikolaus et al. (2019) recently introduced a compositional generalization dataset where models need to describe images that depict unseen compositions of primitive concepts. For example, models are trained to describe images with "white" entities and all types of "dog" concepts but never the adjective-noun composition of "white dog." In their dataset, models are evaluated on their ability to caption images depicting the unseen composition of held out concepts. Their study suggests that RNN-based captioning models do not compositionally generalize, and that this is primarily attributable to the language generation component.
In this paper, we study the potential for syntax to improve compositional generalization in image captioning by combining syntactic planning and language generation in a single model. Our study is inspired by the traditional Natural Language Generation (NLG) framework (Reiter and Dale, 1997), where NLG is split into three distinct steps: text planning, sentence planning, and linguistic realization. While state-of-the-art captioning models typically proceed directly from visual features to sentence generation, we hypothesize that a model that plans the structure of a sentence as an intermediate step will improve compositional generalization. A model with a planning step can learn the high-level structure of sentences, making it less prone to overfitting the training data.
Specifically, we explore three methods for integrating syntactic planning into captioning in our experiments: (a) pre-generation of syntactic tags from the image, (b) interleaved generation of syntactic tags and words (Nȃdejde et al., 2017), and (c) multi-task learning with a shared encoder that predicts syntactic tags or words (Currey and Heafield, 2019). We do so while also empirically investigating four different levels of syntactic granularity.
The main findings of our experiments are that: • jointly modeling syntactic tags and tokens leads to improvements in Transformer-based (Cornia et al., 2020) and RNN-based (Anderson et al., 2018) image captioning models; • although the effectiveness of each syntactic tag set varies across our explored approaches, the widely-used chunking tag set never outperforms syntactic tags with finer granularity; • compositional generalization is affected by directly mapping from image representation to tokens because performance can be improved by interleaving a dummy tag with no meaning; • interleaving syntactic tags with tokens leads to a loss in performance for retrieval systems. Finally, we also propose an attention-driven imagesentence ranking model, which makes it possible to adaptively combine syntax within the re-scoring approach of Nikolaus et al. (2019) to further improve compositional generalization in image captioning.

Planning Image Captions
Natural language generation has traditionally been framed in terms of six basic sub-tasks: content determination, discourse planning, sentence aggregation, lexicalization, referring expression generation and linguistic realization (Reiter and Dale, 1997). Within this framework, a three-stage pipeline has emerged (Reiter, 1994): • Text Planning: combining content determination and discourse planning. • Sentence Planning: combining sentence aggregation, lexicalization and referring expression generation to determine the structure of the selected input to be included in the output. • Linguistic Realization: this stage involves syntactic, morphological and orthographic processing to produce the final sentence. Early methods for image captioning drew inspiration from this framework; for example, the MIDGE system (Mitchell et al., 2012) features explicit steps for content determination, given detected objects, and sentence aggregation based on local and full phrase-structure tree construction, and TREETALK composes tree fragments using integer linear programming (Kuznetsova et al., 2014). More recently, Wang et al. (2017) propose a twostage algorithm where the skeleton sentence of the caption (main objects and their relationships) is first generated, and then the attributes for each object are generated if they are worth mentioning. In contrast, the majority of neural network models are based on the encoder-decoder framework (Sutskever et al., 2014) of learning a direct mapping from different granularities of visual representations (Vinyals et al., 2015;Xu et al., 2015;Anderson et al., 2018) to language model decoders based on RNNs (Vinyals et al., 2015) or Transformers (Guo et al., 2020;Cornia et al., 2020).

Motivation
In this paper, we explore whether image captioning models can be improved by explicitly modeling sentence planning as an intermediate step between content determination and linguistic realization. In particular, we study the use of syntactic tags in enriching the sentence planning step to improve compositional generalization. In the compositional image captioning task, models are tasked with describing images that depict unseen combinations of adjective-noun and noun-verb constructions (see Nikolaus et al. 2019 for a more detailed description of this task). Nikolaus et al. (2019) presented a model that improves generalization with a jointly trained discriminative re-ranker, whereas here, we investigate the role of sentence planning via syntax.
From a psycholinguistic perspective (Griffin and Bock, 2000;Coco and Keller, 2012), there is evidence that humans make plans about how to describe the visual world: they first decide what to talk about (analogous to content determination), then they decide what they will say (a sentence planning phase), and finally, they produce an utterance (linguistic realization). We hypothesize that, analogously to humans, neural network decoders will also find it useful to make such sentence plans.
From a machine learning perspective, the use of syntactic structure can mitigate the bias introduced by the maximum likelihood training of neural network image captioning models. Recall that in the context of image captioning, the optimization objective consists of maximizing the likelihood: Figure 1: Approaches to syntactically plan image captioning (with POS tags). STANDARD captioning systems directly generate a sequence of surface forms (i.e. words). SEQUENTIAL generates a sequence of syntactic tags, followed by a sequence of surface forms. INTERLEAVE alternates syntactic tags and surface forms. MULTI-TASK generates either a sequence of syntactic tags or a sequence of surface forms from a shared decoder.
where v denotes the visual features (either a single vector or a set of vectors extracted from an image). In a standard generation task, a model learns to predict the next token based on what it has observed so far. This is especially limiting when it is evaluated on unseen combinations of adjective-noun and noun-verb constructions in the compositional generalization task (i.e. data points that fall outside the training distribution). In fact, models are not explicitly asked to learn word classes nor how to connect them to form novel combinations. Whereas, if a system also models syntax, it can assign higher probability to "white dog" if it expects to generate a sequence with an adjective followed by a noun.

Planning Approaches
We investigate three approaches to jointly modeling tokens and syntactic tags: syntax-driven sequential caption planning (SEQUENTIAL), syntaxinterleaved caption generation (INTERLEAVE), and syntax and caption generation via multi-task learning (MULTI-TASK). See Figure 1 for an overview.
SEQUENTIAL: Our first approach closely follows the traditional NLG pipeline and it is related to the text planning stage defined above, although limited to sentence-level rather than to a full discourse. Here, a model plans, through syntactic tags, the order of the information to be presented. Specifically, the model is required to generate a sequence whose first T outputs represent the underlying syntactic structure of the sentence before subsequently generating the corresponding T surface forms.
INTERLEAVE: Our second approach consists of interleaving syntactic tags and tokens during generation, which means a syntactic tag and its realization are next to each other, removing the pressure for a model to successfully track long-range dependencies between tags and tokens. Moreover, this allows for a more flexible planning, where the model can adapt the sentence structure based on the previously generated tags and tokens. In particular, a model can break bi-gram dependencies and learn narrower distributions over the next word based on the current syntactic tag. For instance, if we consider part-of-speech tags, the model learns that only a subset of the vocabulary corresponds to nouns, and another subset to adjectives, and so on.

MULTI-TASK:
Our last approach is based on multi-task learning, where a model produces either a sequence of tokens (main task) or syntactic tags (secondary task). We draw on the simple and effective approach of Currey and Heafield (2019), proposed for neural machine translation (NMT). In the NMT framework, the source sentence was prepended a task-specific tag, which led the decoder to either predict the translation of the source sentence or the syntactic tags of the source sentence. We adapt this to image captioning by setting the first token to either a start-of-syntax token (<T>) or start-of-sentence token (<S>) and then generating tags or tokens, respectively. Compared to the other approaches, MULTI-TASK allows the model to learn both types of forms at the same position. While this approach does not double sequence length, it doubles the number of sequences per training epoch.

Syntactic Granularity
In addition to the three approaches of realizing sentence planning, we investigate the effects of different syntactic tags from a coarse to fine granularity. We experiment with the following tags: • CHUNK: Also known as shallow parsing, chunks are syntactic tags that model phrasal structure in a sentence, such as noun phrases (NP) and verb phrases (VP). • POS: Part-of-speech tags are specific lexical categories to which words are assigned, based on their syntactic context and role, such as nouns (N) and adjectives (ADJ). • DEP: Dependency-based grammars model the structure as well as the semantic dependencies and relationships between words in a sentence.
In this study, we consider the dependency labels assigned to each word, such as adjectival modifiers (amod), which denote any adjective that modifies the meaning of a noun. • CCG: Combinatory categorial grammar (Steedman and Baldridge, 2006) is based on combinatory logic and provides a transparent interface between surface syntax and the underlying semantic representation. For example, the syntactic category assigned to "sees" is "(S\NP)/NP", denoting it as a transitive verb that will be followed by a noun phrase. We also study the merit of breaking bi-gram dependencies for the INTERLEAVE approach by tagging each word with a synthetic tag <IDLE>. We hypothesize this approach would not give any benefits in any metric, as attention-based models can simply learn to ignore these pseudo-tags.

Experimental Setup
Data We use training and evaluation sets such that paradigmatic gaps exist in the training set. That is, for a concept pair {c i , c j }, the validation D val and test D test sets only contain images in which at least one of the captions contains the pair of concepts, while the complementary set -where concepts c i and c j can only be observed independently -is used for training D train . Following Nikolaus et al. (2019), we select the same 24 adjective-noun and verb-noun concept pairs, and split the English COCO dataset (Lin et al., 2014) into four sets, each containing six held out concept pairs.
Pre-processing We first lower-case and strip away punctuation from the captions. We then use StanfordNLP (Qi et al., 2018) to tokenize and lemmatize the captions, and to extract universal POS tags and syntactic dependency relations. For IOBbased chunking, we train a classifier-based tagger on CoNLL2000 data (Tjong Kim Sang and Buchholz, 2000) using NLTK (Bird et al., 2009). Finally, we use the A* CCG parsing model by Yoshikawa et al. (2017) with ELMo embeddings (Peters et al., 2018) to extract CCG tags. Visual features are extracted from 36 regions of interest in each image using Bottom-Up attention (Anderson et al., 2018) trained on Visual Genome (Krishna et al., 2017).
Evaluation Following Nikolaus et al. (2019), we evaluate compositional generalization with Recall@K. Given K generated captions for each of the M images in an evaluation set, { s 1 1 , . . . , s 1 K , . . . , s M 1 , . . . , s M K }, the recall of the concept pairs is given by: where s m k denotes the k-th generated caption for image m and C is the set of captions which contain the expected concept pair and in which the adjective or the verb is a dependent of the noun.
In addition, we use pycocoeval to score models on the common image captioning metrics: ME-TEOR ( Papineni et al. 2002); and the recent multi-reference BERTSCORE (BS; Yi et al. 2020). In particular, we report the average recall across all concept pairs, the average across the four splits for each score in pycocoeval, and the average across all captions for BERTSCORE.

Models
We evaluate three models: • BUTD: Bottom-Up and Top-Down attention (Anderson et al., 2018), a strong and widelyemployed RNN-based captioning system.  does not increase for five consecutive epochs. 2 We use the default hyperparameters and do not finetune them when tasking the models with syntax generation. For full experimental details, refer to App. A. Our code and data are publicly available. 3

Syntax Awareness
In Table 1, we first report the performance of BUTD when jointly modeling different types of syntactic tags and each approach to sentence planning.
Syntax helps compositional image captioning Table 1 clearly shows that, regardless of the level of granularity, syntactic planning enhances compositional generalization in image captioning (R@5). Moreover, CHUNK -one of the most widely-used tag sets for syntax-aware image captioning (e.g. Kuznetsova et al. 2012; Yang and Liu 2020) -is outperformed by tag sets with finer granularity (e.g. DEP) in every approach, motivating further research into incorporating them in image captioning. Looking at the results for the SEQUENTIAL approach, we see that, with the exception of POS tags, syntactic planning increases the ability of the model to recall novel concept pairs, with gains of at least +1.1 R@5 points. We then hypothesize that syntax-based sequential planning is effective if the tags convey information about words in relation to each other, e.g. CCG tags as opposed to POS tags.
When the model INTERLEAVEs syntactic tags and words, there is an improvement of at least +1.0 R@5, except for CHUNK. Moreover, POS tags lead to the highest gain of +2.3 R@5.
Finally, the MULTI-TASK approach also leads to significant gains in compositional generalization, with DEP (original setup of Currey and Heafield 2019) giving the highest R@5, corroborating the effectiveness of our porting into image captioning.
Generalization across categories We further investigate the role of syntactic planning for the different unseen composition categories defined by Nikolaus et al. (2019). Figure 2 illustrates how our different combinations of approaches and syntactic tags deal with color and size, type of the objects (animate and inanimate) and type of the verbs (transitive and intransitive). We see that DEP tags consistently improve upon BUTD for color and size concept pairs, regardless of the planning approach, making them a robust tag set for future research. INTERLEAVE+POS also leads to gains for all color and size categories, with up to +10 R@5 for colors of inanimate objects. Conversely, all the variants perform worse than the baseline for the sizes of animate objects. However, this drop is not substantial because BUTD already performs poorly.
Towards neural NLG pipelines While the SE-QUENTIAL approach closely follows the traditional NLG pipeline, it consistently degrades performance in standard metrics for image captioning. On the other hand, both INTERLEAVE and MULTI-TASK lead to higher performance in compositional generalization and other metrics. In particular, when BUTD is trained to predict either words or CCG tags in the MULTI-TASK approach, the generated captions achieve the highest average scores, including a substantial gain of +2.9 CIDER points. These results indicate that neural models require novel ways of sentence planning; and that effectively doing so consistently leads to the same or better performance in every considered metric.
Grounding the need for planning Overall, Table 1 provides empirical support that an explicit planning step improves compositional generalization in image captioning. In fact, even breaking bi-grams with the <IDLE> tag in the INTERLEAVE approach improves performance: the standard approach of directly mapping image representations to tokens is sub-optimal because the model learns to generate n-grams seen during training.   Given its superior performance in the recall of novel compositions of concepts, we adopt INTER-LEAVE+POS throughout the remainder of this paper to jointly model syntactic tags and words. For clarity of exposition, we refer to this approach as POS.

Adaptive Re-Ranking for Syntax
Recall that the best-performing model for compositional image captioning re-ranks its generated captions given the image (BUTR; Nikolaus et al. 2019). Here, we study how to combine the benefits of syntactic planning and their re-ranking approach.
The BUTD model, investigated above, is a twolayer LSTM (Hochreiter and Schmidhuber, 1997) in which the first LSTM encodes the sequence of words, and the second LSTM integrates visual features through an attention mechanism to generate the output sequence (Anderson et al., 2018). The state-of-the-art BUTR model extends this with an image-sentence ranking network that projects images and captions into a joint visual-semantic em-bedding. The sentence representation used by the ranking network is a learned projection of the final hidden state of the first LSTM: s = Wh l T . Table 2 shows that the image-sentence retrieval performance of BUTR decreases when interleaving POS tags with words. Given the previous formulation of BUTR and its connection to BUTD, we conclude that jointly modeling syntactic tags and words leads to decreased performance in the generation and ranking tasks.

Ranking performance
Adaptively attending to tags We explore two approaches to combining the improvements of interleaved syntactic tagging with ranking: • mean: The model creates a mean representation over the hidden states of the first LSTM. • weight: The model forms a weighted pooling of the hidden states of the first LSTM layer, whose weights are learned through a linear layer. This is a simple form of attention mechanism and it is equivalent to the one used  by Nikolaus et al. (2019) to represent image features in the shared embedding space: (3) Table 2 shows that the weighting mechanism in the ranking model effectively disentangles syntactic tags and tokens, resulting in +1.5 RECALL points over BUTR, with small improvements to the other metrics. Compared to BUTR, BUTR weight also improves the retrieval performance of the ranking module. Adding POS tags still decreases retrieval performance but, compared to BUTR+POS, the difference is now halved for text retrieval and only 0.7 points for image retrieval. Overall, our BUTR weight is a more general and robust approach to jointly training a captioning system and a discriminative image-sentence ranker.

Results and Discussion
We now report the final performance of three image captioning models that integrate the syntactic planning (INTERLEAVE+POS) with word generation.
Model-agnostic improvements We start by investigating whether the compositionality given by syntactic planning generalizes across architectures. Table 3 reports average validation and test scores for the BUTD, BUTR weight and M 2 -TRM models. We find that interleaving POS tags and tokens consistently leads to +2 RECALL points in each model without affecting the performance on other metrics, with the exception of decreased BLEU score of M 2 -TRM. In this case, M 2 -TRM +POS generates captions that are abnormally truncated, ending with bi-grams such as "of a," "on a" and "to a". 4 Furthermore, we can clearly see that despite 4 This is known as reward hacking (Amodei et al., 2016) which arises in models with a reinforcement learning-based op-  M 2 -TRM outperforming the RNN-based models in every standard metric, it is only +1 RECALL point better than BUTD at compositional generalization. Hence, syntactic planning is an effective strategy to compositional generalization, regardless of the language model used. Table 4 lists the R@5 scores for different categories of held out pairs. Differently from the results reported by Nikolaus et al. (2019), we find the performance of BUTD for noun-verb concept pairs to be much higher thanks to a larger beam size (equal to the one used for BUTR in our experiments). Moreover, the performance from interleaving POS across different categories of held out pairs shows that improvements are consistent across categories and models, with the exception of size modifiers of animate objects, where all models perform poorly. This was also found by Nikolaus et al. (2019) and it is likely due to the need for real-world knowledge (i.e. does this image depict a "big dog" compared to all other "dogs"?). For a full breakdown of the R@5 generalization performance for each held out pair by each model, see Table 9 in App. B. timization phase. Investigating whether proposed approaches to mitigate this problem (Liu et al., 2017;Li et al., 2019, inter alia) are also effective in our setup is left as future work. Performance by minimum importance Given that annotators of the COCO dataset were given a relatively open task to describe images, captioning systems should exhibit higher recall of concept pairs when more annotators use them in the descriptions. As shown in Figure 3, this behavior is seen in each model, with increasing gains given by jointly modeling lexical and syntactic forms. In particular, we observe that the M 2 -TRM model recalls fewer pairs than BUTD when they are considered more relevant (more annotators use them in describing an image), and that interleaving POS tags partially solves its limitations. Moreover, as agreement among annotators increases, we also see that BUTD+POS is as effective as BUTR weight , corroborating the effectiveness of our model-agnostic approach against a more complex, multi-task model. Table 5 reports the average scores for caption diversity (van Miltenburg et al., 2018) in the validation data. Comparing BUTD (RNN-based) and M 2 -TRM (Transformer-based) models, we see that the output vocabulary of the M 2 -TRM-based model spans many more word types, resulting in +11% novel captions. However, M 2 -TRM has lower mean segmented type-token ratios (TTRs), contrasting the conclusion of van Miltenburg et al. (2018) that the number of novel descriptions is strongly correlated with the TTR (while this correlation is maintained with the number word types). The models that jointly model syntactic tags and tokens lead to a higher number of types in both models and a substantial +8% in novel captions for M 2 -TRM, without affecting other metrics. Clearly, BUTR weight leads to longer sentences, more types, higher TTRs and the highest percentage of novel captions. We can also see that BUTR weight achieves the highest coverage (defined  as the percentage of learnable words it can recall), while M 2 -TRM has the highest local recall score, being able to better recall the content words that are important to describe a given image.

Captions diversity
Accuracy of syntactic forms We verify that the models can correctly predict syntactic tags, regardless of their granularity and the approach used to jointly modeling them with tokens. Indeed, the accuracy of the generated syntactic tags, measured as the ratio of sequences matching the annotations from StanfordNLP, by BUTD is high, ranging between 95% and 99%. See App. B for details.
Qualitative examples Figure 4 shows generated captions. Compared to standard BUTD, all syntaxaware approaches allow the model to recall more unseen concept pairs, while also improving the overall quality of the captions. In addition, when looking at the captions generated by all three models, both with and without interleaving POS tags, we find the integration of syntactic tags to clearly improve the quality of the generated caption. See Figure 5 in App. B for more examples.

Related Work
Compositional image captioning Nikolaus et al. (2019) studies compositional generalization in image captioning with combinations of unseen adjective-noun and verb-noun pairs, whose constituents are observed at training time but not their combination, thus introducing a paradigmatic gap in the training data. Nikolaus et al. (2019) showed how to improve compositional generalization by jointly training an image-sentence ranking model with a captioning model. Other work has also investigated generalization to unseen combinations of visual concepts as a classification task (Misra et al., 2017;Kato et al., 2018), triplet prediction (Atzmon et al., 2016), or unseen objects (Lu et al., 2018). Here, we improve generalization by jointly modeling syntactic tags and tokens, and we show BUTD: there is a woman that is on the floor BUTD + SEQUENTIAL: a woman doing a trick on a bicycle BUTD + INTERLEAVE: a woman riding a bike on a wooden floor BUTD + MULTI-TASK: a woman riding a bike on a wooden surface BUTD: a woman with a child sitting on a bench BUTD + POS: a girl that is standing on a skateboard BUTR weight : a girl and child playing with a toy in a backyard BUTR weight + POS: a girl doing a trick on a skateboard on a brick walkway how to combine this with the improvements gained from a jointly-trained ranking model.

Joint syntactic and semantic representations
While little work has investigated the interaction of jointly modeling semantics and various syntactic forms in captioning models, a few studies have exploited syntax in image and video captioning. Zhao et al. (2018) propose a multi-task system to jointly train the task of image captioning with two additional tasks: multi-object classification and syntax generation. The same LSTM decoder is used to generate captions and CCG tags by mapping the hidden representations to either word or tag vocabularies through two different output layers. Dai et al. (2018) propose a twostage sequential pipeline where a sequence of nounphrases is first selected from a fixed pool, which are then patched together via predetermined connecting phrases. This method, however, is unlikely to realize any benefits for compositional generalization because it uses the top-50 noun-phrases and 1, 000 connecting phrases from the training set. Our INTERLEAVE approach can be used to address these limitations in their "phrase pool" and "connecting" modules to produce unseen compositions. Deshpande et al. (2019) rely on sequences of POS tags to produce diverse captions. Similarly to our SEQUENTIAL approach, their model first predicts a sequence of POS tags conditioned on the input image. However, the authors limit the POS sequences to 1, 024 templates obtained through quantization of the training set. During inference, the model samples k POS tag sequences and uses them to condition a greedy decoder for captions generation. Hou et al. (2019) take yet another approach to jointly learn POS tags and surface forms in the framework of video captioning. They introduce a model that resembles our INTERLEAVE approach but with two main differences: (i) the t-th tag is not conditioned on previous tags, and (ii) the t-th word is only conditioned on the t-th tag and the video.

Conclusion
We investigated a variety of approaches along with the use of syntactic tag sets to achieve compositional generalization in image captioning via sentence planning. Our results support the claim that combining syntactic planning and language generation consistently improves the generalization capability of RNN-and Transformer-based image captioning models, especially for inanimate colornoun combinations. While this approach penalizes image-sentence ranking models, we showed that this can be overcome with an adaptive mechanism, resulting in state-of-the-art performance on the compositional generalization task. We believe our results will lead to further exploration of syntaxaware captioning models given their potential to better generalize, both in terms of under-researched syntactic granularity (e.g. CCG) and more expressive alternatives to modeling syntactic structure. Another direction for future work is to focus on size-noun compositions, which rely on the successful integration of real-world knowledge.

A Experimental Setup
Data In order to evaluate the compositional generalization of a model, we use training and evaluation sets such that paradigmatic gaps are observed in the training set. That is, for a concept pair {c i , c j }, the validation D val and test D test sets only contain images in which at least one of the captions contains the pair of concepts, while the complementary set -where concepts c i and c j can only be observed independently -is used for training D train . Specifically, following Nikolaus et al. (2019), we select the same 24 adjective-noun and verb-noun concept pairs, and split the English COCO dataset (Lin et al., 2014) into four sets, each containing six held out concept pairs (training and validation instances are drawn from train2014, while test instances from val2014). Table 6 lists the sizes (in number of images) of each split. 5 For more details, we refer the reader to Nikolaus et al. (2019).
Training details Following Nikolaus et al. (2019) and Cornia et al. (2020), each system is trained with teacher forcing. Model selection is performed using early stopping, which is determined when the BLEU score of the generated captions in the validation set does not increase for five consecutive epochs. 6 All models are trained using the Adam optimizer (Kingma and Ba, 2014): BUTD and BUTR use an initial learning rate of 1e − 4, β 1 = 0.9 and β 2 = 0.999, and gradients are clipped when they exceed 10.0. For the GradNorm optimizer (Chen et al., 2018) used in BUTR, the initial learning rate is 0.01 and the asymmetry is 2.5, although we find it beneficial to tune the latter when generating syntax. 7 Moreover, we find that taking the absolute value of the GradNorm weights for each loss in the renormalization step (given that our loss functions are by definition positive) leads to more stable multi-task training. M 2 -TRM first uses an initial learning rate of 1, β 1 = 0.9 and β 2 = 0.98, with a warm-up equal to 10, 000 iterations (Vaswani et al., 2017), and it is then fixed to 5e − 6 during CIDER-D optimization. A batch size of 50 is used when training M 2 -TRM, while 5 Note that the size of each set is slightly different from the one in Nikolaus et al. (2019)   BUTD and BUTR are trained with batch sizes of 100. When adding syntactic forms, due to memory limitations, a batch size of 50 is always used (see Table 7 for a comparison of the planning approaches). All models are trained on one NVIDIA TitanX GPU in a shared cluster. See Table 8 for running times on the second held out dataset.
Inference At evaluation time, a maximum caption length of 20 is used when generating lexical forms only, and of 40 when also syntactic tags are generated. Notably, we use the default hyperparameters provided by the respective authors and do not fine-tune them when tasking the models with syntax generation. Differently from Nikolaus et al. (2019), rather than using a beam of 100 for BUTR only, we let all systems generate captions using such beam size as we found it to significantly improve compositional generalization of BUTD in our validation sets. 8

B Further Analysis
Accuracy of syntactic forms We verify that a model can correctly predict syntactic forms, regardless of their granularity and the approach used to jointly modeling them with lexical forms. Figure 6 shows that, indeed, the accuracy of the generated syntactic tags, measured as the ratio of sequences matching the annotations from StanfordNLP, by BUTD is high, ranging between 95% and 99%. Note that we only evaluate accuracy of the SE-QUENTIAL and INTERLEAVE approaches as there is no close relationship between syntactic and lexical sequences in the MULTI-TASK approach.
Qualitative examples Figure 5 shows more generated captions for images in the validation sets.    Table 9: R@5 for each of the held out concept pairs in the validation sets.