Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning

Image captioning systems are expected to have the ability to combine individual concepts when describing scenes with concept combinations that are not observed during training. In spite of signiﬁcant progress in image captioning with the help of the autoregressive generation framework, current approaches fail to generalize well to novel concept combinations. We propose a new framework that re-volves around probing several similar image caption training instances (retrieval), perform-ing analogical reasoning over relevant entities in retrieved prototypes (analogy), and en-hancing the generation process with reasoning outcomes (composition). Our method aug-ments the generation model by referring to the neighboring instances in the training set to produce novel concept combinations in generated captions. We perform experiments on the widely used image captioning benchmarks. The proposed models achieve substantial improvement over the compared baselines on both composition related evaluation metrics and conventional image captioning metrics.


Introduction
Generating a textual description for a given image, a problem known as image captioning (Chen et al., 2015), requires a conditional generation model to recognize salient visual regions, e.g., object (Anderson et al., 2018) or scene graph detection (Yao et al., 2018), align visual features with textual tokens (Lu et al., 2017;Pu et al., 2018;Shi et al., 2020), and verbalize them in a natural language sentence (Xu et al., 2015;. Current state-of-theart image captioning models benefit from powerful neural autoregressive generation models, attention mechanisms, and progress in object or scene graph detection. They have achieved significant progress in obtaining visual representations for images as well as modelling alignment between visual features and textual tokens, resulting in superior per- Figure 1: Comparison of compositional generalization in generated descriptions between human and machine (Anderson et al., 2018). formance under a variety of text-similarity based metrics.
However, when verbalising the visual semantic concepts into natural language sentences, these models still fall short of compositional generalization for images with novel concept combinations (Nikolaus et al., 2019). Note that making systematic generalizations (Lake and Baroni, 2018;Janssen and Partee, 1997) from limited data is an essential property of human language. As shown in Figure 1, the visual instances of "horse" and "cow" as well as the scene containing concept combinations of "cow eat" have been observed during training. While the existing models can often generate "horse on" for the picture, it would be effortless for humans to generate a caption containing "horse eat" even this combination has not been observed during training. It is partly due to the fact that current language generation models rely heavily on the surface distributional characteristics of the captions and hence are discouraged from generating unseen concept combinations (Holtzman et al., 2019;Nikolaus et al., 2019).
To remedy the problem, we propose to leverage prototype-based generation approaches  which can explicitly expose concepts of other training examples by asking the model to decide what prototypes to retrieve in either a heuristic or learned way. In other words, these approaches have a chance to peek into retrieved prototypes for concepts without relying on the generation component. In addition, to combine the concepts from the prototypes, we enhance the conditional generation model by incorporating analogical reasoning (Vosniadou and Ortony, 1989;Gentner and Smith, 2012;Wu et al., 2020), based on the idea that, if two things are similar on the visual side, they are probably also similar on the text side. Specifically, in each generation step, we compare the visual and textual representation between the current state in the language model decoder and analogy entity pairs (a visual entity and its text form a pair) extracted from retrieved prototypes to produce sentences with improved generalization of semantic compositions.
As a result, our model consists of two major components: (1) a multi-prototype retriever (c.f. Section 3.3) for obtaining multiple prototypes, which aims to cover the basic concepts in the described image, and (2) an analogical reasoning editor (c.f. Section 3.4) to perform analogical reasoning over extracted analogy entity pairs, in order to compose these concepts for generation. We perform extensive experiments on the widely used benchmark MSCOCO (Lin et al., 2014) with both maximum likelihood estimation and reinforcement learning strategies (Rennie et al., 2017). The experiment results show that the proposed models significantly outperform the baselines under both text-similarity based metrics and composition related metrics. The main contributions of our work are summarized as follows: • To the best of our knowledge, this is the first attempt to introduce a novel prototype-based generation framework in image captioning, which helps the generation process with improved compositional generalization.
• The proposed framework substantially improves upon the baselines (Anderson et al., 2018;Nikolaus et al., 2019) on both composition related metrics (from 13.6 to 18.8 on R@5) and conventional evaluation metrics (from 109.9 to 114.3 on CIDEr).
• We analyze various types of concept composition in captioning generation and provide detailed discussion on how the proposed framework improves compositional generalization for each type.

Related Work
Image Caption Generation Image captioning aims at generating visually grounded descriptions for images. Current models often leverage a CNN or variants as the image encoder and an RNN or transformer as the decoder to generate sentences (Vinyals et al., 2015;Karpathy and Fei-Fei, 2015;Donahue et al., 2015;Yang et al., 2016;Huang et al., 2019). Previous work has used a visual attention mechanism (Anderson et al., 2018;Pu et al., 2018;Lu et al., 2017;Pedersoli et al., 2017;Xu et al., 2015;Pan et al., 2020;Shi et al., 2021b), explicit high-level attributes detection (Yao et al., 2017;You et al., 2016) to align visual and textual features. For the learning method, people use reinforcement learning methods (Rennie et al., 2017;Ranzato et al., 2015;Liu et al., 2018), or contrastive or adversarial learning  to generate descriptive captions (Luo et al., 2018;Shi et al., 2021a) with improved quality. The distribution shift between training and test stages also has received a lot of attention, such as generating captions with novel concepts Agrawal et al., 2019;Anderson et al., 2016a). More recently, Nikolaus et al. (2019) proposes 24 concept pairs to explicitly investigate the composition generation ability of current neural image captioning models.
Compositional generalization Systematic compositionality, a method to capture underlying rules from limited data and generalize them to novel situations, is a key feature in human intelligence (Fodor and Pylyshyn, 1988). The topic is closely related to cognitive science (Fodor and Lepore, 2002) and connectionist literature (McClelland et al., 1986). While the topic is widely studied in the semantic parsing literature (Lake and Baroni, 2018;Keysers et al., 2019), it is less investigated in natural language generation. Akyürek et al. (2020) introduces a resample and recombine network to improve generalization in two NLP problems, i.e., instruction following and morphological analysis.

Method
Our framework is designed to enhance text generation with compositional generalization through analogical reasoning from retrieved prototypes. The framework is built on the classical two-layer LSTM network, i.e., Updown (Anderson et al., 2018), but this method is orthogonal to more re- Figure 2: The model framework consists of a prototype retriever and analogical reasoning editor where the former attempts to obtain multiple prototypes to the described image and the latter uses analogical reasoning to leverage the analogy entity pairs for generation. Therefore, even if "white refrigerator" is a novel combination, we can generate a captioning containing it from entities in prototypes.

Problem Definition
We are given a training dataset D which contains matched image-caption pairs {d i }, where d i denotes an image x i and its caption c i .

Composition of Common Concepts
Following (Nikolaus et al., 2019), we use some common concepts {s i } of interest, which covers a range of different types of attributes, objects and verbs frequently and then select a number of concept pairs {S j } based on {s i }, including attribute-noun and noun-verb composition. Note that {s i } is frequently seen in both training and evaluation stages but {S j } is a held-out set of concept combinations to test the generalization ability of the model. (c.f. Section 4.1 for dataset splits).

Composition of Rare Concepts
We further select a few rare concepts {s i } of interest, covering a few verbs and objects. As these concepts are rarely seen in the training stage but frequently used in evaluation stage, these rare concepts are proposed to test the generalization ability to learn new concepts in context from little data. (c.f. Section 4.1 for dataset splits)

Overall Framework
The goal of image captioning is to train a conditional generation model p m (c | x). As shown in Figure 2, the framework corresponds to the following retrieve and edit generative process: given an input x, we first retrieve k prototypes d 1:k from D by sampling from p r (d 1:k | x). We then generate a visually grounded sentence c using an analogical reasoning editor p e (c | x, d 1:k ).
Typical models leverage a two-phase training process to learn p m (c | x): the former phase uses the cross entropy loss to maximize the log probability with respect to the ground truth captions and the latter phase uses a policy gradient algorithm to maximize the expected reward metric r, i.e., CIDEr: (2) Here, we focus on deterministic retrievers, where p r (d 1:k | x) is a point mass on particular prototypes d 1:k . Note that when generating texts with novel semantic compositions, neither a basic LSTM editor p e nor a single prototype retriever p r is enough. We accordingly elaborate on retrieval and edit models separately in the rest of this section.

Multi-Prototype Retriever
To generate captions with novel compositions, we aim to provide a large inventory of contextualized individual concepts and encourage further use of both their visual and textual features. Furthermore, the retrieved prototypes not only focus on visual similarity with the query, but also gather enough information to cover the concepts in the query collectively.
Specifically, given an image x, we first obtain n neighbor prototypes x 1:n in the training set by ranking the cosine similarity of image features encoded by CLIP (Radford et al., 2021), which is the state-of-the-art visual encoder trained on a large amount of image-text data by contrastive learning. Then for each neighbor image x i and the query image x, we get its entity 1 set g x i and g x from the scene graph using a pre-trained parser (Yang et al., 2019). We get K images {x j } K j=1 by iteratively selecting from the following formulas: As such we obtain K retrieved images with their corresponding captions d 1:K so that these K prototypes cover most meaningful semantic concepts in the query image x.

Analogical Reasoning Editor
We take p e (c | d 1:k , x) to be a neural autoregressive conditional text generation model (a two layer LSTM) which decomposes as: where T is the length of the caption and c 0 is the start token "<s>". For image x the model employs Faster R- CNN (Ren et al., 2015) to recognize instances of objects and returns a set of image regions for objects: x = {r 1 , r 2 , · · · , r M }.
Bottom LSTM The bottom LSTM is used to align a textual state to image region representations: where LSTM means one step of recurrent unit computation via LSTM; r is the mean-pooled representation of all object regions in the image; h 1 t−1 and h 2 t−1 denote hidden states of bottom and top LSTM at time step t−1, respectively; e is the word embedding lookup table.
1 Entity means attributes, objects and predicates here Attention Unit The state h 1 t is then used as a query to attend over object features {r i } to get contextualized image region featuresr t : where W ra , W ha and W a are model parameters.
Top LSTM The top-layer LSTM works as a recurrent language model. At time step t, the input consists of the output from the bottom LSTM layer h 1 t and the output of visual attention unitr t : Analogy Entity Pairs We first run the two-layer LSTM on the K retrieved prototypes d 1:K to obtain aligned visual and textual representations. We take the attention unit outcome as the visual feature and its corresponding ground truth token as the textual feature, obtaining a total of K · T aligned pairs. Specifically, in time step t and retrieved prototype k, we get the aligned pair as {(e c k,t ,r k,t )}. To obtain the analogy entity pairs, we remove the pair if c k,t is not an entity, thus getting Y (Y is dependent on the input x and its retrieved prototype d 1:K ) analogy entity pairs {(e c en ir en i )}, 1 ≤ i ≤ Y .
Analogical Reasoning For the described image x, we obtain the analogy entity pairs {(e c en i ,r en i )}. An analogy pair consists of a pair of visual and textual features, and analogical reasoning is the type of reasoning that relies upon the analogy pairs. we perform analogical reasoning over these analogy pairs. Specifically, we user t as the query for attending these entity pairs to get analogy context features: We combine features fromr t and the top layer LSTM hidden state h 2 t to predict the next token:
Diversity-related We report diversity by calculating the number of distinctly generated unigrams(Div-1) and bigrams(Div-2) scaled by sentence length  as well as self-BLEU (Zhu et al., 2018) (a lower value yields a higher diversity), which is computed among multiple generated sentences.
Composition-related We calculate the recall of the concept pairs (R@K) (Nikolaus et al., 2019) for the multiple (K) generated captions given images in the evaluation dataset.

Implementation Details
Parameter Setting To make a fair comparison, we use the default experiment setup that the compared baselines used as indicated in Luo's package 2 . The number of retrieved prototypes k is 3 and the specific retrieval model used for obtaining prototypes is ViT-B/32 by official release (note that in prototype retrieval, we only use the image encoder). The leveraged scene graph parser is the same with the official release from (Yang et al., 2019). For the decoding stage, we use beam search to produce 5 sentences, i.e., the beam size is also 5, for further evaluation. The re-rank strategy is based on a beam search with size of 100, and then ranking the sentences in the beam by the ViT-B/32.

Split Construction
We first use a set of synonyms (Nikolaus et al., 2019) to represent one concept as each concept accounts for the variations it can be expressed across the dataset. Then we use the dependency parser from StanfordNLP (Qi et al., 2019) to identify the chosen nouns, verbs, attributes, noun-verb, and attribute-noun concept combinations. For the construction of rare concept splits, we pick up all image-caption pairs in the original training set that contain the rare concept and distribute 95 percent of them into the validation set, leaving 5 percent of the pairs unchanged in the training set.

Overall Performance
Composition and diversity related metrics. We analyze the composition and diversity related metrics together to have a clearer view of compositional generalization ability, as intuitively a more diversified generation method would be helpful in increasing the R@5 of generating concept pairs in the sentence. As shown in Table 2: (a) On the common concept split, our method achieves a significant increase of compositional generalization, improving the recall@5 from 7.0    (UD) to 10.3 (Ours) and 13.6 (UR+Rank) to 18.8 (Ours+Rank) with the re-ranking strategy applied. (b) On the rare concept split, we obtain a similar relative result, increasing recall@5 from 13.5 (UD) to 15.5 (Ours) and 15.8 (UR+Rank) to 18.7 (Ours+Rank) with re-ranking applied. (c) We can see that the increase of the recall value is not caused by a change of diversity, i.e., the diversity on the common concept split stays almost unchanged from 27.2 (UD) to 26.8 (Ours) in Div1, and 25.9 (UD) to 25.6 (Ours) for Div1 on the rare concept split. However, the re-rank strategy will significantly increase the diversity while improving the recall@5 value. Table 4 shows more detailed results in terms of various concept combinations. We can see that the increase of performance mostly rests on the nounverb type concept combinations, increasing from 13.2 (UD) to 22.4 (Ours) and from 14.7 (UD) to 19.9 (Ours) for transitive verbs, i.e., eat, ride, hold, and intransitive verbs, i.e., lie, fly, stand. One expla-nation for that increase could be derived from the characteristic of prototype retriever, as the retriever is more capable of obtaining prototypes which have similar verbs or nouns with the query image. However, the attribute-noun pairs with size modifiers (big, small) remain the hardest composition generalization problems.
Quality related metrics. The quality-related results from the common concept and rare concept splits, in Table 2, show that our method gains improvement in terms of CIDEr and SPICE, improving CIDEr from 99.4 to 101.3 and SPICE from 19.9 to 20.2. To further verify to what extent the model can improve caption quality, we also test the quality-related metrics on the widely applied Karparthy split. As shown in Table 3, our method consistently outperforms the baseline models on most conventional metrics, especially SPICE and CIDEr in both the CE and RL phases; e.g., the proposed model improves the baseline from 109.9 to 114.3 on CIDEr and 19.9 to 20.3 on SPICE in the CE phase, and 123.5 to 125.3 on CIDEr and 21.4 to 21.5 on SPICE in the RL phase. It is partly due to the fact that this framework can also be viewed as a general method to leverage neighbor instances into training. In contrast to the baseline method that could only condition on the image features for captions, our method can refer to both visual and textual features of multiple prototypes for generation, thus making the models refer to more training examples during inference.

Ablation Analysis
Effect of multiple prototype retriever We analyze the effect of the retriever with regard to the recall value under two aspects: (1) How many prototypes for usage? (2) What kind of retrieved samples would benefit? Change of prototype numbers We compare re-call@5 by changing the prototype numbers in both training and inference stage. As shown in Figure 3, we attempt to use a different number of retrieved prototypes in training and inference. It shows that the compositional generation ability could be improved with the increasing number of prototypes. The performance gain is marginal when we change the prototype number from 3 to 5. In addition, the model achieves the best performance using the same number of prototypes in training and inference for prototype number of 2, 3 and 5. Using more prototypes in inference would not help for better recall performance. Change of prototype retrievers To evaluate how the retrievers would affect the recall@5 of concept pairs, we compare two retrievers as below: a random prototype retriever and the retriever used in this work on the common concept split. Note that a random retriever would randomly pick up three image-caption pairs in the training set as prototypes. As shown in Figure 4, we find that using a random retriever in both the training and inference stages would have little improvement over baselines. It demonstrates that the analogy entity pairs extracted from retrieved prototypes play an important role in improving recall@5. Comparison between CLIP and VSE We also train a visual semantic embedding model (Faghri et al., Noun Other Combine Total  Table 5: Concept hit of the prototypes between CLIP (C) and VSE (V); Other means verb or attribute 2017). Table 5 shows the hit rate of prototypes retrieved by different cross modal retrieval models (The VSE model is trained on the training set of relevant split), e.g., for images containing "black cat" (448), the three prototypes from CLIP can cover "cat" in 420 out of the 448 images, and the three prototypes from VSE can cover "cat" in 405 out of the 448. Overall, we can see from the table that CLIP model shows a better retrieval capacity compared to VSE, achieving a better combination hit in 5 out of the 6 concept pairs. Though both models show similar retrieval performance with regard to nouns, CLIP could yield better performance regarding attributes or verbs.
Effect of Analogical Reasoning We analyze the effect of using analogical reasoning over prototype entity pairs compared to the method of meanpooling the entity pairs representations as the input to the editor. The result shows that the recall value would drop from 10.3 to 7.2 when mean pooling is used, which is almost the same as the baseline (7.0). It demonstrates that aligning the visual features of the described image with the visual features of entity pairs is of critical importance for recall@5.

Qualitative Analysis
Case Study We list a few cases to show how our model achieves better generation results by the retrieved prototypes. As shown in Figure 5, the image in the first example can retrieve prototypes with similar "red" objects (red lights) and "bus" objects (trolley) and then generate a caption covering the concept of "red bus". For the second example, the described image could retrieve similar images which include "woman" from image containing "woman eat", "lie" from image including "man lie" and also "couch" from another picture, thus helping generate a sentence with concept combinations of "woman lie". For the last one, we could find "horse eat" combinations from "zebra stand" (zebra is categorized as synonym of "horse") and "cow grazing" (graze is categorized as synonym of "eat"), helping to generate "horse eat."   Table 6 shows the hit rate of prototypes, e.g., for images containing "black cat" (448), the three prototypes can cover "cat" in 420 of the 448 images, "black" in 210 of 448, and both "black" and "cat" in 195 of 448 (note that "black" and "cat" are covered by different prototypes).

Attribute-Noun
(1) Color as the modifier: the attribute-noun pairs with color as the modifier have relatively good generalization performance, as shown in Table 4. Similar with other methods, we find that our model is better at generalizing to describe inanimate objects than animate objects as inanimate objects are more feature invariant.
(2) Size as the modifier: the generalization performance for size modifiers remains low for all models. It is because the size modifier has little cor-relation with the bounding box size; for example, a big bird could be very small in a image because it is viewed from a distance. It is more object or context dependent, e.g., a human has to grasp the commonsense knowledge of an average cat before describing a cat as small or large. Meanwhile, people sometimes need to reference other objects in the picture to describe the object of interest with a size modifier. In addition, we can also see from Table 6 that the retriever also fails to retrieve prototypes with the size modifier. Therefore it remains a hard question under this framework. Noun-verb For these concept pairs, our method achieves a significant increase with regard to the baseline. Table 6 indicates that the hit rate of three prototypes covering the verbs is relatively higher than attributes. The increase of composition generalization could be attributed to the higher hit rate. Rare concepts For the rare concepts, our method consistently improves the concept recall rate. It is due to the fact our retriever is capable to retrieve the concept from other training instances, thus upsampling that the rare concept. This can enhance the generation model with these rare concepts.

Why re-ranking helps
As illustrated from Table 2, re-ranking a large number of sentences produced by the beam search algorithm would significantly increase recall@5. We presume that the gain might be from a debiased decoding objective. The original objective is: To deduct the concept occurrence bias of captions in training so that the probability of sentences with novel concepts would increase, we could therefore add a regularization term log p(c): c = arg max c (log p(c | x) − λ log p(c)) (16) = arg max c ((1 − λ) log p(c | x)+λ log p(x | c)) However, directly decoding from Equation 17 is intractable as the second term p(x|c) requires completion of caption generation before it can be computed. Practically, we turn to the re-ranking approach that involves first generating the top-n candidates based on the first term of the objective function and then re-ranking the top-n list using the other. As training a model to predict p(x|c) is not trivial, empirically, we turn to the visual semantic similarity score s(x, c) as an alternative 3 .

Conclusion
We explore a prototype-based generation approach to encourage image captioning models to produce sentences with improved compositional generalization. We design a multi-prototype retriever and an analogical reasoning editor to merge the analogy entity pairs into the generation process. We demonstrate the effectiveness of the model on both composition related and quality related evaluation metrics over both common concept and rare concept splits. We perform detailed analyses on the results. In the future, we will explore this framework on the transformer based decoders.