Fine-grained Image Captioning with CLIP Reward

Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Toward more descriptive and distinctive caption generation, we propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained criteria: overall, background, object, relations. In our experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model. We also show that our unsupervised grammar finetuning of the CLIP text encoder alleviates the degeneration problem of the naive CLIP reward. Lastly, we show human analysis where the annotators strongly prefer the CLIP reward to the CIDEr and MLE objectives according to various criteria. Code and Data: https://github.com/j-min/CLIP-Caption-Reward


Introduction
Describing an image with its detailed and distinguishing aspects is crucial for many applications, such as creating text keys for the image search engine and accessibility for the visually impaired. Standard deep learning approaches train an imageconditioned language model by maximizing the textual similarity between generated and reference 1 Code and Data: https://github.com/j-min/ CLIP-Caption-Reward captions (Vinyals et al., 2015;Xu et al., 2015;Rennie et al., 2017;Anderson et al., 2018). However, the reference captions of public datasets often describe only the most prominent objects in the images. This makes models trained to maximize textual similarity with reference captions tend to generate less distinctive captions that ignore the fine detailed aspects of an image that distinguishes it from others.
To alleviate the problem, we propose to use CLIP (Radford et al., 2021), a multimodal encoder model trained on large image-text data (mostly English) collected from the web, by using its similarity scores as rewards (Sec. 3.1). In addition, we propose a CLIP text encoder finetuning strategy with synthetic negative caption augmentation to improve the grammar of captioning model, without any extra text annotations (Sec. 3.2). Note that our approach completely eliminates the need for reference captions during reward computation. We illustrate our approach at Fig. 1. To comprehensively evaluate descriptive captions, we also introduce FineCapEval, a new dataset that measures captioning in diverse aspects: overall, background, object, and relation between objects (Sec. 4).
In our experiments on the MS COCO (Lin et al., 2014) dataset, we show that the captions of models trained with CLIP reward are more distinctive and contain more detailed information compared to the captions from CIDEr (Vedantam et al., 2015)optimized models. CLIP-guided captions even achieve higher text-to-image retrieval performance than reference captions that are originally paired with images. We also show that our text encoder finetuning significantly improves caption grammars by removing degeneration artifacts such as word repetition. In fine-grained caption evaluation with FineCapEval and human analysis, we show that our CLIP-based rewards outperform text similarity objectives by a large margin in all categories.
Objectives for Image Captioning. Standard deep learning-based image captioning approaches train models with a maximum likelihood estimation (MLE) objective. Ranzato et al. (2016) point that MLE suffers from an exposure bias problem. 2 To address exposure bias, Bengio et al. (2015) propose a curriculum learning strategy called scheduled sampling. Ranzato et al. (2016) propose to train models by directly maximizing the text similarity between the generated and reference captions with REINFORCE (Williams, 1992). Rennie et al. (2017); Luo (2020) propose self-critical sequence training (SCST) approach by normalizing rewards to stabilize the high variance of rewards.
As illustrated in Fig. 2, de facto standard reward function for captioning is text similarity between generated and reference captions. Recent studies have found that reference-trained captioning models often neglect important information from im-2 While language models are trained with ground-truth previous context, they generate words based on the context words previously generated by themselves during inference.  ages Wang et al., 2017). Lee et al. (2020b) use accuracy of an visual question answering model as a reward, encouraging models to generate captions that include information sufficient to answer a visual question. Luo et al. (2018); Liu et al. (2018) use image-text retrieval model's self-retrieval score as a reward and combine them with metrics based on ngrams, encouraging captioning models to generate captions that are distinctive to each input image.
Note that these works require a careful balance between self-retrieval and text similarity objectives for stable training. In contrast, with the CLIP text encoder finetuning (Sec. 3.2), our approach eliminates the need for reference caption and text similarity metrics for the reward computation.

CLIP-guided Image Captioning
We propose using the CLIP (Radford et al., 2021) image-text similarity score to guide a image captioning model. Following Hessel et al. (2021), we use CLIP-S as our reward: CLIP-S(I, c) = w * max( f I (I) f T (c) |f I (I)|·|f T (c)| , 0), where I, c are the image and caption, f I , f T are the CLIP image and text encoders, and w = 2.5. By learning to maximize the image-text similarity of the contrastive model, image captioning models are encouraged to generate captions that contain more distinctive information about the input image. Fig. 1 (a) illustrates this training strategy.
We approximate the gradient of expected reward for the generated captionĉ, where the reward of the beam search is normalized with the baseline reward b from greedy decodingĉ greedy : Background white house, truck digging soil in front of the house, trees and bushes, house surrounded by a small garden, Mini excavator, houses, white and grey building, greenery, two houses, blue and white colored machine Object a blue car, a blue car, black car, car, dozer, white and grey building, greenery, black car, green bushes Relation parked in the front yard, in front, parked in front of, Parked, car standing on the road Overall A blue car parked in the front yard of an off white house with a truck digging soil in front of the house. A blue car in front of a house surrounded by a small garden with trees and bushes in the background. A black car parked in front of a house with a mini excavator behind it with other houses in the background. A car and a dozer parked in front of two white and grey buildings and greenery on both sides. A black car standing on the road surrounded by green bushes on both sides and two houses and a blue and white colored machine in the background.
Background velvet carpet stairs, light-brown colored stairs, Off white wall, Cream painted walls, cream wall with straight line light Object brown jumpsuit, kid, Toy, black jumpsuit, boy, brown clothes, toy, brown carpet, Little young boy, cotton carpeted stair, dark brown jumper dress, cream wall Relation with its head on to, touching, Hiding, Holding, boy holding and playing with the toy, putting, wearing Overall A child wearing a brown jumpsuit with its head on to the velvet carpet stairs.
A kid is touching their head on a light brown colored stairs. A Kid wearing a black jumpsuit and holding a toy hiding below the stairs with off white wall in the background. A boy wearing brown clothes holding and playing with his toy and playing on a brown carpet on stairs with cream painted walls. Little young boy is putting his forehead on the cotton carpeted stair wearing dark brown jumper dress and background of cream wall with straight line light.

Improving Grammar with CLIP Text Encoder Finetuning
Since CLIP is not trained with a language modeling objective, the captioning model trained with the CLIP-S reward often generates grammatically incorrect captions (e.g., repeated words; see Table 3). We inject grammatical knowledge into the CLIP text encoder with negative captions, generated by randomly repeating/removing/inserting/swapping/shuffling tokens of the reference captions. We provide the implementation details of such operations in the appendix. We introduce a grammar head, a twolayer perceptron that takes the CLIP text feature f T (c) as input and produces the probability that c is grammatically correct: g(c) ∈ [0, 1]. We use binary cross-entropy for the grammar objective, whose label y is 1 for reference captions and 0 for negative captions: −y log g(c). We jointly finetune the text encoder and grammar head with the summation of the original CLIP objective and the grammar objective. Note that we fix the CLIP image encoder parameters during finetuning. We illustrate the finetuning process in Fig. 1 (b). After finetuning CLIP, we train captioning models with rewards augmented with grammar score:

FineCapEval: Fine-grained Caption Evaluation Dataset
We introduce FineCapEval, a new dataset for caption evaluation in four different aspects. To con-struct FineCapEval, we collect 500 images from the MS COCO (Lin et al., 2014) test2015 split and Conceptual Caption (Sharma et al., 2018) val split, respectively. Then, for each image, we ask 5 human annotators to write phrases of 1) background, 2) objects (and their attributes; i.e., color, shape, etc.), 3) relation between objects (i.e., spatial relation), and 4) a detailed caption that includes all three aspects. See details of data collection process in appendix. In total, FineCapEval consists of 1,000 images with 5,000 annotations for each of the four criteria. In Table 1, we show samples from the FineCapEval dataset.

Experiments
We compare different reward configurations: MLE, CIDEr, CLIP-S, CIDER+CLIP-S, and CLIP-S+Grammar. Following previous work, we conduct experiments on the MS COCO (Lin et al., 2014) English captioning dataset with Karpathy split (Karpathy and Fei-Fei, 2015). We evaluate the model with n-gram based metrics, embedding based metrics, text-to-image retrieval scores, and FineCapEval. We also perform a human evaluation with five criteria to understand the human preference for the generated captions in various aspects.  tions from CLIP-S+grammar reward (ours) with CIDEr reward and with MLE baseline to human annotators from Amazon Mechanical Turk 5 . Then we ask them to select a better caption on five criteria (overall, background, object, attribute, relation). For each of the five criteria, we ask 10 annotators with 50 pairwise selection questions. We use 50 images from FineCapEval for caption generation.
6 Results and Discussions

CLIP Guides Distinctive Captions
In Table 2, the models with CLIP-S and CLIP-S+Grammar rewards achieve higher image-text metrics (CLIP-S / RefCLIP-S) and text-to-image retrieval scores than baselines. Interestingly, their retrieval scores are even higher than the retrieval score with reference captions. This shows the distinctiveness of their generated captions. For image (a) in Table 3, our model with CLIP-S+Grammar reward describes the rainy weather with 'wet', while the model with CIDEr reward does not describe it.
Our models with CLIP-S and CLIP-S+Grammar rewards score lower text similarity metrics (n-gram metrics and BERT-S) than the model with CIDEr reward. However, the low scores on these referencebased metrics can be addressed by that models with CLIP-S and CLIP-S+Grammar rewards often generate captions that include fine-grained information that is not even present in the reference captions. For image (b) in Table 3, CLIP-S+Grammar model describes 'blue sign' of the restaurant, whereas none of the reference captions mentions them.   mitigated by adding the grammar reward (CLIP-S+Grammar). Table 2 shows that adding grammar reward significantly increases all text similarity metrics (e.g., +60 for CIDEr).

Fine-grained Caption Evaluation
FineCapEval. The four right columns of Table 2 show that CLIP-S and CLIP-S+Grammar significantly outperform CIDEr on all four criteria of FineCapEval: overall, background, object, relation. The gap is smallest in the object criterion, which implies that MS COCO reference captions describe more object information than background or relation between objects.
Human Evaluation.

Conclusion and Future Directions
We introduce a novel training strategy for image captioning models by maximizing multimodal similarity score of CLIP and finetuning its text encoder to improve grammar. The use of CLIP reward eliminates the need for reference captions and their bias for the reward computation. We also introduce FineCapEval, a dataset for fine-grained caption evaluation. We demonstrate the effectiveness of our proposed method based on improvements in text-to-image retrieval, FineCapEval, and human evaluation on fine-grained criteria along with qualitative examples. Future works involve finetuning CLIP reward models with desired writing styles for different applications and improving the synthetic augmentation process by using external data suitable for grammars with advanced linguistics expertise.

Ethical Considerations
The CLIP models that we used are trained on millions of web image-text pairs. Birhane et al. (2021) shows that such large-scale datasets often contain explicit and problematic image-text pairs. As the CLIP model card 6 suggests, the use of CLIP reward to train image captioning models is intended as a research output, and any deployed use case of the models is out of scope. Our captioning models and CLIP models are trained on English datasets; its use should be lim-ited to English language use cases. As our proposed method is not limited to English and is easily extended to other languages, future work will explore the extensions in various languages.
In this appendix, we include more example image captioning with different rewards (Sec. A), implementation details (Sec. B), FineCapEval details (Sec. C), human evaluation details (Sec. D), and the license for the datasets and models used in this project (Sec. E).

A More Image Captioning Examples
We provide more image captioning examples using different reward functions in Table 5. Overall, the captions from the model with CLIP-S+Grammar reward provide 1) more descriptive than the captions from the CIDEr model and reference captions, and 2) more grammatically correct than the captions from the model with CLIP-S reward.

B Implementation Details
Negative Caption Generation. In Alg. 1, we show Python implementation of the negative text generation (Sec. 3.2) for grammar finetuning. In summary, we generate negative captions using one of the operations: repeat, remove, insert, swap, shuffle on the original captions.
Evaluation Scripts. We use pycocoevalcap 7 for MS COCO caption evaluation metrics such as CIDEr. We use BERTScore official repo 8 with roberta-large model to calculate BERT-S. We report the evaluation script number from single run (single weight initialization), as we did not observe meaningful score fluctuation across multiple runs in our initial experiments.

C FineCapEval Details
Data Collection. To create a fine-grained description of the image, we ask annotators to write a caption that should describe target images' 1) background, 2) objects and their attributes (i.e., color, shape, etc.), and 3) the relationship between the objects if any (i.e., spatial relation). Furthermore, we ask the annotators to write metadata containing which words/phrases in their writing belong to the three criteria. We also provide annotators with guidelines in writing a caption as follows: 1) There should be a single sentence describing the image.
2) The image may be a photo, an illustration or a pure background. 3) Pay close attention to local and global events in the image. 4) Descriptions should be at least ten words for each image. 5) Avoid the subject description of the image (i.e., a dog runs "very fast", a man feels "successful"). 6) Avoid known entities such as specific locations (i.e. Eifel Tower), time (i.e., 4 pm), event (i.e., Halloween), proper name. 7) In describing people, use only man/woman/boy/girl if clear; otherwise, use person/child. All annotators are hired by a professional crowdsourcing platform TELUS 9 . The crowdsourcing company obtained consents from the crowdworkers before the annotation process and conducted the ethical reviews. We collect English captions and all the annotators are native English speakers living in the US. We pay 5,400 USD, including 1) caption creation (5k samples) and 2) Image Reward Captions (a) CIDEr a group of boats parked in the water on a lake CLIP-S several rows of boats parked near a canal mountains horizon area and a mountain horizon horizon area horizon ear motion CLIP-S+Grammar a lot of boats parked on the grass next to the lake with the hills behind

Reference Captions
A blue boat docked on a green lush shore. A small marina with boats docked there a group of boats sitting together with no one around Some boats parked in the water at a dock boats sitting around the side of a lake by a tree CIDEr a zebra standing in the snow next to a brick wall CLIP-S a adult zebra wearing black and grey stripes standing near a brick wall area area with grey stance position stance CLIP-S+Grammar a large black and grey zebra standing together in the snowy ground next to a stone

Reference Captions
A zebra is standing outside in the snow One zebra standing in snow near a stone wall. A zebra is standing in a snowy field. A zebra stands in snow in front of a wall. A zebra standing alone in the snow with a stone block wall and wooden fence behind it.
(c) CIDEr a man riding a bike next to a train CLIP-S older adult male riding a bicycle near a red and commuter train passing a train station motion stance ear stance CLIP-S+Grammar a person walking on a bike next to a red passenger train on the road

Reference Captions
A man on a bicycle riding next to a train A person is riding a bicycle but there is a train in the background. a red and white train and a man riding a bicycle a guy that is riding his bike next to a train A man riding a bike past a train traveling along tracks.
(g) CIDEr a window of an airport with planes on the runway CLIP-S several rows of planes parked outside a terminal window area with fog outside a terminal window motion position area motionn CLIP-S+Grammar a lot of airplanes parked on a wet airport terminal

Reference Captions
An airport filled with planes sitting on tarmacs. The view of runway from behind the windows of airport. a truck driving towards some planes parked on the runway Planes on a wet tarmac unloading at arrival gates. Window view from the inside of airplanes, baggage carrier and tarmac. quality assurance process that manually examines 50% of the created caption by different workers.
Word-level Recall R word . In Alg. 2, we show Python implementation of word-level recall R word . In summary, R word measures how many words from each of the reference phrases are included in a generated caption on average.

D Human Evaluation Details
We conduct pairwise evaluation of human preference, as shown in the Sec. 5. For each image, we show two captions generated from two models: ours (CLIP-S + Grammar) and the baseline (MLE/CIDEr). A human worker selects a caption that better describes the image in terms of five criteria: overall, background, object, attribute, and relation. For each criterion, we use 50 images from