ECOL-R: Encouraging Copying in Novel Object Captioning with Reinforcement Learning

Novel Object Captioning is a zero-shot Image Captioning task requiring describing objects not seen in the training captions, but for which information is available from external object detectors. The key challenge is to select and describe all salient detected novel objects in the input images. In this paper, we focus on this challenge and propose the ECOL-R model (Encouraging Copying of Object Labels with Reinforced Learning), a copy-augmented transformer model that is encouraged to accurately describe the novel object labels. This is achieved via a specialised reward function in the SCST reinforcement learning framework (Rennie et al., 2017) that encourages novel object mentions while maintaining the caption quality. We further restrict the SCST training to the images where detected objects are mentioned in reference captions to train the ECOL-R model. We additionally improve our copy mechanism via Abstract Labels, which transfer knowledge from known to novel object types, and a Morphological Selector, which determines the appropriate inflected forms of novel object labels. The resulting model sets new state-of-the-art on the nocaps (Agrawal et al., 2019) and held-out COCO (Hendricks et al., 2016) benchmarks.


Introduction
Novel Object Captioning is a zero-shot Image Captioning task where the captions should mention novel objects (i.e., not seen in the training captions), but for which information is available from external object detectors. To produce high-quality captions, the captioning models should select and describe all salient detected objects and avoid mentioning minor or irrelevant details in the input images. As shown in Figure 1, caption A is the best caption among the three because A mentions all salient objects in the images without any unnecessary details while B mentions Bread which is just a Figure 1: Caption A is the ground-truth caption for the image. Compared with B and C, A is the best caption because it mentions all salient objects (i.e, Hamburger, French Fries and Drinks). We use Abstract Labels, that is hypernyms of the objects' detected object labels in the object representations, transferring knowledge from the objects seen in the training captions to novel objects. Our copy mechanism also selects appropriate inflected forms of object labels (i.e., Hamburgers vs. Hamburger). minor detail; and C misses the salient object Drink. This paper aims to develop a captioning model that produces caption A.
We use an advanced copy mechanism, similar to the one in , to effectively integrate novel objects. We follow the setup in Agrawal et al. (2019) and use two object detectors: one providing rich object visual features and another providing task specific (including novel) object labels as copy candidates. Our preliminary experiments show that the copy mechanism is infrequently triggered and unable to mention many salient objects in the input images. We propose the ECOL-R model (Encouraging Copying of Object Labels with Reinforced Learning), a copy-augmented transformer model trained in the Self-Critical Sequence Training (SCST) framework (Rennie et al., 2017). SCST with a CIDEr reward (Vedantam et al., 2015) is a standard approach for training the captioning models (Anderson et al., 2018b), but this paper will show that it does not sufficiently encourage the model to use copy operations. We design a new reward function that provides a reward for each copy operation proportional to the caption quality. We further restrict the SCST training to the images that contain at least one word in the ground truth captions that corresponds to one of the detected object labels. With these innovations, the ECOL-R model outperforms a SCST baseline and a strong inference encouragement baseline by a large margin.
Our copy mechanism and caption generator incorporate two enhancements to better choose and incorporate novel objects: a) Abstract Labels which correspond to hypernyms of the object labels and facilitate knowledge transfer between objects appearing in training captions and novel objects; b) a Morphological Selector which determines the correct inflected form of the copied task specific object labels which is similar in purpose to that proposed in (Lu et al., 2018b).
We evaluate the ECOL-R model on the novel object captioning benchmark nocaps (Agrawal et al., 2019) and held-out COCO (Hendricks et al., 2016). The ECOL-R model achieves a new state of the art on both benchmarks and generalizes well to in-domain images.

Related Work
Popular Image Captioning models include LSTMbased (Anderson et al., 2018b) and Transformerbased decoders (Herdade et al., 2019;Cornia et al., 2020). The visual encoders are often neural object detectors (Anderson et al., 2018b; producing Region-of-Interest (ROI) vectors. To train the model to copy novel object labels, the Neural Baby Talk model (NBT) (Lu et al., 2018a) and follow-up work (Wu et al., 2018;Yao et al., 2017;Li et al., 2019) use copy mechanisms (Vinyals et al., 2015). The copying candidates are labels of salient objects produced by external object detectors. In this paper, we follow previous work by using the Visual Genome object detector from (Anderson et al., 2018b) as the visual feature extractor and a task specific object detector to provide object labels for copying.
These models are typically trained with the Cross-Entropy loss (CE). This creates a mismatch between the training and testing environments because the evaluation metrics are non-differentiable text-based measures (Ranzato et al., 2015). Self-Critical Sequence Training (SCST) (Rennie et al., 2017) was proposed to address this issue by directly optimizing the inference output using caption-level rewards, such as CIDEr-D (Vedantam et al., 2015).
There are two existing novel object captioning benchmarks: a) the held-out COCO Benchmark (Hendricks et al., 2016), constructed by excluding images containing one of eight selected object classes from the standard COCO 2014 benchmark, and b) nocaps (Agrawal et al., 2019), which uses the COCO 2017 benchmark for training and provides new validation and test images from the Open Images Dataset with over 400 novel objects. Both benchmarks are object-centric and there is no reliable benchmarks that systematically evaluate the quality of generated actions or attributes. Figure 2 provides an overview of the ECOL-R model. We refer to the ECOL-R model without SCST training as ECOL. We describe this model in Sec. 3.1 and our novel reinforced copy encouragement training in Sec. 3.2.

The ECOL Model
Input Image Objects: Following the setup in Agrawal et al. (2019), we use two object detectors: the Visual Genome object detector from Anderson et al. (2018b), producing image objects and regions G (represented by embedding vectors [x g 1 , . . . , x g k g ]) with detailed visual features; and a task specific object detector, producing image and their corresponding labels L f = [l 1 , . . . , l k f ] used as copy candidates during caption generation. We will introduce object representations x i below and define them in Eq. 1.
Image Object Representations: Following Anderson et al. (2018b); Lu et al. (2018a), we represent both sets of objects with Region-Of-Interest (ROI, r i ∈ R 2048 ) vectors from the Visual Genome object detector and object positional features (p i ∈ R 8 ), including bounding box coordinates and size, and an object label confidence score. In addition, Figure 2: Overview of the ECOL-R Model. X is the concatenated Object representations G and F from the two object detectors. The Transformer encoder produces H and the decoder provides h t at step t. We then estimate the probabilities for generating each vocabulary word (yellow box) and copying from each task specific image object (green box). The results are concatenated and jointly softmax (red box). We refine each copy probability into the concrete inflected word probability in MSelector . The final output P (y t ) concatenates all above word probabilities.
to transfer knowledge from the seen objects to the novel ones, we propose Abstract Labels for the task specific objects, described below. Abstract Labels: The task specific object detectors we use provide taxonomies of object classes, and every detected object is assigned a label from that taxonomy. More general object classes conceptually include all the labels lower in the taxonomy. 1 This provides us with a mechanism for associating class labels not present in the training data with those that do occur in the training data by mapping them to a common ancestor in the hierarchy. Inspired by Ciaramita and Johnson (2003), we define Abstract Labels to be a fixed set of ancestor class labels that spans the entire taxonomy (see Figure 3). Using the abstract labels to drive copy decisions allows the usage of known object types to inform the word generation of novel objects. Each object from the task specific detector is associated with its nearest abstract label ancestor. We choose the set of abstract labels such that the objects in the training data are evenly distributed across the set of abstract labels. We represent abstract labels with trainable embeddings e i ∈ R d , where d is the hidden size of our base model. We use the Open Images V4 class hierarchy for the nocaps benchmark and a merged 8 coco super-categories hierarchy Lin et al. (2014) for the held-out COCO benchmark. The Figure 3: A part of the class hierarchy from the Open Images V4 Dataset (Kuznetsova et al., 2018). The green nodes are used as abstract object labels. For each label, its abstract label is its closest green ancestor.
final representation for each object x i is: where LN is layer normalization and W r ∈ R d×2048 , W p ∈ R d×8 are trainable projections. The two sets of object representations are concatenated as X = F G where represents concatenation.
Transformer Base Model: We use a transformer model (Vaswani et al., 2017) with an N enclayer encoder and an N dec -layer decoder (N enc = N dec = 3 in our experiments). We denote the encoder output H = Encoder (X). The decoder uses frozen word and positional embeddings WE and PE from GPT2 (Radford et al., 2019) which are helpful in producing better captions describing novel objects. In step t: w 1:t−1 = WE (y 1:t−1 ) + PE (y 1:t−1 ) (2) where y 1:t−1 is the generation history and h t ∈ R d .
Outputs With Copy Mechanism: The ECOL model either generates words from the vocabulary or copies from task specific objects. We deploy a copy mechanism similar to the dynamic pointer network in . Given the decoder output h t , we first calculate a raw score for each vocabulary word: where W e ∈ R |V |×d , |V | is the GPT2 vocabulary size. We then calculate raw additive attention scores over the encoder output of task specific image objects (i.e., H 1:k f ): Finally, we concatenate the raw scores from VScore and OScore and jointly softmax: where represents concatenation. v t provides probabilities for GPT2 vocabulary words and c t provides probabilities for copying task specific object labels.
Morphological Selector: Object labels can appear in inflected forms in captions. For example, in Figure 1, after selecting the object hamburger, the ECOL model should generate "hamburgers" after "Two". We propose a morphological selector (M Selector) to refine the copy probability of each task specific image object label l i (i.e., c t,i ) into the probabilities of generating all possible morphological forms y l i t (i.e., P (y l i t |l i )). Specifically, we use h t to choose an inflected form from its possible inflected forms (e.g., Singular or Plural in English): Here W l i ∈ R s i ×d where s i is the number of inflected forms of label l i (in most cases 2 for English, singular and plural). Finally, the ECOL model concatenates the above refined probabilities as follows: Model Application Scope In this paper, we focus on the Novel Object Captioning task. However, in general, our copy mechanism is capable of copying any type of information. The Abstract Label approach is general to zero shot learning problems where novel items share characteristics with training instances. The Morphological Selector is also applicable to linguistic copy mechanisms in other contexts such as Commonsense Reasoning (Lin et al., 2020) where copied terms may require linguistic alignment with the generated text.

Copying More Object Labels
In this paper, we encourage the copying of object labels by using a suitable reward function in the Self-Critical Sequence Training (SCST) framework, which has proven effective for image captioning tasks. Compared with injecting additional loss terms together with the standard XE loss, using the SCST framework allows us to design arbitrary encouragement signals based on the inference output. It minimizes the negative expected reward score: where r is the reward function and p θ represents the models outputs. In this paper, following Cornia et al. (2020), we first pre-train the ECOL model with the CE loss, then switch to fine-tune the ECOL model with the above SCST loss.
Inference Bias Baseline: We add an Inference Bias (IB) b ∈ R + to increase P (y l i t ) at inference time. Eq. 9 is changed to: and remaining probabilities normalised accordingly. IB is functionally equivalent to adjusting the threshold for the copy decision during inference. Surprisingly, this simple inference trick provides a strong baseline (see Table 3). This shows that after the CE training, many correct copy operations are assigned with low probabilities, compared to the fixed vocabulary items. However, we believe that it is better to train the model to increase the probabilities of these copy operations than adding ad hoc adjustments at inference time.  Anderson et al., 2018b) use CIDEr as the standard reward function in their SCST optimization. This shows suggests that the problem of overfitting of SCST training with CIDEr reward is minimal. Intuitively, the CIDEr reward is positively correlated with the number of salient object label mentions and should encourage the model to copy salient novel object labels. However, CIDEr equally rewards both generation of object labels present in training data via the vocabulary P (y v t ) and via copy operations P (y l i t ). Novel objects labels however can only be generated by copy operations (see Sec. 3.1), thus the CIDEr reward function does not sufficiently encourage these operations. We propose two orthogonal modifications to the standard SCST algorithm to address this issue: Novel Encouragement Reward: We propose combining the standard CIDEr-D reward with a reward function that gives captions with words copied from object labels an extra bonus, which we intend to encourage copy operations. One straightforward way to implement this idea is to provide a constant bonus to each triggered copy operation: where X is a generated caption, C is the number of copy actions in the caption X and a ∈ R + is a fixed hyper-parameter. We refer this as additive bias. Optimizing with the additive bias, the captioning model only needs to trigger the copy operation for arbitrary objects at arbitrary generation steps.
That is, the model may encourage copying object labels at the expense of caption quality (i.e., high CIDEr-D scores). Therefore, we propose a proportional bias that assigns different rewards to the copy operations in different images by making a connection between the copy bonus and the generated captions CIDEr-D score: where p ∈ R + is a fixed hyper-parameter. Although R a can effectively encourage the model to copy objects, it may introduce noisy object mentions. R p penalizes those noisy object mentions via the low caption CIDEr score.
Visual Object Aligned (VOA) Images: VOA Images refers to the set of training images where the reference captions contain at least one word from retained object labels. During SCST training, images that contain no object label words (i.e., non-VOA images) will not utilise copy operations, thus these images encourage the model NOT to copy. VOA images account for approximately 70% of the full COCO-2017 training images set. Although restricting training to VOA images can be done with arbitrary models, this may hurt the diversity of generated captions. Experiments in Table 3 confirm that restricting to VOA images only improves performance when used with SCST training.

Hyper-Parameters For Copy Encouragement:
The above approaches introduce two additional parameters: a and p. In our experiments, a and p range over 0.2, 0.3 and 0.4; we found that 0.3 works the best for both reward functions. Combined with restricting SCST training to VOA images, R p works better than R a and sets a new SOTA for novel object image captioning.

Experiments
We conduct experiments on the nocaps (Agrawal et al., 2019) and the held-out COCO (Hendricks et al., 2016) Benchmark. We set the layer and embedding size to d = 768 and use Adam optimisation (Kingma and Ba, 2014). We train our models 15 epochs with batch size 100 for CE loss and 15 epochs with batch size 10 for SCST loss.

Evaluation Metrics
We use CIDEr (Vedantam et al., 2015), SPICE (Anderson et al., 2016a) and METEOR (Banerjee and Lavie, 2005) to evaluate the caption quality. CIDEr measures the similarity between the reference captions and generated outputs using tf-idf weighted ngram overlap. SPICE is based on the scene graphs matching between the reference captions and generated outputs. METEOR focuses on the alignment between the words in reference captions and generated outputs, with an aim of 1:1 correspondence.
To measure the effectiveness of our copy encouragement approach, we report object F1 (Anderson et al., 2017) in the held-out COCO Benchmark. As the nocaps benchmark does not release its groundtruth captions, we instead report averaged number of mentioned objects (Ave. O) and CIDEr score for dummy captions that only contain copied object words (Object CIDEr, OC., details see Appendix).

Comparison with the State-of-the-art
We compare our models ECOL + IB and ECOL-R with other state-of-the-art systems in Tables 1  and 2.
On the nocaps benchmark (Table 1), our models outperform previous work, including the recently proposed OSCAR L + CBS + SCST model (Li et al., 2020), which is fine-turned from the BERT-LARGE model (Devlin et al., 2019), by 2.0 CIDEr, 0.9 SPICE and set a new state of the art. Compared with the OSCAR L model, our models use far fewer model parameters (340M vs. 60M) and outperforms OSCAR L on both CIDEr and SPICE metrics. We train our model for about 10 hours for CE Loss and 24 hours for SCST Loss using a single Nvidia P100 GPU. As a comparison, the OSCAR L model which is fine-tuned from BERT-LARGE uses 60 -90 hours for training CE Loss and 60 -90 hours for training SCST Loss. 2 This shows that simply deploying a BERT-based language model is not sufficient for the Novel Object Captioning task.
On the held-out COCO benchmark (Table 2), the ECOL-R model produces more novel objects (+ 13.3 Object F1) and higher quality captions (+ 3.9 CIDEr on the out-of-domain split) than the ECOL model with run-time Inference Bias. Compared with previous work, the ECOL-R model achieves 10.9 CIDEr and 1.9 SPICE higher in the out-ofdomain split, 21.2 CIDEr and 2.8 SPICE higher in the in-domain split with the highest object F1. This shows that our copy encouragement approach successfully trains our model to correctly copy more novel objects and to produce high-quality captions. Compared with PS3 (Anderson et al., 2018a) and FDM-net model (Cao et al., 2020) which are trained on extra images containing novel objects and scene graphs, our models still outperform the PS3 model and 13.9 CIDEr higher than the FDMnet. We set a new state of the art in this benchmark without additional novel objects information. Table 3 presents ablation results for various ECOL-R components, including our copy encouragement approach. Table 4 shows that our encouragement of copying in the ECOL-R model does not benefit from additional Inference Bias. Table 5 shows the effect of Abstract Labels and the Morphological Selector in the ECOL-R model. Finally, Table 6 confirms the ECOL-R model's generalization ability for in-domain COCO images.

ECOL-R Components:
The ECOL model produces better captions using the frozen GPT2 parameters (row 1 vs. 2). Our copy mechanism (C) helps the model to explicitly integrate novel objects, substantially improving the out-of-domain split by 15.3 CIDEr and 0.3 SPICE (row 2 vs. 3). The Inference Bias (IB) introduces noticeable performance improvement: 8.4 CIDEr and 0.3 SPICE (row 3 vs. 4) in models that do not use our reinforcement learning approach. The ECOL model trained with the standard SCST reward function obtains an overall 8.1 CIDEr improvement, but most of the improvement is from the in-domain and neardomain splits (row 8 vs. 6). Compared with the ECOL + IB model, the ECOL model trained with standard SCST algorithm is 8.1 CIDEr lower in the out-of-domain split (row 5 vs. 4). As discussed in Sec. 3.2, standard SCST cannot provide sufficient copy encouragement as object words can be generated from either pathways (fixed vocabulary or copy). Optimizing either the R a or R p reward functions improves the ECOL + CIDEr model by 7.0 CIDEr and 7.8 CIDEr respectively (row 7 and 5). R a achieves 3.7 CIDEr higher than R p in the out-of-domain split. Interestingly, after restricting the model training to VOA images, R p achieves 7.8 CIDEr improvement in the out-of-domain split (row 8 vs. 7), outperforming the ECOL + R a w/ VOA model by 1.4 CIDEr (row 10 vs. 8).
Effectiveness of Copy Encouragement: We directly measure the copy quantity by counting the number of copied object labels and Object CIDEr. Row 5 and 3 confirm that the standard SCST algorithm has little impact on the copy quantity (only + 0.1 object per image and + 1.4 Object CIDEr). Inference Bias (IB), R a and R p rewards substantially improve the quantity of copied objects (row 4, 6, 7 vs. 3). Among these three components, the models trained with R a and R p work better than the IB baseline (row 6, 7 vs. 4). The model trained with the R a reward copies more objects than the R p reward, especially training with all training images. This is because the R a reward assigns constant positive reward for all copied objects. However, such a naive reward appears to encourage noisy copying operations (i.e., copying non-salient objects). As a result, the ECOL + R a model performs worse than the ECOL + R p model in terms of caption quality (row 7 vs. 6). After restricting training with VOA images, the models trained with R a and R p copy similar amount of objects, but the model with R a produce better captions than the one with R a , especially in the out-of-domain split (row 10 vs. 8). The R p reward maintains a good balance between copying more objects and high caption quality.
Are The VOA images Always Useful? Restricting training to the VOA images can be done with any captioning models. However, this does not necessarily encourage copy operations and improve the output caption quality. When we restrict training to VOA images, the ECOL-R model performs consistently worse in all three splits compared to our proposed training scheme (row 9 vs. 10). The only difference is that the ECOL model is not trained with diverse images during the cross-entropy stage. That is, restricting to VOA images is only suitable for fine-tuning in the SCST stage.

Sufficient Encouragement For Copy:
Here we investigate whether our ECOL-R model mentions a sufficient number of salient objects. We apply an increasing amount of inference bias to the ECOL, ECOL + CIDEr and ECOL-R models in Table 4. We note that only ECOL-R model is negatively impacted (measured by CIDEr score) by different Inference Bias values. This shows that the ECOL-R model does not benefit from further copy encouragement.

ECOL-R
A bathroom with a shower curtain and a toilet.
An ostrich and a deer standing in a field.
A red door of a red house with a red phone. × ECOL + IB A white bath tub sitting next to a white toilet. × Two ostriches and a deer in a grassy field. × A red telephone booth sitting next to a brick wall.

GT
The bathtub is white and has a white shower curtain.
An ostrich standing in grass with a few deer in the background.
A red phone booth is standing against a brick wall.   score. As SPICE is sensitive to long-range object word relationships, such as attributes and predicate words, (Anderson et al., 2016a) Abstract Labels and the M Selector improve the semantic coherence and fluency of the captions. The performance gap in the ECOL-R model becomes smaller. Our copy encouragement approach contributes to the generation coherency and fluency.   novel object captioning models (NBT + CBS and Up-Down + ELMo + CBS) reported in Agrawal et al. (2019) in Table 6. Both of our models outperforms the Up-Down and NBT model by a large margin. Our models produce high-quality captions for images with novel objects as well as known objects.

Qualitative analysis on nocaps
Qualitative analysis on the nocaps validation set reveals that the ECOL-R model mentions the salient object in the input image (first example in Figure 4), is able to generate more accurate descriptions of novel objects (second example in Figure 4), however may generate inaccurate captions due to the non-informative detected object labels (third example in Figure 4). In summary, the ECOL-R model is better at incorporating detected image objects into generated captions than the ECOL + IB model.

Conclusion and Future work
This paper proposes the ECOL-R model that includes a training scheme to encourage copying novel object labels using Reinforced Learning. Our experiments show that the ECOL-R model successfully integrates novel object information and achieves state-of-the-art performance on two Novel Object Caption Benchmarks. In the future, we plan to extend our SCST reward function to other metrics such as SPICE (Anderson et al., 2016b) and BertScore (Zhang et al., 2020).

Appendices A Model Details
The hyper-parameters of the ECOL-R model is shown in Table 7. This architecture is basically from (Cornia et al., 2020). We only change the hidden size of the model to 768 to fit the size of GPT2 (the smallest version). Our model has total 60.8 × 10 6 parameters and 43.0 × 10 6 trainable parameters. This scale is slightly smaller than the Transformer Base model (65.8 × 10 6 ) (Vaswani et al., 2017). We optimise with Adam(α=0.9, β=0.98, =1e-9) (Kingma and Ba, 2014) and clip gradients to 0.1 for both Benchmarks. In Cross-entropy training, we vary the learning rate over the course of training using the heuristic: where S is the step number and W is the number of warm-up steps. We set W to 20000 steps for the nocaps Benchmark and 10000 steps for the heldout COCO Benchmark. The number of warm-up steps has some impact on both benchmark. We tried 20,000 and 10,000 for both Benchmarks. For SCST training, we set the initial learning rate 1e −6 and reduce it by half if the reward metric (Validatoin set CIDEr) does not improve for 3 evaluations. We conduct evaluation every 3000 steps.
We use Pytorch 1.4.0 to implement our model. The Cross-Entropy Training takes about 8 hours and the SCST optimization takes about 15 hours in a single NVIDIA Tesla P100 GPU.
Our source code is submitted in the Software. We setup an anonymous Google Drive to host large file 3 .

A.1 Input Object Detector
We follow the processing of input objects in Agrawal et al. (2019). We observed that some object categories are frequently mentioned in the training captions and that they often have variable, context-sensitive verbalisation (e.g., a person might be described as a sports player, a student, etc., depending on the context). For those objects, vocabulary based word generation often did a better job at selecting the correct verbalisation due to their frequency in training captions. On the other hand, novel objects typically have lower-frequencies and a fixed, single verbalisation. For example, elephants are usually only referred to with the word elephant. For this reason, we remove objects with high-frequency in training captions from the output of the task specific object detector, leaving their corresponding words to be generated via vocabulary softmax. We also remove the more abstract objects (higher in the object class hierarchy) when object regions overlap. Finally, we keep only one detected object for each label (the one with highest confidence score). We provide the downloadable link of filtered results in Sec B. We use exactly the same Visual Genome objects as described in Anderson et al. (2018b). The Visual Genome object detector (Anderson et al., 2018b) can produce ROI vectors for arbitrary bounding boxes, hence we use it also to produce ROI vectors for objects from the task specific detector.

A.2 ECOL-R Inference Details
We use Beam Search with beam size 5 to decode the captions. We first do length normalization for the overall score of each decoded caption. We also penalize captions when they generate repeated bi-grams. Once the repetitions are found, the logprobability for that particular word is divided by a penalty value e 2 . All image objects are only allowed to be copied once. During the SCST optimization, we mask out words from the vocabulary that can be generated via copy operations to encourage the model to copy. All the above constraints are applied to all of our models in the ablation study.

A.3 Object CIDEr Details
Object CIDEr score for dummy captions that only contain copied object words. This shows the correctness of our copy mechanism. High Object CIDEr score means many of the copied object labels are also mentioned in the ground-truth captions. We use this metrics because the nocaps benchmark does not release its ground-truth captions and only provide online evaluation APIs.

B Dataset Details
For the nocaps Benchmark, we train with the COCO-2017 dataset, which is available at http://images.cocodataset.org/ zips/train2017.zip.
The nocaps Validation and Test datasets are available from https://nocaps.org/download.

The Visual
Genome image object detection files can be found in https://github.com/nocaps-org/ updown-baseline.
For the held-out COCO Benchmark, the training and evaluation data can be found in https://github.com/LisaAnne/DCC. The Visual Genome image object detector is used for both benchmarks because COCO-2017 and COCO-2014 share the same set of images. The anonymous Google Drive includes the above data and the sets of task specific objects detected for the above two benchmarks.

B.1 Duplicated Caption Removal
We find some images in COCO share exactly the same reference captions. We find it beneficial to remove those duplicates. We simply iterate over all reference captions and remove any captions if they have already been found previously. This removes 25463 captions from the training data of the nocaps Benchmark and 7059 captions from the training data of the held-out COCO Benchmark.
B.2 VOA (visual object aligned) Images VOA (visual object aligned) images/reference caption pairs are those that mention at least one detected task specific image object label (or their linguistic variant).
Non-VOA image/caption pairs are removed from our SCST training process. We provide the reduced set of reference captions in the anonymous Google Drive (ddc captions/ddc train VOA.json and nocaps captions/nocaps train VOA.json). Table 9 and Table 10 show the number of images and annotated reference captions of the nocaps and held-out COCO Benchmark, respectively. On average, each image has five reference captions. The COCO Train in the nocaps Benchmark is larger than the held-out COCO Benchmark.

C Evaluation
The nocaps Benchmark hosts its evaluation sever at https://evalai.cloudcv.org/web/    We provide an on-the-shelf version of this tool in the anonymous Google Drive (in tools).

C.1 held-out COCO Benchmark Validation Performance
We only show the test performance on the heldout COCO Benchmark in our main paper. Here, we show the performance of our model performance on the validation Set in Table 8. The models achieve similar level of performance on the Validation Set.