Learning from Children: Improving Image-Caption Pretraining via Curriculum

Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem -- it requires multiple concepts (nouns) from captions to be aligned to several objects in images. To tackle this problem, we go to the roots -- the best learner, children. We take inspiration from cognitive science studies dealing with children's language learning to propose a curriculum learning framework. The learning begins with easy-to-align image caption pairs containing one concept per caption. The difficulty is progressively increased with each new phase by adding one more concept per caption. Correspondingly, the knowledge acquired in each learning phase is utilized in subsequent phases to effectively constrain the learning problem to aligning one new concept-object pair in each phase. We show that this learning strategy improves over vanilla image-caption training in various settings -- pretraining from scratch, using a pretrained image or/and pretrained text encoder, low data regime etc.


Introduction
Recently, there has been a tremendous interest in employing image-caption pretraining for downstream vision tasks like zero-shot object classification (Radford et al., 2021) and zero-shot object detection (Zareian et al., 2021;Li et al., 2022).The idea is to learn a common semantic space where the visual embeddings of objects in images lie close to the textual embeddings of the concepts (objects' name/tag/label) in captions they refer to.This learned semantic space is later exploited for zero-shot object recognition by finding the concept embedding nearest to the objects' embeddings.
Despite the recent success, image-caption pretraining is a complex problem as it entails aligning  multiple concepts in a caption with multiple objects in an image, as shown in fig. 1. Different methods have tried to solve this problem from various angles -CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) by using more data, ALBEF (Li et al., 2021) by using more complex network architecture, Florence (Yuan et al., 2021) and CoCa (Yu et al., 2022) by using more tasks and ERNIE-ViL 2.0 (Shan et al., 2022) by using more data augmentations (views).
We propose an alternative approach based on a novel learning strategy that is architecture agnostic and does not require any additional data or compute.We take inspiration from cognitive science research studying how children learn language (concepts) in early stages by just observing their surroundings (images).Specifically, we refer to two studies showing that childern learn rapidly if the object of interest is unambiguous (Pereira et al., 2014) and by applying co-referential statistics across multiple scenes (Smith and Yu, 2008).
We implement these two ideas via a curriculum learning approach (demonstrated in fig.1): 1. We train the model in multiple phases of increasing difficulty with each phase containing one more concept in the caption than the previous one.Moreover, each phase contains only one new concept, the rest seen in prior phases.
2. In each phase, we leverage the concept-object association learned in prior phases to recognize the seen concepts and focus on aligning the new/unseen concept (section 2.2.2).
These two strategies effectively reduce the problem of aligning multiple object-concept pairs per training sample to aligning only one such pair.
To the best of our knowledge, no prior work has applied curriculum leaarning to image-caption pretraining in this way.Srinivasan et al. (2022) apply a curriculum based on the difficulty of negative samples in contrastive loss.Whereas, Liu et al. (2021) design the curriculum based on the granularity of text: from words to phrases to sentences.
Although our proposed approach can be applied to any multimodal network architecture, we pick OVR-CNN (Zareian et al., 2021) due to its simplicity.We pretrain it with the proposed curriculum learning approach and evaluate on the downstream task of zero-shot object detection.We demonstrate that curriculum learning outperforms vanilla imagecaption pretraining on a variety of architectural settings -with and without a pretrained image encoder and/or a pretrained text encoder.We even show superior performance in low-data settings, suggesting our method can be leveraged in low-resource scenarios as well.

Method
We propose a curriculum learning framework to improve image caption pretraining.In this work, we apply it to OVR-CNN as its architecture is simpler and easier/faster to train/evaluate.We begin the description of our approach with a brief background on OVR-CNN.Next, we discuss how we modify it to implement the proposed curriculum learning framework.

OVR-RCNN Background
OVR-RCNN is a dual-encoder (separate visual encoder and text encoder) multimodal architecture.First, it pretrains the encoders using image-captions and later utilizes them for the downstream task of object detection.We only discuss the pretraining procedure as we only utilize this component.
OVR-RCNN's visual encoder is ResNet-50 (He et al., 2016) and text encoder is either BERT (Devlin et al., 2019) or GloVE (Pennington et al., 2014).The visual encoder takes an image, I with w × h dimensions, as input and outputs a feature map of w/32×h/32 regions.Each feature map is a vector which is transferred to language space using a projection layer.This gives the visual embeddings, e I i , for each region i.The tokenized caption, C, is input to the text encoder which outputs an embedding e C j for each token j.The token-image region pair is aligned via weak supervision.Specifically, a global alignment score between image and caption, ⟨I, C⟩ G is calculated using a locally weighted average alignment score of image regions and tokens as follows: where ⟨., .⟩L is the dot product of two vectors, n I and n C are the number of image and caption tokens respectively, and The model is trained using contrastive learning by maximizing the global alignment score, ⟨I, C⟩ G , between positive image-caption pairs and minimizing it between negative pairs sampled from the same training batch.
are batch captions and batch images respectively.This learning objective aligns paired image and caption together and also provides weak supervision for image-regions and caption-tokens association.

Curriculum Learning Framework
OVR-CNN facilitates object-concept alignment through coarse image-region and concept alignment.However, as an object can span multiple image regions or multiple objects can span an image region, this strategy can be noisy.To eliminate this noise and focus on the contribution of our curriculum framework to object-concept alignment, we train the model using object region features instead of image region features.To this end, object  1) and (2).

Curriculum Design
The learning is divided into 1, 2, 3 . . .k phases.Each phase p is trained with only those imagecaption pairs having p concepts per caption.To divide the data into phases, we use spacy 1 to PoS (Part of Speech) tag the captions.Depending upon the number of nouns in each caption, the caption and its paired image is grouped into the corresponding phase.This strategy of designing the curriculum also imparts the data an additional property empirically -at most only one new concept is introduced per caption in each phase (as demonstrated in fig.2b).

Curriculum Aware Alignment Loss
To recognize the concepts in captions previously seen in prior phases and focus on aligning the new/unseen concept, we formulate a novel Curriculum Aware Alignment Loss (L C ). Specifically, we first calculate the previously learned object-concept alignment, a o,j from modified eq. ( 2), using either the trained model from the last iteration (L CR ) or the trained model from the last phase (L CP ).Next, a o,j is used to compute: where, t is the current iteration number and T is the total number of iterations in training.For a concept j, which is already closely aligned to an object o, max o (a o,j ) is high.This leads to a low value of a ′ o,j , resulting in less attention being paid to concept j in the current training iteration/phase.Vice versa for a concept that is not 1 https://spacy.io/usagewell aligned with any object.a ′ o,j effectively redistributes the attention of learning to focus more on concepts that are not well aligned with any object.The term t/T has a low value in the beginning of training and gradually scales to 1 by the end.This allows the network to ignore prior knowledge in the beginning while utilizing it in the latter stages.
We use a ′ o,j to replace a o,j in modified eq. ( 1), and then use eq.( 3) to compute L C .

Pretraining Dataset and Implementation Details
We use the COCO Captions dataset (Chen et al., 2015) for pretraining.It contains 118,287 images and 5x captions.To obtain bounding box regions for objects in images, we use COCO Objects (Lin et al., 2014) dataset as it uses the same set of images as COCO Captions.We divide the data into k = 4 phases using the strategy discussed in section 2.2.1.Figure 2a shows number of captions assigned to each phase.As shown in fig.2b, the majority of captions in each phase have at least k-1 concepts previously seen, allowing the curriculum to introduce at most one new concept per training sample.Further, as more concepts are introduced with each passing phase, the percent of captions per phase actually introducing a new concept decreases (as depicted in fig.2c).By phase 4, this percent reduces to < 5%.Additional phases of training may not contain enough captions actually introducing a new concept in the curriculum way, making these phases similar to regular image-caption training.Hence, we limit to 4 phases.We train the model using SGD optimizer, with a batch size of 32 for 4 epochs in each phase, a learning rate of 0.005, a step scheduler, and the loss L CP .

Downstream Task, Dataset and Transfer
We evaluate the performance of the model on zeroshot object detection task on COCO Objects, val split (4836 images; 33374 instances of 65 object class).The task involves object bounding box predictions besides classifying these object regions to a label (concept).However, our method is aimed only at improving the alignment of object regions to a concept.As such, we eliminate any performance noise from bounding box predictions by only evaluating the classification accuracy of object regions given ground truth object bounding boxes.
Transfer to Downstream Task: We extract object features from image and object bounding box using visual backbone and use it to find the closest class label vector (obtained via language backbone).

Baseline and Evaluation
Our baseline is OVR-CNN, a regular image-caption pretrained model.However, since our method uses object region features instead of image patch features for multimodal alignment (section 2.2), we also pretrain OVR-CNN with object regions to obtain OVR-CNN O .It is transferred to downstream task similar to our proposed model (section 3.2).
Our proposed curriculum framework outperforms the baseline in various settings, as shown in table 1.The accuracy numbers reported are averaged across three seeds.This demonstrates that our proposed learning strategy works across encoders trained from scratch or pretrained ones.
Performance Gain Analysis.We analyze model performance on object classes introduced during pretraining in phase 1 and phase 2 separately.As reported in table 2, the improvement in phase 2 objects is ~10x.This illustrates that our curriculum strategy improves alignment of multiple concepts in a caption by focusing on one at a time.
Low Data Setting.Our model outperforms the baseline even if both uses 50%, 25% or 10% data (fig.3), indicating its utility when data is scarce.
Region proposals instead of ground-truth object regions.We use a RPN model (Girshick, 2015b) trained class-agnostically on Visual Genome (Krishna et al., 2016) to generate object regions.The superior performance of our model against baseline, reported in table 3, demonstrates that our approach is effective even when groundtruth object regions are not available.
Loss Ablation.From table 4, we can conclude that our curriculum design works (Ours + L > OVR-CNN O + L); our proposed curriculum aware loss works (Ours + L < Ours + L CR ) irrespective of curriculum (OVR-CNN O + L < OVR-CNN O + L CR ); curriculum aware loss works better when previous knowledge is taken from the last phase instead of the last iteration (Ours + L CP > Ours + L CR ).
Qualitative Analyssis.We provide qualitative analysis as well to shed more insights into the cases where our approach works/doesn't work.From Figure 4, we find that our model performs better than OVR-CNN O in certain cases, especially when the objects are from Phase 2 -"snowboard", "cup", "skis" etc.This provides further evidence towards our claim that our approach improves the alignment of Phase 2 objects.
Comparison of traditional mAP metric for object detection As mentioned before, we have focused our experiments on evaluating object-  concept alignment rather than on traditional object detection mAP metric.This was done to avoid unnecessary performance noise arising from training a RPN, which is required for mAP evaluation.However, to test the limits of our model, we evaluate on this noisy mAP metric as well.We keep all the settings similar as Zareian et al. (2021), except we pretrain using our curriculum learning approach.
The results are reported in Table 5.We find that our model performs better in the most generic Generalized ('All') set (41.33 vs 39.9), signifying the effectiveness of our approach even in this noisy setting.We further observe that we perform better in the base classes while lagging behind in the target classes.A deeper analysis shows that most of the Phase 2 objects, on which we make major improvements, lie in the base classes.This explains the improved performance on base classes and slight depreciation in target classes performance.
Training with image grid regions.Our curriculum based pretraining method was aimed at improving object-concept alignment by focussing on one object at time.To facilitate this, we pretrained directly with object regions.Image regions were not used to eliminate noise arising from an ob-  ject spanning multiple regions or multiple objects being present in the same image region (object presence noise).However, we further push the limits of our model to assess how it performs when trained with noisy image regions instead of object regions.
The results are reported in Table 6.We find that our model performs slightly worse than .We attribute this performance degradation to the inherent object presence noise in image regions as discussed earlier.

Conclusion
We proposed a curriculum learning framework to improve image-caption pretraining, using the number of concepts in captions.We also designed a novel curriculum aware loss to focus learning on the unaligned concept in each phase.Our approach outperforms vanilla image-caption pretraining in various settings, including with/without pretrained encoders and small data.Further, we extensively analysed our model to study the contribution of each component.

Limitations
Although our proposed curriculum can be applied to any multimodal architecture, curriculum aware loss requires modifications for use with dual encoder architectures that don't use cross-modal attention.Additionally, we use an off-the-shelf Partof-Speech tagger to divide the data into different phases.As such, the correctness of this division is dependent on the quality of tagger.A poor tagger can negatively impact the curriculum design.Moreover, our approach doesn't apply to possible image-captions dataset which contain only short captions, containing possibly only one noun.

Figure 1 :
Figure 1: Top: Normal image-caption pertaining.Bottom: Proposed Curriculum Learning Framework.The curriculum eases the learning problem by requiring the model to align only one concept-object pair at a time.

Figure 2 :
Figure 2: Curriculum Statistics.(a) #Captions Vs Phase (b) The number next to bar shows % of captions per phase with at least #shaded concepts seen previously.(c) % Captions per phase introducing 1 new concept region bounding box is used to ROI pool (Girshick, 2015a) the image region features.The resulting feature vector e I o , for each object o, is used to replace e I i in eqs.(1) and (2).2.2.1 Curriculum Design

Table 1 :
Curriculum learning vs. baseline in various settings with/without pretrained encoders.BB: backbone Model P 1 Obj.P 2 Obj.

Table 2 :
Phase wise top-5 accuracy.P i Obj.: Phase i Objects.

Table 3 :
Object bounding boxes from RPN.

Table 4 :
Ablation of proposed loss.
Qualitative analysis of cases where our approach works/doesn't work.The top two rows show samples where our model is successfully able to align the objects with the correct concept while OVR-CNN O makes mistakes.Interestingly, most of these objects are from Phase 2 -"snowboard", "cup", "skis" etc.The bottom row shows cases where our model makes mistakes.Note: only relevant, not all, objects are shown from each image.

Table 5 :
Comparison of traditional mAP metric for object detection.

Table 6 :
Training with image grid regions instead of object regions.