Delving into the Openness of CLIP

Contrastive Language-Image Pre-training (CLIP) formulates image classification as an image-to-text matching task, i.e., matching images to the corresponding natural language descriptions instead of discrete category IDs. This allows for open-vocabulary visual recognition, where the model can recognize images from an open class set (also known as an open vocabulary) in a zero-shot manner. However, evaluating the openness of CLIP-like models is challenging, as the models are open to arbitrary vocabulary in theory, but their accuracy varies in practice. To address this, we resort to an incremental perspective to assess the openness through vocabulary expansions, and define extensibility to measure a model's ability to handle novel classes. Our evaluation shows that CLIP-like models are not truly open, and their performance deteriorates as the vocabulary expands. We further dissect the feature space of CLIP from the perspectives of representation alignment and uniformity. Our investigation reveals that the overestimation of openness is due to confusion among competing text features, rather than a failure to capture the similarity between image features and text features of novel classes. We hope that our investigation and analysis will facilitate future research on the CLIP openness issue.


Introduction
An intrinsically open mechanism for visual recognition (Deng et al., 2009;He et al., 2016) has always been a shared goal in the computer vision community (Scheirer et al., 2013;Geng et al., 2021;Bendale and Boult, 2015).This mechanism requires models to maintain flexibility to cope with the scaling of the recognition target, where both input images and the corresponding classes will dynamically expand according to actual needs.For example, in medical diagnosis (Razzak et al., 2017), new diseases emerge constantly, and in e-commerce, new categories of products appear daily (Xu et al., 2019), which cannot be predefined in a finite, fixed class set.
Faced with the challenging task of open-world recognition, Contrastive Language-Image Pretraining (CLIP) (Radford et al., 2021) and its openvocabulary learning paradigm demonstrate superiority over traditional supervised classifiers (He et al., 2016;Dosovitskiy et al., 2021).CLIP pretrains a vision-language model on web-scale collections of image-text pairs, learning semantic alignment between images and corresponding textual descriptions.During inference, it formulates image classification as an image-to-text matching task, where the set of class names serves as a vocabulary, and textual prompts like "a photo of a [CLASSNAME]" are curated as class descriptions for images.By varying the [CLASSNAME] placeholder and computing the similarity between class descriptions and images, CLIP can identity the most suitable class name and predict it as the target class.This approach allows CLIP to operate with arbitrary vocabularies and adapt to novel classes by expanding the vocabulary, enabling zero-shot inference for new input images and classes.
Nevertheless, previous evaluation protocols for CLIP models only assess their accuracy on static, closed vocabularies from downstream datasets, leaving their actual performance on open tasks in the shadows (Radford et al., 2021).In this work, we delve into openness, the intriguing yet underexplored property in CLIP-like models (Li et al., 2021b;Mu et al., 2021;Yao et al., 2021;Zhou et al., 2021), and present a novel protocol for evaluating the openness from an incremental view.Specifically, we define a metric of extensibility to measure a model's ability to handle new visual concepts through vocabulary expansion.Different from previous metrics, our metric explicitly models the dynamics of the real open world, and formulates the empirical risk of CLIP when new vocabularies incrementally emerge.Additionally, we define a metric of stability to explore how stable the model's predictions are for old classes when new classes are introduced, which provides a tool to analyze the compatibility between different classes.
Using our protocol, we conduct a systematic and comprehensive evaluation of CLIP-like models.Our experimental results based on extensibility show that CLIP and its variants have a significant drop in accuracy as the vocabulary size increases.For example, CLIP (RN101) on CIFAR100 experiences a 12.9% drop in accuracy when the vocabulary size expands from 5 to 100.This indicates that the limited zero-shot capability of CLIP-like models is inadequate for supporting their deployment in the open world.What's worse, through an analysis of the prediction shift during vocabulary expansion, we find that the performance of CLIP can be dramatically reduced by adding only three adversarial class names into the vocabulary, exposing the model's poor stability and security risks.Furthermore, we investigate the representation space of CLIP-like models via three metrics: margin, inter-modal alignment, and intra-modal uniformity.Our results show that the small margin between positive and negative class descriptions leads to prediction shifting when competing class features appear.Therefore, enforcing the distinguishability of class features increases the margin and improves the stability of these models.
In summary, our contribution is threefold: First, to the best of our knowledge, we are the first to systematically quantify the openness of CLIP, for which we design the evaluation protocol and two indicators of extensibility and stability.Second, we conduct extensive experiments on CLIP-like models based on our protocol and find that their openness is overestimated and their performance declines as the vocabulary expands.Finally, we analyze the feature space of CLIP from the perspectives of representation alignment and uniformity, observing that the uniformity of the textual space is critical for better extensibility.

Related work
Contrastive language-image pre-training and open-vocabulary learning.CLIP (Radford et al., 2021) introduces the paradigm of open-vocabulary learning and learns transferable visual models from natural language supervision.The CLIP model con-sists of an image encoder and a text encoder, which are utilized to encode image-text pairs into a joint feature space for learning the semantic alignment of vision and language.The paired images and texts are pulled together in the feature space, while the others with dissimilar semantics are pushed apart via a contrastive loss.After pre-training on large-scale image-text pairs, CLIP is able to map images to their corresponding language descriptions, which makes visual recognition generalize in the wild.Recent studies further improve CLIP by using more pre-training data (Jia et al., 2021), incorporating self-supervision (Mu et al., 2021), fine-grained supervision (Yao et al., 2021), and widespread supervision (Li et al., 2021b) to pretraining.Another line of recent studies (Li et al., 2021a;Wang et al., 2022;Yu et al., 2022;Alayrac et al., 2022) adopts seq2seq generation instead of contrastive discrimination framework to achieve open-vocabulary recognition.We leave the investigation of their extensibility for future work.
Open Set and Open-World Visual Recognition.Open Set Recognition (OSR) (Scheirer et al., 2013;Geng et al., 2021) and Open World Recognition (OWR) (Bendale and Boult, 2015) are paradigms aiming to cope with input images from novel classes during inference.OSR requires classifiers to identify images that have not been introduced during training as "unknown".While OWR raises higher demands, models are supposed to incrementally extend and retrain the multi-class classifier as the unknowns are labeled as additional training data.Contrary to the above research, CLIP-based Open-vocabulary Recognition (OVR) aims to identify novel classes in a zero-shot manner by using natural language representations of categories instead of discrete label IDs.This allows CLIP to directly synthesize textual descriptions of novel classes for matching, eliminating the need for relabeling additional training data and re-training the entire model.A more detailed comparison of OSR, OWR, and OVR can be found in Appendix A.1.

Openness, Extensibility, and Stability
In this section, we first review CLIP's visual recognition paradigm and demonstrate how it realizes open-vocabulary image classification through vocabulary expansion ( § 3.1).To quantify the actual performance of CLIP-like models as the vocabulary expands, we define the metric of extensibility and propose a systematical evaluation protocol ( § 3.2).

Input Images
Text Encoder Image Encoder Images associate with target vocab.
A photo of a [CLASS].
Figure 1: Left: the original accuracy of CLIP with target vocabulary (Eq.( 1)) and the conditional accuracy of CLIP with non-target vocabulary (Eq.( 4)).In the latter, the classes from the non-target vocabulary are involved as distractors for input images restricted in the target vocabulary.Upper right: calculation of Acc-E (Eq.( 2)).It measures the extensibility of models when recognition targets, including both classes and the associated input images, are scaling simultaneously.Bottom right: calculation of Acc-S (Eq.( 5)), a sub-problem introduced by Acc-E.It measures the prediction stability on the images from the target vocabulary as the distractors from the non-target vocabularies are incorporated incrementally.
The experimental results and further analysis reveal that, as the vocabulary expands, CLIP's predictions become unstable and prone to drift towards newly introduced competing class descriptions, which limits its extensibility and poses a huge security risk when deployed in real-world applications ( § 3.3).

Openness of CLIP
CLIP (Radford et al., 2021) models image classification as an image-to-text matching task.Formally, let f be the CLIP model, f T and f I be the text and image encoders in CLIP, respectively.The CLIP model takes an image x and a target vocabulary V (T ) = {w i } of the class names w i as inputs, and predicts the image label as: , where t i is the textual description of the class name w i in a prompt format, e.g., "a photo of a w i ", and sim(•, •) denotes cosine similarity.Such a modeling paradigm can realize open-world image classification in theory by extending the target vocabulary V (T ) to arbitrary degrees.However, in most previous work (Radford et al., 2021;Li et al., 2021b;Mu et al., 2021;Yao et al., 2021;Zhou et al., 2021), CLIP is evaluated with a fixed vocabulary where |D (T ) | is the size of the dataset and I(•) is the indicator function.This vanilla evaluation setting, utilizing restricted input images and classes, falls short for open recognition tasks.It fails to consider the dynamic expansion of vocabulary during inference and, as a result, cannot accurately reflect CLIP's openness in real-world scenarios where the number of classes may increase.

Quantifying extensibility for open world
To quantify the model's capability in dealing with newly emerged recognition targets, we propose an evaluation protocol and define a metric of extensibility based on vocabulary expansion.Concretely, we incrementally expand the vocabulary V (T ) in Eq.( 1) by introducing new classes and their associated input images, then evaluate the accuracy after each expansion.These accuracy values reflect the model's dynamic performance as openness increases, and the expected average of these values is defined as the model's extensibility.In practice, we achieve this expansion by incrementally unioning N disjoint target vocabularies2 as shown in the upper right panel of Figure 1.
N }, we denote the set of all possible permutations of these vocabularies as S N , and V (T ) s i as the i (th) vocabulary in a permutation s ∈ S N .When we union the i (th) vocabulary with the previous i − 1 vocabularies, we achieve a vocabulary expansion and obtain s i .The extensibility refers to the averaged classification accuracy across N incremental expansions as i increases from 1 to N : Experimental settings We evaluate the extensibility of CLIP and its variants, including DeCLIP (Li et al., 2021b), SLIP (Mu et al., 2021), Prompt Ensemble (Radford et al., 2021), CoOp (Zhou et al., 2021), on the CI-FAR100 (Krizhevsky and Hinton, 2009) and Ima-geNet (Deng et al., 2009) datasets.Non-matching methods (Gao et al., 2021;Zhang et al., 2021;Wortsman et al., 2021), such as linear probing, are NOT included since they train a classifier with finite class vectors, and thus are not suitable for class scaling in operation.To construct the vocabulary, we leverage the underlying superclass-class hierarchical structure of the two datasets (Krizhevsky and Hinton, 2009;Santurkar et al., 2021) It represents the original model performance on closed vocabularies.To calculate the expectation in Acc-E, we sample 100 × N permutations for N vocabularies and take the average.(3) The most extensible results are obtained by CoOp (Zhou et al., 2021), which performs prompt tuning on all classes of CIFAR100 and ImageNet.However, the prompt tuning method utilizes the additional category information and training data, which cannot be applied to real-world open tasks.

Stability during vocabulary expansion
As the vocabulary expansion introduces new classes incrementally, some images belonging to previous vocabularies may be incorrectly predicted as new classes, resulting in a drop in accuracy and poor extensibility.To analyze the prediction stability of CLIP during vocabulary expansion, we introduce the non-target classes.They do NOT correspond to any input images, and only serving as distractors for the target classes.Based on it, we define conditional classification accuracy as: where V (N T ) is the non-target vocabulary, i.e., the vocabulary of non-target classes.The conditional accuracy is depicted in the left panel of Figure 1.In Eq.( 4), the categories of the input images are limited to the target vocabulary ((x, y) ∈ D (T ) ), but CLIP is asked to distinguish all categories from a larger vocabulary V (T ) ∪ V (N T ) .In other words, compared to traditional closed-set classification, CLIP is expected to reject all the negative categories from V (N T ) .The model is required to distinguish visual concepts stably and robustly, rather than making wrong predictions in the presence of other distractors.Based on Eq.( 4), we define the stability of CLIP in the open task as: Definition 3.2 (Stability).Given a target vocabulary V (T ) and M non-target vocabularies }, we denote S M as their full permutation, and as the i (th) vocabulary in a permutation s ∈ S M .We design the local stability to measure the averaged classification accuracy of CLIP on the given target vocabulary when nontarget vocabularies are extended incrementally: As Eq.( 5) only reflects the local stability with respect to a single target vocabulary, we further design the general stability as an average of local stability over a set of target vocabularies to reduce the bias from data distribution and vocab- ulary sampling.Specifically, given N vocabularies T ) and the rest V ̸ =i as the non-target vocabularies V (N T ) , and then formulate the general stability as: Experimental settings and results The models and datasets adopted for evaluation are consistent with that in § 3.2.For the calculation of stability, take CIFAR100 with N = 20 vocabularies as an example, we treat each vocabulary as the target vocabulary and the rest are treated as the non-target vocabularies.To calculate the expectation in Eq.( 5), we sample 100 permutations for M = 19 nontarget vocabularies and report the averaged scores.Table 1 demonstrates the stability of CLIPlike models.On CIFAR100, the Acc-S of CLIP (RN101) decreased by 13.4%.Figure 2a shows Acc-S on CIFAR100 during non-target vocabulary expansion.Given a closed V (T ) = Insects, CLIP (ViT-B/32) achieves an accuracy of 81.2%.However, when the remaining 19 non-target vocabularies are incorporated, the accuracy sharply drops to 57.0%.The decrease of Acc-S brought by the introduction of each non-target vocabulary indicates that more images from Insects are incorrectly classified into the new vocabulary.Figure 2b demonstrates the difference between Acc-C and Acc-S for each target vocabulary.When V (T ) = Medium-sized Mammals, CLIP is most easily interfered with by the non-target vocabularies, with a 21.08% performance drop.It suggests that the unstable predictions lead to the poor extensibility of CLIP when new categories are introduced.Besides, we notice that CLIP performs stably on groups like Flowers, where its Acc-S only declines by 0.53% compared to Acc-C.The different behaviors of different groups indicates that the stability is also influenced by the inherent property of the image categories and naming variation (Silberer et al., 2020;Takmaz et al., 2022).

Adversarial non-target vocabulary
In order to explore the lower bound of the stability of CLIP, we define the adversarial non-target vocabulary V (AN T ) as the non-target vocabulary that reduces Acc-S the most: To build V (AN T ) , we refer to the method of adversarial examples generation (Ren et al., 2019) to traverse the words in a large vocabulary, e.g., the vocabulary of nouns in WordNet (Fellbaum, 2000), which are regarded as non-target classes in order to calculate Acc-S, and then take the most confusing words to form the adversarial non-target vocabulary.
We constrain the size of V (AN T ) to 3. Results in Figure 3 illustrate the performance with nouns in WordNet and class names in ImageNet as the candidate vocabulary, respectively.First, we observe a clear performance degradation on both datasets under adversarial attack, e.g., adding bitmap, automobile insurance and equidae leads to an absolute 52.7% accuracy drop on CIFAR10.Besides, we find that the selected adversarial words are much less concrete than common visual concepts like Flower, indicating the potential reason behind is the poor semantic modeling of CLIP on those objects with higher abstraction levels.This investigation reveals that CLIP is vulnerable when facing malicious non-target vocabulary, and we hope future work may pay more attention to the robustness of CLIP under open recognition tasks.we delve into the representation space of CLIP to understand its extensibility.We first point out that the small margin between positive and negative class descriptions leads to the prediction shifting when competing class features appear, which thus limits the stability of CLIP ( § 4.1).Further, we investigate the representation space of CLIP-like models via two metrics: inter-modal alignment and intra-modal uniformity.The results show that enforcing the distinguishability of class features increases the margin and makes the models scale more stably ( § 4.2).

Small margin limits the stability of CLIP
Since CLIP formalizes the visual recognition as an image-to-text matching task, each text feature of the class description corresponds to the class vector in traditional classifiers, and the image-text similarity scores are analogous to the logits in classification.Ideally, regardless of vocabulary expansion, for an image, the similarity of the positive pair (the image with the text specifying the ground-truth class) should be higher than that of the negative pairs (the image with the texts specifying other classes) to ensure the correct prediction on open tasks.In other words, the margin (Jiang et al., 2019) between positive and the largest negative similarity is a direct contributor to stability.Unfortunately, the similarity and margin distribution of CLIP do not meet our expectations.Figure 4 illustrates the averaged cosine similarity of CLIP (ViT-B/32) on 15 classes of CIFAR100.The diagonal elements represent the similarity of the positive image-text pairs, while the others represent that of the negative ones.In general, the cosine similarity of image-text pairs is very low, with an average of 0.20.This number is only 0.26 even for the positive pairs.Besides, the similarities of positive and negative pairs are very close, indicating the low distinguishability between different classes.As shown in Figure 5 and Figure 6, the similarity histogram of positive and negative pairs has a large overlap, and the margin is clustered around zero, leaving the predictions of models at risk of being reversed to new non-target classes.For example, as the vocabulary extends from the red box to the green box (diagonal) or the yellow box (horizontal) in Figure 4, more deceptive classes (circles) with negative margins are added, leading to prediction shift.Particularly, the classes belonging to the same vocabulary 3 have higher similarity and smaller margin, making them more likely to be confused with each other.

Inter-modal alignment and intra-modal uniformity ground the margin
According to the results in § 4.1, the ideal feature space for CLIP-like models should have a large margin between different classes to ensure stability in open-vocabulary recognition tasks.To achieve this, the text feature of a class name should be close to the features of the images it describes (Ren et al., 2021), and the intra-modal features, especially textual features, should be uniformly distributed to make the descriptions of competing categories more distinguishable (Wang and Isola, 2020).In order to measure the quality of representations in the vision-and-language domain, we propose two metrics, inter-modal alignment and intra-modal uniformity.Inter-modal alignment calculates the expected distance between features of positive image-text pairs p pos : while intra-modal uniformity measures how well the image or text features are uniformly distributed: 3 Every 5 adjacent classes in Figure 4 constitute a vocabulary (superclass), see Table 4 in

Discussions
After the preliminary explorations on openness of CLIP-like models, we present potential ways to enhance the models's extensibility and stability.
(1) For pre-training: In order to improve the quality of CLIP's feature space and enhance alignment and uniformity, more high-quality pre-training data and effective supervision signals such as ℓ align and ℓ uniform can be introduced during pre-training.
each class name is the same during inference, making it difficult to discriminate between distinct visual categories because the semantics of each cannot be holistically represented.To remedy this, we suggest customizing class descriptions with diverse captions retrieved from the pre-training corpus as a prompt ensemble.The effectiveness of this idea is verified through experiments, details can be found in Appendix A.5.

Conclusion
In this paper, we evaluate the extensibility of CLIPlike models for open-vocabulary visual recognition.
Our comprehensive study reveals that as the vocabulary expands, the performance of these models deteriorates significantly due to indistinguishable text features among competing classes.We hope that our investigation and analysis will facilitate future research on the CLIP openness issue.

Limitations
To facilitate future research, we analyze the difficulties and possible solutions in this new area.
(1) As we present extensive empirical results and address the weakness of CLIP on vocabulary expansion, its theoretical risk on open tasks is urged to be investigated.
(2) The current evaluation protocol is an approximation of the real open world.An evolving benchmark could facilitate future research.
(3) For various visual categories, their degree of abstraction, the ease of describing them in natural language, and their density in the data distribution can also influence the extensibility and stability of models, which are worth studying.

A Appendix
A.1 Comparison of related work

A.2 Superclass-class hierarchy for vocabulary construction
To construct the vocabularies in § 3, we leverage the underlying superclass-class hierarchical structure of CIFAR100 (Krizhevsky and Hinton, 2009) and ImageNet (Deng et al., 2009), and group the classes belonging to the same superclass into a vocabulary.Table 4 lists the vocabularies in CIFAR100, which are specified by (Krizhevsky and Hinton, 2009).There are 20 vocabularies, each with 5 classes.For ImageNet, we utilize two superclass-class structures, Entity13 and Liv-ing17 (Santurkar et al., 2021), as shown in Table 5 and Table 6, respectively.Entity13 has 13 vocabularies, each with 20 classes, while Living17 has 17 vocabularies, each with 4 classes.

A.3 Dataset-level extensibility
The evaluation protocol in § 3 estimates the extensibility and stability within a single task dataset,  where the input images and classes during the vocabulary expansion come from the same data distribution.While the protocol is only an approximation of the real open world, current CLIP-like models have exhibited serious performance degradation.In this section, we take a step further toward real open recognition by conducting a vocabulary expansion setting at the dataset level, where the expanded vocabularies are from different datasets.
In this way, the relationship between vocabularies is more uncertain and thus can be viewed as a rigorous stress test for the CLIP-like models.Specifically, we group all categories in a dataset into one vocabulary.Afterward, the inputs and classes of the entire new dataset are introduced at each expansion.
Classes in the new vocabulary will be removed if they already exist in the previous vocabularies.
Table 7 demonstrates the result of the datasetlevel expansion.First, the performance of CLIPlike models on generic dataset expansion drops dramatically.For example, the accuracy (Acc-E) of CLIP (RN101) decreases by an averaged absolute point of 14.2 on the CIFAR100-Caltech101-SUN397 composition during expansion, and 14.5 on the CIFAR10-CIFAR100-ImageNet composition.Due to the existence of subclass-superclass relationship for some classes in different generic datasets, e.g., cat in CIFAR10 and tiger cat in ImageNet, CLIP is extremely unstable on such expansion across generic datasets.For example, the Acc-S of CLIP (RN101) on the CIFAR10-CIFAR100-ImageNet composition is 28.2% lower than Acc-C, indicating the models are prone to be confused about the subclass-superclass relationship.Meanwhile, the CLIP-like models exhibit much better extensibility and stability on the dataset-level expansion across specialized datasets, e.g., the Flowers102-OxfordPets-StanfordCar composition.The vocabularies of this composition are intrinsically disjoint in semantics, so the model can be stably extended.In summary, our investigations on the dataset level expansions along with the task level in the paper show the current CLIP-like models fail to meet the expectation of conducting real open vocabulary recognition.

A.4 Incremental Acc-E and Acc-S on CIFAR100
We record the Acc-E (Eq.( 2)) and Acc-S (Eq.( 5)) after each vocabulary expansion on CIFAR100 to investigate the openness of CLIP-like models.
Figure 10 shows the Acc-E for 20 trials as new vocabularies are merged incrementally.The falling  (Krizhevsky and Hinton, 2009).
lines indicate that the model is either performing poorly on the new input images, or that some images that were correctly identified before are misclassified after introducing the new classes.
Figure 11 shows Acc-S of CLIP-like models during non-target vocabulary expansion.Each subfigure represents the situation when one vocabulary is selected as the target vocabulary.As the remaining 19 non-target vocabularies are incorporated and the model is required to recognize the 5 target classes from 100 potential classes, the accuracy drops sharply.The decrease of Acc-S brought by each introduction of non-target vocabulary indicates that more images from the target vocabulary are incorrectly classified into the new non-target vocabulary by models.

A.5 Retrieval-enhanced prompt engineering
In light of the previous investigations, we propose a simple yet effective method named Retrievalenhanced Prompt Engineering (REPE) to enforce the distinguishability of class features and the image-class semantic alignment (Cao et al., 2020;Ren et al., 2021).Recall that the context for each class name is the same in vanilla CLIP-like models (e.g., "a photo of a [CLASSNAME]"), making it difficult to discriminate between distinct visual categories because the semantics of each cannot be holistically represented (Zhou et al., 2022).
To remedy this, we propose to customize each class description with diverse captions retrieved from the pre-training corpus as a prompt ensemble.Specifically, for each class description based on the original prompt, we utilize CLIP to recall the most similar images from the pre-training dataset via image-text similarity, then obtain their corresponding captions.The retrieved captions with no appearance of the class name are filtered out, yielding K captions.Such a workflow leverages both visual semantics and class names, achieving better performance.Table 8 shows some cases of the captions retrieved by our proposed REPE on CIFAR100.They share the same target of interest with the original prompt, i.e., "a photo of a [CLASS]", but provide the context in which the class name is located and thus have richer semantics.For example, given a class like bridge, the retrieved captions describe its possible properties (e.g., "golden", "wooded"), connections to other objects (e.g., "over a mountain river"), etc., yielding more expressive and distinguishable text features of the class.
After retrieval, we encode the retrieved captions and conduct a mean pooling operation among them.
The final text representation is:  Living17).Each superclass corresponds to a vocabulary, and each vocabulary has 4 classes.There are 17 kinds of vocabulary in total, specified by BREEDS (Santurkar et al., 2021).Class Retrieved captions apple "Apple slices stacked on top of each other" "Apples growing on a tree" "Still life with apples in a basket" woman "Portrait of a young woman" "Woman standing at the window" "Confident woman in a red dress and gold crown" bridge "The golden bridge in Bangkok" "Bridge on the River Kwai ∼Video Clip" "Wooden bridge over a mountain river" ray "Stingray in the Grand Cayman, Cayman Islands stock photography" "Common Stingray swimming close to the sea floor.""Sun Rays Tours: Go Pro captured the rays under water"   where rt ij is the j (th) retrieved caption for class i and λ is a weighting factor.After that, the ensemble text representation f REPE T (t i ) is adopted as the class anchor for conducting the image classification.With REPE, the representation of the class description shifts towards that of the representative captions in the pre-training dataset, which alleviates the semantic inconsistency between pretraining and inference.
Experiments We retrieve the images and captions from CC12M (Changpinyo et al., 2021), a subset of the pre-training dataset of CLIP.The images and captions are pre-encoded within an hour using a single RTX TITAN GPU, then we build their indices for KNN search with the FAISS framework (Johnson et al., 2019), which also takes about an hour.Once the indices are built, we can efficiently search over the dataset according to the query image in less than 5 ms, which is applicable for query-intensive scenarios.
Table 9 shows the results of REPE.The hyperparameter K is 100 and λ is 0.25.REPE consistently improves the extensibility and stability of CLIP by an average of 1.2% across all three datasets.We further evaluate the quality of the enhanced representations by analyzing the loss of text uniformity and inter-modal alignment.As shown in Figure 7, our proposal effectively reduces ℓ uniform-T from −0.8 to −1.0 and ℓ align from 1.5 to 1.4, verifying its effectiveness in improving the class anchor for better extensibility and stability.Additionally, as shown in Figure 9, REPE increases the median value of the margin distribution from 0.005 to 0.01 and pushes the overall distribution towards the positive side compared to vanilla CLIP.It indicates that REPE widens the gap between positive and negative class features, making it more difficult to invert predictions with competing classes.These findings support REPE's effectiveness in alleviating the openness issue.
It is worth noting that compared to the method that requires computation-intensive pre-training procedures (DeCLIP and SLIP), and the prompttuning approach (CoOp) demands access to the downstream target dataset, our REPE is a lightweight framework for the zero-shot inference stage without fine-tuning.Besides, since REPE is model-agnostic and orthogonal to parameter-tuning methods, it can also be combined with fine-tuning methods like adapter-tuning (Gao et al., 2021), to achieve a further performance boost of 0.6 on CI-FAR100 and ImageNet, which demonstrates the adaptability and superiority of our method.Please refer to Acc  () | () ∪   () ∪   () ∪ ⋯ ∪ Difference between Acc-C and Acc-S of CLIP (ViT-B/32) on different groups.

Figure 2 :
Figure 2: Acc-C and Acc-S (%) of CLIP and its variants on CIFAR100.The horizontal axis represents the extended non-target vocabularies in order.PE refers to Prompt Ensemble.

Figure 3 :
Figure 3: Adversarial non-target vocabulary for CIFAR datasets.Adding 3 adversarial non-target classes leads to severe performance (Acc-S) deterioration, revealing the vulnerability of CLIP when faced with malicious vocabulary.

Figure 4 :
Figure 4: Cosine similarity between image (-I) and text (-T) features of CLIP on CIFAR100.Each value in the matrix are averaged over 100 samples.The expansions from the red box to the green box (diagonal) and the yellow box (horizontal) refer to the calculation of extensibility and stability, respectively.The circle represents that more than 15 wrong predictions have arisen after adding this class.

Figure 7 :
Figure7: ℓ align and ℓ uniform of CLIP-like models.For both two metrics, lower numbers are better.The color of points and numbers denote the extensibility performance (Acc-E) on CIFAR100 (higher is better).

Figure 8 :
Figure 8: Representation visualization of CLIP and CoOp (ViT-B/16).The five classes with different colors are from CIFAR100.• refers to image features (-I), while × and ⋆ refers to text features (-T) of CLIP and CoOp, respectively.The color of ⋆ from transparent to opaque indicates the optimization trajectory during the CoOp prompt-tuning process.

Figure 9 :
Figure 9: Margin distribution of similarity scores of our REPE (blue) and CLIP (ViT-B/32) (red).The median value of REPE's distribution (the blue vertical line) is larger than that of CLIP (the red line), indicating that the predictions of REPE are harder to be inverted with competing classes than the original CLIP.

Figure 11 :
Figure 11: Incremental Acc-S of CLIP and its variants on CIFAR100.

Table 3 :
A comparison of Closed Set Recognition, Open Set Recognition (OSR), Open World Recognition, and Open-vocabulary Recognition (OVR).

Table 4 :
Superclass-class hierarchy in CIFAR100.Each superclass corresponds to a vocabulary, and each vocabulary has 5 classes.There are 20 kinds of vocabulary in total, specified by

Table 7 :
Extensibility and stability of CLIP and its variants during dataset-level vocabulary expansion.∆ refers to the decline of Acc-E/Acc-S (%) compared to Acc-C (%).PE denotes Prompt Ensemble.

Table 8 :
Instances of the captions retrieved by our REPE on CIFAR100.

Table 9 :
Extensibility and stability of our REPE method on CIFAR100 and ImageNet datasets.

Table 10 :
Accuracy of CLIP-Adapter and our REPE method with few-shot learning.
Table 10 for details.
Figure 10: Incremental Acc-E of CLIP and its variants on CIFAR100.