CLIPT EXT : A New Paradigm for Zero-shot Text Classification

,


Introduction
Understanding various modalities is one of the core goals of Artificial Intelligence.To achieve this, vision-and-language (VL) tasks such as visual question answering (Antol et al., 2015) and image caption (Chen et al., 2015) have emerged, aiming to test a system's ability to understand the semantics of both the visual world and natural language.Recently, CLIP (Radford et al., 2021), a cross-modality model pre-trained with 400M noisy image-text pairs collected from the Internet, has gained remarkable success on various VL tasks.
In addition, CLIP shows strong zero-shot transfer capabilities on over 30 different existing computer vision (CV) datasets (e.g., image classification (Jia et al., 2021) and object detection (Gu et al.,  2021b)).In addition to success on CV tasks, various works begin to explore transferring knowledge of CLIP to other VL modality tasks.For example, Shen et al. (2021) demonstrate that serving CLIP as a strong visual encoder can benefit VL tasks in both pre-training and fine-tuning stage.Song et al. (2022) prove that CLIP can be considered as a strong few-shot learner for VL tasks by providing a comprehensive empirical study on visual question answering and visual entailment (Xie et al., 2019).Nevertheless, while significant recent progress has been made in applying CLIP to other VL and CV modality tasks, the same success has not yet been achieved in language tasks.In this work, we argue that CLIP was pre-trained with natural language supervision, which should be capable of helping language tasks.Motivated by this, this work aims to close this gap by studying this research question: can CLIP benefit language task?
To this end, we provide a comprehensive investigation on zero-shot text classification task, aiming to studying how to transfer CLIP's zero-shot ability into the language task.Specifically, this work presents CLIPTEXT, a novel paradigm for zero-shot text classification.The key insight is that CLIPTEXT reformulates zero-shot text classification into a text-image matching problem, so that directly applying CLIP to zero-shot text classification can be achieved.As shown in Fig. 1, CLIPTEXT consists of procedure with two steps: (i) Label Mapping and (ii) Inference.Specifically, the Label Mapping step is used for mapping text classification label into a corresponding image, so that the text-image pairs can be constructed.Then, the inference step passes the generated textimage pairs into CLIP model, and the label with the highest alignment score is regarded as the prediction result.In addition, inspired by recent progress in prompt methods in natural language processing (Liu et al., 2021;Zhao and Schütze, 2021;Zhu et al., 2022;Hu et al., 2022;Qi et al., 2022), we further present PROMPT-CLIPTEXT by adding an additional semantic prompt word at the beginning of the text in CLIPTEXT, enabling model to better infer language knowledge from CLIP.Compared with previous methods, our method has the following advantages.First, some prior work (Yin et al., 2019) require additional NLI dataset to further train their zero-shot classification model.In contrast, our framework is capable of making full use of the powerful zero-shot capability of the CLIP without any extra pre-training.Second, we present a innovate perspective for zero-shot text classification, which can naturally leverage the additional vision information inferred from CLIP to benefit language tasks.Third, our framework is model-agnostic without any specific network design, thereby it can be easily extended to other VL pre-trained model.
We first evaluate our approaches on the standard zero-shot text classification benchmark (Yin et al., 2019).Experimental results show that CLIPTEXT and PROMPT-CLIPTEXT achieves superior performance.In addition, we further evaluate CLIPTEXT on other four publicly available zero-shot text classification datasets to verify the generalization of CLIPTEXT and PROMPT-CLIPTEXT.
In summary, contributions of this work are: Image Encoder I felt fear when my mother was heavily ill.

… Image Label Set
Train Sample

CLIP Framework
Figure 2: CLIP consists of a text encoder and image encoder, and followed by a dot product operation.The highest alignment score is predicted as the result.
• To our knowledge, this is the first work to investigate how to transfer zero-shot capabilities of CLIP into language tasks.We hope this work will spur more researchers to rethink the role of VL model for language tasks; • We introduce CLIPTEXT, a novel paradigm for zero-shot text classification by reformulating it as a text-image matching problem.
In addition, we further propose PROMPT-CLIPTEXT to better infer knowledge from CLIP to zero-shot text classification; • Experiments on seven text classification datasets show the effectiveness of our framework.Extensive analysis further verify the generalization and superior of our approach.
To promote the further research, codes are will be publicly available at https://github.com/LightChen233/CLIPText.

Preliminaries
2.1 CLIP CLIP (Contrastive Language-Image Pretraining) (Radford et al., 2021), an efficient and scalable approach to learn visual concepts from natural language supervision, has obtained surprisingly remarkable success on various of zero-shot computer vision tasks (Gu et al., 2021b).Instead of pre-training on traditional high-quality annotated data, CLIP is trained on 400 million noisy web-crawled image-text pairs, which is much easier to collect.As shown in Fig. 2 (a), CLIP contains a visual encoder V and a text encoder T. Specifically, CLIP employs ResNet (He et al., 2016) or ViT (Dosovitskiy et al., 2020) as visual encoder backbone and uses transformer (Vaswani et al., 2017) as text encoder backbone.After text encoder and image encoder acquire text T(text) and image V(image) representation, a dot-product function (V(image) • T(text)) is further used for calculating similarity between the given text and image.Specifically, the normalized similarity score of matching image i with text j can be calculated by: , where β is a hyperparameter; N denotes the number of batch samples.

Zero-shot Text Classification
To provide an intuitive understanding of zero-shot text classification, we first introduce the classic supervised text classification and then describe the key difference between the supervised paradigm and zero-shot paradigm.
where M denotes the model trained on D train and D dev ; Y represents the outputs of M .
Zero-shot Text Classification Paradigm.In contrast to the supervised paradigm, following where Ŷ represents the outputs of zero-shot text classification.

Model
This section illustrates how to solve the zero-shot text classification task with CLIP (see CLIPTEXT ( §3.1) and PROMPT-CLIPTEXT ( §3.2)).

CLIPTEXT
We convert the original text-label pairs in text classification into text-image pair to keep the CLIP original structure unchanged, To this end, CLIP-TEXT consists of two step: (i) Step I Label mapping ( §3.1.1)converts text label into image to build text-image pairs; (ii) Step II Inference ( §3.1.2) passes the generated text-image pairs into CLIP to obtain the matching similarity score of each text-image pair and obtain the final zero-shot prediction results.

Step I: Label Mapping
Given test set D test = (x (i) , y (i) ) N i=1 (N denotes the data number of test dataset), label mapping aims to convert the text label set V Label into the corresponding semantic alignment image label set V Image to build the text-image pairs.
In our framework, for each text label y, we manually apply the google search engine directly to find the corresponding image according to the dev performance1 : v = LabelMapping(y). (5) Therefore, with the help of label mapping step, the text-label pairs , CLIP model can get a zero-shot prediction by: where we select the label with the highest probability as the final prediction result in single label text classification task while we choose the labels greater than the threshold value t in multi-label classification.

PROMPT-CLIPTEXT
Similar to CLIPTEXT, PROMPT-CLIPTEXT also contains Label Mapping and Inference step.

Step I: Label Mapping
PROMPT-CLIPTEXT employ the same label mapping step to acquire the constructed text-image pairs

Step II: Inference
Instead of directly passing the D test into CLIP, PROMPT-CLIPTEXT add an additional semantic prompt word at the beginning of input text x to generate a new prompt-guided text x by: where Prompt denotes the task-specific hard prompt word for different zero-shot text classification datasets.
Given the updated prompt-guided text-image pairs ( x(i) , v (i) ) N i=1 , PROMPT-CLIPTEXT employ CLIP to obtain the final prediction by: Take the input text in Fig. 3 for example, the original input text in topic classification dataset x is {What is an "imaginary number"...} (Fig. 3 (a)), we insert an additional prompt word topic: to generate the prompt-guided text { topic: What is an "imaginary number"...}) (Fig. 3 (b)).The behind intuition is that prompt in PROMPT-CLIPTEXT can be regarded as a inductive prior knowledge to help the CLIP model to better understand the theme of text classification task and thus better transfer knowledge from CLIP to the language task.Specifically, the prompt word for topic classification, emotion classification, situation classification, intent detection, news categorization, opinion classification and question categorization are topic, interest, publication, type, clarify, caption and match, respectively.

Experimental Datasets
We first evaluate our approach on three standard zero-shot text classification benchmark, including: (1) Topic classification: Yin et al. (2019) choose Yahoo!Answers dataset (Zhang et al., 2015) to evaluate topic classification.It consists of 10 topic categories; (2) Emotion classification: The Unify Emotion dataset was released by Bostan and Klinger (2018).It includes 9 emotion types; (3) Situation classification: Situation Typing dataset released by Mayhew et al. (2019).It includes 11 situation types.
To further demonstrate the generalization of our method, we take other four publicly available datasets, including: (1) Intent detection: We choose a wildly used intent detection benchmark Snips that is collected from the Snips personal voice assistant (Coucke et al., 2018), which contains seven intent labels; (2) News categorization: AG's news dataset (Conneau et al., 2017) is the most popular dataset for news categorization, which contains four news types; (3) Opinion classification: Subjectivity dataset (Subj) (Pang and Lee, 2005) from with two opinion categories; (4) Question categorization: Question dataset (TREC) (Li and Roth, 2002) contains six questions types.Detailed statistics of the datasets are summarized at Table 1.

Experimental Baselines
We compare the performance of our approach with the following strong zero-shot text classification baselines: (1) Majority: This method directly adopts the most frequent label as output; (2) Word2Vec (Mikolov et al., 2013): This approach first uses the average embedding to represent input text and label, and then applies maximum cosine similarity to obtain the final output; (3) ESA (Chang et al., 2008): This method represents input text and label in the Wikipedia concept vector space, and then acquires final prediction output; (4) RTE (Yin et al., 2019): This method is the entailment-based approach that considers the input text and label as entailment problem.RTE employ train a entailment model based on bert-base-uncased with RTE dataset; (5) MNLI (Yin et al., 2019) 2021) to obtain results.All experiments are conducted in GeForce GTX TITAN X, 2080Ti and 3080.

Experimental Results
Following Yin et al. (2019) and Ma et al. (2021), we report label-wise weighted F1 for emotion and situtation datasets, and accuracy for other datasets.
Experimental results are illustrated at Table 2, we have the following interesting observations: • Our framework obtains better performance against all baselines.Compared with the previous NSP-base (Reverse) model, CLIP-TEXT obtains 4.6% improvements on AVG, which verifies our hypothesis that knowledge transferring from CLIP can benefit language task, even better than the knowledge from language itself pre-trained models.(Chang et al., 2008;Mikolov et al., 2013;Yin et al., 2019;Ma et al., 2021).Results with BERT-base denotes that models use BERT-base as backbone and with BERT-large represents that models use BERT-large as backbone.Results with -denotes the missing results from the corresponding published work.
from CLIP to enhance zero-shot text classification.

Analysis
To better understand our model, we provide comprehensive analysis to answer the following questions: (1) Whether the vision knowledge from CLIP benefits the language task?
(2) Whether it be better to convert a label to multiple images and then ensemble them?
(3) Why our approach can successfully perform zero-shot text classification?
(4) What is the intuition behind of our approaches?
(5) What is the impact of image selection?

Answer 1: Vision Knowledge inferred from CLIP can Benefit Zero-shot Text Classification
In this section, we investigate whether the vision knowledge inferred from CLIP can benefit zeroshot text classification.To this end, we conduct experiments by directly encoding both text and label by CLIP Text Encoder and calculating the similarity score to predict the final results.We refer it to the CLIP Text Encoder.
Table 2 (CLIP Text Encoder) illustrates the results.We observe that our framework surpasses CLIP Text Encoder by a large margin (54.4% vs. 43.0%),indicating that the image knowledge learned from CLIP text-image matching pretraining benefits zero-shot text classification tasks.

Answer 2: Ensemble Model Boosts Performance
This section investigates the effectiveness of ensemble approach.Specifically, each text label x is converted into two corresponding images and we sum the two text-image alignment scores as the final prediction score.Table 2 (ensemble) shows the results.We observe that ensemble mode can consistently outperform the single model on the CLIPTEXT and PROMPT-CLIPTEXT, which suggests different images can provide different knowledge and views for text, thereby promoting the performance.

Answer 3: Why CLIPTEXTC Works
To analyze why our approaches work, we provide an intuitive visualization analysis on CLIPTEXT.We choose representations of each text from CLIP text encoder T and the corresponding image label from CLIP vision encoder V for visualization.to each other, which demonstrates that the powerful cross-alignment capabilities of CLIP, enabling the model to perform zero-shot text classification.

Answer 4: Qualitative analysis
To intuitively understand our approach, we conduct qualitative analysis by providing a case study in emotion classification task produced by CLIP-TEXT and NSP (Reverse).
Fig. 5 illustrates the vase study.Given the input text "I felt frustrated , angry , utterly dejected.",NSP Reverse model predicts the label angry incorrectly.We suspect that the spurious cues word angry in the text confuse the NSP Reverse model to predict angry.In contrast, our approach CLIP-TEXT predicts the label sadness correctly.This further demonstrates that the rich information in the image can help our model to make a correct prediction compared with single text label in traditional zero-shot text classification model.

Answer 5: Impact of Image Selection
An interesting question arise is what is the impact of image selection in label mapping stage.To answer this question, for each text label, after obtaining M images returned from google search engine, we randomly choose one image from M images as the mapping image.Finally, we try 30 different experiments and obtain the standard deviation.
Results are illustrated in Fig. 6, which shows a slightly high standard deviation on each dataset.Therefore, future work can focus on how to automatically select label mapping, which is an interesting and important topic to investigate.

Potential Impact
Recently, CLIP (a powerful vision-and-language (VL) model) has shown remarkable success on various zero-shot VL and compute vision tasks.Inspired by this, our work make the first attempt to investigate how to transfer knowledge of CLIP to language task.To achieve this, we introduce CLIP- TEXT and PROMPT-CLIPTEXT, a novel paradigm for zero-shot text classification by reformulating it into a text-image matching problem.Our work demonstrates CLIP can be a good zero-shot learner in language task and we hope this work will attract more researchers to explore how to better leverage knowledge of VL model to help language tasks.

Related Work
In this sectin, we discuss the related work of zeroshot text classification task and application of CLIP.

Zero-shot Text Classification Task
Zero-shot text classification allows model directly to make prediction without any training process, which gains increasing attention since it can greatly reduce human annotation efforts.Yin et al. (2019) introduce three zero-shot text classification benchmarks and propose some strong entailment-based baselines to facilitate this line of research.Puri and Catanzaro (2019) introduce a generative language model (e.g., GPT-2) for zero-shot text classification.Ma et al. (2021) explore the powerful zero-shot ability of BERT for zero-shot text classification, which achieves promising performance.Compared with their work, our approaches explore the zeroshot capacities of VL model (CLIP) for zero-shot text classification while their model focus on the natural language understanding models.

Application of CLIP
CLIP (Radford et al., 2021), a powerful text-image cross-modality pre-trained model, has shown strong zero-shot capability on various downstream tasks.Gu et al. (2021a) apply CLIP to perform open-vocabulary object detection by detecting objects described by arbitrary text inputs rather than in the pre-defined categories.Portillo-Quintero et al. (2021) use CLIP for zero-shot video retrieval.Song et al. (2022) provide a comprehensive investigation on applying CLIP to zero-shot visual question answering and visual entailment.Subramanian et al. (2022) present a strong zero-shot baseline for referring expression comprehension.Su et al. (2022) combine CLIP and off-the-shelf language model for image-grounded text generation, which achieves promising performance.In contrast, our work investigate CLIP into zero-shot text classification and show knowledge from CLIP can benefit language task while their work mainly focusing on zero-shot computer vision or visionand-language tasks.To the best of our knowledge, we are the first to explore CLIP for zero-shot text classification task.

Conclusion
In this work, we studied how to transfer knowledge from CLIP into zero-shot text classification.To this end, we introduced a novel paradigm, CLIPTEXT and PROMPT-CLIPTEXT, for zero-shot text classification by reformulating it as a text-image matching problem.Experimental results demonstrated that CLIP can be a good zero-shot learner for text classification.To the best of our knowledge, this is the first work to apply CLIP for zero-shot text classification task.We hope that our work will motivate further research on transferring knowledge from VL model (e.g., CLIP) to language tasks.

Limitations
We present some limitations of our approach, which can be investigated in the future: (1) Currently, our approaches need to manually choose image for each text label, which may cause the model to be sensitive to the images selected.Though the ensemble method can alleviate this problem to some extent, how to automatically map the text label into the corresponding image is an interesting research question to investigate.(2) Since CLIP was pre-trained on noisy web-crawled data on the Internet, our approaches are limited by pre-training data distribution of CLIP.Therefore, a potential future direction is to further pre-train CLIP on more general downstream task datasets.

Figure 1 :
Figure 1: Illustration of two steps in CLIPTEXT.CLIP-TEXT consists of two steps (1) Label Mapping: aiming to map each label into a corresponding image and construct text-image pairs; (2) Inference: directly passing the generated text-image pairs into CLIP to obtain the final prediction results.

Figure 3 :
Figure 3: Illustration of CLIPTEXT (a) vs. PROMPT-CLIPTEXT (b).Topic stands for hard prompt for the text classification task.

FitzGerald
et al. (2022), zero-shot text classification model M does not require any training process (D train ), and can only access to the dev D dev and test set D test .Model M can be directly applied to test set without any training process (D train ), which is formulated as:

Fig. 4 Figure 5 :
Figure 4: t-SNE visualization of text vectors (dots) from CLIP text encoder T and image vectors (pentagram) from CLIP image encoder V.The dots in the same color represents text representation with same intent and different colors denote different languages.

Figure 6 :
Figure 6: Performance distribution boxplots for each task over 30 random experiments.
Supervised Text Classification Paradigm.In traditional supervised text classification paradigm, given training data D train , validation data D dev , test data D test , we first leverage D train and D dev to train a model in supervised manner, and then apply the trained model to D test , which can be denoted as:

Table 1 :
Statistics of the datasets.

Table 2 :
Zero-shot Main Results.AVG denotes the average score on all datasets.Results with † are obtained by re-implemented and other results are taken from the corresponding published paper