Constructing Contrastive samples via Summarization for Text Classification with limited annotations

Contrastive Learning has emerged as a powerful representation learning method and facilitates various downstream tasks especially when supervised data is limited. How to construct efficient contrastive samples through data augmentation is key to its success. Unlike vision tasks, the data augmentation method for contrastive learning has not been investigated sufficiently in language tasks. In this paper, we propose a novel approach to construct contrastive samples for language tasks using text summarization. We use these samples for supervised contrastive learning to gain better text representations which greatly benefit text classification tasks with limited annotations. To further improve the method, we mix up samples from different classes and add an extra regularization, named Mixsum, in addition to the cross-entropy-loss. Experiments on real-world text classification datasets (Amazon-5, Yelp-5, AG News, and IMDb) demonstrate the effectiveness of the proposed contrastive learning framework with summarization-based data augmentation and Mixsum regularization.


Introduction
Learning a good representation has been an essential problem in the deep learning era. Especially, in the area of natural language processing, the language model pre-training techniques, such as BERT (Devlin et al., 2019), have been overwhelming in a wide range of tasks by learning contextualized representations. However, the success of these pre-trained models hinge largely on plenty of labeled data for fine-tuning. With limited labels on the target task, fine-tuning BERT has been shown unstable (Zhang et al., 2021). In practice, it is costly to gather labeled data for a new task, and lack of training data is still a big challenge in many real-world problems.
Recently, contrastive learning methods have become popular self-supervised learning tools and gained big progress in few-shot learning due to its better discriminative ability (Gidaris et al., 2019;Su et al., 2020). Various contrastive learning methods have been developed and lead to state-of-the-art performance in many computer vision tasks. They are also extended to the fully supervised setting by leveraging label information to make further improvement. In natural language processing, contrastive learning has not been fully investigated but it is attracting more and more attentions.
A contrastive learning method generally consists of two components: finding positive samples and negative samples for each anchor sample; and building up an effective objective function to discriminate them. In many contrastive learning frameworks, how to efficiently find the contrastive samples has been the key to their success. For example, in MoCo , the contrastive pairs are constructed by matching an encoded query with a dynamic dictionary; in SimCLR , the contrastive pairs are created by applying two different data augmentation operators, and it was shown that composition of data augmentation operations is crucial for learning good representations. In supervised contrastive learning, essentially the positive sample space has been augmented. Instead of only using the anchor sample and its own transformation, all samples in the same class can be further regarded as positive pairs.
In this paper, we focus on using contrastive learning to assist the text classification tasks with limited labels. Considering the specialty of the text classification task, we propose two novel strategies to further enhance the performance of supervised contrastive learning. We assume that a good summarization system can keep the most critical information of original texts and the generated summary tends to belong to the same category as the original text. Thus we utilize text summarization as a data augmentation method to create more positive and negative samples for supervised contrastive learning. Furthermore, we propose Mixsum, an idea similar to the methodology of mix-up (Zhang et al., 2018), which combines texts from different categories and creates new summary samples to further augment the data for contrastive learning. We adapt the supervised contrastive loss to the Mixsum setting, and show that it brings great benefit for text classification when training data is extremely scarce.
Our main contributions are listed as below: • We propose a new contrastive learning framework for text representation learning and mitigate the label deficiency problem for text classification.
• We employ text summarization, a new data augmentation method, to construct positive and negative sample pairs for contrastive learning.
• We improve the supervised contrastive learning method by mixing up the samples in different categories. Combining with the summarization based data augmentation method, our model shows superior performance on three real-world datasets.  . In a self-supervised contrastive learning framework, anchor samples are the original data samples, positive samples are the augmented anchor sample, and negative samples are generally set to all other samples in the mini-batch. Equation 1 is the self-supervised contrastive learning objective for the popular SimCLR framework . For each mini-batch with N anchor samples, we can get another N positive samples by data augmentation, concatenate them to form a new batch. Then for each anchor examples index, i in the range {1, 2, ..., N }, the index for the corresponding positive sample is 2i, and all other 2N −2 samples in the batch are negative samples. f (·) is a representation model mapping the input samples to a normalized dense vector in R d , and τ is the temperature parameter. Contrastive learning on NLP tasks also arises much research intensity recently. Fang et al. (2020) propose to learn sentence-level representations by fine-tuning BERT (Devlin et al., 2019) with back-translation based data augmentation and self-supervised contrastive learning objective function. Klein and Nabi (2020) propose to use contrastive learning for commonsense reasoning, and the proposed method alleviates the current limitation of supervised commonsense reasoning. Khosla et al. (2020) explore the general supervised contrastive learning loss and show the effectiveness of supervised contrastive learning. Gunel et al. (2020) introduced the supervised contrastive loss to the original cross-entropy loss for fine-tuning pre-trained transformers like Roberta  and BERT (Devlin et al., 2019), which is highly related to our work. Our approach is different from these previous works in that we utilize a new data augmentation, i.e. summarization, for supervised contrastive learning. Our Mixsum method is also never explored by those methods.

Beyond Empirical Risk Minimization
The general theme of supervised learning is minimizing the empirical risk of datasets by defining a loss function l, which describes the difference between the model prediction f (x) and target label y. The expected risk of the datasets can be described in Equation 2.
P(x,y) is the distribution of the dataset, which is unknown but can be approximated by empirical distribution. Then we can now approximate the expected risk by empirical risk in Equation 3.
Minimizing the empirical risk in Equation 3 is called Empirical Risk Minimization(ERM) (Vapnik, 1999). ERM will lead the model to memorize the training samples and fail for data out of training samples. Motivated by the limitation of ERM, Zhang et al. (2018) propose a generic vicinal distribution, called mixup: Zhang et al. (2018) use this new vicinal distribution described in Equation 4 to approximate the expected risk, and minimizing the empirical vicinal risk (Chapelle et al., 2001) in Equation 5.
The proposed vicinal distribution-mixup, can be viewed as a form of data augmentation that leads the model to behave in between the training samples and soften the labels. Experiments demonstrate that mixup can improve the robustness of the trained model and avoid undesirable oscillations when predicting unseen samples (Zhang et al., 2018).
Besides, Kim et.al (Kim et al., 2020) proposed MixCo, which create a vicinal distribution for selfsupervised contrastive learning based on the idea of mixup (Zhang et al., 2018), they demonstrate the effectiveness of vicinal distribution minimization for self-supervised contrastive learning loss over image classification tasks. Inspired by mixup and MixCo, we propose a novel vicinal distribution, i.e. Mixsum, for supervised contrastive learning.

Problem Definition
The task we want to solve is text classification with limited annotations. In the text classification task, the input data is usually a sentence, a paragraph or a document. Assume we have a small number of training samples with labels D train and a large amount of unlabeled data D test . For each text sample x ∈ D train , it has a label y which is from L classes. And we want to predict the labels of all samples in the test data.

Text Summarization
We propose to use text summarization as the data augmentation strategy for constructing positive and negative samples in supervised contrastive learning when the number of annotated training samples is limited. Intuitively, the summarization process can filter out unnecessary and redundant information in the text and extract the most representative semantics. The summary owns the same label as its source text.
We use PreSumm (Liu and Lapata, 2019) for automatic text summarization. PreSumm utilizes BERT as a general framework for both extractive and abstractive summarization, both of them can achieve great summarization quality even without text-summary pairs for finetuning. For each input text x we can get its summary x by feeding the input text x i to PreSumm model 6,where i is the index in Minibatch.
We use the abstractive summarization model trained by (Liu and Lapata, 2019) without any textsummary pairs for fine-tuning. Compared to extractive summarization, which can only generate summaries by extracting key sentences from original paragraphs, abstractive summarization can generate information-rich, coherent and less-redundant summary compared to extractive summary and do not have the limitation that summary is only from the original text.
Assuming the generated summaries belong to the same class as their original source texts, we can add them to the training samples.

Supervised Contrastive Learning
Although fine-tuning pretrained model using crossentropy is commonly used for text classification, and it achieves state-of-art results on many text classification tasks . However, this approach still can not achieve optimal performance in few-shot setting, where training data is limited. In order to alleviate this limitation, we propose to add a supervised contrastive learning objective (Gunel et al., 2020) and using text summaries as contrastive samples to train a more robust text classifier under the limited annotation setting.
The main idea of supervised contrastive learning is minimizing the intra-class representation distance while maximizing the inter-class representation distance. It would be easier for the classifier to learn a good decision boundary by applying supervised contrastive learning. This process can be achieved by minimizing Equation 7.
For each batch with N input texts and N labels, we first apply summarization to get the augmented N text summaries; then, we get 2N samples in a although it is more of a pain , it seems that the humidifiers with the filters stay cleaner on their own. i thought it was interlinear but it was n't what i wanted x i x' i y i y j y i y j Figure 1: Illustration of using summaries as contrastive samples for text classification. x i is the original text, x i is the summary of x i , y i is the target label for x i . Randomly select another sample x j , concatenate the summary of x j -x j with x i , and use it as the contrastive sample of x i batch. For each anchor sample x i , we want to minimize the vector distance between x i and positive samples x j , whose labels y i and y j belong to the same class.
Where N is the mini-batch size, and 2N is the size of the augmented batch after applying summarization. N y i is number of samples which have same labels as y i . Labels for the summary is the same as the original text. X and Y are the batches of augmented training samples and target labels. g(·) is l 2 normalized representation of input text in R n , where n is the dimension of text feature used for supervised contrastive learning. The similarity measure of g(·) is cosine similarity with temperature parameter τ . The cosine similarity of g(x i ) and g(x j ) should be maximized when x i and x j come from the same class; otherwise it should be minimized.
Since contrastive learning can gain better performance when an MLP head is used , we also apply an MLP head upon the base text encoder Φ(·). The text encoder Φ(·) can be any pretrained text encoder which maps a text to a dense vector in R d , eg. BERT (Devlin et al., 2019), XLNet , Roberta , LSTMs and CNNs (Zhang et al., 2015). d is the feature dimension of the text encoder. The entire text encoding process is expressed in Equation 8 and 9.
Combining the cross-entropy loss in Equation 11 with a trade-off parameter λ, we can get the final loss function in Equation 10. λ is a hyperparameter to control the relative importance of cross-entropy loss and supervised contrastive loss.
where y i is the label of training sample x i in onehot representation. p(x i ) is the predicted probability distribution generated by the text classification model. Φ(·) is the backbone text encoder, which is exactly the same as the text encoder used in the supervised contrastive learning stage and the model weights also shared in supervised contrastive learning stage. W is a fully connected classification projection matrix in R C×d , which map the text feature in R d to score vector of output classes in R C . b is the bias of the classification head in R C . C is the number of different classes across the training samples.

Mixsum
We propose another novel method, i.e. Mixsum, by combining the idea of mix-up (Zhang et al., 2018) and using summarization to construct contrastive samples-to achieve better text classification performance under the limited annotation setting. Basically, the main idea is that summaries of concatenated texts from different classes contain the feature of both classes, then the newly generated summary can serve as the regularization for crossentropy loss and supervised contrastive learning objective, which can lead the model to behave in between the training samples and soften the labels. Similar to mixup (Zhang et al., 2018), which use a convex combination of the input image to create the vicinal distribution, we propose to combine the summaries of texts from two different classes and use the conjunct summary as the augmentation.
There are also other methods for mixing the texts from two different classes, such as linear interpolation of sentence-level features (Guo et al., 2019;Sun et al., 2020) and word-level features (Guo et al., 2019). Those methods are also applicable under our setting. In the summarization context, concatenating two documents with the same weight is the simplest and most intuitive way to keep our model neat and practical. Consequently, we choose this method for mixing up the texts and the λ for mixing the vicinal label in Equation 4 is also fixed at 0.5.x Where x i is the summary of the original text x i in a batch, then randomly pick another summary x j in the batch and conjunct them together to form a mixup summaryx i . This process can be visualized in Figure 1. The new generated labelŷ i follows the mix-up method introduced in (Zhang et al., 2018). Same as the contrastive samples augmentation strategy mentioned in Section 3.3, we concatenate the original N input texts with the mix-up summaries to form a new Minibatch with 2N samples. Then we can formulate the new cross-entropy loss and supervised contrastive loss under Mixsum setting in Equation 15 and 19.
we can derive a similar compact form for supervised contrastive loss under Mixsum setting in Equation 19. The derivation is inspired by the cross entropy loss under Mixsum setting.
The constraints 1 y i =y j in Equation 7 can be written as y i · y j , where y i and y j are the one hot label vectors. Then in the Mixsum setting, each mixed label y mix i is obtained by 0.5 · y i + 0.5 · y m i , where y i ∈ Y and y m i ∈ Y m . Thus, by expanding the LHS of Equation 19, we can replace the constraints 1 y mix i =y mix j with y mix i · y mix j , which is (0.5 · y i + 0.5 · y m i ) · (0.5 · y j + 0.5 · y m j ) (20) Expanding Equation 20, we can get 0.25(y i · y j + y i · y m j + y m i · y j + y m i · y m j ) (21) But Equation 21 is too complex for computation and also not neat, so we decided to do an approximation-using y i · y j + y m i · y m j to approximate y i · y m j + y m i · y j . Then we can get Benefit of doing this approximation is that it can reduce the complexity and make final form neat, and we commit that this approximation inevitably will lose some information.
Minimizing Equation 19 is sufficient to achieve the goal-pull the representation of Mixsum sample "in between" the representation of class y j and y i .
Finally, combining the cross-entropy loss and supervised contrastive loss under the Mixsum setting, we can get the final objective in Equation 23.

Datasets
We use Amazon-5, Yelp-5, AG News and IMDb text classification datasets for benchmarking, and the dataset splits are obtained from Zhang et al. (2015). In order to demonstrate the effectiveness of the proposed methods under the limited annotation setting, we randomly sample ten subsets using ten different random seeds from each of Amazon-5, Yelp-5, AG-News and IMDb for each experiment, each subset contains 80 training samples and 1000 test samples. The statistics of sampled datasets is shown in Table 1.

Experimental Setting
For all the experiments, we test the proposed methods using several pretrained transformer models as backbone text-feature encoders including Roberta-base model , and Bertbase model (Devlin et al., 2019). As for the pooling strategy of the backbone encoder, we simply use the feature of [CLS] token as the sentence feature, which is commonly used as the text feature for text classification. Adam optimizer (Kingma and Ba, 2015) is used for optimization. The maximum learning rate is set to 1e − 5, and the learning rate is decayed linearly with warm-up steps. The batch size is set to 8. We set the trade-off parameter λ to 0.9 for experiment involving L sup , since 0.9 is the optimal trade-off parameter between supervised contrastive loss and cross-entropy loss when using Back-Translation for augmentation according to Gunel et al. (2020). The summarization method we used for creating contrastive samples is PreSumm (Liu and Lapata, 2019), which is available on github 1 , and we also use the Text-Rank algorithm for replacement when junk outputs are generated by PreSumm. It's inevitable for abstractive summarization methods like PreSumm to generate some junk outputs when certain input texts are given, and only a few junk outputs will be generated. Text-Rank is an extractive summarization method, which generates summaries by extracting existing sentences in the texts.
All of our code and datasets are available on the github repository 2 .

Baselines
In order to testify the effectiveness of creating contrastive samples using summarization, we compare the proposed data augmentation strategy with Back-Translation (Edunov et al., 2018). Back-Translation is a common data augmentation strategy for contrastive learning in NLP (Fang et al., 2020). We first translate the training samples in English to Chinese and then translate back the Chinese texts to English using Google Translate.
We also conduct an ablation experiment under a setting that does not use summarization as contrastive samples. Under this setting, we simply remove the augmented samples in the data batch and only use original samples in the batch. The objective function under this setting only consists of cross-entropy loss and supervised contrastive loss of original samples.

Results
All the experiment results reported are the average results of repeating experiments with ten different random seeds. The experiment settings for producing all the results are introduced in Section 4.3 and 4.2.

Methods
Bert Roberta  We have two findings from the experiment results in Table 2. First, the proposed contrastive samples generation technique, i.e. summarization, outperforms the Back-Translation method (Edunov et al., 2018) under limited annotation setting on all four datasets. Second, the proposed Mixsum method can further improve the performance of using summarization for contrastive samples generation(Sum).

Ablation Study
In order to demonstrate the effectiveness of the proposed two methods, we conduct ablation experiments on Amazon(S), Yelp(S), AG-News(S) and IMDb(S) to see the classification accuracy gain of each methods. The results are shown in Table 3, 4, 5 and 6. L ce represents the setting that only use cross entropy loss and without any data augmentation. L ce + L sup (N ) represents the setting that do not use summarization as contrastive samples, and only use original samples for supervised contrastive learning. Under this setting, we can simply remove the augmented samples in the data batch and only use original samples in the minibatch. L ce + L sup (Sum) represents the setting that uses summarization to create contrastive samples, which is introduced in Section 3.3. L ce + L sup (Sum + BT ) represents the setting that combine summarization and Back-Translation together for contrastive samples generation. L mix ce + L mix sup is the setting that uses Mixsum introduced in Section 3.4 for supervised contrastive learning.

Methods
Bert Roberta   We have four findings from the Ablation Results.
• The proposed summarization method can significantly increase the performance, and the average performance gain is 2.61% across all datasets and models compared to L ce setting.   • The proposed Mixsum method can further improve the performance of the classifier. The average performance gain compared to L ce setting is 4.38%, and the average performance gain compared to the summarization method is 1.7%.
• Supervised contrastive learning without any augmented contrastive samples may or may not increase the classifier performance, the average performance gain is 0.0875% across all datasets and models. Sometimes it would even decrease the performance of classifier.
• Combining Sum and BT samples together can not outperforms the setting that only use one of them.   when the number of training samples is only 80 in Section 4.4.2, we find that performance improvement of the proposed two methods is much smaller. When the number of training samples increases to 6500, the performance of the proposed methods even lower than the ablation setting. Combining results from Section 4.4.2, it's reasonable to infer that the proposed methods are beneficial under the limited annotation scenario, but they may not necessary when the number of training samples get larger.

Sensitive analysis
In order to investigate how summarization methods will impact the performance of the proposed methods, we replace the original abstractive summarization method-PreSumm(Liu and Lapata, 2019) with extractive summarization method-TextRank. TextRank algorithm will rank the relative importance of the sentences in a text and select the most important sentence as the text summary. We report the test accuracy of using TextRank for text summarization in Table 9.

Methods
Amazon (  With this alternative summarization system, the performance of the proposed mix-sum regularization methods is not as good as using PreSumm. We think that it is the limitation of the extractive summarization that leads to the performance drop because extractive summarization can only create summaries from original texts and will bring more information loss compared to abstractive summarization. Besides, the performance of the proposed Mixsum regularization still outperforms other ablation models, which proved the generalization ability of the proposed Mixsum method over different summarization methods. Furthermore, we also investigated effect of using different texts mixing methods. Sun et al. (2020) propose to mix the texts by linearly interpolating sentence-level features of texts. The sentence-level features are encoded by a pre-trained transformer model, like BERT and Roberta. We replace our texts mixing methods with the linear interpolation of sentence-level feature as introduced by Sun et al. (2020), and keep all other settings same as Mixsum introduced in Section 3 and 4. The results are shown in Table 10. All the experiment are repeated with 10 different random seeds.

Methods
Amazon (  We observe that replacing our texts mixing methods with LISF still can achieve similar results and outperforms the Sum setting. Thus, we believe that other different sentence mixing methods can also be adopted in Mixsum framework.

Conclusion
We proposed a novel data augmentation technique for constructing contrastive samples in supervised contrastive learning-summarization. Besides, we also proposed a Mixsum method based on using summarization to construct the contrastive samples. We demonstrate the effectiveness of the proposed two new techniques on text classification task under the limited annotation setting. The experiment results on four datasets show that Mixsum and using summarization as contrastive samples can improve the performance of text classification under the limited annotations setting. Besides, We show that the proposed Mixsum methods can be generalized to different summarization methods and text mixing methods.
Our work also opens up several possibilities for future work, since using summarization to construct contrastive samples has shown the effectiveness in supervised contrastive learning. We may investigate whether using summarization as data augmentation can improve unsupervised text classification (Wu et al., 2018), and the robustness and performance of other NLP applications like question answering, commonsense reasoning and semantic code retrieval (Ling et al., 2021).