Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene

The major paradigm of applying a pre-trained language model to downstream tasks is to fine-tune it on labeled task data, which often suffers instability and low performance when the labeled examples are scarce.~One way to alleviate this problem is to apply post-training on unlabeled task data before fine-tuning, adapting the pre-trained model to target domains by contrastive learning that considers either token-level or sequence-level similarity. Inspired by the success of sequence masking, we argue that both token-level and sequence-level similarities can be captured with a pair of masked sequences.~Therefore, we propose complementary random masking (CRM) to generate a pair of masked sequences from an input sequence for sequence-level contrastive learning and then develop contrastive masked language modeling (CMLM) for post-training to integrate both token-level and sequence-level contrastive learnings.~Empirical results show that CMLM surpasses several recent post-training methods in few-shot settings without the need for data augmentation.


Introduction
The past few years have seen the rapid proliferation of large-scale pre-trained language models such as GPT (Radford et al., 2018), BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b). These models are generally characterized by pre-training on huge general-domain corpora and then finetuning on task-specific labeled examples when applied to a downstream task. Despite tremendous knowledge obtained from general-domain corpora through pre-training, sufficient labeled examples from the task domain are still needed. However, collecting them is often infeasible in many scenes, where it tends to be unstable and low-performing * Corresponding author. sequence-level similarity, where the token "affecting" is close to the token "[MASK]" due to the latter's context. The two sequences with different maskings are also close due to their semantic affinity.
when directly fine-tuning these pre-trained models Dodge et al., 2020). Many efforts have been devoted to addressing the above issue. Firstly, it can be relieved by improving the fine-tuning process, such as introducing regularization (Jiang et al., 2020;Lee et al., 2020), re-initializing top layers , and using debiased Adam optimizer (Mosbach et al., 2021). Besides, according to empirical results Mosbach et al., 2021), fine-tuning with a small learning rate and more finetuning epochs can also improve the situation. Secondly, additional data can be explored, for which two main genres of data might be helpful: labeled examples from related tasks and unlabeled task examples. The former has shown considerable success on the GLUE tasks (Phang et al., 2018;Liu et al., 2019a), whereas such labeled examples are not always easy to collect. By contrast, the latter is more feasible, especially in scenes where task examples are easy to collect but expensive to label.
Contrastive learning (Hadsell et al., 2006) is a recently re-emerged method for leveraging unlabeled data to enhance representation learning. The key to contrastive learning is to capture the similarity be-tween samples. As shown in Figure 1, there are two sorts of similarities that can be captured for a natural language sequence: token-level similarity and sequence-level similarity. Masked language model (MLM) (Devlin et al., 2019), widely adopted in pre-trained language models, can be considered as token-level contrastive learning, as it maximizes the similarity between the "[MASK]" token and the original token before masking and minimizes the similarity with other tokens. As for sequence-level similarity, several works (Iter et al., 2020;Giorgi et al., 2020;Wu et al., 2020) introduce sequencelevel contrastive learning into pre-training. While all these works focus on the pre-training phase, the focal point of this paper is to improve the performance of pre-trained models in downstream tasks through post-training especially for scenes where limited labeled data is available.
Speaking of pre-trained language models, Xu et al. (2019) and Gururangan et al. (2020) demonstrate improvement in various downstream tasks by training the models with MLM on task examples before fine-tuning, which, following Xu et al. (2019), is termed post-training in this paper even though Gururangan et al. (2020) term it adaptive pre-training. Fang and Xie (2020) also post-train their model on task examples by contrastive selfsupervised learning , which pulls together two augmented sentences generated from the same sentence by back-translation (Edunov et al., 2018) while separating those otherwise. However, these works consider either token-level or sequence-level contrastive learning, without integrating them. Moreover, adopting back-translation to generate augmented sentences demands considerable computation and makes the effect of posttraining dependent on the translation systems.
To capture both sequence-level and token-level similarities, we propose contrastive masked language modeling (CMLM) to achieve more effective knowledge transfer in post-training and to improve the performance of pre-trained language models in few-shot downstream tasks. For this purpose, we propose complementary random masking (CRM) to generate a pair of masked sequences from a single sequence for both sequence-level and tokenlevel contrastive learnings. We conduct extensive experiments to compare CMLM with several recent post-training approaches, and the empirical results show that CMLM achieves superior or competitive performance in a wide range of downstream tasks.
Our contributions can be concluded as follows. First, we propose a new random masking strategy, CRM, to generate a pair of masked sequences favorable to sentence-level contrastive learning. Second, we propose a novel objective, CMLM, for post-training pre-trained language models, which realizes both sequence-level and token-level contrastive learnings on a pair of masked sequences. Third, we compare our approach with several posttraining methods and obtain superior or competitive results in few-shot settings. Lastly, we compare two contrastive learning implementations, SimCLR  and SimSiam , in pre-trained language models. To our best knowledge, our work is the first effort to implement SimSiam in NLP and compare it with SimCLR.

Pre-trained Language Model
Pre-trained language models such as GPT (Radford et al., 2018), BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b) have become a new paradigm of NLP research, and been successfully applied in a wide range of tasks that used to be thorny. These models are generally structured with stacks of Transformer (Vaswani et al., 2017) and pre-trained on large-scale unlabeled corpora. Among them, GPT is pre-trained with a unidirectional language modeling objective, and BERT is with masked language modeling (MLM) and next sentence prediction (NSP). In RoBERTa, Liu et al. (2019b) turn the static MLM in BERT into a dynamic one and remove the NSP task, and pre-train it more intensively with larger corpora.
Despite tremendous knowledge learned from unlabeled corpora during pre-training, sufficient labeled task examples are still needed for fine-tuning, which can be challenging for some scenes. Plus, unlabeled task examples are not leveraged when sticking to the pre-training and fine-tuning paradigm.

Contrastive Learning
To take advantage of unlabeled or labeled data more effectively, contrastive learning (Hadsell et al., 2006) is re-emerged recently in computer vision (CV) and natural language processing (NLP). The key to contrastive learning is to pull positive samples together while separating negative samples apart. The construction of positive and negative sample pairs varies from tasks to tasks. In CV,  take the augmented (e.g., by random crop, color distortion, and Gaussian blur) images originated from the same image as positive pairs, and treat those otherwise as negative pairs. Khosla et al. (2020) take the images with the same label as positive pairs and take others as negative. In NLP, MLM in BERT can be viewed as contrastive learning on token level, which takes "[MASK]" and its original token before masking as a positive pair and the rest as negative. For sequence level, both Fang and Xie (2020) and Wu et al. (2020) follow  to construct the sample pairs. Specifically, Fang and Xie (2020) utilize back-translation for sequence augmentation while Wu et al. (2020) use some easy deformation operations like deletion, reordering and synonym substitution. Besides, Giorgi et al. (2020) construct two spans as a positive pair if they overlap, subsume, or are adjacent. Gunel et al. (2021) introduce supervised contrastive learning proposed by Khosla et al. (2020) into NLP and treat the sequences with the same label as a positive pair. Li et al. (2021) augment the cross-entropy loss with a contrastive self-supervised learning term and a mutual information maximization term to deal with the cross-domain sentiment classification task.

Post-training
Post-training has been broadly applied in downstream tasks. For examples, Xu et al. (2019) post-train BERT with MLM on task examples to improve the sentiment analysis task. Gururangan et al. (2020) further divide post-training into two categories: domain-adaptive pre-training and taskadaptive pre-training, and evaluate them by extensive experiments. Phang et al. (2018) fine-tune their model on a related task before fine-tuning on the target task. Liu et al. (2019a) extend previous work into a multi-task learning fashion. Fang and Xie (2020) introduce contrastive self-supervised learning (CSSL)  to perform posttraining and name it CSSL Pre-training.

Dynamic Random Masking
Masked language modeling (MLM) is firstly applied in BERT (Devlin et al., 2019), where some of the tokens in the input sequence are selected and replaced by a special token "[MASK]". BERT uniformly selects 15% of the input tokens for replacement, and among the selected tokens, 80% are replaced with "[MASK]", 10% are left un-changed, and 10% are replaced by a randomly selected vocabulary token. In the original implementation of BERT, random masking and replacement are performed once in the beginning, and the sequences are kept unchanged through pre-training. Liu et al. (2019b) transform this static masking strategy into dynamic random masking (DRM) by generating a masking pattern every time a sequence is fed. That is to say, given an input sequence T = {t 1 , t 2 , ..., t N }, the probability of each token being selected is determined by p m , which is fixed to 15% in BERT and RoBERTa. (1)

Complementary Random Masking
It is straightforward to come up with an idea that generates a pair of masked sequences from a single sequence by random masking and applies MLM on each masked sequence to capture token-level similarity, and then perform sequence-level contrastive learning between the two sequences to capture sequence-level similarity. However, it faces a dilemma when applying this idea: setting a small p m would make the pair of masked sequences too similar and make the contrastive learning loss drop to 0 quickly, harming sequence-level contrastive learning. On the other hand, setting a large p m would make each masked sequence collapsed, making it hard for the model to recover the original tokens from "[MASK]" based on the context, which in turn harms token-level contrastive learning.
To address this issue, we decouple the pair of masked sequences, denoted by T 0 and T 1 , and assign them different masking probabilities p m and p c . Specifically, we obtain T 0 with a small masking probability p m and T 1 with a larger probability p c . Moreover, to avoid a single word being masked by both sequences, which cripples their relevance, we propose complementary random masking (CRM) to generate a pair of complementary masked sequences which maintain a complementary relationship. Concretely, in CRM, we first gen- (Liu et al., 2019b) from the original sequence T , and generate T 1 with an extra constraint: the masking probability P CRM (t n ) will be set to p c if and only if t 0 n has not been selected in T 0 . Otherwise, it will be set to 0. The process of CRM is described in Figure 2.  CRM is aimed to generate a complementary pair of masked sequences: If p c = 1, all tokens that are not selected in T 0 will be selected in T 1 . Reducing p c can soften this complementary relationship and make the two sequences overlap increasingly.

Contrastive Masked Language Modeling
We propose contrastive masked language modeling, CMLM, based on CRM to realize domain transfer by masked language modeling (MLM) and sequence-level contrastive learning (CL) with pairs of masked sequences. The framework of CMLM is shown in Figure 2, which is described as follows.
Given a batch T = {T 1 , T 2 , ..., T B } of input sequences, we firstly apply dynamic random masking (DRM) on each sequence T b to generate a masked sequence T 0 b , and then apply CRM K times to gen- After obtaining K +1 masked sequences from each sequence T b in T , we then compute their representations H k b ∈ R N ×d by using an encoder, where d is the hidden size of the encoder: Even though our approach is model-agnostic, in this paper we focus on the Transformer-based pretrained language model RoBERTa, which is an enhanced version of BERT. Therefore, we employ RoBERTa to implement the Encoder(·) function. To capture token-level similarity, We apply MLM on H 0 b as Devlin et al. (2019) and Liu et al. (2019b), and compute the loss as follows: To capture sequence-level similarity, we apply contrastive learning on each H k b and H 0 b , and obtain the loss term L CL . We compare two different implementations of contrastive learning: SimCLR  and SimSiam . For SimCLR, L CL can be calculated as: where τ is a temperature parameter. Following Gunel et al. (2021), we take the first token representation h k b ∈ R d of H k b to calculate the similarity between H k i and H 0 j as follows.
SimSiam is similar to SimCLR except without using negative pairs and has a negative loss value. To be consistent with L M LM and L CL , we define the loss function of SimSiam as follows: where z k b and D(z, h) are defined as: Here, W 1 , W 2 ∈ R d×d are learnable parameters, sim(·) is similar to Equation 8, and Stopgrad(·) is a stop-gradient operation which is crucial for SimSiam . Finally, we combine L M LM and L CL for contrastive masking language modeling: where α is a tunable hyper-parameter.

Relationship to Existing Approaches
Among existing approaches, the closest one to ours is CSSL Pre-training (Fang and Xie, 2020). We can implement CSSL Pre-training by slightly modifying Equation 3 and 12 to following ones: where back(T ) means back-translation of T . And other equations stay the same. By comparing these equations, we note that CMLM can be considered as: (1) replacing back-translation with CRM, which not only reduces the computational cost but also prevents the model from depending on the translation systems; (2) adding L M LM to implement token-level contrastive learning, which is shown to be crucial in Section 5.2; (3) easily extending one pair of positive samples to K pairs, which can be attributed to the nature of random masking. As for the loss term, both Giorgi et al. (2020) and Wu et al. (2020) use similar terms to ours for pretraining: L total = L M LM + L CL , where L M LM captures token-level similarity and L CL captures sequence-level similarity. The main difference between these methods and ours is that we use differently masked sequences from the same sequence as a positive pair, while Giorgi et al. (2020) use position-related segments (overlapping, adjacent or subsumed) and Wu et al. (2020) use sequences by different deformations as the positive pair.

Tasks: GLUE
We evaluate our model on the GLUE benchmark (Wang et al., 2018), which contains 9 natural language understanding tasks that can be divided into three categories: (1) single sentence tasks: CoLA and SST-2; (2) similarity and paraphrase tasks: MRPC, QQP, and STS-B; (3) inference tasks: MNLI, QNLI, RTE, and WNLI. All of them are classification tasks except STS-B, so we eliminate it to focus on the classification tasks. WNLI has a small development set (70 examples) and is also ignored. MNLI contains two evaluation sets. One, denoted as MNLI, is from the same sources as the training set, and the other, denoted as MNLI-MM, is from different sources than the training set.
To simulate few-shot scenes of different degrees, we randomly select 20, 100, and 1000 examples respectively from these tasks as our training sets following recent work (Gunel et al., 2021). For each subset in each task, we sample 5 times with replacement and obtain 15 training sets for each task. As for the development set and test set, we randomly select 500 examples from the original development set as our development set and take the remaining as our test set. Since QQP contains too many examples (40k) in the original development set, we randomly select 2000 from the remaining examples after sampling our development set as our test set. Note that all the 15 training sets in each task share the same development and test sets.

Model: RoBERTa
As mentioned above, we take RoBERTa to implement our encoder in Equation 5. The base version of RoBERTa, Roberta-base, which contains 12 Transformer blocks with 12 self-attention heads, is employed. All the blocks have the same hidden size 768. The input sequence is either a segment or two segments separated by a special token "[\s]", while "[s]" is always the first token. We take the implementation and pre-trained weights from Huggingface Transformers library (Wolf et al., 2020).

Training Details
For the fine-tuning of all approaches to be reported below, unless otherwise specified, we use AdamW (Loshchilov and Hutter, 2019) with a learning rate of 1e-5 and epochs of 350, 100, 10 for subsets sized 20, 100, 1000, respectively. This setting is  (Liu et al., 2019b)) and several recent post-training or contrastive learning methods (SCL (Gunel et al., 2021), CSSL (Fang and Xie, 2020), TAPT (Gururangan et al., 2020)). Unit of standard deviation is 10 −2 . based on previous empirical results Mosbach et al., 2021), which show that finetuning with a small learning rate and more epochs stabilizes the performance of a model in few-shot scenes. We set the batch size to 16 and dropout rate to 0.1, and save model parameters every 100 update steps and pick the best based on validation.
For post-training of CMLM, we apply AdamW with a learning rate of 1e-5 and epochs of 200, 50, 5 for subsets sized 20, 100, 1000, respectively. For a fair comparison with other approaches, we set K in Equation 3 to 1 and the batch size to 8, where the maximum GPU memory usage is approximately equal to that of fine-tuning. For the implementation of L CL , we choose SimSiam for it consumes less computation. For p m in Equation 1, we follow (Liu et al., 2019b) and set it to 0.15. We conduct a grid-based search for hyper-parameters with α ∈ {0.01, 0.1, 0.3, 0.5, 0.7, 1} (Equation 12) and p c ∈ {0.1, 0.3, 0.5, 0.7, 0.9} (Equation 2), and find that the combination of α = 0.5 and p c = 0.7 performs the best on the development set.
For the baselines to be introduced below, we follow the same fine-tuning and post-training settings as our CMLM, with only several method-specific hyper-parameters unchanged.

Baseline Approaches
As mentioned in Section 2.2 and 2.3, there have been works trying to add extra loss terms in finetuning or to insert a post-training phase in between pre-training and fine-tuning. To make a com-prehensive comparison, we employ the following approaches as our baselines: (1) fine-tuning (FT) (Liu et al., 2019b), which directly fine-tunes a model with cross-entropy loss; (2) fine-tuning with SCL (SCL) (Gunel et al., 2021), which finetunes a model with cross-entropy loss and supervised contrastive loss; (3) post-training with CSSL (CSSL) (Fang and Xie, 2020), which post-trains a model with contrastive self-supervised learning loss; (4) post-training with MLM (TAPT) (Gururangan et al., 2020), which post-trains a model with MLM loss and is equal to CMLM when α = 0. Comparing with recent works (Fang and Xie, 2020;Gunel et al., 2021) that take only the conventional either BERT or RoBERTa as their baseline, we consider a few more baselines to obtain more conclusive results.

Evaluation Details
In few-shot scenes, the distribution of the training set may deviate from the test set seriously. Gunel et al. (2021) pick the top-3 results from all combinations of training sets and model seeds for each task. Differently, for each data size of 20, 100, and 1000 described in Section 4.1, we train our model with random seeds {31, 42, 53} for the 5 training subsets, and calculate the mean and standard deviation of the 15 test results. We assume this is a better way to evaluate the overall effect of our model.

Few-Shot Results
In Table 1, we report our few-shot results on the GLUE tasks with 20, 100, and 1000 training exam-ples, respectively. Five observations can be made from the table. First, CMLM obtains superior performance on the datasets with 100 and 1000 examples, surpassing the baselines in 13 of 16 tasks. Since we use the same hyper-parameters for these approaches and report the average results over 3 random seeds and 5 randomly sampled training sets, these results are convincing. Second, on the dataset with 20 training examples, CMLM only surpasses the other approaches in 3 of 8 tasks. Training a model with only 20 examples is very unstable, and the test results of the baseline approaches indeed show large deviations across different training sets. Third, we find that post-training with only L M LM (Xu et al., 2019;Gururangan et al., 2020) can achieve competitive results with the baselines, showing the effectiveness of this widely-used approach. Fourth, SCL (Gunel et al., 2021) has extremely poor performance on CoLA, MNLI, and MRPC when the data size is 1000, which is beyond our expectation. In the original paper, the authors of SCL only report the top-3 results from combinations of model seeds and train sets. So we speculate this under-performance might come from the instability of SCL in few-shot settings. Fifth, CSSL (Fang and Xie, 2020) performs even worse than FT when the data size is either 20 or 100 but achieves competitive results when the data size is 1000. CSSL is designed for full-size GLUE tasks and might not be suitable for the few-shot scenes.

Full-Size Results
To verify whether post-training with CMLM can still achieve desirable results when sufficient labeled examples are available, we conduct experiments on the RTE (2.5k), MRPC (3.7k), CoLA (8.5k), SST-2 (67k), and QNLI (106k) tasks with their full-size training sets. We set the learning rate to 3e-5 for both post-training and fine-tuning and set the epoch to 3. Other hyper-parameters remain the same as in Section 4.3. Experiment results are shown in Table 2, from which we can note that CMLM maintains its superiority on RTE and CoLA but fails on MRPC, SST-2, and QNLI. TAPT performs better on tasks with more training examples, which can be explained by the better generalizability of token-level representation, though it demands more training steps to learn well. Note that CMLM is specifically proposed for few-shot settings, so the experiments in the full-size setting are only to evaluate it from different perspectives and make a  Table 2: Results on the RTE, MRPC, CoLA, SST-2 and QNLI tasks with full-size training sets, and average results over 3 random seeds are reported.

Additional Unlabeled Examples
We consider the scene where additional unlabeled task examples are provided. We evaluate CMLM on CoLA, SST-2, QNLI, and MRPC with 100 labeled examples for fine-tuning and increase unlabeled examples from 100 to 2500 for post-training. We depict the results on the development sets in Figure 3, from which two observations can be made.

Hyper-parameter K
We evaluate whether increasing K (in Equation 7) can lead to improvement of our model in the fewshot setting. We evaluate CMLM on the CoLA, SST-2, QNLI, and MRPC tasks with 100 training examples by increasing K, and the results are depicted in Figure 4. Similar to increasing unlabeled examples, the performance slightly improves on the 5 tasks but has some fluctuations, which is within our expectation. Intuitively, exposing the model to different forms of masked sequences can better reflect the distribution of examples sampled from a large training set, but cannot narrow the deviation between these examples and the original train set.

SimCLR vs SimSiam
As described above, L CL can be implemented by either Equation 7 or Equation 9, although we implement the latter to conduct the above experiments for its less computational cost. According to , SimSiam performs better than Sim-CLR on ImageNet (Deng et al., 2009). It is thus interesting to verify whether the same holds in our situation. We compare SimSiam (CMLM) and Sim-CLR (w/ SimCLR) on SST-2, CoLA, QNLI and RTE, and the results are reported in Table 3. From the results, we cannot easily conclude which one is better due to their comparable performances, yet further investigation is beyond the scope of this paper. However, we prefer SimSiam due to it consumes less computation and is easier to implement.

Are MLM & CL Critical for CMLM?
One of the improvements of CMLM over previous works is combining L M LM and L CL to implement both token-level and sequence-level contrastive learnings. Here, we verify how the bigranularity contrastive learnings contribute to the performance differently. We remove L M LM and L CL alternatively from L CM LM and evaluate the resulting model on SST-2, CoLA, QNLI, and RTE with 100 and 1000 training examples, respectively. The results are reported in Table 3. As we can see, the results suffer severe deterioration by up to 7.5% after removing L M LM , while removing L CL only leads to a drop by up to 1.4%. Although both L M LM and L CL contribute to the improvement of CMLM, MLM tends to play a more essential role.

Complementary Random Masking vs Dynamic Random Masking
We propose a complementary random masking (CRM) strategy to generate complementary masked sequences T k , k ∈ [1, K], based on T 0 , which is generated by dynamic random masking (DRM).
Here, we verify whether this complementary nature of T k benefits contrastive learning. We replace CRM in Equation 3 by DRM, and conduct experiments on SST-2, CoLA, QNLI and RTE with 100 and 1000 training examples, respectively. As shown in Table 3, CRM still surpasses DRM on all 8 tasks, with improvement by up to 1.6%. The superiority of CRM mainly comes from fact that it avoids tokens to be masked in both T 0 and T k .
In this paper, we proposed a novel post-training objective, CMLM, for pre-trained language models in downstream few-shot scenes. CMLM attempts to combine both token-level and sequence-level contrastive learnings for more efficient domain transfer during post-training. For sentence-level contrastive learning, we developed a random masking strategy, CRM, to generate a pair of complementary masked sequences for an input sequence. Empirical results show that post-training with our CMLM outperforms other recent approaches on the GLUE tasks with 100 and 1000 labeled training examples, respectively. We also conducted extensive ablation studies and showed that both token-level and sequence-level contrastive learnings contribute to the results of CMLM, and that CRM achieves favorable sequence-level contrastive learning over the previous masking strategy. In future work, we will further investigate how token-level and sequencelevel contrastive learnings affect domain transfer in post-training and explore more effective methods for sequence-level contrastive learning.