ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation

,


Introduction
In recent years, sentence representation has played a crucial role in various NLP tasks operating at the sentence level (Reimers and Gurevych, 2019;Yang et al., 2020;Zhang et al., 2021;Yang et al., 2021).Many researchers use a transformer language model (LM) (Devlin et al., 2019;Liu et al., 2019), as a backbone of sentence representation by finetuning LM on natural language inference (NLI) and semantic textual similarity (STS) labeled data, which yields promising results (Reimers and Gurevych, 2019;Li et al., 2020).However, these techniques require labeled data during the finetuning process, which can be a limiting factor in lowresource settings.
In order to incorporate unlabeled data into the training process, unsupervised learning paradigms have gained popularity.The contrastive learning paradigm has recently led to significant advancement in unsupervised learning.The main idea of contrastive learning in sentence representation is learning a meaningful representation by maximizing the similarity between differently augmented views (Kim et al., 2021;Yan et al., 2021;Gao et al., 2021;Carlsson et al., 2021;Kim et al., 2021;Giorgi et al., 2021;Liu et al., 2021a;Fang et al., 2020).For example, Gao et al. (2021) proposed a state-ofthe-art contrastive framework called SimCSE, the learning framework that benefits from the dropout normalization to produce differently augmented views and works well with unsupervised and supervised learning.In particular, SimCSE is SOTA on the STS benchmark, and the performance gap between unsupervised and supervised settings in SimCSE on large networks (e.g., BERT-base) is only five points when evaluated on the STS benchmark.
However, the performance of SimCSE rapidly degrades as we decrease the model size, which is undesirable when the computational resource is limited, e.g., edge computing (Jiao et al., 2020;Sun et al., 2020b).For instance, when we use the MiniLM-L3 (#parameters: 17M) instead of the BERT-base model (#parameters: 109M), the Spearman rank correlation of SimCSE-unsupervised drops from 76.25 to 55.10 (averaged across 7 STS corpora).The gap between compressed and base LMs is 21.15, as shown in Figure 1.In addition, the gap between unsupervised and supervised learning in MiniLM-L3 is 21.56, while the gap of larger models like BERT-base is only 5.32 points.Maintaining a high performance for supervised and unsupervised learning is challenging for smaller LMs.
In this paper, we aim to retain the advantage of unsupervised learning while mitigating the performance penalty from model compression at the same time.We propose an unsupervised control and generalization distillation, ConGen, a distillation framework that transfers knowledge from a large model to any model regardless of its architecture and size.Not only does ConGen outperform state-of-the-art unsupervised sentence representation, its performance is also similar to supervised learning (Figure 1).
The crux of ConGen lies in the distillation mechanism, which handles two different data augmentation views.In particular, we employ inputs derived from two data augmentation operations, which we refer to as control and generalization.The student observes both control and generalization inputs, whereas the teacher observes only control input, which we refer to as a reference input.We derive a similarity distribution between the student inputs (control and generalization) and the instance queue (He et al., 2020;Fang et al., 2021), and we also do the same to the teacher input (reference).To compare similarity distributions from the teacher to student models, we minimize the discrepancy between the student and teacher distributions in the two following control & generalization learning objectives.First, we control the similarity distribution of the control distribution and the reference distribution to be the same.Second, to increase the model's generalizability, we enforce the similarity distribution of the generalization distribution to be the same as the reference distribution.
To demonstrate our method's effectiveness, we compare it to other distillation methods in three tasks: semantic textual similarity (STS), text classification (transfer), and natural language inference (NLI).The experimental results from STS demonstrate that our method significantly improves the performance of compression models and consis-tently outperforms competitors.In addition, when the model parameters are less than 33 Million, Con-Gen outperforms or matches the supervised baseline (Figure 1).Moreover, in transfer and NLI, ConGen outperforms unsupervised learning, i.e., SimCSE, and other distillation methods in 11 of 12 cases.Additionally, we extend our method to multilingual sentence representation; experimental results from multilingual STS demonstrate that ConGen outperforms competitors in all cases.
The contributions of our work are as follows: 2 Related Work

Unsupervised Sentence Representation
Transformer-based language models (LM), i.e., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), have shown outstanding performance in many downstream tasks including sentence representation.Contrastive learning is often utilized for training an unsupervised sentence encoder based on a pretrained LM.The main idea behind contrastive loss in unsupervised learning is to enforce similarity between the representations of anchor and positive samples.Anchor samples can be randomly sampled from the training data.In contrast, the positive samples can be obtained from various techniques, e.g., generating from another LM (Carlsson et al., 2021;Kim et al., 2021), sampling sentence from the same document or dialogue (Giorgi et al., 2021;Liu et al., 2021a), generating similar sentences from back-translation operations (Fang et al., 2020), and randomly dropped some features of a vector (Yan et al., 2021;Gao et al., 2021;Liu et al., 2021b), the performance from these techniques outperformed previous un-supervised methods.However, those frameworks only focus on large models (BERT/RoBERTa-base and BERT/RoBERTa-large), without any consideration for smaller models.The experimental results from Wu et al. (2021) have demonstrated that the current SOTA unsupervised learning technique, SimCSE (Gao et al., 2021), fails to produce meaningful sentence representation when SimCSE is trained on compressed LMs.

Sentence Representational Knowledge Transfer
Knowledge distillation (KD) is a technique for transferring knowledge from a source model (teacher) to a target model (student), where the learning objective is minimizing the discrepancy between the two models.In particular, directly transferring knowledge from a teacher vector to a student model by using prediction outputs (Turc et al., 2019;Sanh et al., 2019) or transformer probabilities (Jiao et al., 2020;Sun et al., 2020b;Wang et al., 2020Wang et al., , 2021c) ) to create soft labels for compressed student models.Labeled Sentence-Pair Knowledge Distillation.Reimers and Gurevych (2020) propose an LM finetuning method that minimizes the discrepancy between English and other languages vector representations using the L2 loss.Cheng (2021) propose a dual-view distillation method called DvBERT, which minimizes the discrepancy of a student NLI output with respect to outputs from two teachers using KL divergence.Notably, a recent concurrent work, namely DisCo (Wu et al., 2021) also points out the importance of developing better sentence representation for compressed models.DisCo is based on contrastive distillation (Sun et al., 2020a), where the positive and negative samples in contrastive learning are obtained from a memory bank (Liu and Mukhopadhyay, 2018) produced by a supervised teacher model.Unlabeled Sentence-Pair Knowledge Distillation.Liu et al. (2022) propose a binary cross-entropy self-distillation called Trans-Encoder.Trans-Encoder imitates the similarity of pair-wise datasets from the teacher to student models by using binary cross-entropy loss.However, the techniques mentioned above require sentence-pair for training, thereby not entirely unsupervised.Knowledge Distillation without Sentence-Pair.Previous KD literature on sentence representation focuses on weakly-supervised and supervised set-tings but remains unexplored for unsupervised knowledge distillation.In contrast to previous literature, in the computer vision community, Fang et al. (2021) propose a self-supervised knowledge distillation (SEED) for visual representation learning.SEED is based on two components: (i) large-scale negative samples and (ii) similarity distribution to transfer the knowledge from large to small models without pair-wise or labeled datasets.This allows us to perform unsupervised distillation.We apply these components from computer vision to sentence representation models by designing a new training process, new generalization technique, loss function, and data augmentation methods.

Proposed Method
In this section, we describe our Control and Generalization distillation (ConGen) method.Con-Gen is a knowledge distillation technique comprising two objectives: (i) transferring the knowledge from large to small models and (ii) improving the model's generalizability.As illustrated in Figure 2, we describe our framework's training process, including how we organize the inputs and outputs, compare the outputs, and train the model.

How We Organize the Inputs and Outputs
As shown in Figure 2, given a new batch sample x, we first obtain two differently augmented samples x 1 = T (x) and x 2 = T ′ (x), where T and T ′ are back-translation from English-to-German-to-English and English-to-French-to-English (Zhang et al., 2021), respectively.Unlike SEED which uses single-view distillation, we use two augmented methods (T , T ′ ) to achieve the control and generalize objectives.
Let f T θ and f S θ denote the teacher and student encoders, respectively.Sentence representations are extracted from the student model (S) for different augmented views: On the other hand, the teacher model (T ) observes only one augmentation view,

How We Compare the Outputs
A simple method to assess the discrepancy between the teacher and student outputs is to directly compare the two vectors using a function such as L2 (Reimers and Gurevych, 2020) or cosine similarity (Sun et al., 2020a;Wu et al., 2021).In this work, however, we adopt a more robust alternative which uses a large set of negative samples (Fang et al., 2021) to compare the outputs z S con , z S gen , and z T ref .
In particular, we represent the teacher and student outputs as similarity distributions computed from an instance queue of negative samples used in the loss calculations.Instance Queue of Negative Samples.Since we use negative samples to describe the teacher and student outputs, we want them to provide sufficient coverage of the entire dataset.However, we want to keep the number of samples small due to the computational cost.Consequently, we adopt the instance queue approach (He et al., 2020) to achieve these goals.Let D = [d 1 ....d K ] denote the instance queue where d is a sentence representation obtained from the teacher, and K is the queue length.To cover the entire dataset, our framework progressively updates the instance queue D using the "first-in-first-out" (FIFO) strategy (He et al., 2020;Fang et al., 2021).At the beginning of each minibatch, we dequeue the first m entries where m is the minibatch size.We then enqueue the representation z T ref of each batch sample x bringing the total queue length back to K. The queue contains reference points for distillation and keeps rotating representations of the entire dataset for coverage.This practice reduces the overhead cost of computing negative samples.There are many ways to initialize D, i.e., random vector (He et al., 2020;Fang et al., 2021).We found that random initialization by sampling from the training data produces acceptable results since we used a pre-trained LM as the student model.Similarity Distribution.We use discrepancies between similarity distributions to help transfer the knowledge from the teacher to student models.Equation 1 describes how we compute the similarity distribution from a given representation z and an instance queue D's (j = 1, ..., K). where where τ denotes the temperature parameter, and sim(•) denotes the cosine similarity between two feature vectors.As shown in Figure 2, we created three distributions: (i) Student-Control similarity distribution: gen , D, τ S ); and (iii) Teacher-Reference similarity distribution: ).We found that using different temperature scaling for the teacher and student models (τ T , τ S ) yields better performance than using the same value for both models (see Appendix A.2).
Distilling knowledge via the similarity distribution from the instance queue achieves three objectives: (i) the student learns to match the positive examples via the reference; (ii) the student learns to contrast the positive sample against a large number of negative samples efficiently via the instance queue; and (iii) the student model learns the difference between each negative samples within each distribution.

How We Train the Model
The training objective of ConGen facilitates the knowledge transfer from the teacher to student models (Figure 2).Specifically, we use the reference distribution to compute the control and generalization discrepancies.
ConGen transfers the knowledge using our novel loss function L ConGen : where α represents the control-generalize trade-off and CE(•) is the cross-entropy function computed between the teacher and student distributions.
The intuition behind the objectives are: • Obj 1: Control.The first objective is to minimize the discrepancy between the student distribution P S con and the teacher distribution P T ref when the inputs are identical, i.e., both are compute from the same augmentation method T (•).• Obj 2: Generalize.The second objective is to improve the generalizability by minimizing the discrepancy between the student distribution P S gen and the teacher distribution P T ref when the inputs are slightly different, i.e., T (•) and T ′ (•).With the two objectives, we are ensuring that the student behaves similarly to the teacher with the added robustness from the multiple views.

Pre-training
Teacher Model.By default, we use the current state-of-the-art unsupervised sentence representation, SimCSE-RoBERTa-large (Gao et al., 2021) (#parameters: 356M), as the teacher model.Note that our distillation framework is compatible with any teacher method and model.We also compare SimCSE to other recent unsupervised finetuning methods, i.e., BSL (Zhang et al., 2021) and Con-SERT (Yan et al., 2021).For the teacher model, we also consider BERT-large and BERT-base.See Table 4. Student Model.We experiment with multiple pretrained language models from small compressed models to large state-of-the-art models: BERT-Tiny, -Mini, and -Small (Turc et al., 2019), MiniLM (Wang et al., 2020), TinyBERT (Jiao et al., 2020), BERT (Devlin et al., 2019), and RoBERTa (Liu et al., 2019).To produce sentence representations from these LMs, we use mean pooling (Reimers and Gurevych, 2019).In addition, we add one additional linear layer with the TanH activation function, where the number of hidden dimensions of the linear layer is equal to the teacher's.Training Setup.For the training data, we use unlabeled texts from two NLI datasets, SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets, to make it comparable with the previous works (Li et al., 2020;Zhang et al., 2020Zhang et al., , 2021)).We train the student model with the AdamW optimizer, a linear learning rate warm-up over 10% of the training data, and a batch size of 128 for 20 epochs.For hyperparameter settings, we use grid search to find the best parameter settings for the learning rate, teacher temperature (τ T ), student temperature (τ S ), and instance queue size (K).The full hyper-parameter configurations are given in Appendix A.1.Lastly, we randomly pick sentences from the training data to initialize the instance queue, which is more efficient than the random vector.

Competitive Methods
To show the effectiveness of our method, we compare our work to six competitors as follows.Finetune-based.We use the state-of-the-art sentence representation, SimCSE (Gao et al., 2021), with unsupervised settings as the baseline results and supervised settings as an supervised baseline.Unsupervised settings are trained with contrastive loss and dropout as the data augmentation method, while supervised settings are trained with contrastive loss and NLI labeled datasets.Distillation-based.We also compare our work with other distillation techniques: • L2: A L2 minimization between the teacher and student representations (z T ref , z S con ) (Romero et al., 2015).
• Dual-L2: Two terms L2 minimization where the first term is L2(z T ref , z S con ) and the second term is L2(z T ref , z S gen ) (Reimers and Gurevych, 2020).• SKD: A self-knowledge distillation method that uses the same two terms L2 minimization as Dual-L2, with additional term L2(z S con , z S gen ) (Limkonchotiwat et al., 2022).• CKD: An adaptation of contrastive knowledge distillation, where the positive and negative samples obtained from the teacher model (Wu et al., 2021).In this paper, however, we change from a supervised teacher to an unsupervised teacher.We retrained all models with our training data.
Multilingual STS.We demonstrate the versatility of our approach on eight multilingual STS-2017 datasets (Cer et al., 2017) including EN-EN, AR-AR, ES-ES, EN-AR, EN-ES, EN-TR, EN-DE, and EN-FR following previous works (Reimers and Gurevych, 2020;Zhang et al., 2021).To extend our work to a multilingual setting, we changed one of the data augmentation operations (T ′ ) from back-translation to machine translation (Google NMT) from English to languages in multilingual STS-2017 following Zhang et al. (2021).For simplicity, we changed the student LMs from monolingual to Multilingual-DistilBertcased (Sanh et al., 2019) and Multilingual-MiniLM-L12 (Wang et al., 2020), and we use the same teacher model (Unsupervised-SimCSE-RoBERTalarge).For competitors, we compare different unsupervised and multilingual settings with finetunebased, i.e., BSL, and distillation-based, i.e., Dual-L2 and L2.We show the average score from three random seeds for all experimental results.
5 Experimental Results

Semantic Textual Similarity
Table 1 shows the performance of our distilled models produced by our method compared to those of competitors on STS tasks.As mentioned in Sec-tion 1, finetuned-based methods do not perform well for small models.The experimental results demonstrate that distillation from a large model improves the performance of compressed models.For instance, using Unsupervised-SimCSE-RoBERTalarge as the teacher model of MiniLM-L3, the Spearman rank correction of ConGen-MiniLM-L3 can be improved from 55.10 to 78.22.Moreover, when the number of parameters is greater than 22 million, our models perform on par with the teacher model.In addition, when the number of parameters is less than 33 million, our models perform on par with Supervised-SimCSE.ConGen outperforms unsupervised methods in every compression model.For the full results of each dataset, see Appendix A.4.

Transfer and NLI
This study shows how our proposed models performed on transfer and NLI benchmarks.We continue to use the same baseline as the previous experiment without any modification.The setting of these tasks is described in Section 4.3.As shown in Table 2, in the transfer learning task, our distillation models improved compression models' performance and were comparable to supervised-SimCSE.Compared to other distillation models, ConGen outperforms competitive models in five out of six models, except the CKD-BERT-base result.Furthermore, in the NLI task, the performance of unsupervised-BERT-Tiny is improved from 68.52 to 78.01 with our distillation method, which is slightly better than supervised-SimCSE and other distillation models.We hypothesize that using the similarity distribution is crucial Table 2: Sentence embedding performance on transfer and NLI tasks (accuracy score).Where all settings from SimCSE (Gao et al., 2021), ✝ is the teacher performance, and ♣ is the performance of supervised learning.
for smaller models.For the full results of each dataset, see Table 10 and 11.

Multilingual STS
This study shows how well our method can be extended for multilingual sentence representation.
To extend from monolingual to multilingual settings, we use multilingual compression models, i.e., Multilingual-MiniLM-L12 and Multilingual-DistilBERT-cased, as the student model.For the full setting's detail, see Section 4.3.
The results are illustrated in Table 3.The performance of our distillation method greatly outperforms the finetune-base results, BSL.In comparison to other distillation techniques, L2 has a critical issue in terms of performance, e.g., the Spearman rank correlation of L2-Multilingual-DistilBERTbased on EN-TR is only 10.73.Meanwhile, our method outperforms other distillation methods in all settings.

Ablation Studies
This subsection explores the effect of various design decisions, such as teacher architectures, learning methods, loss functions, data augmentation techniques, anisotropy study, qualitative analysis, and instance queue.Different Teacher Pretraining.This study shows the performance of our distillation in other teacher models and techniques.For the diversity of teacher models, we use BERT base and large versions trained on BSL (Zhang et al., 2021) and Con-SERT (Yan et al., 2021), respectively.As shown in Table 4, our method works well regardless of the teacher model.Loss function Study.In this study, we show the effectiveness of each objective in ConGen.Since our work is inspired by SEED (Fang et al., 2021) with an additional generalization loss term and data augmentation process designed for NLP tasks, we study how much gain the generalization term gives.
In table 5, we show the Spearman rank correlation on average 7 STS corpus.We found that the performance of the original SEED (control only) is similar to our method ConGen when the number of the parameter is less than 14 million, i.e., BERT-Tiny and TinyBERT-L4.Nonetheless, when the number of parameters increases, the gap between SEED and ConGen widens, e.g., the gap between SEED and ConGen on TinyBERT-L4 is only 0.44; in BERT-base, the gap increased to 0.77.In addition, using only the generalize term (0%) performs slightly better than the original SEED (100%); still, combining the two learning objectives (50%, Con-Gen) yields the best performance.We investigated more about control and generalization objectives in error analysis.Effect of Data Augmentation Choice.We evaluate the effectiveness of different data augmentation methods for the generalization objective.For simplicity, we use data augmentation methods from Gao et al. (2021).The experimental results show in Table 6.The results showed that using Google NMT for data augmentation yields the best results conforming to previous works (Zhang et al., 2021;Fang et al., 2020).On the other hand, we can also use word deletion and delete one word when Google NMT is unavailable, which is not much different in the larger model.Anisotropy Study and Qualitative Analysis The anisotropic property of contextualized representations derived from pre-trained BERT is studied in several works (Ethayarajh, 2019;Li et al., 2020).This phenomenon consequently leads to degradation in semantic retrieval performance (Wang and Isola, 2020).Figure 3 shows the correlation between ground-truth similarity scores and modelderived cosine similarity.ConGen shows a better correlation between the gold standard and cosine similarities from sentence pair representations on STS-B than unsupervised-SimCSE and similar to supervised-SimCSE.This result confirms that our work can significantly decrease the unsupervised and supervised learning gap.We also performed qualitative comparisons following Gao et al. (2021).Using the 150,000 Flickr30k captions dataset, we randomly selected sentences from the dataset to retrieve similar sentences using embeddings from SimCSE, ConGen (only Obj 1), and ConGen (both objectives).Unsupervised simCSE failed to yield good top-5 retrieval results, while ConGen (only Obj 1) started to do badly at the top-10.However, ConGen (two objectives) yielded the most robust results.Example retrievals are available in Table 12.Instance Queue Study.This study shows the effect of instance queue size on LMs.For simplicity, we select two popular models, BERT-base and RoBERTa-base.For the instance queue sizes, we set the size as follows: 128, 1024, 16384, and 65536.As shown in Figure 4, the instance queue size has affected the performance of LMs.Unlike previous works that also used instance queue (Fang et al., 2021;Wang et al., 2021b), we found that the best instance queue is not always 65536 sentences.The best size for RoBERTa-base is only 1024 sentences.These results show the importance to finetune the instance queue size for distillation since each model has a different instance queue size.However, the difference is relatively small.For the best queue size of each model, see Appendix A.1.

Conclusion
In this paper, we propose a novel unsupervised Control and Generalization Distillation (ConGen).
ConGen is a distillation framework that transfers knowledge from a large model to any model regardless of its architecture and size by exploiting the concept of control and generalization mechanism.Our method outperforms competitive methods in all cases in monolingual and multilingual STS and five out of six text classification benchmarks.Furthermore, we demonstrate that our distillation framework can reduce the gap between compressed and base LMs.Using ConGen, the performance differences between supervised and unsupervised methods are slim for smaller models.

Limitation
Out-of-domain data might pose certain difficulties to our method.We strongly advise against using our model with out-of-domain data i.e., health or legal texts, directly.For example, we measure the cosine similarity between "The risks and benefits of the procedure were discussed, and the patient consented to this procedure" and "The content of this note has been reproduced, signed by an authorized physician in the space above, and mailed to the patient's parents, the patient's home care company.", the result from ConGen-BERT-Base is 0.3 (indicate that the two sentences are not equivalent, but share some details or are on the same topic) while the answer similarity is 0 (the two sentences are completely dissimilar).Both texts are from the MedSTS corpus (Peng et al., 2019), which is considered out-of-domain.To tackle this problem, we advise detecting out-of-domain samples or incorporating techniques that can help handle out-of-domain samples (Limkonchotiwat et al., 2020(Limkonchotiwat et al., , 2021;;Trijakwanich et al., 2021;Wang et al., 2021a).In addition, we did not try our method on non-MLM families such as GPT, BART, or CLIP.  .

Figure 1 :
Figure 1: Comparison between finetuning LMs (Sim-CSE) vs. knowledge distillation (ConGen) on the average of 7 semantic textual similarity (STS) benchmark datasets and ∆ is the improvement of ConGen from SimCSE.

Figure 2 :
Figure 2: Illustration of Control and Generalization Distillation (ConGen) training pipeline.For the teacher model, we freeze the weights during the distillation.We train student model by minimizing the cross-entropy of teacher & student similarity distributions computed over an instance queue.

Figure 3 :
Figure 3: Scatter plot of the groundtruth similarity scores (x-axis) and the cosine similarities (y-axis) between sentence pairs in the STS-B (dev set).

Figure 4 :
Figure 4: Effect of queue size in BERT-base and RoBERTa-base.

Figure 5 :
Figure 5: Scatter plot of edit distance and sentence similarity.The vertical axis represents the edit distance between sentence pairs in the STS-B (dev set).The horizontal axis represents the groundtruth similarity for (a), and modelderived cosine similarities for (b -d).Red dots indicate sentence pairs with edit distance ≤ 5.

Table 1 :
, Sentence embedding performance on STS tasks (Spearman rank correlation).The results of BERT-based, RoBERTa-base, and RoBERTa-large (the teacher model) are from SimCSE

Table 4 :
Sentence embedding performance on STS average scores.Where we changed from SimCSE to other finetuning algorithms such as ConSERT and BSL.

Table 6 :
Comparison between data augmentation operations for the generalize objective.

Table 10 :
The full results of our work on transfer tasks (Accuracy).

Table 11 :
The full results of our work on NLI tasks (Accuracy).