PAUSE: Positive and Annealed Unlabeled Sentence Embedding

Sentence embedding refers to a set of effective and versatile techniques for converting raw text into numerical vector representations that can be used in a wide range of natural language processing (NLP) applications. The majority of these techniques are either supervised or unsupervised. Compared to the unsupervised methods, the supervised ones make less assumptions about optimization objectives and usually achieve better results. However, the training requires a large amount of labeled sentence pairs, which is not available in many industrial scenarios. To that end, we propose a generic and end-to-end approach -- PAUSE (Positive and Annealed Unlabeled Sentence Embedding), capable of learning high-quality sentence embeddings from a partially labeled dataset. We experimentally show that PAUSE achieves, and sometimes surpasses, state-of-the-art results using only a small fraction of labeled sentence pairs on various benchmark tasks. When applied to a real industrial use case where labeled samples are scarce, PAUSE encourages us to extend our dataset without the liability of extensive manual annotation work.


Introduction
A sentence embedding is a numerical representation used to describe the meaning of an entire sentence. Embeddings of this type are becoming increasingly important for many downstream tasks in the language understanding domain, such as similarity or sentiment analysis. Some earlier methods, like GloVe (Pennington et al., 2014), BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) pool directly from underlying token-level embeddings to create a sentence representation. Recently, these pooling strategies have been challenged by various parameterized policies that can be optimized on domain specific tasks. The majority of these are either unsupervised or supervised. While unsupervised methods only utilize unlabeled sentences, supervised methods can quickly customize the embeddings by using domain specific labels. As a consequence, supervised methods make less assumptions about optimization objectives and usually achieve better results. However, supervised training requires a large amount of labeled sentence pairs, which is usually unavailable. In many real scenarios, the dataset turns out to be positiveunlabeled (i.e. PU dataset), where the majority is unlabeled and the rest of the samples are labeled as positive. The methods that enable learning binary classifiers on PU datasets are called PU learning. To bridge the gap between supervised and unsupervised approaches, we incorporate state-of-the-art PU learning with the general supervised sentence embedding approaches, proposing a novel method -PAUSE (Positive and Annealed Unlabeled Sentence Embedding) 1 . The main highlights of PAUSE include: (1) good sentence embeddings can be learned from datasets with only a few positive labels; (2) it can be trained in an end-to-end fashion; (3) it can be directly applied to any dual-encoder model architecture; (4) it is extended to scenarios with an arbitrary number of classes; (5) polynomial annealing of the PU loss is proposed to stabilize the training; (6) our experiments show that PAUSE constantly outperforms baseline methods.

Related Work
Among unsupervised sentence embedding methods, some are capable of exploring the relations among sub-sentences, such as skip-thoughts (Kiros et al., 2015), FastSent (Hill et al., 2016), quickthoughts (Logeswaran and Lee, 2018) and Disc-Sent (Jernite et al., 2017). These methods assume that adjacent sentences always have similar semantics. However, not every corpus is long enough, perfectly ordered or coherent enough to fulfill that assumption, which limits their applicable domains. As a result, other unsupervised methods merely focus on the internal structures within each sentence, such as paragraph-vectors (Le and Mikolov, 2014), Doc2VecC (Chen, 2017), Sent2Vec (Pagliardini et al., 2018;Gupta et al., 2019), WMD (Wu et al., 2018), GEM  and IS-BERT (Zhang et al., 2020). In general, those unsupervised approaches optimize objectives based on assumptions, which limits their embeddings from being adapted towards different applications. Recently, several concurrently proposed methods, such as Yan et al. 2021;Kim et al. 2021;Carlsson et al. 2021;Giorgi et al. 2021, adopt contrastive objectives by constructing different views from the same sentence. Gao et al. 2021 achieved superior results by simply using dropout to create different views. The supervised approaches, on the other hand, are usually (1) trained in an end-to-end manner, (2) following a dual encoder architecture and (3) finetuned from a model pretrained on SNLI (Stanford Natural Language Inference) (Bowman et al., 2015) and Multi-Genre NLI (Williams et al., 2018) datasets. NLI is the task of determining whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise 2 . The recent representative methods include InferSent (Conneau et al., 2017), USE (Universal Sentence Encoder) variants (Cer et al., 2018;Chidambaram et al., 2019), SBERT (Sentence-BERT) (Reimers and Gurevych, 2019) and LaBSE (Language-agnostic BERT Sentence Embedding) (Feng et al., 2020). Built upon pretrained models, they can effectively learn good embeddings from the labeled sentence pairs. However, this approach is not feasible in scenarios where the quantity of annotations is limited.
Rather than purely labeled or unlabeled, in many real scenarios, the dataset turns out to be positiveunlabeled (PU), where a small portion of sentence pairs are labeled as positive samples and the rest are unlabeled. To address this type of problem, the gap between supervised and unsupervised methods 2 http://nlpprogress.com  has to be filled. Levi experimented with incorporating unsupervised regularization criteria in the supervised loss. Although (Levi, 2018) reported better generalization capability, all of the samples still have to be labeled. Jiang et al. made an early attempt to apply PU learning -particularly for matrix factorization (Yu et al., 2017) -to obtain word embeddings for low-resource languages. Existing PU learning methods can be divided into three categories based on how unlabeled data is handled. The first category, with methods like Yang et al., 2017), tries to assign labels to unlabeled data in a heuristic-driven and iterative manner which makes the training scattered in steps/phases and hard to implement in practice. The second includes methods like , which treats unlabeled data as negative with lower confidence. This can be more computationally expensive to tune. The third category, with methods such as uPU (Du Plessis et al., 2014), nnPU (Kiryo et al., 2017), PUbN (Hsieh et al., 2019) and Self-PU (Chen et al., 2020) regards each unlabeled sample as a weighted mixture of being positive and negative. This third category optimizes a so called PU loss, and has recently become dominating due to its wide applicability and end-to-end nature. However, a major limitation is that these algorithms are only applicable to binary classification problems. In this work, we show how to adapt PU learning to effectively learn sentence embeddings from multi-class PU datasets.

The Proposed Method
We propose a generic and end-to-end approach to obtain sentence embeddings in a setup that is a generalized natural language inference (GNLI) task. This approach can have (1) any number of classes and (2) the majority of the sentence pairs unlabeled. Let X be the set of sentence pairs in the entire dataset; as illustrated in Figure is the total number of entailment classes, X (p)(c) denotes the N (c) p sentence pairs labeled as the c-th class, and X (u) represents the N u unlabeled pairs. For the N p = C c=1 N (c) p labeled pairs, we use Y ∈ R Np×C to denote their mutually exclusive and one-hot encoded labels, hence the binary label for the c-th entailment class should have the form of y (c) ∈ R N (c) p . On the individual sample level, we use x and x (u) i to denote the i-th sentence pair that is labeled as class c, the binary label towards the c-th class for the i-th pair, and the i-th unlabeled sample respectively.

Dual encoder model architecture
The model architecture of PAUSE follows a dual encoder schema ( Figure 2) that is widely adopted in supervised sentence embedding training. Each individual sample x i contains a pair of hypothesis and premise sentences (x i , x i ), each of which is fed into a pretrained encoder (e.g. BERT). As shown in Figure 2, the two encoders are identical during the training by sharing their weights. We add a pooling operation to the output of both encoders to obtain the fixed sized sentence embeddings g(x i ) and g(x i ), and following the empirical suggestion of (Reimers and Gurevych, 2019), we apply the MEAN-strategy (i.e. calculating the average of the encoder output vectors). Once the sentence embeddings are generated, three matching methods are applied to extract relations between g(x i ) and g(x i ): (1) concatenation of the two vectors, (2) absolute element-wise difference |g( The results of the three matching methods are then concatenated into a vector, which captures information from both the premise and the hypothesis. This vector is fed into a 128-way fully-connected (FC) layer with ELU (Exponential Linear Unit) activation (Clevert et al., 2016), the output of which is transformed by a C-way linear FC layer, obtaining the final output

Supervised loss
For multi-class and mono-label problems, we calculate cross entropy (CE) loss using the labeled samples: (1) For multi-class multi-label problems, the supervised loss can be binary CE: (2) When there is absolutely no negative label for binary classification problems, the supervised loss can be safely ignored.

Positive unlabeled loss
To leverage unlabeled data in obtaining better sentence embeddings, we turn to the state-of-the-art PU learning methods, among which we largely follow (Du Plessis et al., 2014Kiryo et al., 2017) due to their effectiveness and simplicity. Recently, Chen et al. proposed an updated version which achieved marginal improvement but with the cost of greatly increased training complexity. To facilitate computing PU loss, we address each class separately as a binary classification problem, therefore y For each class C, we define p(x, y (c) ) as the joint density of (X, y (c) ), p (c) p as the negative prior. Assuming that we have all samples labeled for the c-th class, we can easily estimate the error risk where function f is approximated by the model depicted in Figure 2, and : R × {±1} → R is the loss function, such that the value (a, b) means the loss incurred by predicting an output a when the ground truth is b. The feasible functions can be referred to in (Kiryo et al., 2017). Noticing that R(f ) c can be equivalently calculated by summing the error from positive and negative samples where n (x), we have: For the sake of simplicity, we denote the terms By solving (4) and (6), we can eliminate the term (4): where the two terms are called positive risk and negative risk respectively. Kiryo et al. argue that when the value of negative risk becomes less than zero, it often indicates potential overfitting. In that circumstance, we empirically choose to drop the positive risk and optimize reversely in respect to the negative risk term. Hence, in implementation, the error risk for the c-th class has the form of For , we choose to use sigmoid loss, i.e. (a, b) = (1 + e ab ) −1 , and we can conveniently calculate R(f ) c by plugging in the flowing equations: The overall PU loss L PU can be constructed using (8) and (9):

Annealed joint optimization
During training, the model is optimized in an endto-end manner on mini-batches. Every mini-batch is sampled from all subsets of the dataset (cf. Figure 1) according to the relative subset sizes, so that the mini-batches reflect the composition of the entire dataset. Because the initial estimations of positive/negative risks in (7) tend to be inaccurate, simply optimizing both losses (i.e. L CE and L PU ) jointly with the same weights often leads to sub-optimal and unstable solution. This problem is particularly prominent when the dataset is large or the model is highly flexible. As a result, we propose to apply an annealing strategy to the PU loss component when constructing the overall loss: where T denotes the total number of training steps and 1 ≤ t ≤ T is the elapsed number of steps. The hyper-parameter α ≥ 2 controls the annealing speed. We empirically discover that α = 3 usually offers optimal and stable performance. For binary classification problems (C = 1), the overall loss will fall back to L PU when there is no negative labels available.

Experiments
Inspired by the previous works (Hill et al., 2016;Reimers and Gurevych, 2019;Zhang et al., 2020), we evaluate PAUSE on STS (Semantic Textual Similarity) and SentEval 3 in both supervised and unsupervised settings. We also show the robustness of PAUSE in a real industrial use case. In our experiments, we test two versions of PAUSE: PAUSEbase (110M parameters) and PAUSE-small (4.4M parameters), which use uncased BERT-base (Devlin et al., 2019) and BERT-small (Turc et al., 2019) 4 as their encoder model, respectively. PAUSE is trained by minimizing (11) with α = 3 (searched 3, 4, and 5) using the Adam optimizer with learning rate 7.5e-5 (searched {1e-3, 1e-4, 7.5e-5, 5e-5, 2.5e-5, 1e-5 and 1e-6}). The experiments are carried out on a machine (managed virtual machine instance 5 ) with four virtual CPUs (Intel Xeon 2.30GHz), 15GB RAM, and four GPUs (NVIDIA Tesla P100). Since PAUSE requires a large batch size to ensure enough labeled samples from each class in every mini-batch, we use a batch size of 128 and 1,024 for PAUSE-base and PAUSE-small respectively, fully utilizing the GPU capacity.

Unsupervised STS
We first evaluate PAUSE on STS tasks without using any STS data for training. Specifically, we choose the datasets of STS 2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, STS benchmark (STSb) (Cer et al., 2017), and SICK-Relatedness (SICK-R) (Marelli et al., 2014). These datasets have labels between 0 and 5 indicating the semantic relatedness of sentence pairs. We compare PAUSE with two groups of baselines. The first group is the unsupervised methods, which includes Fast-Sent (Hill et al., 2016), IS-BERT-NLI (Zhang et al., 2020), the average of GloVe embeddings, the average of the last layer representations of BERT, and the [CLS] embedding of BERT. The second group consists of supervised approaches: InferSent-Glove (Conneau et al., 2017), USE (Cer et al., 2018), SBERT, and SRoBERTa (Reimers and Gurevych, 2019). All models are trained on the combination of the SNLI (Bowman et al., 2015) and Multi-Genre NLI (Williams et al., 2018) datasets, which contains one million sentence pairs annotated with three labels (C = 3): entailment, contradiction and neutral. PAUSE is trained for 2 epochs with a linear learning rate warm-up over the first 10% of the training steps.
As suggested in (Reimers et al., 2016;Reimers and Gurevych, 2019;Zhang et al., 2020), we calculate the Spearman's rank correlation between the cosine-similarity of the sentence embeddings and the labels, which is presented in Table 1. The results show that most of the supervised methods achieve superior performances compared to unsupervised ones, which has been previously evidenced by (Hill et al., 2016;Cer et al., 2018;Zhang et al., 2020). PAUSE using the BERT-base encoder (PAUSE-NLI-base) performs much better than the versions using the BERT-small encoder (PAUSE-NLI-small). PAUSE-NLI-base takes on average 220 minutes to complete one epoch of training, while PAUSE-NLI-small takes merely 9 minutes. The post-fix (i.e. 1%, 5%, ..., 100%) in the names of the PAUSE variants indicate the percentage of NLI labels that are used during the training. Although the performance monotonically drops when using less labels, this drop remains marginal even when only 50% of the labels are used. PAUSE-NLIbase-100% obtained slightly better results than SBERT-NLI-base. PAUSE-NLI-base-100% and SBERT-NLI-base are trained on the same amount of labeled samples, yet the former achieves slightly better results, probably due to differences in (1) the BERT-base versions and/or (2) the layers following the encoding step. Observing the average results of each PAUSE-NLI-base variant, the model trained on merely 10% of the labels results in a performance about 2% lower than the one relying on all labels. This demonstrates that PAUSE is a labelefficient sentence embedding approach applicable to situations where only a small number of samples are labeled.
In an attempt to train the SBERT-NLI-base model using only 1%, 5%, 10%, and 30% of the labels for 2 epochs, we found that all trials suffered from over-fitting in varying degrees. While this problem could be addressed by hyper-parameter optimization and regularization, such alterations  (Hill et al., 2016). † The results are extracted from (Reimers and Gurevych, 2019). ‡ The results are extracted from (Zhang et al., 2020). would compromise the fairness of the comparison.

Supervised STS benchmark
Similar to (Reimers and Gurevych, 2019;Zhang et al., 2020), we use the STS benchmark (STSb) dataset (Cer et al., 2017) to evaluate the models' performance on the supervised STS task. STSb includes 8,628 sentence pairs from the categories of captions, news, and forums. The dataset is split into train (5,749), dev (1,500), and test (1,379) subsets. We use the training set to finetune PAUSE (pretrained using partially labeled NLI) using a regression objective function. On the test set, we compute the cosine similarity between each pair of sentences. Since PAUSE obtains better results using BERT-base compared to BERT-small (cf. Table 1), we only report the results for PAUSE-NLI-STSb-base models, which are trained with five random seeds and four epochs.
In Table 2, we compare PAUSE to three categories of baselines: (1) not trained on STSb at all, (2) only trained on STSb, and (3) first trained on the fully labeled NLI, then finetuned on STSb. It is clear that finetuning on STSb greatly improves the model performance and pretraining on NLI further uplifts the performance slightly. Using merely 10% to 70% of NLI labels, PAUSE manages to achieve results comparable to the baselines that use all NLI labels. Another interesting finding is that when pretraining PAUSE on less than 5% of the labels, the performance becomes inferior to directly finetuning on STSb. This suggests that pretraining PAUSE with too few labeled samples may result in embeddings that are hard to finetune in downstream regression tasks. In addition, we observe a clear trend in Table 2: the standard deviation increases when less labels are used, which is also observed in the unsupervised experiments.

SentEval: domain specific tasks
In order to give an impression of the quality of our sentence embeddings for various domain specific  tasks, we choose to evaluate PAUSE on seven Sen-tEval tasks (Conneau and Kiela, 2018): (1) TREC -fine grained question type classification (Li and Roth, 2002), (2) CR -sentiment prediction of customer product reviews (Hu and Liu, 2004), (3) MRPC -Microsoft Research Paraphrase Corpus from parallel news sources (Dolan et al., 2004), (4) SUBJ -Subjectivity prediction of sentences from movie reviews and plot summaries (Pang and Lee, 2004), (5) MR -sentiment prediction for movie reviews (Pang and Lee, 2005), (6) MPQA -opinion polarity classification (Wiebe et al., 2005) and (7) SST -binary sentiment analysis (Socher et al., 2013). Unlike (Devlin et al., 2019) and (Zhang et al., 2020) that finetune the encoder on these tasks, we directly use the sentence embeddings from PAUSE-NLI-base models (cf. Section 4.1) as features for a logistic regression classifier that is trained in a 10-fold cross-validation setup, where the prediction accuracy is computed for the test fold.
The results can be found in Table 3, where the top-3 results on each task are presented in bold face. Largely speaking, the sentence embeddings from SBERT and PAUSE successfully capture domain specific information with the exception of the TREC task where pretraining on questionanswering data (e.g. USE) seems to be beneficial. In Table 1, we have observed poor results from Avg. BERT embeddings, BERT [CLS]-vector, PAUSE-NLI-base-5% and PAUSE-NLI-base-1%. However, on the selected SentEval tasks, they all achieve decent results and the performance of PAUSE does not even degrade as we use significantly less labels. This inconsistency can be explained by how we measure the model performance: on STS datasets we calculated the cosine-similarity between sentence embeddings, which treats all dimensions indifferently, while SentEval fits a logistic regression classifier to the embeddings allowing different dimensions to have different impact on the classifier's output. As a result, cosine-similarity can only be relied on when the sentence embeddings are finetuned on related datasets with a large number of labeled samples. When the sentence embeddings are directly used as input features for training discriminative models on downstream tasks, finetuning on NLI only gives approximately 1∼2% performance uplift. Moreover, PAUSE appears to be unaffected by a drastic decrease in labeled samples, which is consistent with the results of IS-BERT-task.
We also notice that unsupervised PAUSE-NLIbase-100% performs better than SBERT-NLI-base in Table 1, yet this is not the case for supervised fine-tuning (Table 2 and 3). This might be a consequence of several differences: (1) PAUSE uses a newer version of the pretrained BERT-base model 6 compared to (Reimers and Gurevych, 2019), (2) PAUSE has an extra term g(x i ) * g(x i ) when extracting relations between two sentences, as seen in Figure 2, and (3) PAUSE treats the NLI sentence pairs with ambiguous/conflicting labels as unlabeled samples that are utilized by the PU loss term during model optimization.

Use case: finding similar companies
In this section, we will discuss the potential of PAUSE on a real industrial use case from EQT Group 7 . The EQT investment professionals use  (Hill et al., 2016). † The results are extracted from (Reimers and Gurevych, 2019). ‡ The results are extracted from (Zhang et al., 2020); IS-BERT-task is fintuned on each of the task-specific dataset (without label) to produce sentence embeddings, which are then used for training downstream classifiers. To benchmark PAUSE in this setting, we trained models using different percentages of labeled samples (100%, 50%, 10%, 5%). Table 5 shows that it is sufficient to have 10% of the samples labeled and still reach high accuracy, precision, and recall. When all samples are labeled, the accuracy is only increased by around 3%. In practice, these results encourage us to extend our dataset without the burden of manually labeling all samples. We also speculate that increasing the size of the dataset while maintaining the balance between labeled and unlabeled samples could improve performance further. Essentially, this implies that we can achieve results close to that of a fully labeled dataset with a fraction of the manual annotation work.

ID Anchor Company
Candidate Company label 1 Owner and operator of data centers in UK ... distribute data in data centers and the global digital economy.
Independent co-location / data center provider in Slovenia.

Conclusions and Future Work
In this work, we attempt to bridge the gap between supervised and unsupervised sentence embedding techniques, proposing PAUSE -a generic and endto-end sentence embedding approach that exploits the labels and explores the unlabeled sentence pairs simultaneously. PAUSE trained on NLI datasets achieves state-of-the-art results on unsupervised STS tasks, and also performs well on many downstream domain-specific tasks. In all of our experiments, we observe that PAUSE keeps performing well with a reduced number of labeled samples, as long as more than 5-10% of the dataset is labeled. This indicates that PAUSE is a label-efficient sentence embedding approach that can be effectively applied to datasets where only a small part is labeled while the rest remains unlabeled. We also demonstrate that PAUSE helps lower the labeling requirement for an industrial use case aimed at encoding company descriptions. In that sense, PAUSE pushes the application boundary of sentence embeddings to include many more real-world scenarios where labeled samples are scarce. The possible extensions of this work include (1) augmenting the the labels with dropout, (2) experimenting with contrastive supervised loss, and (3) exploring how PAUSE can be extended with contextual sentence embeddings.