Bootstrapped Unsupervised Sentence Representation Learning

As high-quality labeled data is scarce, unsupervised sentence representation learning has attracted much attention. In this paper, we propose a new framework with a two-branch Siamese Network which maximizes the similarity between two augmented views of each sentence. Specifically, given one augmented view of the input sentence, the online network branch is trained by predicting the representation yielded by the target network of the same sentence under another augmented view. Meanwhile, the target network branch is bootstrapped with a moving average of the online network. The proposed method significantly outperforms other state-of-the-art unsupervised methods on semantic textual similarity (STS) and classification tasks. It can be adopted as a post-training procedure to boost the performance of the supervised methods. We further extend our method for learning multilingual sentence representations and demonstrate its effectiveness on cross-lingual STS tasks. Our code is available at https://github.com/yanzhangnlp/BSL.


Introduction
Sentence representation learning aims to map sentences into vectors that capture rich semantic information. Among previous approaches, supervised methods achieve state-of-the-art performance by leveraging quality sentence labels. For example, the recently proposed model Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) fine-tunes a Siamese BERT network on natural language inference (NLI) tasks with labeled sentence pairs. It achieves state-of-the-art results on multiple semantic textual similarity (STS) tasks. However, such performance is mostly induced by high-quality supervision, while labeled data are difficult and ex- * * Equally Contributed. † Corresponding author.
pensive to obtain in practice. Zhang et al. (2020) showed that SBERT generalizes poorly on target tasks that differ significantly from NLI on which SBERT is fine-tuned. Many unsupervised methods learn sentence representations by optimizing over various selfsupervised learning (SSL) objectives on a largescale unlabeled corpus. Early works often use auto-encoders (Socher et al., 2011;Hill et al., 2016) or next-sentence prediction  for sentence representation learning. Recently, more efforts have been devoted to representation learning with transformer-based networks using masked language modeling (MLM). However, transformer-based methods do not directly produce meaningful sentence representations. Instead, significant supervised fine-tuning steps with labeled data are commonly required to form good representations (Reimers and Gurevych, 2019). Recently, Giorgi et al. (2020) and Zhang et al. (2020) proposed novel transformer-based frameworks to directly learn sentence representations from an unlabeled corpus, which even exhibited competitive performance to the supervised counterparts on some tasks. However, Giorgi et al. (2020) required long text during training while the contrastive learning strategy employed by Zhang et al. (2020) need a careful treatment of negative pairs. More important, there is still great room for improvement in terms of the quality of learned sentence representations.
In this paper, we introduce Bootstrapped Sentence Representation Learning (BSL), a simple and lightweight framework that directly learns sentence representations without supervised finetuning. Our work is inspired by the recent success of Siamese networks (Bromley et al., 1994) for unsupervised visual representation learning Grill et al., 2020;Caron et al., 2020;Chen and He, 2020), especially the BYOL framework (Grill et al., 2020). These models employed various kinds of unsupervised learning objectives to maximize the similarity between two augmented views of each image, yielding performance on par with supervised methods. Unlike contrastive learning-based methods, which demand a carefully negative sampling process and large batch sizes, BYOL could achieve great performance without negative pairs.
The proposed BSL works as follows. Given an input sentence, we first construct two augmented views through back-translation. These two views are simultaneously fed into the two branches of the Siamese network, i.e., an online network and a target network following the terminology in (Grill et al., 2020). In particular, the online and target networks use two pre-trained transformer networks with the same structure, e.g., BERT, to encode the two views separately. During learning, the online network is trained to predict the representation of the other augmented view generated by the target network, and its parameters are updated by minimizing a predefined prediction loss. As for the target network, we apply a stop-gradient strategy (Chen and He, 2020) and update it with a weighted moving average of the online network. Hence, the outputs of the target network are iteratively bootstrapped to serve as targets, enabling enhanced representation learning of the online network while avoiding trivial solutions.
Our method is evaluated through extensive experiments. Empirical results show that BSL significantly outperforms strong unsupervised baselines on a standard suite of STS and classification tasks from the SentEval benchmark (Conneau and Kiela, 2018). We also demonstrate that BSL can serve as an effective post-training approach to boost the performance of the state-of-the-art supervised SBERT model. We further extend our method for learning multilingual sentence representations and demonstrate that it is able to outperform strong multilingual baselines on cross-lingual STS tasks under both unsupervised and supervised settings. Detailed analysis of a few factors that could affect the model performance is provided as well to motivate future research.
2 Related Work 2.1 Sentence Representation Learning might leverage on both of them. Most of the supervised methods are trained on labeled natural language inference (NLI) datasets including Stanford NLI (SNLI) (Bowman et al., 2015) and MultiNLI (Williams et al., 2018). Early methods demonstrate good performance on a wide range of tasks (Conneau et al., 2017;Cer et al., 2018). Recently, SBERT (Reimers and Gurevych, 2019) fine-tuned a pre-trained Siamese BERT network on NLI and demonstrated the state-of-the-art performance. Though effective, those methods highly rely on labeled data and could be problematic to port to new domains. Zhang et al. (2020) showed that SBERT generalizes poorly on target tasks with a data distribution significantly different from the NLI data.
There are also fruitful outcomes for unsupervised methods. Some early studies attempt to learn from the internal structures within each sentence (Socher et al., 2011;Hill et al., 2016;Le and Mikolov, 2014) or utilize a distributional hypothesis to encode contextual information with generative Hill et al., 2016) or discriminative objectives (Jernite et al., 2017;Logeswaran and Lee, 2018). Recently, transformer-based networks attract more attentions (Devlin et al., 2019;Liu et al., 2019), however, they do not yield meaningful sentence representations directly without supervised fine-tuning. Reimers and Gurevych (2019) show that sentence embeddings obtained from BERT without fine-tuning even underperform the GloVe embeddings (Pennington et al., 2014) in terms of semantic textual similarity.
More recently, a few unsupervised methods were proposed to learn sentence representations from transformer-based networks without supervised fine-tuning. Li et al. (2020) proposes to transform the representation obtained by a pre-trained language model to an isotropic Gaussian distribution. Giorgi et al. (2020) minimizes the distance between different spans sampled from the same document. However, it requires an extremely long document of 2,048 tokens as input, which limits its applications to domains with only short documents. Zhang et al. (2020) proposed IS-BERT to maximize the mutual information between the global embedding and local n-gram embeddings of a given sentence. However, IS-BERT requires careful negative sampling and the n-gram embeddings may be suboptimal in capturing sentence-level semantics.

stop-grad
Original sentence EMA Figure 1: The proposed framework BSL. Two augmented views x 1 and x 2 of sentence x are encoded by the online network f θ and the target network f ξ , respectively. Both networks are initialised from the same pretrained language models but ξ are an exponential moving average (EMA) of θ during training. p denotes the predictor, which is a multi-layer perceptron and only applied on the online side. A stop-gradient operation is applied on the target side. The loss L θ,ξ maximise the similarity between online prediction z 1 and target representation h 2 .

Unsupervised Representation Learning with Siamese Networks
Siamese networks have been increasingly used in various models (Chen and He, 2020;Grill et al., 2020;Caron et al., 2020) for unsupervised visual representation learning. These models typically maximize the similarity between two augmented views of an image encoded by the Siamese network. The main difference among these models is how they prevent undesired trivial solutions. Most works rely on contrastive learning with negative sampling Tian et al., 2020) to avoid collapsing. Our method BSL is mainly inspired by BYOL (Grill et al., 2020), which shows that one can learn transferable visual representations via bootstrapping representations without negative sampling. We transfer this learning strategy from images to texts with different network architectures and augmenting methods.

Model Description
Given a sentence x sampled from the dataset D without label information, our goal is to learn a meaningful representation h f (x). In our framework, we adopt the idea from BYOL for unsupervised sentence representation learning with a Siamese network. The architecture of the proposed BSL is illustrated in Figure 1. Given a sentence x, we first obtain two augmented views x 1 T (x) and x 2 T (x), where T and T are augmentation transformations.
The two views are fed into the Siamese network separately. The online network contains an encoder network f θ (·) and a predictor network p θ (·). The target network contains an encoder network f ξ (·) without a predictor, leading to an asymmetric framework. For the first augmented view x 1 , the online network outputs a representation z 1 p θ (f θ (x 1 )). For the second augmented view, the target network outputs a representation h 2 f ξ (x 2 ). Afterwards, we define a mean squared loss between the two normalized representations from the online and target networks, which can be simplified as minimizing their negative cosine similarity: where · denotes the l 2 -norm and <, > denotes the dot product between two vectors. As the loss is asymmetric over the two views, we also feed x 2 to the online network and x 1 to the target network to getz 2 p θ (f θ (x 2 )) andh 1 f ξ (x 1 ), leading to the final objective: Though we define the loss with parameters {θ, ξ}, we only update θ during training, as shown in the stop-gradient operation Fig 1. This stopgradient operation is empirically demonstrated effective for Siamese network (Grill et al., 2020;Chen and He, 2020). f ξ is detached from the optimization graph of L θ,ξ and will be updated with a weighted moving average of f θ . The updating dynamics becomes: Here δ is the momentum. When it is set to 1, the target network is never updated. When it is set to 0, the target network is instantaneously synchronized to the online network at each training step. At the inference stage, we obtain the representation of a sentence with the online encoder f θ .

Architecture Details
Augmentation We use back-translation to obtain two augmented views x 1 and x 2 . In this work, we only consider input sentence x in English. We use an English-to-German machine translation (MT) system to translate x to y 1 , and subsequently use a German-to-English MT system to translate y 1 back to x 1 to obtain one augmented view. Similarly, we use English-to-French and French-to-English MT systems to obtain another augmented view x 2 . 1 Besides back-translation, we also discuss other text augmentation approaches in § 4.4. Architecture The online network f θ and the target network f ξ take x 1 and x 2 as inputs and output h 1 and h 2 . We use pre-trained language models to initialize the weights in f θ and f ξ such that they benefit from the knowledge obtained at the pretraining stage. We apply average-pooling over outputs from the pre-trained language models to obtain h 1 and h 2 . A multi-layer perceptron (MLP) p θ is stacked on top of f θ as the predictor to transform h 1 to predictions z 1 such as z 1 matches the target representation h 2 .

Experiment
Design We conduct various experiments to evaluate the effectiveness of the proposed method. Following prior works (Reimers and Gurevych, 2019;Zhang et al., 2020), our major evaluations are conducted on the Semantic Textual Similarity (STS) tasks and the classification tasks with the SentEval toolkit (Conneau and Kiela, 2018). To demonstrate the flexibility of the proposed method, we further extend it for learning multilingual sentence representations and evaluate it on cross-lingual STS tasks.
Implementation The MLP contains three linear layers. Given an input vector of dimension d, the output dimensions of the three layers are kd → kd → d, where k is a hyperparameter controlling the hidden size. Batch normalization and rectified linear units (ReLU) are applied to the intermediate linear layers. We use BERT-base or RoBERTabase to initialize the online and target networks in monolingual settings.
Hyperparameter We tune learning rate, batch size, momentum δ, and the hyperparameter k on the development set of STS-B (Cer et al., 2017 Baselines Under a unsupervised learning setting, we compare to the unigram-TFIDF model, the Sequential Denoising Auto-Encoder (SDAE) (Hill et al., 2016), the Skipthought  and FastSent (Hill et al., 2016). Those models are all trained on the Toronto book corpus with 70M sentences . We also compare with sentence representations obtained with the average of GloVe embeddings (GloVe avg.), the average of BERT embeddings (BERT avg.), and the [CLS] representation of BERT (BERT [CLS]), as those are common ways to get sentence-level representations. We compare with BERT-flow (Li et al., 2020), a recent method that transforms the representation obtained by BERT to an isotropic Gaussian distribution. In addition, we compare with two unsupervised BERT fine-tuning methods. The first is to finetune BERT with masked language modeling (MLM) objective (BERT-mlm) (Gururangan et al., 2020). The second is IS-BERT (Zhang et al., 2020) which employs a mutual information maximization objective for fine-tuning BERT. We denote our model initialized by BERT-base (RoBERTa-base) as BSL-BERT (BSL-RoBERTa). Under a supervised learning setting, we compared to InferSent (Conneau et al., 2017), Universal Sentence Encoder (USE) (Cer et al., 2018), and sentence BERT/RoBERTa (SBERT/SRoBERTa) (Reimers and Gurevych, 2019), which are all trained on the SNLI and MultiNLI datasets. To adapt BSL to a supervised learning setting, we first train a SBERT (SRoBERTa) model and then use the learned weights to initialize the online and target networks of BSL and perform BSL training. We denote this model variant as BSL-SBERT (BSL-SRoBERTa).
Results of baselines marked with † are obtained from (Hill et al., 2016) (with a different number of decimal places).
sist of sentence pairs with scores from 0 to 5, where a larger score indicates higher semantic relatedness of the two sentences. We use Spearman's rank correlation between the cosine-similarities of the sentence pairs and the gold scores as an evaluation metric, following prior works (Reimers and Gurevych, 2019;Zhang et al., 2020).
Most of the prior unsupervised methods were trained on the Toronto book corpus , while the most recent and the best performed unsupervised method IS-BERT was trained on unlabeled texts from SNLI and Multi-Genre NLI (MultiNLI) datasets. To have a fair comparison with IS-BERT, we follow its setting to train BSL on unlabeled texts from the SNLI and MultiNLI datasets. The BERT-mlm baseline is also trained with the same setting for a fair comparison. We illustrate the effect of corpus choice in § 4.4. SNLI contains 570k sentence pairs and MultiNLI contains 430k sentence pairs from a wider range of genres of spoken and written texts. In both datasets, each sentence pair is labeled with contradiction, entailment, and neutral. Note that the labels are excluded when training BSL in unsupervised settings. Table 1 presents the comparison results. Models are divided into two sets: trained on unlabeled data, or trained on labeled data. For unsupervised models, Unigram-TFIDF, SDAE, SkipThought and FastSent are trained on the Toronto book corpus while BERT-mlm, IS-BERT, BERT-flow and our proposed method are trained on NLI. In the supervised setting, BSL-SBERT and BSL-SRoBERTa only take labeled entailment pairs as the inputs to the online and target networks.
We make the following observations. First, BSL outperforms all prior unsupervised methods by large margins. On average, it outperforms IS-BERT and BERT-flow trained with the same encoder and training corpus by 5.45%, and 6.65%, respectively. It even outperforms supervised baselines InferSent and USE. Second, unsupervised BSL still underperforms SBERT since the latter was fine-tuned on labeled NLI data. We show that by using BSL as a post-training approach, BSL-SBERT ( BSL-SRoBERTa) can further increase the average result  by 2.6% (4.7%) from SBERT. This suggests that BSL can also be used as an effective post-training approach after supervised fine-tuning.

SentEval Classification Tasks
Following prior works (Reimers and Gurevych, 2019;Zhang et al., 2020), we evaluate sentence representations on a set of classification tasks from SentEval. The evaluation is done by the SentEval toolkit. It takes sentence representations as fixed input features to a logistic regression classifier, which is trained in a 10-fold cross-validation setup and the prediction results is computed on the test-fold. The sentence encoder is not fine-tuned in the training process. This set of tasks is the common bechmark used to evaluate the transferability of sentence representations on downstream tasks. Table 2 presents the comparison results. On average, BSL outperforms all prior unsupervised baselines. It also outperforms supervised baselines In-ferSent and USE, and only slightly underperforms SBERT. BSL-SBERT can marginally improve the results of SBERT. BSL-SRoBERTa achieves the best performance.

Multilingual STS
In this subsection, we show that BSL can be easily extended for learning multilingual sentence representations. Following (Reimers and Gurevych, 2020), we conduct evaluation on the multilingual STS 2017 dataset (Cer et al., 2017)

which contains annotated pairs for EN-EN, AR-AR, ES-ES, EN-AR, EN-ES, EN-TR, EN-DE, and EN-FR.
To learn multilingual representations under the unsupervised setting, we process the NLI data as follows. We translate the English NLI sentences to AR, ES, TR, DE and FR using Google translation engine and pair the original English sentence to each of its translations. We obtain 5 pairs (EN-AR/ES/TR/DE/FR) from one sentence and treat the English sentence as one view and its translation as the other view. We concatenate all pairs as the training data. We use multilingual BERT (mBERT) to initialize f θ and f ξ , such that the token-level representations between the different languages are aligned. The remaining training procedure is the same as described in § 3. We denote our unsupervised model as BSL-uns. We compare with sentence representations obtained with mean pooling of mBERT and XLM-R (Conneau et al., 2020) embeddings under the unsupervised setting.
For supervised learning, we compare with meth-  Table 3: Spearman's rank correlation ρ between the cosine similarity of sentence representations and the gold labels. ρ*100 is reported. Results of baselines are obtained from (Reimers and Gurevych, 2020).
ods from (Reimers and Gurevych, 2020): mBERT-/ XLM-R-nli-stsb denotes the setting where we fine-tune XLM-R and mBERT on the English NLI and the English training set of the STS benchmark (STS-B); mBERT-/XLM-R ← SBERT-nli-stsb is the knowledge-distillation method proposed in their paper where we learn mBERT and XLM-R to imitate the output of the English SBERT trained on NLI and STS-B with multilingual parallel sentence pairs. We also compared to results of mUSE (Chidambaram et al., 2019) and LaBSE (Feng et al., 2020), which use dual encoder transformer architectures. mUSE was trained on question-answer pairs, SNLI, translated SNLI data, and parallel corpora over 16 languages. LaBSE was trained on 6 billion translation pairs for 109 languages. For BSL, we initialize our online and target networks with the learned weights from XLM-R ← SBERTnli-stsb 3 and then perform BSL training in a same way as described above. We denote our model in this setting as BSL-sup. Table 3 presents the results. Under the unsupervised setting, averaging the multilingual token representations yields poor results. BSL-uns achieves promising results with scores higher than 70. For the supervised methods, we observe that directly fine-tuning multilingual pre-trained models on English NLI and STS-B datasets does not generalize well in a cross-lingual setting. Knowledge distillation-based models are strong baselines. Applying BSL as a post-training approach can boost the results of the distilled models by large margins. These observations demonstrate that BSL has the flexibility to be applied to learning multilingual sentence representations.

Analysis
In this subsection, we discuss a few factors that could affect the model performance. We use BERTbase as the encoder for analysis. (Hill et al., 2016;Cer et al., 2018) indicated that the dataset used for learning sentence representations in a supervised setting significantly impacts their performance on STS tasks. They found learning with NLI datasets is particularly useful and yields good results on common STS benchmarks. We have similar observations with the proposed unsupervised method. In Table 4, we show the results of training our model with a subset of 5 million sentences from the Toronto book corpus. This setting achieves an average result of 69.65 on STS tasks, still outperforming prior best unsupervised model IS-BERT by 3.07%, which again demonstrates the effectiveness of the proposed framework.

Choice of Corpus Previous works
However, we observe that the average result obtained from training with the book corpus is 2.38% lower than the result of training with the NLI datasets even the number of training pairs of the latter is only 1 million. Training on both of them still underperforms training on NLI alone. This finding indicates that the choice of training corpus is a key factor that affects model performance. When evaluating the common STS benchmarks as used in our experiments, the NLI datasets are better choices as they are semantically related to the STS data. We also conduct an evaluation on an Argument Facet Similarity task, which is more domain-specific and  Table 4: Results with 1) different training corpora; and 2) different augmentation techniques. Spearman rank correlation ρ between the cosine similarity of sentence representations and the gold labels. ρ * 100 is reported.

Original
The cats used to love plopping on the newspapers.

Synonym
The cats use to have sex flump on the newspapers.

MLM
The cats used to love plucking in the newspapers.

Back-translation
Cats loved to play in the newspapers.

Entailment
oh when i had the uh cats at my place as soon as i took out the newspaper to read it they would plop right down on top of it and just not move and just stay there forever. dissimilar to the NLI tasks. The results are provided in Appendix B. We find that in this scenario, training with NLI data yields poor generalization results on the target test set while training on the target raw text yields a much better performance. The results indicate that semantically related corpus to the target task should be adopted as the training set.
Augmentation Techniques It has been shown that data augmentation plays a crucial role in unsupervised visual representation learning Grill et al., 2020). The images can be augmented easily by rotating, resizing, or cropping . However, less work has been done on augmentation techniques for texts (Fang et al., 2020;Giorgi et al., 2020). Here, we study how different augmentation techniques would affect the model performance.
We present the results of another two augmentation approaches besides back-translation in Table 4. Synonym denotes the setting where we randomly replace a few words with their synonyms. MLM denotes the setting where we first randomly mask a few tokens and then use a pre-trained masked language model to generate the masked tokens. Specifically, for both methods, given a sentence x, we make x 1 = x and obtain x 2 with the respective augmentation technique. We found that using one augmented view performs slightly better than using two augmented views for synonymand MLM-based methods. One possible reason is that these methods may generate augmented sentences with semantics totally different from the original sentences as we will show in this subsection. Such kind of augmentation may bring in too much randomness and noise. Therefore using two augmented views might instead harm the model performance.
For Synonym, we select 30% of words and substitute them with similar words according to Word-Net (Miller, 1995). For MLM, we mask 20% of tokens and use RoBERTa-base for token generation. In addition, we show results of a setting where we treat the sentence pairs labeled with entailment from the NLI datasets as the two views (NLI entail ) for our model, as well as a setting using the combination of NLI unlabeled text with backtranslations and the entailment pairs as the training corpus(Back-translation+NLI entail ). The purpose is to illustrate how our model would perform with high quality augmented data.
The results in Table 4 show that our proposed framework can work with both Synonym and MLM, as they still outperform IS-BERT on the average result by 1.63% and 2.81%, respectively. However, they are less effective compared to Backt-   translation. We observe that training with entailment pairs yields good results, with only 300k training pairs, NLI entail is comparable to the model trained on all data from the NLI datasets augmented with back-translation (1 million training pairs). In addition, when training on both (Back-translation + NLI entail ), a 2.91% improvement on the average result over Back-translation is observed. The results indicate that the quality of the augmented pairs directly affects the performance of the proposed framework. Table 5 presents an example of augmentations generated to the same sentence. 4 We observe that Synonym substitutes words without considering the context while MLM generates words based on the context but losing the original word semantics. Back-translation yields a relatively better sentence, however, the drawback of which is that it relies on external machine translation systems. The Entailment refers to the sentence in the NLI datasets to which the original sentence has an entailment relation. It can be regarded as an ideal augmentation of the original sentence. How to automatically generate such augmentations remains an open question, and we leave it to future research.
Momentum The momentum δ in Equation (4) is an important hyperparameter. When it is set to 1, the target network is never updated and remains the same to its initialization. When it is set to 0, the target network is updated to the online network at each training step. Table 6 shows the results of our method with different values of momentum. We observe that our proposed method works better with larger momentum near but not equals to 1. A similar phenomenon has also been observed in BYOL (Grill et al., 2020). In addition, we find that 4 More examples are provided in Appendix C although directly averaging the token embeddings from BERT yields poor sentence representations as shown in Table 1, initializing the target network using BERT and keeping it unchanged (set momentum to 1) during the learning procedure helps the online network learn much better representations, yielding a 21.84% improvement on STS-B.
Batch Size & Contrastive Learning Lastly, we analyze the effect of batch size. Table 7 shows how the proposed model performs with batch sizes in {16, 32, 64, 128}. We also compare to a setting where contrastive learning is used as the selfsupervised learning objective since it is more commonly used in visual representation learning . Specifically, in this setting, given a batch of n augmented sentence pairs (2n sentences), each of them is treated as a positive pair. For each positive pair, we treat the other 2(n − 1) augmented examples within the minibatch as negative examples.
The results in Table 7 show that for BSL, setting the batch size to 64 yields the best result. Overall BSL is less sensitive to changes in batch size while contrastive learning tends to perform better with a larger batch size such that sufficient negative samples can be obtained. Contrastive learning may achieve better performance with a larger batch size while we leave it for future investigation due to its large memory consumption.

Conclusion
In this paper, we propose BSL for unsupervised sentence representation learning. The experimental results demonstrate that our method could significantly outperform the state-of-the-art unsupervised methods and it can be further extended for learning multilingual sentence representations. In future work, we expect both theoretically advance of Siamese networks for representation learning, e.g., why stop-gradient works so well and how to further improve the updating dynamics, as well as specifically designated ideas for NLP, e.g., augmentation or learning objectives.

A Implementation Details
Our implementation is based on Python 3.6 and Pytorch 1.6.0. All experiments were conducted on a RTX 8000 GPU (CUDA version 10.2) configured on a standard workstation. The workstation is configured with 2 Intel Xeon Gold 6248R, 256GB RAM, and Ubuntu 18.04 operating system. We provide main hyperparameters of our model training on the NLI datasets in the Table 8. For crosslingual experiments, we use bert-base-multilingualcased and the other hyperparameters are the same. The NLI and related datasets can be downloaded from https://huggingface.co/datasets. The development results of BSL on the NLI dataset are shown in Table 9.

B Argument Facet Similarity
We have demonstrated that the proposed method significantly outperforms other unsupervised baselines on a suite of STS and classification tasks that are commonly used in previous works. However, those tasks are less domain or task specific.
Here, we further investigate the effectiveness of BSL in a domain-specific scenario. Following prior works (Reimers and Gurevych, 2019;Zhang et al., 2020), we conduct evaluations on an Argument Facet Similarity (AFS) (Misra et al., 2016) dataset.  The dataset consists of 6k argument pairs on three controversial topics: gun control, gay marriage, and death penalty. Each pair was annotated on a scale from 0 (different) to 5 (equivalent). This dataset is more challenging compared to the STS benchmarks: the lexical gap between the sentences in AFS is larger and to be consider similar, a pair of arguments must not only make similar claims, but also provide a similar reasoning. We compare models in a setting where task-or domain-specific labeled data is not available. In this setting, supervised method such as SBERT and InferSent need to be trained on NLI data and perform cross-domain predictions on the AFS sentence pairs. Unsupervised methods such as BERTmlm, IS-BERT and our proposed BSL can be directly trained on the task-specific raw texts. Table 10 shows the comparison results. We present both Pearson correlation and Spearman's rank correlation. The results show that the proposed method still outperforms other methods. It is interesting to find that the two supervised methods InferSent and SBERT perform the worst in this setting. This is due to the fact that AFS data differes significantly from NLI data. This suggests that the domain-relatedness between the training set and the target test set has a huge impact on the model performance, and the models learned with supervised methods are problematic to port to other distant domains.

C More Examples
More examples of augmentations generated by different approaches are provided in the Table 11.