Generating Datasets with Pretrained Language Models

To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.


Introduction
While pretrained language models (PLMs) achieve strong results for many NLP tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019), they do not produce good sentence embeddings out of the box (Reimers and Gurevych, 2019). Recent approaches address this by augmenting or replacing the language modeling objective with likewise unsupervised sentence-level objectives (e.g., Zhang et al., 2020;, but they typically lag behind their supervised counterparts trained on human-annotated sentence pairs. Unfortunately, obtaining large amounts of high-quality training data can be both difficult and prohibitively expensive (Bowman et al., 2015;Agirre et al., 2016). Furthermore, with larger and larger model sizes (Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020;Fedus et al., 2021), it becomes increasingly challenging to finetune PLMs. 1 Our code and datasets are publicly available at https: //github.com/timoschick/dino. Task: Write two sentences that mean the same thing.
Sentence 1: "A man is playing a flute." Sentence 2: "He's playing a flute." Task: Write two sentences that are somewhat similar.
Sentence 1: "A man is playing a flute." Sentence 2: "A woman has been playing the violin." Task: Write two sentences that are on completely different topics.
Sentence 1: "A man is playing a flute." Sentence 2: "A woman is walking down the street." Figure 1: Continuations generated by GPT2-XL with DINO for three different task descriptions. We investigate two different unsupervised approaches to generating sentence-similarity datasets: (i) The input sentence is given and only the continuation is generated. This requires that an (unlabeled) set of sentences is available. (ii) Both input sentence and continuation are generated. This does not rely on the availability of any resources.
To alleviate both problems, we explore a novel approach to obtaining high-quality sentence embeddings: We mimic the creation of NLI datasets by human crowdworkers (Bowman et al., 2015;Williams et al., 2018), but replace human annotators with large PLMs. This allows us to automatically create entire datasets from scratch that can be used for supervised training of much smaller models. Not only does this solve the problem of limited training data, it also provides a viable path to leverage big models like GPT-3 (Brown et al., 2020) without requiring any updates to their parameters. As illustrated in Figure 1, our approach is based on recent methods for providing instructions to PLMs (e.g., Radford et al., 2019;Brown et al., 2020;Schick andSchütze, 2020, 2021a). We use the self-debiasing approach of Schick et al. (2021) to ensure that each generated text pair is not only a good fit for a given similarity label, but also not a good fit for other labels. We refer to our method as Datasets from Instructions (DINO).
In summary, our contributions are as follows: • We introduce DINO, a method for automatically generating labeled datasets of arbitrary size by providing PLMs with instructions.
• We release STS-(read as "STS-Dino"), the first textual similarity dataset generated completely automatically, without any human annotation effort.
• We show that Sentence-RoBERTa (Reimers and Gurevych, 2019) trained on STS-outperforms strong baselines on several semantic textual similarity datasets.
Closely related to our work, Efrat and Levy (2020) examine the ability of PLMs to follow natu-Task: Write two sentences that i y .
Sentence 1: "x 1 " Sentence 2: " Figure 2: Instruction template I y (x 1 ) for similarity label y and input sentence x 1 ; i y is described in Section 3. See Figure 1 for three instantiations of the template. ral language instructions for generating examples in place of human crowdworkers, but find that their approach performs poorly.

Datasets from Instructions
Let M be a PLM with vocabulary V , X = V * the set of all token sequences and Y a finite set of semantic similarity labels. Our aim is to generate a dataset Z ⊂ X × X × Y of text pairs (x 1 , x 2 ) with corresponding similarity labels y. For x ∈ V and x ∈ X, we denote with p M (x | x) the probability that M assigns to x as a continuation of x.
We first assume that we already have access to a set X 1 ⊂ X of texts (e.g., a set of sentences that are typical of the domain of interest). This is a realistic setting for many real-world applications, where large amounts of unlabeled text are abundant, but it is difficult to obtain interesting and (for our task) useful text pairs and labels. DINO requires a set of instructions I = {I y | y ∈ Y } where each I y ∈ I is a function that, given an input x 1 ∈ X 1 , prompts its recipient to generate an appropriate second text x 2 . We use the instruction template in Figure  Note that for all y, I y ends with an opening quotation mark, which allows us to treat the first quotation mark generated by the PLM as a sign that it is done.
For a given x 1 ∈ X 1 and y ∈ Y , we could directly use the instructions I y to obtain x 2 by continuously sampling tokens starting from k = 1 until x k is a quotation mark and setting x 2 = x 1 , . . . , x k−1 . However, we may want the PLM to generate a text x 2 that is not only a good fit for instruction I y (x 1 ), but also not a good fit for some other instruction I y ′ (x 1 ). We refer to y ′ as a counterlabel for y and denote the set of y's counterlabels as CL(y). For example, 1 ∈ CL(0.5) means that for y = 0.5, we want M to generate a sentence x 2 that is similar to (y = 0.5), but at the same time does not have the same meaning as (y = 1) sentence x 1 . We achieve this using Schick et al. (2021)'s self-debiasing algorithm: When sampling the token x k , we consider not just , for all y ′ ∈ CL(y). We penalize each token x k for which p y is lower than any p y ′ by multiplying its probability with a factor α = exp(λ · δ y ) where is the difference between x k 's probability given I y (x 1 ) and its maximum probability given I y ′ (x 1 ) for any y ′ ∈ CL(y), and the decay constant λ is a hyperparameter.
For settings where no set of unlabeled texts X 1 is available, a straightforward approach would be to use the phrase shown in Figure 2 up to and including the first quotation mark as an instruction to let the PLM generate both x 1 and x 2 . However, this approach has at least two issues: First, generated texts may not match the required schema (e.g., the model may never produce the string "Sentence 2:"). Second, the set of texts x 1 should ideally be highly diverse, whereas we want to give the model less leeway when generating x 2 , so we may want to use different sampling strategies for x 1 and x 2 .
We solve both problems as follows: We first use I y (Figure 2) up to and including the first quotation mark (the one right after "Sentence 1:") to generate x 1 ; we stop as soon as the model produces a quotation mark. We run this procedure repeatedly until we have a sufficient number of sentences. These are gathered into a set X 1 and then we proceed exactly as in the case where X 1 is already given.
We use DINO to generate STS-⊂ X×X×Y , a dataset of text pairs with semantic similarity labels. We generate two variants: • STS--x 2 , for which we make use of STSb to obtain a set of texts X 1 ; • STS--x 1 x 2 , where the set of sentences X 1 is generated from scratch.
We use GPT2-XL as PLM with a decay constant of λ = 100 and the set of counterlabels CL(y) = {y ′ ∈ Y | y ′ > y}. That is, we do not restrict the PLM when generating texts for y = 1, but for y = 0.5 (y = 0) we encourage it not to generate texts x 2 that mean the same thing as (are somewhat similar to) x 1 . We apply top-p (Holtzman et al., 2020) and top-k (Fan et al., 2018;Holtzman et al., 2018) sampling with p = 0.9, k = 5 and generate up to 40 output tokens. For each x 1 ∈ X 1 and y ∈ Y , we generate up to two corresponding x 2 's. 2 For STS--x 1 x 2 , we obtain X 1 by generating 15,000 sentences using only top-p sampling (again with p = 0.9) and no top-k sampling to ensure more diversity in the generated output. We remove all examples where x 1 = x 2 (as those provide no training signal to the model) and split the datasets 90/10 into training and validation.
To assess the quality of the generated datasets, we use them to train Sentence-RoBERTa (Reimers and Gurevych, 2019), a biencoder architecture based on RoBERTa (base) (Liu et al., 2019) that measures the similarity of two texts by computing the cosine similarity of their embeddings. As our datasets contain many noisy examples, we use a technique similar to label smoothing (Szegedy et al., 2016) and replace similarity scores of 0 and 1 with 0.1 and 0.9, respectively. Additionally, for each x 1 , we sample two x 2 's from other dataset entries and augment the dataset with (x 1 , x 2 , 0). We use the default parameters of Reimers and Gurevych (2019) with a batch size of 32 and train for at most one epoch; the exact number of training steps is determined based on Spearman's rank correlation on the STS-validation set.

Results
We compare S-RoBERTa (base) trained on datasets generated with DINO to S-BERT and S-RoBERTa finetuned on NLI data as well as Universal Sentence Encoder (USE) (Cer et al., 2018)   We investigate the importance of self-debiasing (Schick et al., 2021) in Table 2 (top); as can be seen, removing self-debiasing (λ = 0) dramatically hurts performance. Increasing the decay constant (λ = 200) leads to slightly worse performance as the overall quality of generated sentences decreases (Schick et al., 2021). Table 2 (bottom) shows that training on STS-requires measures to limit the effect of noisy labels: removing label smoothing and performing no data augmentation (i.e., not generating additional pairs (x 1 , x 2 , 0) by sampling random x 2 's for each x 1 ) clearly hurts performance.
To further assess the quality of datasets generated with DINO, we additionally perform a smallscale human evaluation. To this end, we consider the exact version of STS--x 2 used for training S-RoBERTa; that is, we perform label smoothing, augmentation with randomly sampled text pairs, and removal of trivial examples where x 1 = x 2 . From the resulting dataset, we randomly select 100 text pairs (x 1 , x 2 ) and annotate them ourselves with similarity scores y ∈ {0, 0.1, 0.5, 0.9}, where we assign a score of 0.9 when x 1 and x 2 mean (almost) the same thing and a score of 0.1 when they are on different topics, but still show a weak similarity in some aspect.
In Table 3  posed to be on completely different topics, many (41%) still have a certain similarity according to human judgment. In contrast, randomly sampled pairs are indeed on completely different topics in almost all cases. Moreover, we can see that GPT2-XL has particular difficulty in generating pairs of non-identical sentences that really mean the same thing: Only 47% of all examples that should have the same meaning do actually mean (almost) the same thing. However, the strong performance of S-RoBERTa trained on STS--x 2 suggests that, despite this noise, there is sufficient signal in this dataset for successful training. We finally take a qualitative look at both positive examples where DINO is able to create high-quality text pairs and at some typical errors found in many of the generated examples. As shown in Table 4, for y = 1 the PLM sometimes comes up with decent paraphrases (e.g. "notches a victory" → "wins") or substitutes with very similar meaning ("cutting" → "slicing"), but more often it generates sentences that either omit or mix up important information, and sometimes it produces sentences with an entirely different meaning. Whereas sentences generated for y = 0.5 by and large look reasonable, for y = 0 the PLM often simply flips words ("closed" → "open", "large" → "small") instead of producing sentences on completely different topics.

Conclusion
We have introduced DINO, a method for using large PLMs to generate entire datasets of labeled sentence pairs from scratch, requiring no labeled data and no parameter updates. This is achieved by providing instructions in natural language, combined with the self-debiasing method of Schick   For future work, it would be interesting to see whether the noise in datasets generated with DINO can further be reduced, e.g., by using different sets of instructions (Jiang et al., 2020;Schick and Schütze, 2021a) or by supplementing our pipeline with some additional filtering steps.

A Experimental Setup
Our implementation is based on the Transformers library (Wolf et al., 2020) and PyTorch (Paszke et al., 2017). All our experiments were conducted using two GPUs with 11GB RAM (NVIDIA GeForce GTX 1080 Ti). Generating STS--x 1 x 2 and STS--x 2 using both GPUs took approximately 48 hours per dataset. Training a Sentence Transformer on these datasets took less than 2 hours on average.

C Additional Results
Our main results do not include scores for De-CLUTR (Giorgi et al., 2020) and CLEAR (Wu et al., 2020) -two recent approaches using contrastive learning -as their evaluation setup differs from that described in Reimers and Gurevych (2019) (and used by all other baselines) in the following respects: • Both Giorgi et al. (2020) and Wu et al. (2020) treat SICK and STSb as supervised tasks, i.e., they use the provided task-specific training sets to perform regular supervised training.
• The STS12-16 datasets each consist of several subsets. Giorgi et al. (2020) and Wu et al. (2020) compute Spearman's correlation coefficient separately for each of these subsets and report the mean score across all subsets. In contrast, for our main results we follow Reimers and Gurevych (2019) and concatenate all subsets to form one large set on which Spearman's correlation is computed just once.
As the implementations of both methods are not publicly available as of this writing, we are unable to compute scores for DeCLUTR and CLEAR using the evaluation setup of Reimers and Gurevych (2019) ourselves. Instead, we recompute scores for DINO (both with STS--x 2 and STS--x 1 x 2 ) using the evaluation setup of Giorgi et al. (2020) and Wu et al. (2020) on STS12-16; results are shown in Table 5.  Table 5: Results for CLEAR (Wu et al., 2020), DeCLUTR (Giorgi et al., 2020) and Sentence-RoBERTa (base) trained on STS--x 1 x 2 and STS--x 2 using the evaluation setup of Wu et al. (2020) and Giorgi et al. (2020): For each task, we report the mean Spearman correlation of all subtasks in a fully unsupervised setting.