SMASH: Improving SMAll Language Models’ Few-SHot Ability with Prompt-Based Distillation

,


Introduction
Language models at scale, such as GPT-3 (Brown et al., 2020), have shown remarkable performance on prompt-based few-shot learning on a wide variety of tasks given only a natural language prompt and few demonstrations.However, the ability of few-shot learning usually comes with heavy computation and a huge amount of parameters.Recent works (Gao et al., 2020;Li and Liang, 2021) investigated prompt-based few-shot learning on moderately-sized language models such as BERTlarge (Devlin et al., 2019), RoBERTa-large (Liu et al., 2019) and GPT-2 (Radford et al., 2019) Table 1: Fine-tuning performance (FT), prompt-based fine-tuning performance (PT) and relative improvement of PT comparing to FT (∆) on models with different sizes on 16-shot training and validation dataset.†: results from (Gao et al., 2020).Bold results indicates the largest relative improvement.
these models are still difficult to be deployed on edge devices such as mobile phones.
In this paper, we investigate whether we can make small language models, such as DistilRoBERTa-base (Sanh et al., 2019), better fewshot learners.Prompt-based fine-tuning has been seen as a promising method for few-shot learning as it uses language modeling heads instead of introducing new parameters as task-specific classification heads during fine-tuning, thus narrowing down the gap between pre-training (e.g., masked language modeling for RoBERTa) and applying to downstream tasks.However, Table 1 shows that when annotated data is insufficient, promptbased fine-tuning on DistilRoBERTa-base does not make improvements over fine-tuning as large as on RoBERTa-base or RoBERTa-large for most downstream tasks.We assume that's because the models' abilities to respond to gaps between tasks are proportional to their sizes, and the gap of transferring from pre-training to the prompt-based fine-tuning on downstream tasks directly is still too wide for small language models.This suggests that additional adaptations are required when applying small language models on few-shot downstream tasks.
To tackle this problem, we propose SMASH, an approach of further training SMAll language models on intermediate tasks before applying them to few-SHot downstream tasks.Intuitively, if we can design intermediate tasks similar to both the pre-training task and downstream tasks, we can mitigate this problem by replacing one large gap (from the pre-training task to downstream tasks) with two smaller gaps (from the pre-training task to intermediate tasks, then to downstream tasks).Noticing that same manual prompt templates (or similar templates with minor differences) are often used when prompt-based fine-tuning models on a group of similar tasks (e.g., template <s>x 0 ?<mask>,x 1 .</s>for sentence-pair tasks in GLUE benchmark (Wang et al., 2019), such as MNLI, QNLI, RTE, etc.), we consider using this prompt template and sample sentences from a large-scale unsupervised corpus to form the inputs of the intermediate task.To construct supervision signals we leverage knowledge distillation (Hinton et al., 2015) by feeding the inputs to a larger pre-trained language model and using its outputs as the training objective.In this way, the intermediate task can be both similar to the pre-training task (by training on similar distributions of data, e.g.large scale corpus from the web) and downstream tasks (by using similar prompt templates).From the perspective of knowledge distillation, the intermediate task can also be seen as a kind of data augmentation using a largescale unsupervised corpus to transfer knowledge of solving a group of similar tasks from larger models to smaller models, which can be exploited later by prompt-based fine-tuning on downstream tasks.
As the intermediate task depends on the input format of downstream tasks, it's not feasible to experiment with SMASH on all NLP tasks at once.In this paper, we only take sentence-pair tasks and sentiment classification tasks, two groups of tasks that are popular in NLP as an example, and design two intermediate tasks respectively.Note that practitioners can also use SMASH on other groups of downstream tasks by designing their own intermediate tasks under the training framework we proposed.Experiments on the GLUE benchmark and several other tasks show that using SMASH can make a 6-layer DistilRoBERTa-base achieve com-parable performance with a 12-layer RoBERTabase on few-shot datasets at a low cost.We find that SMASH provides more improvements on more complicated tasks like natural language inference and sentence similarity than easier tasks like sentiment classification, and is robust over different templates, verbalizers, and model structures.In summary, our key contributions are: • Conducting systematic experiments to verify the effectiveness of existing few-shot learning methods on small language models; • Proposing SMASH, a general method to improve few-shot prompt-based fine-tuning performance for small language models on a group of downstream tasks; • Designing intermediate tasks for sentence-pair tasks and sentiment classification tasks, and showing their effectiveness on several downstream tasks.

Related Work
Prompt-based learning.Prompt-based learning has become a new paradigm in NLP fueled by the series work of GPT (Radford et al., 2018(Radford et al., , 2019;;Brown et al., 2020).There are a large body of works on mining sequences of tokens as discrete prompts (Jiang et al., 2020;Shin et al., 2020) or training continuous prompts (Li and Liang, 2021;Liu et al., 2021;Lester et al., 2021;Hambardzumyan et al., 2021;Zhang et al., 2022).Prompt-based learning has been seen as a promising method for few-shot learning.PET (Schick and Schütze, 2021a,b) focuses on prompt-based fine-tuning an ensemble of models to create a softlabeled dataset from unlabeled in-domain examples and use it to fine-tune a classifier model.LM-BFF (Gao et al., 2020) proposes methods to generate prompts and find demonstrations from few-shot datasets.The work most related to us is PPT (Gu et al., 2021), which leverages unsupervised data to pre-train representations of prompt tokens.
General distillation of Pre-trained Language Models (PLMs).General Distillation aims at distilling student models at the pre-training stage using unsupervised data to foster its ability on solving various types of tasks.DistilBERT (Sanh et al., 2019) performs general distillation using soft crossentropy loss over logits and cosine embedding loss.TinyBERT (Jiao et al., 2020) proposes transformer distillation method that aligns embedding layer, hidden states, attention matrices, and output logits between teacher model and student model.CoDIR (Sun et al., 2020) adopts a contrastive learning framework where the student's representations are enforced to be close to the corresponding representations of the teacher and far apart from negative ones.MiniLM (Wang et al., 2020) proposes self-attention distillation at the pre-training stage, and MiniLMv2 (Wang et al., 2021) proposes multihead self-attention relation distillation that has no restrictions in terms of the number of teacher's and student's attention heads.MGSKD (Liu et al., 2022) proposes to transfer the structural relations among multi-granularity representations to the student hierarchically.Though the effects of different training objectives have been studied extensively, most previous works conducted general distillation on the masked language modeling task using raw text input.To the best of our knowledge, our work is the first work that performs prompt-based general distillation using input sequences assembled by prompt templates.

Method
3.1 Preliminaries: Prompt-Based Fine-tuning An burgeoning approach of fine-tuning pre-trained language models is prompt-based fine-tuning, where input is formulated as a "blank-filling task" with natural language prompts.Take sentiment classification task as an example, given an input sentence x ∈ V * (where V is the vocabulary of the language model M, and V * denotes a sequence of tokens from V) and its label y ∈ Y, a template f : V * → V * maps x into a new token sequence f (x) containing the input sentence, several prompt tokens, and at least one <mask> token for M to predict.Then a verbalizer v : R |V| → Y is used to map p, the output distribution of M to a label v(p) = y ∈ Y.For example, we can formulate a sentiment classification task t = (f, v) as: and let M decide whether it is more appropriate to fill in "terrible" (negative) or "great" (positive) for <mask>.Then the verbalizer v = [great, terrible] maps the output distribution of M to a label: positive p("great") > p("terrible") negative otherwise Same as (Gao et al., 2020), we treat regression tasks in a bounded interval [l, u] as interpolation between two opposing words in verbalizer v = [y l , y u ].In this way, we calculate ŷ as: where p is the output probability of model M and p(y l |x) + p(y u |x) = 1.We train the model M to minimize the following objective function using KL divergence: where In this subsection we propose SMASH to leverage unsupervised corpus to transfer the ability of solving a group of downstream tasks from a pretrained teacher model M tea to a pre-trained student model M stu to further narrow down the gap between pre-training and prompt-based fine-tuning.Formally, suppose a group of downstream task T containing n downstream tasks {t 1 , ..., t n }, where We design an intermediate task ) for group T .2After distilling M stu from M tea with t dis , we continue to train M stu for each task t in T with prompt-based fine-tuning on task-specific data.
In this work, we focus on two groups of downstream tasks: sentence-pair classification and sentiment classification, and provide the design of their intermediate tasks respectively.We emphasize that SMASH can be applied to any downstream tasks, and practitioners can design their own corresponding intermediate tasks by themselves.
Sentence-Pair Tasks.Sentence-pair tasks such as natural language inference take two sentences (x 1 , x 2 ) as input.Following (Gu et al., 2021), we construct a dataset from unlabeled raw text documents, and set the two sentences from different documents as label 0, those from the same document but not adjacent as label 1, and those next to each other as label 2. We sampled the Sentiment Classification Tasks.Sentence classification tasks take only one sentence x as input.In order to comply with the setting where annotated data is difficult to obtain, we filter sentences with low-classification probability with a pre-trained RoBERTa-large model instead of a model finetuned on another sentiment analysis task in (Gu et al., 2021).To filter the low-classification probability sentences, we use template f f ilter = <s>x 1 It was <mask>.</s> and verbalizer v f ilter = [terrible, bad, okay, good, great] After filtering we distill M stu using template We experimented on three transformer distillation objectives: prediction distillation, hidden states distillation, and multi-head attention distillation.
For prediction distillation, we train M stu using the output distribution of M tea by minimizing the following objective function: where t i and s i are probabilities of token i estimated by M tea and M stu respectively.
For hidden states distillation, the objective is defined as: where M SE denotes mean squared error loss, M stu is the number of layers of the student model, H m refers to the hidden states of mth layer, and H 0 denotes the representations after embedding layer.g(m) denotes the layer mapping function from student to teacher, which means mth layer of M stu learns from g(m)th layer of M tea .We use uniform strategy described in (Jiao et al., 2020), where g(m) = m × (M tea /M stu ).
For multi-head attention distillation, the objective is defined as: where H is the number of attention heads, A mh refers to the hth attention head of mth layer.We use the same layer mapping function as hidden states distillation, i.e. g(m) = m * (M tea /M stu ).
Note that hidden states distillation and multi-head attention distillation requires M stu and M tea to be the same type of transformer language model with the same number of attention heads and hidden size, while prediction distillation has no requirements for the model structure of M stu and M tea .
After conducting experiments on all the abovementioned distillation objectives described in detail in Section 4.4, we only use prediction distillation in other experiments due to its superior performance and simplicity.

Setup
Datasets.We evaluate our methods on SNLI (Bowman et al., 2015), a number of tasks from the GLUE benchmark (Wang et al., 2019), and three more sentiment analysis datasets: SST-5 (Socher et al., 2013b), MR (Pang and Lee, 2005) and CR (Hu and Liu, 2004).Our few-shot dataset is the same as (Gao et al., 2020), with K = 16 examples per label sampled from the original training set as our training set and validation set, and use the original validation set as our test set.For each task, we sampled 5 different few-shot datasets and report the average (standard deviation) metric of 5 trials.For intermediate tasks, we sampled 2560k sentence pairs for sentence-pair task and 2560k sentences for sentiment classification task from Wikipedia3 , so every training example is used only once.
Implementation Details.During the intermediate task, we use RoBERTa-large as the teacher model and DistilRoBERTa-base4 as the student model.We set batch size as 128 and learning rate as 1e-4.During prompt-based fine-tuning, we perform grid search and take learning rates from {1e-5, 2e-5, 5e-5} and batch sizes from {2, 4}.We set the max length of input sequences as 64 for sentiment classification tasks and 128 for sentence-pair tasks.We use templates and verbalizers same as (Gao et al., 2020).For more implementation details please refer to Appendix A.

Main Results
Table 2 shows our main results.On most tasks SMASH is comparable with the 2× larger RoBERTa-base model on both prompt-based finetuning and zero-shot performance, and it also outperforms LM-BFF and Vanilla PPT, which shows the effectiveness of SMASH.
For sentence-pair tasks, prompt-based zero-shot performance of DistilRoBERTa-base is worse than or slightly better than the majority baseline, but it can be improved by SMASH; For sentiment classification tasks, prompt-based zero-shot performance of DistilRoBERTa-base is much better than the majority baseline, and SMASH cannot provide further improvements.We suppose that's because in the pre-training stage DistilRoBETa-base only sees one sentence at a time, so transferring from pre-training to sentence-pair tasks is harder, thus training on SMASH intermediate task is more beneficial.
We observed that distilling directly from RoBERTa-large using prompt-based methods performs worse than prompt-based fine-tuning on fewshot datasets, as it's hard to train a strong taskspecific teacher model and transfer knowledge from teacher model to student model due to lack of training data.It is counter-intuitive that the student model performs even worse with the use of backtranslation, we suppose that's because the teacher model is not powerful enough to resolve the noise introduced by the augmented examples.
We also observed that on most tasks, promptbased fine-tuning improvements of SMASH are significantly greater than zero-shot improvements (e.g., for RTE, prompt-based fine-tuning improvement is 59.4 − 54.5 = 4.9, and zero-shot improvement is 53.8 − 51.3 = 2.5).This verifies our assumption that instead of making the student model learn to solve downstream tasks directly, SMASH works like general distillation that transfers the potential ability to solve a group of similar tasks to the student model, which can be exploited later by prompt-based fine-tuning on downstream tasks.

Comparisons to Baselines with Stronger Data Requirements
In this subsection we also compare our methods to iPET (Schick and Schütze, 2021a), which iteratively trains ensembles of PET models to label in-domain unlabeled examples.We use RoBERTalarge as PET models and DistilRoBERTa-base as the final classifier.Note that this comparison is not fair as SMASH only requires web-scraped corpus, while iPET requires up to 10000 unlabeled examples per label from the same downstream task, which can play a key role in few-shot learning.
Results of some representative tasks are shown in Table 3 (see Table 5 in Appendix for more re-sults).SMASH results in inferior performance than iPET on some of the tasks, as iPET fine-tunes the final classifier on soft-labeled in-domain examples several orders of magnitude larger than the 16-shot dataset.However, we find that SMASH and iPET are compatible, as simply changing the training method of iPET final classifier from fine-tuning to prompt-based fine-tuning (+PT) and using models trained on SMASH intermediate tasks as initialization (+SMASH init) brings further improvements.

Comparisons of Different Distillation Settings
To verify the effectiveness of different distillation settings, we compare the following settings for   sentence-pair tasks: (1) SMASH with objective L ce (Eq.1) and RoBERTa-large teacher (Rl ce); (2) SMASH with objective L ce (Eq.1) and RoBERTabase teacher (Rb ce); (3) SMASH with objective L ce + 0.01L hidn (Eq.2) and RoBERTa-base teacher (Rb ce+hidn); (4) SMASH with objective L ce + 0.01L attn (Eq.3) and RoBERTa-base teacher (Rb ce+attn); (5) perform further pre-training using raw text from Wikipedia on masked language modeling task and training objective same as Dis-tilBERT (Sanh et al., 2019) and RoBERTa-base teacher (Rb distilbert); and (6) using rule-based label described in Section 3.2 as objective and minimize cross-entropy loss using prompt-based finetuning with template same as (1) and verbalizer v dis = [N o, M aybe, Y es] (rule).We trained all these models for up to 200k steps and prompt-based fine-tune on MNLI.
Figure 2(a) shows the results of the comparison.Performance of all settings stabilizes at around 100k steps, and fluctuations after 100k steps are probably due to high variance caused by small training and validation sets of downstream tasks.For settings ( 2)-( 4) with RoBERTa-base teacher, distilling only with L ce achieves better performance than using other transformer distillation objectives such as L hidn or L attn consistently.We suppose that's because during the pre-training stage of DistilRoBERTa-base, its self-attention heads and hidden states are not trained to imitate RoBERTabase, so its self-attention heads (and hidden states) may capture different information (and lie in different spaces) from RoBERTa-base.Hence adding these objectives actually introduces noise to the distillation process.Setting (5) shows that further pretraining using (Sanh et al., 2019) objective does not make further improvements, as this setting still uses the masked language modeling task and can not narrow down the gap between the pre-training task and downstream tasks.This observation makes sense as DistilRoBERTa-base has been trained with setting (5) for millions of steps during the pre-training stage and already converges.Based on previous observations we find that using L ce as training objective gets the best results despite its simplicity and compatibility, as it has no requirements of the structure of the teacher model.So in setting (1), we use RoBERTa-large as a stronger teacher with L ce objective, and it outperforms all other settings.Setting (6) shows that training using rule-based labels results in inferior performance, indicating that without knowledge distillation, these rulebased labels are not appropriate optimization targets for natural language inference downstream tasks.In summary, choosing an intermediate task similar to downstream tasks and using knowledge distillation are both essential.[terrible, bad, okay, good, great] as supervision (Rl hard label).Note that the verbalizer of SST-2 v = [terrible, great] is a subset of v dis .Though setting (7) seems similar to (6), labels in setting (7) are acquired using a RoBERTa-large instead of rules, so it can be viewed as a kind of "knowledge distillation" which transfers knowledge using hard labels on a 5-class single-sentence classification task.The performance of setting ( 7) is comparable with setting (1), which shows that using promptbased fine-tuning as intermediate task also works, as long as the label is given by a teacher model.

Robustness Under Different Templates and Verbalizers
Previous works (Gao et al., 2020;Gu et al., 2021;Liu et al., 2021) mentioned that the choice of template and verbalizer leads to substantial differences in performance.Note that the superior results in Table 2 are achieved using manual downstream task templates and verbalizers, in this subsection we validate the robustness of SMASH when changing templates or verbalizers, especially when using templates that are different from the intermediate task.Figure 3 presented results changing the manual template to one of the 5 best templates or changing the manual verbalizer to one of the 5 best verbalizers from the generated prompts provided by (Gao et al., 2020).Note that these templates/verbalizers are generated based on RoBERTa-large and may not be the best ones for DistilRoBERTa-base.Results show that SMASH provides consistent im-  provements even the template of the downstream task is different from the intermediate task.

SMASH on Different Language Models
To explore the impact of SMASH on language models other than RoBERTa, We use T5-large (Raffel et al., 2019) as the teacher model and T5-small as the student model.We distill for 200k steps and prompt-based fine-tune on downstream tasks.We compare with two prompt-based fine-tuning baselines: T5-small and T5-base.Table 4 shows that SMASH improves the few-shot performance of T5-small on several sentence-pair tasks.

Conclusion
In this paper, we present SMASH, an approach of using knowledge distillation on an unsupervised corpus to improve small language models' few-shot performance.The principle of this approach is distilling the model using input similar to downstream tasks sampled from unsupervised corpus as an intermediate task to transfer knowledge of solving these tasks and further narrow down the gap between pre-training and prompt-based fine-tuning.We design intermediate tasks for sentence-pair tasks and sentiment classification tasks.We show that our approach results in significant improvements on few-shot datasets, especially for harder tasks like natural language inference.We analyse SMASH on different distillation objectives, and verify its robustness over different templates, verbalizers, and model structures.
Possible future directions of this work include: apply SMASH on more types of downstream tasks, especially those that can not be easily formulated using prompts or are difficult to simulate using unsupervised corpus (e.g., text-to-SQL); or explore intermediate tasks that are more data-efficient.

Limitations
Like many other prompt-based approaches, SMASH requires expert knowledge when designing templates and verbalizers for different groups of tasks, and the performance can be much or less affected by the choices of them.Though in this paper we demonstrated the effectiveness of SMASH on several classification tasks, it is non-trivial to apply SMASH on tasks that are difficult to simulate using unsupervised corpus like machine translation, text-to-SQL, dependency parsing, etc. Due to computational constraints, we only experimented with a 6-layer DistilRoBERTa-base as our small model.Theoretically SMASH is also applicable to larger "small models" such as RoBERTa-base or RoBERTa-large with the usage of a larger teacher, but the effects on these models remain unexplored.

Ethical Statement
The risks and potential harms of pre-trained language models are widely discussed in papers such as (Bender et al., 2021;Bender and Koller, 2020).As models in this work are trained under a fewshot learning setting, these models may have biases due to the lack of diversity of training data.The performance of these models can also be strongly influenced by prompts, and inattentively designed prompts may cause the model to exhibit unexpected behaviors.

A Experimental Details
A.1 Hyper-Parameters Experiments in Section 4.2.During distillation stage, we use batch size = 128, learning rate = 1e-4, max input length = 128 for sentence-pair task and 64 for sentiment classification task.We distill for 200k steps, which takes about 4 days for sentencepair task and 2 days for sentiment classification task on 2 GTX 1080 Ti GPUs.We sampled 2560k sentences (or sentence pairs) from Wikipedia for sentiment classification (or sentence-pair) task, and training for 200k steps takes exactly 1 epoch.During prompt-based fine-tuning stage, we perform grid search and take learning rates from {1e-5, 2e-5, 5e-5} and batch sizes from {2, 4}.We promptbased fine-tune the model for up to 1000 steps and save checkpoints every 100 steps.We take the bestperforming checkpoint on validation set to get test set results. 5or LM-BFF baseline, we use auto templates and manual verbalizers with SBERT (Reimers and Gurevych, 2019) to select demonstrations.For Vanilla PPT baseline, we use the same data as SMASH to pre-train soft-prompts for sentence-pair tasks, and a RoBERTa-base fine-tuned on Yelp-full (Zhang et al., 2015) to filter the pre-training corpus for sentiment classification tasks.For iPET baseline, we made minor modifications on the provided prompts to train PET models on the tasks not experimented in (Schick and Schütze, 2021a).We use 3 generations in iPET, in each generation we train 3 different models for each of the 4 patterns.We follow (Schick and Schütze, 2021a) to set other hyper-parameters.
Experiments in Section 4.4 For settings (1)-(4), we use the same hyper-parameters as Section 4.2.For setting (5), we use learning rate = 1e-5, max from the GLUE benchmark in our experiments.We ommited results on WNLI7 due to its adversarial dev set and CoLA (Warstadt et al., 2018) due to its input may be a non-grammatical sentence and is out of the distribution of the pre-training corpus, as mentioned in (Gao et al., 2020).

A.3 Templates and Verbalizers
Table 6 shows our templates and verbalizers used on RoBERTa models, which is the same as (Gao et al., 2020).For T5 models, we use the same verbalizers and similar templates by removing <s> and replacing <mask> with <extra_id_0>.

Figure 1 :
Figure 1: Overview of SMASH on sentence-pair tasks.

Figure 2 :
Figure 2: Few-shot Performance with different distillation settings.Shaded area indicates standard deviation.We omit standard deviation of RoBERTa-base and DistilRoBERTa-base for simplicity.

Figure 3 :
Figure 3: Prompt-based fine-tuning using different templates and verbalizers.Shaded area indicates standard deviation.We omit standard deviation of SMASH and DistilRoBERTa-base for simplicity.

Figure 2
Figure 2(b) shows comparisons on SST-2 using setting (1), (5), and a new setting (7) using the hard label annotated during the filter process in Sec.3.2 and verbalizer v dis =[terrible, bad, okay, good, great]  as supervision (Rl hard label).Note that the verbalizer of SST-2 v = [terrible, great] is a subset of v dis .Though setting (7) seems similar to (6), labels in setting (7) are acquired using a RoBERTa-large instead of rules, so it can be viewed as a kind of "knowledge distillation" which transfers knowledge using hard labels on a 5-class single-sentence classification task.The performance of setting (7) is comparable with setting (1), which shows that using promptbased fine-tuning as intermediate task also works, as long as the label is given by a teacher model.

Figure 4 :
Figure 4: Comparisons of different downstream dataset sizes.Shaded area indicates standard deviation.

4. 6
Figure 4 illustrates comparisons of promptbased fine-tuning DistilRoBERTa-base (dR-b PT), RoBERTa-base (R-b PT) and SMASH on DistilRoBERTa-base when K increases.SMASH still provides improvements when using training set and validation set up to K = 256 samples per label, but the improvements reduce as K increases.
, but
(Gao et al., 2020)o et al., 2020).Bold Results indicates the best result achieved using DistilRoBERTa-base and no extra in-domain training data.

Table 3 :
Results of iPET, and modifications of iPET.Bold results indicates the best results.

Table 4 :
Few-shot performance of T5 models.Bold results indicates the best result achieved using T5-small.
Timo Schick and Hinrich Schütze.2021a.Exploiting cloze-questions for few-shot text classification and natural language inference.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255-269, Online.Association for Computational Linguistics.Timo Schick and Hinrich Schütze.2021b.It's not just size that matters: Small language models are also fewshot learners.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339-2352, Online.Association for Computational Linguistics.