Diverse Distributions of Self-Supervised Tasks for Meta-Learning in NLP

Meta-learning considers the problem of learning an efficient learning process that can leverage its past experience to accurately solve new tasks. However, the efficacy of meta-learning crucially depends on the distribution of tasks available for training, and this is often assumed to be known a priori or constructed from limited supervised datasets. In this work, we aim to provide task distributions for meta-learning by considering self-supervised tasks automatically proposed from unlabeled text, to enable large-scale meta-learning in NLP. We design multiple distributions of self-supervised tasks by considering important aspects of task diversity, difficulty, type, domain, and curriculum, and investigate how they affect meta-learning performance. Our analysis shows that all these factors meaningfully alter the task distribution, some inducing significant improvements in downstream few-shot accuracy of the meta-learned models. Empirically, results on 20 downstream tasks show significant improvements in few-shot learning – adding up to +4.2% absolute accuracy (on average) to the previous unsupervised meta-learning method, and perform comparably to supervised methods on the FewRel 2.0 benchmark.


Introduction
Humans show a remarkable capability to accurately solve a wide range of problems efficiently -utilizing a limited amount of computation and experience. Deep learning models, by stark contrast, can be trained to be highly accurate on a narrow task while being highly inefficient in terms of the amount of compute and data required to reach that accuracy. Within natural language processing (NLP), recent breakthroughs in unsupervised pretraining have enabled reusable models that can be applied to many NLP tasks, however, learning of * Correspondence: tbansal@cs.umass.edu. † Part of the work was done at Microsoft Research.
new tasks is still inefficient (Yogatama et al., 2019;Linzen, 2020). Meta-learning (Schmidhuber, 1987;Bengio et al., 1992;Thrun and Pratt, 2012) treats the learning process itself as a learning problem from data, with the goal of learning systems that can generalize to new tasks efficiently. This has the potential to produce fewshot learners that can accurately solve a wide range of new tasks. However, meta-learning requires a distribution over tasks with relevant labeled data that can be difficult to obtain, severely limiting the practical utility of meta-learning methods.
In the supervised setting, in particular, metalearning task distribution is often defined by subsampling from the classes in a classification problem over a fixed dataset (Vinyals et al., 2016). This not only limits the applicability of meta-learning to the underlying classification problem, but also requires a diverse set of supervised datasets with a large number of classes to enable learning. Selfsupervised meta-learning, on the other hand, seeks to propose tasks from unlabelled data Bansal et al., 2020b), and has great potential to enable numerous important applications (Hospedales et al., 2020) such as neural architecture search, continual learning, hyper-parameter optimization, learning in low-resource settings, etc. Existing work in meta-learning for NLP, however, defaults to task distributions that tend to be overly simplistic, e.g. using existing supervised datasets (Han et al., 2018;Dou et al., 2019; or unsupervised cloze-style tasks with uniform selection of words from the vocabulary (Bansal et al., 2020b). Given the lack of exploration on this critical component, we propose to devise and evaluate various task distributions in the context of unsupervised meta-learning for NLP.
Specifically, we explore a diverse set of approaches to create task distributions that are inductive to better meta-training efficacy. We provide empirical evidence that existing definitions of task distributions are prone to producing tasks that might not be challenging enough for the underlying model to learn useful representations, which in turn translates into poor downstream task performance. We therefore propose several new approaches that instead consider important features of the task distribution including task diversity, difficulty, resemblance to the downstream tasks, and the curriculum or the order in which tasks are presented during training. When evaluated on a suite of 20 NLP classification tasks, our best unsupervised metalearning method leads to an absolute increase of up to +4.2% in average few-shot accuracy over unsupervised baseline results; and it even outperforms supervised meta-learning methods on FewRel 2.0 benchmark (Gao et al., 2019) on 5-shot evaluation.
The paper is organized as follows. We start by providing some relevant background (2) on metalearning and the unsupervised task generation approach in SMLMT. Next, we introduce (3) new approaches to improve the task distribution. We then analyze (4.2) the different unsupervised distributions and how they relate to each other. Finally, we evaluate (4.3, 4.4) the different unsupervised methods on a wide range of NLP tasks including sentiment classification, entity typing, text classification, sentence-pair classification and relation classification.

Meta-Learning
In this work, we focus on Model Agnostic Meta-Learning (MAML) (Finn et al., 2017), which is an optimization-based meta-learning method. To efficiently adapt to a task training data, MAML jointly optimizes the initial point of a neural network model and a gradient-descent based optimizer. This is framed as a bi-level optimization consisting of an inner loop for task-specific learning and outer loop for fast adaptation across tasks: where θ are the parameters of the model, α is the (learnable) inner loop learning rate, Θ := {θ, α}, L i is the loss for task T i , (D tr , D val ) ∼ T i are support and validation data for the task T i , and β is the outer loop learning rate which is a hyperparameter. Typically, multiple steps of gradient descent are performed in the inner loop. Training such methods proceeds in an episodic framework (Vinyals et al., 2016), where in each episode a minibatch of tasks are sampled along with their support and validation sets, and the model parameters are optimized as above.

Task Distribution for Meta-Learning
Meta-learning assumes access to a distribution P(T ) over tasks. The goal is to utilize tasks T i ∼ P(T ) sampled from this distribution to train a learning procedure that generalizes to unseen tasks T ∼ P(T ) from the distribution. Supervised meta-learning often utilizes a fixed task dataset to create P(T ) by sub-sampling from all class labels (Vinyals et al., 2016). Bansal et al. (2020b) sought to provide an unsupervised approach that proposes tasks from unlabelled data. The resulting Subset Masked Language Modeling Tasks (SMLMT) approach proposes self-supervised tasks to enable meta-learning and improves few-shot learning across a diverse set of classification tasks.
Sampling an N -way task from SMLMT requires first sampling a size-N subset of the vocabulary, which are subsequently mapped to consecutive integer ids and serve as labels for the task. Then to sample examples for each label, sentences containing that word are sampled and the occurrences of the word are masked out. Note that a task in SMLMT is a sentence classification task where each input sentence consists of exactly one word type that is masked throughout the sentence and the label for the sentence is the underlying word type that was masked. This enables sampling combinatorially many classification tasks for meta-learning.

Exploring Unsupervised Task Distribution for Meta-Learning
Sampling tasks in SMLMT depends on sampling of words, which serve as labels, and sampling of sentences containing that word. The original formulation used uniform sampling for both steps. This can lead to several limitations on the quality of the resulting task distribution including task diversity and difficulty. The single-sentence classification tasks also lack cross-sentence reasoning capacities, leading to a severe train-test mismatch for downstream tasks involving sentence pairs. To remedy these problems, we consider alternative distributions that are inductive to more diverse and challenging tasks for meta-training. We also describe an automatic curriculum over tasks that seeks to continuously find challenging tasks for the model during training.

Sampling labels in SMLMT
Frequency-based sampling: Word distribution in natural language is characterized by an exponential distribution with a long tail of rare words (Baayen, 2002). Uniform sampling of words in SMLMT puts a disproportionately high weight on the long tail, leading to inefficient use of the training corpora since the low frequency words occur in only a small proportion of the sentences. On the other hand, simple frequency-based sampling can be highly skewed towards a handful of high frequency words. We thus propose to simply sample words in proportion to their log-frequency instead.
Cluster-based sampling: Given two words randomly sampled from a large vocabulary, it is likely to be rather trivial to distinguish their corresponding contexts. This can lead to overly simple tasks in the SMLMT task distribution. To avoid this problem, we consider clustering words based on pre-trained word embeddings and grouping words into semantically-related clusters. Diverse and difficult instances of tasks in SMLMT can then be sampled by selecting all words in a task from either (1) the same cluster (intra-cluster sampling), or (2) different clusters (inter-cluster sampling). Words co-occurring in the same cluster are semantically or topically related and hence occur in similar contexts, leading to harder to classify sentences as we see in our analysis (Sec 4.2). Moreover, choosing different clusters to sample words across tasks provides a natural diversity over topics in the training tasks. On the other hand, picking words from different clusters (inter-cluster sampling) can still lead to tasks where the sentences are easy to classify due to easily distinguishable contexts. Specifically, clustering of pre-trained word embeddings using k-means has been proven effective in generating topical clusters rivaling topic models (Sia et al., 2020). We use the FastText (Joulin et al., 2017) embeddings as word representations. We choose FastText as it is fast, incorporates subword information, can generate embeddings for out-of-vocabulary words, and has been found to yield topical clusters (Sia et al., 2020).
Since cluster sizes can be imbalanced, we pick clusters proportional to the number of words in the cluster. Thus, assuming {C 1 , . . . , C m } to be the m clusters of the word vocabulary, we replace the uniform sampling over words in SMLMT as: where Cat(p 1 , . . . , p m ) is a categorical distribution over m categories with probabilities {p 1 , . . . , p m }.

Dynamic Curriculum over Self-Supervised Tasks
The methods discussed so far use a static task distribution for learning with tasks sampled i.i.d from this distribution for training. Curriculum learning (Bengio et al., 2009;Graves et al., 2017) instead posits that choosing the order in which instances are presented, with gradually increasing complexity, can enable faster learning and better generalization. We explore whether a curriculum in task sampling is beneficial for meta-learning by proposing a method to sample increasingly difficult tasks during training. To enable this we need a method to propose difficult tasks based on the current state of the model during training.
Since words act as labels in SMLMT, words that are closer in the representational space of the neural model will be more difficult to distinguish, leading to more difficult tasks. On the other hand, nearestneighbors can be too difficult to induce effective learning for a model. This is related to findings in negative sampling in metric learning literature (Schroff et al., 2015;Suh et al., 2019) where using "too hard" negatives typically hurts performance.
To alleviate this problem, we cluster representations computed from the model and uniformly sample words within the same cluster to create difficult but not impossible tasks (similar to the "static" clustering approach). Secondly, we adopt an easyto-hard curriculum by controlling the ratio between the harder tasks from the dynamic distribution D t and the easier ones from the static distribution S, consisting of tasks sampled i.i.d from uniform random word sampling or fixed word-clustering. At step t, let λ t be the probability of sampling a task from D t and 1 − λ t from S. Then the dynamic curriculum is defined by sampling tasks from the following mixture distribution with λ t linearly annealed over the training epochs from 0 to 1: To construct D t , we consider the following word (i.e. label) representation for clustering, obtained by the average representation under the model of the masked sentences corresponding to a word: where S(w i ) is the set of all sentences containing the word w i with the word w i masked out (as defined in SMLMT), f θt (.) is the representation from the neural model for instance x that is fed into the softmax classification layer, and θ t are the model parameters at step t.
To make the computation ofŵ (t) i tractable, we first approximate the quantity by the expectation over a subset of S(w i ). Moreover, since computing the representations {ŵ i } for all vocabulary words and clustering at every step t of the training will be computationally infeasible, we consider doing this after m steps of meta-training. This also allows the model to train on the current distribution for sometime before abruptly changing the distribution. Finally, while the model is being updated between time step t and t + m, we use the model snapshot at t to create the word clusters asynchronously for the model at t + m, which allows the task generation to run in parallel to the model training.

Task proposal using sentence clustering
SMLMT uses a data-augmentation strategy to automatically assign labels to unlabelled sentences by consistently masking out the same word type in a set of sentences. The masked out word then serves as the label for the sentences. This cloze-style approach to creating tasks was inspired by the success of masked language modeling (Devlin et al., 2018) in learning useful representations. While this leads to significant improvements in sentence-level classification on a range of real downstream tasks (Bansal et al., 2020b), it is unclear whether a word masking approach is the most efficient to learning useful sentence representations. To probe this question further, we explore an alternative to SMLMT that directly assigns semantic labels to sentences without any augmentation.
Specifically, we consider pre-trained sentence representations for proposing tasks, which have been proven useful for improving semi-supervised learning (Du et al., 2020). We use a pre-trained sentence embedding model (Du et al., 2020;Wenzek et al., 2020) to embed all sentences in a corpus and cluster them. To propose an N -way task, we first randomly sample N cluster-ids and remap them to random consecutive integers {1, . . . , N }. Then examples for each label are sampled from the corresponding cluster, creating a classification task for classifying the sentences into their underlying cluster labels. Note that the step of remapping the cluster-ids ensures that the model cannot memorize the sentence to cluster mapping, which would lead to meta over-fitting .

Contrastive learning over sentence pairs
SMLMT proposes sentence-level tasks and thus lacks cross-sentence reasoning. This is confirmed by the poor downstream few-shot performance of models trained on SMLMT (see Sec. 4.3). Since models trained on SMLMT have never seen pairs of sentences as input, it leads to a train-test mismatch for sentence-pair classification tasks. To remedy this, we introduce a simple but effective contrastive learning task over sentence-pairs that bridges this gap. Contrastive learning has been used to learn effective sentence representations (Logeswaran and Lee, 2018). Next sentence prediction, a sentencepair task, was used in the training of BERT (Devlin et al., 2018) which was later found to be not effective (Liu et al., 2019b). BERT considered segments instead of full sentences, however the downstream tasks often require reasoning over complete sentences. Thus, we consider classifying whether two sentences come from the same document as opposed to different documents, as a sentence-pair task to enable cross-sentence reasoning. This simple objective was found to be quite effective in our experiments. Note that during meta-training, this can be treated as an additional task in the task distribution. Since the SMLMT task distribution consists of an exponential number of tasks, we sample the sentence-pair task in an episode with a fixed probability α, which is hyper-parameter.

Experiments
We evaluate various self-supervised task distributions for their utility in meta-learning for few-shot classification. We first describe the experimental setting, then we perform evaluations to understand how the different self-supervised tasks relate to each other, and finally show performance on a large set of 20 real classification datasets. These datasets cover a wide range of tasks: sentiment classification, entity typing, text classification, sentence pair classification and relation classification.
Our proposed approach shows significant improvements over previous few-shot classification results (Bansal et al., 2020b;Gao et al., 2019).

Experimental Setup
We consider the challenging few-shot setting where models are trained on unlabelled corpora and then evaluated on target tasks with only k examples per label (k ≤ 32) to allow fine-tuning of the models on the target task. Since our focus is on unsupervised meta-learning, we closely follow the experimental setup of Bansal et al. (2020b).

Meta-learning Model
We use the same model as in Bansal et al. (2020b) for our results to be comparable 1 . The model is a BERT transformer encoder coupled with a parameter generator, a 2layer MLP, that generates the initial point for classification layer for a task conditioned on the support examples. The model is meta-trained using the MAML algorithm (Finn et al., 2017), with learned per-layer learning rates, on the self-supervised task distributions. All model hyper-parameters are kept the same so that any change in performance can be attributed to differences in the task distribution. See Supplementary for all hyper-parameters.

Methods Evaluated
We consider all the different approaches to self-supervised task distributions described in Sec 3 and the baseline approach of SMLMT: (1) Uniform: this is the SMLMT approach of Bansal et al. (2020b) which use uniform random sampling over word-types; (2) Frequency: SMLMT with a sampling proportional to log-frequency (see 3.1); (3) Cluster: SMLMT where labels are picked from same word cluster (see 3.1); (4) Dynamic: curriculum-based task sampling with Cluster as the static distribution (see 3.2); (5) Cluster-ccnet: same as Cluster but using ccnet (Wenzek et al., 2020) as the corpora, which consists of web crawled data; (6) SentCluster: alternative to SMLMT which proposes tasks from subsets of sentence clustering (see 3.3); (7) SentPair: the sentence-pair tasks (see 3.4). All methods, except SentCluster and Cluster-ccnet, have Wikipedia as the text corpora. The sentence embeddings for SentCluster task distribution were obtained from Du et al. (2020), and consist of embeddings of about 1 billion sentences from ccnet (Wenzek et al., 2020). For this reason, we also report Cluster-ccnet that uses this same set of sentences. We found it beneficial to include 25% Frequency tasks in the Cluster task distribution and SentPair tasks are included in all other task distributions unless otherwise noted. Note that we only consider completely unsupervised meta-learning methods for fair evaluation. However our results improve over Bansal et al. (2020b) which showed improvements over BERT and multi-task BERT baselines. As we utilize the same dataset splits released in their work, our results can be directly compared.

Analyzing task distributions
We start by a quantitative exploration of the various self-supervised task proposals without resorting to full fine-tuning on downstream tasks. Our goal here is to understand properties of these task distributions and how they relate to each other. To do this, we consider models meta-trained on a specific type of task proposal (rows in Table 1) and evaluate their performance in the few-shot setting on tasks sampled from all of the other task proposal methods (columns therein). We use ri (or cj) below to refer to row i (or column j) in the table.
We consider the following task proposal methods: Frequency (FREQ, c1): using the frequencybased word sampling in SMLMT; Inter-Cluster (X-C, c2): using the word-clustering approach explained in sec 3.1 but sampling all labels of task from different clusters; Intra-Cluster (I-C, c3&4): using the word-clustering approach explained in sec 3.1 which samples all labels of task from the same cluster; Sentence Cluster (S-C, c5): this is the sentence clustering approach to task proposal presented in sec 3.3. For evaluation, we consider 4-way tasks sampled from the above methods and evaluate average accuracy over 5000 tasks. We consider a BERT model (r1) which is not trained on the SMLMT distribution but is trained on the related masked language modeling (MLM) task. To enable evaluation of this model, we use it as a prototypical network model (Snell et al., 2017). We also consider meta-trained models trained on the SMLMT distribution with uniform sampling (Bansal et al., 2020b) (r2), frequency-based sampling (r3), and intra-cluster sampling (r4). Note that all models are trained on Wikipedia corpus.
Results are in Table 1. First, since BERT wasn't trained on any of the task distributions, we find low accuracy on all these tasks on r1, indicating that they contain information different than what is learned from MLM. Moreover, the highest accuracy of this model is on Sentence Cluster tasks (r1c5; random baseline is 25%), even though the domain of this task is quite different than the training data of BERT. Next, lets consider the vanilla SMLMT model which uses uniformly random word sampling to create the meta-learning task distribution. Interestingly, we find that it gives high accuracy on frequency-sampled tasks (r2c1). Similarly, accuracy is high on the inter-cluster tasks (r2c2), even though the model wasn't meta-trained directly on this distribution. More importantly, performance drops significantly (≈ 18%) on the tasks sampled using the intra-cluster approach (r2c3). This performance drops even further (≈ 10%; r2c4) when the tasks are sampled from a different domain (common crawl) than the training domain of the model (Wiki). Accuracy on Sentence Cluster is also very high (r2c5), without training on this distribution. Models trained on frequency-based sampling perform similarly (r3). We also show the performance of a model trained on tasks sampled using the intra-cluster approach. Note that this model was trained on Wikipedia corpus, and even though it was trained on intra-cluster tasks, we still see a significant performance drop on intra-cluster tasks on a different domain (r4c4 vs r4c3). Finally, consider models trained on the sentence clustering tasks. These perform poorly on all of the tasks proposed by SMLMT (r5c1-4), indicating that this task distribution does not contain the same amount of information as SMLMT.
In summary, these results indicate that: (1) the intra-cluster tasks are more difficult than frequencybased sampling, and inter-cluster tasks are as easy as uniform-sampling (r2c2) (2) sentence cluster tasks are the easiest among all task proposals (c5), and training on this task distribution leads to poor performance on the SMLMT distributions (r5c1-4; but not vice versa), indicating lack of information in this distribution as compared to SMLMT. From this analysis we expect intra-cluster task distribution to be richer as compared to the other alternatives and models meta-trained on these should improve downstream performance over the others. As we will see in the next section, the downstream performance improvements are highly correlated with these unsupervised evaluations.

Evaluation on diverse downstream classification tasks
Datasets We consider all 17 downstream tasks in Bansal et al. (2020b) and 2 additional sentence-pair tasks. We group performance on datasets by the type of the task: (1) Sentiment classification: 4 domains (Books, DVD, Kitchen, Electronics) of Amazon review binary sentiment datasets (Blitzer et al., 2007); (2) Rating classification: 4 domain of 3way classification based on ratings of reviews from the above Amazon datasets, 1 dataset on 3-way classification of tweets about sentiment towards Airlines; (3) Entity typing: CoNLL-2003 (Sang and De Meulder, 2003) entity mention classification into 4 coarse types, MIT-Restaurant (Liu et al., 2013) task on classifying mentions in user queries about restaurants into 8 types; (4) Sentence-pair classification: Scitail, a scientific natural language inference dataset (Khot et al., 2018), RTE task on textual entailment and MRPC task on paraphrase classification from the GLUE benchmark . (5) Other text classification: multiple social-media datasets on classifying tweets into (a) 2-way: political audience, bias or mention of a disaster, (b) 9-way: classifying based on political message, (c) 13-way: classifying emotion.
Evaluation Protocol We meta-train separate models on the self-supervised task distributions, without any access to the downstream supervised tasks. The models are then fine-tuned on the downstream task training sets which consist of k = 8, 16, 32 examples per class. Note that tasks can have different number of classes. Following Bansal et al. (2020b), we use the development set of Scitail and Amazon-Electronics to select the number of steps of fine-tuning for all models, all other hyper-parameters are kept the same as metatraining. Since few-shot performance is sensitive to the few examples in training, each model is finetuned on 10 sets for each task and the average test performance is reported with standard deviation. Results. Table 2 shows the performance of all methods on different types of downstream tasks. We group datasets based on task type as described above and report the average performance over all the datasets in each group. First, note that the Sent-Cluster approach is always inferior to any of the cloze-style approach, except on sentiment and rating classification where it is slightly better than SMLMT with Uniform sampling but worse than the other methods proposed here. Interestingly, replacing Uniform sampling with the simple Frequency sampling already leads to significant improvements throughout. Comparing the Cluster approach, we observe that this is better than Frequency on sentence-level tasks (like sentiment, rating, others), while slightly worse or comparable on sentence-pair tasks and phrase-level classification tasks (entity typing). Overall, the word-clustering approach to sampling labels for SMLMT are more preferable as they are often among the two highest performing on any task group or close to the highest performance. Note that our unsupervised SentCluster 61.1 ± 5.8 64.2 ± 5.7 70.4 ± 4.7 Cluster-ccnet 64.7 ± 6.6 69.9 ± 7.1 76.2 ± 6.3 Rating Classification Uniform 41.9 ± 7.2 47.3 ± 7.2 52.9 ± 7.6 Frequency 42.6 ± 6.9 49.2 ± 7.2 55.1 ± 7.7 Cluster 45.2 ± 7.7 51.9 ± 6.6 56.5 ± 7.1 Dynamic 46.3 ± 8.1 53.5 ± 7.0 57.9 ± 7.3 SentCluster 45.1 ± 8.8 48.7 ± 9.2 50.9 ± 9.0 Cluster-ccnet 45.2 ± 7.5 52.1 ± 7.3 57. SentCluster 52.6 ± 4.7 52.9 ± 2.9 54.0 ± 3.8 Cluster-ccnet 55.9 ± 5.7 58.5 ± 6.9 62.9 ± 6.9 Other Text Classification SentCluster 43.5 ± 4.1 45.7 ± 2.5 47.8 ± 1.7 Cluster-ccnet 46.6 ± 3.4 48.9 ± 2.2 49.9 ± 1.8 Table 2: Results on downstream tasks. Best performing model for each k and each task group is in bold and the second best is underlined.
analysis in Sec 4.2 also reflected that training on the Cluster task distribution should be better compared to others. Finally, note that using the Dynamic curriculum over task sampling further improves the performance over cluster-based approach. This overall trend is also clearly reflected in the overall average performance across all the 19 tasks in Figure 1. Figure 2 further shows that, for the Cluster tasks, constructing tasks from more diverse domain such as CommonCrawl can improve downstream performance even when using the same amount of data for training.
Ablation over SentPair. We introduced the sentence pair task to enable better learning of sentence pair tasks such as natural language inference. These task remove the train-test mismatch in the input format as the sentence pair tasks contain pairs of sentences as input where as SMLMT only proposes single sentence classification tasks. To assess   the efficacy of the SentPair task, we trained a wordcluster model with and without the SentPair task and evaluated it on few-shot sentence pair tasks of Scitail and MRPC. Results are in Table 3. We can see that the unsupervised SentPair task improves performance under most settings, sometimes by large margins up to 8% absolute.
Ablation for dynamic curriculum. The dynamic curriculum over tasks requires two crucial choices: the static distribution and the value of mixing proportion λ t . We ablate over choices for these in Fig. 4 which reports average performance over 5 tasks, one each from the task groups considered. We find that using the Cluster tasks, created from static pre-computer word-embeddings, works better than using Frequency-based tasks as the static distribution. Moreover, annealing λ t from 0 to 1 over the training epochs is better than using a fixed value of λ t throughout training.

Evaluation on FewRel 2.0 benchmark
FewRel (Han et al., 2018;Gao et al., 2019) is a common benchmark for few-shot learning in NLP, which consists of many few-shot relation classification tasks created by sub-sampling from a pool of relation labels. The resemblance to the popular few-shot benchmarks like MiniImageNet (Vinyals et al., 2016) makes FewRel one of the few widely used datasets for training and evaluating NLP metalearning methods. Before submitting to the competition site for test set results, we first use the validation set to select the best model(s), where we observed that the Cluster approaches performs better than the other task proposals (see validation results in Supplementary).

Related Work
Meta-learning applications in NLP have yielded improvements on specific tasks (Gu et al., 2018;Han et al., 2018;Dou et al., 2019). Unsupervised meta-learning has been explored in computer vision Khodadadeh et al., 2019) and reinforcement learning (Gupta et al., 2018).  cluster images using pre-trained embeddings to create tasks. Metz et al. (2019) meta-learn an unsupervised update rule in a semisupervised framework. Bansal et al. (2020b) developed the SMLMT approach to unsupervised meta-learning in NLP. Contemporary work (Murty et al., 2021) explored the use of clustering, though focused only on natural language inference tasks. Curriculum learning (Bengio et al., 2009) in the context of meta-learning has been unexplored in NLP, prior to this work. Jabri et al. (2019) found unsupervised curriculum to be beneficial for metareinforcement learning. We refer to Hospedales et al. (2020) for a comprehensive review of metalearning. Self-supervised learning has emerged as an efficient approach to representation learning in NLP (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2018;Radford et al., 2019;Yang et al., 2019). Multi-task learning of pre-trained models has shown improved results on many tasks (Phang et al., 2018;Liu et al., 2019a), including few-shot setting. Yin et al. (2020) leveraged entailment tasks for few-shot learning. Du et al. (2020) developed self-training methods for semi-supervised few-shot learning. Recently, extremely large language models have been shown to have few-shot capacities (Brown et al., 2020), while Schick and Schütze (2020) demonstrated few-shot capacities for small models in the semi-supervised setting. Meanwhile, Bansal et al. (2020a,b) showed meta-learning to be effective at improving few-shot performance in multi-task and unsupervised settings, as well as improving performance for small models.

Conclusion
We explored several approaches to self-supervised task distribution for meta-learning. Our results demonstrate improvements in few-shot performance over a wide-range of classification tasks. This demonstrates the utility of meta-learning from unlabeled data, opening up the possibility of largescale meta-learning for pertinent applications in NLP such as continual learning, architecture search, learning for low-resource languages, and more.

A.1 Additional Experiment Results
Results on FewRel 2.0 validation set using the different task distributions is shown in Figure 6. Full distribution of results in down-stream takss for the various self-supervised tasks can be seen in Fig. 5 A.2 Fine-tuning hyper-parameters The meta-learning methods learn the learn rate for fine-tuning, thus we only tune the number of steps to run fine-tuning by using development data from 2 tasks (Scitail, Amazon Electronics), following Bansal et al. (2020a,b). We found that running fine-tuning until the loss on the support set is small (<= 0.01) is an alternative that also performs competitively and does not require tuning the number of steps. The reported results followed the previous approach and the tuned number of steps of fine-tuning for k = 8, 16, 32 respectively were: (1) Uniform: 100,75,100 (2) Frequency: 25,150,75 (3) Cluster: 75,50,75 (4) Cluster-ccnet: 150,200,75 (5) SentCluster: 100,250,25 (6) Dynamic: 10, 100, 200. On FewRel we found 20 steps of updates on the support set to perform well on the validation data for all settings.

A.3 Other Implementation Details
Since the Fewrel tasks consist of entity pair in the sentence it is important to mark these entities which define the relation to be classified. We used unused tokens in the BERT vocabulary to mark the positions of the entity mentions. Note the in the unsupervised models these unsused tokens get a zero-embedding and are only fine-tuned from the 1-shot or 5-shot support sets. Hyper-parameters for meta-training are listed in Table 7. Dataset statistics for downstream classification tasks can be found in  and few-shot splits can be downloaded from https://github.com/iesl/leopard.
Training Hardware: The models were trained on 32 V100 GPU. Training takes about 42 hours.