Few-shot initializing of Active Learner via Meta-Learning

,


Introduction
In recent years, transformer-based models such as BERT (Devlin et al., 2018) have been very successful in achieving high performance in natural language processing (NLP) tasks.These results are achieved by training with a significant amount of labeled data, which is often necessary to optimize a large number of weights in these types of models during the fine-tuning stage.This is a major obstacle because many machine learning applications lack widely available labeled data and the labeling task in high volumes can be tedious and expensive.
Two principal research areas to overcome this obstacle are Active Learning (AL) and Few-Shot Learning (FSL).Few-shot learning was initially introduced to simulate the human ability to generalize quickly with only a few labeled examples (Yip and Sussman, 1997).Thus, the goal is to reach the highest possible performance with a small number of labelled data points (e.g., 4, 8, 16, . . .).The field has made great progress after the introduction of optimization-based few-shot learning (Ravi and Larochelle, 2016) using the idea of meta-learning (ML) (Schmidhuber et al., 1997).The basic principle of meta-learning in this context is to allow the neural network to utilize the knowledge acquired from multiple tasks, represented in the network by its weights, for adaptation in new tasks.Hence, initializing the network with weights learned from a variety of tasks can enable faster learning of similar tasks.
The active learning field approaches this problem by only partially annotating the unlabeled data while attempting to achieve the highest performance.This is done by iteratively selecting a subset of unlabeled data points to be annotated by an "oracle" such that the selected points offer the highest learning benefit according to some metric, e.g., representativeness, diversity, uncertainty (Cohn et al., 1996).
In this work, we introduce a novel method to extend active learning with meta-learning to minimize the number of new data points that need to be annotated for achieving good performance.We do this by learning a favorable model initialization, via meta-learning, for the target task during active learning.We show a general approach on what to transfer to the active learning model and what to leave out.We demonstrate the effectiveness of our methodology by showing that it achieves higher performance than the baseline initialization on different natural language understanding tasks and datasets with the same number of annotation queries to the oracle; or eventually performs equally but by requiring fewer queries to the oracle.These results also show that our approach offers an advantage during active learning when larger annotation queries are used.Importantly, we show that performance improvement is significantly enhanced when tasks based on similar principle are available, especially in a cold-start setting.

Related Work
Active learning research focuses on developing novel acquisition functions for selecting data points to be annotated by the oracle.The primary metrics on which the acquisition functions act are uncertainty (Gal et al., 2017), diversity (Zhdanov, 2019), and representativeness (Sener and Savarese, 2017).Furthermore, acquisition functions that incorporate a multitude of metrics achieve better performance (Yuan et al., 2020;Margatina et al., 2021).Nevertheless, there is no capture function that consistently performs the best across different datasets or query sizes (Dor et al., 2020;Citovsky et al., 2021).Which is why we implemented different acquisition functions for our methodology.
Active learning and meta-learning have been used in the same context before, but the role of meta-learning has been treated as an acquisition function (Contardo et al., 2017;Fang et al., 2017).Several of these works show that meta-learned acquisition functions are effective in a cold-start environment (Konyushkova et al., 2017;Shao et al., 2019), which is mainly applied in computer vision.In this work, meta-learning is used to initialize the active learner and we show that it is effective in a cold-start environment and in a low-budget setting.
There has been one similar work from Barrett and White (2021), in the context of chemical peptide design.The mentioned work focuses on twelve different, but closely related tasks and uses metalearning to optimize the initial parameters for active learning of the twelve tasks.However, the methodology is not entirely explained.In the current work, we do give a clear methodology of combining meta-learning and active learning, which is broadly applicable in the machine learning domain.We show this within the NLP context using a wide range of tasks.
A somewhat similar approach is using domain adaption to improve active learning (Rai et al., 2010;Su et al., 2020).The difference with our proposed method is that meta-learning is able to learn an initialization for fast adaptation using multiple tasks and domains.On the other hand domain adaptation is only able to learn common features between two domains in the same task and is therefore not comparable.

Meta-Learning
To obtain a favorable initialization, we train a singular model, denoted f , over a set of tasks T using few-shot meta-learning.Each task T consists of k-shot n-way mini-datasets, meaning each minidataset D consists of n classes with each class containing k samples.
The specific meta-learning method we employed is called LEOPARD, introduced by Bansal et al. (2019).LEOPARD is a BERT-based application of the Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) algorithm, which is a model-independent second-order optimizationbased meta-learning method.The algorithm considers the model f θ with parameters θ.These parameters θ are updated in an inner-and outer loop.In the inner-loop, θ is updated into θ ′ i , by performing gradient descent with D tr i ∼ T i : This update is performed on i = 1, 2, . . ., t tasks, this collection of tasks is denoted as a meta-batch B. Each task T i in B is selected using a predefined distribution T i ∼ P(T ).In the outer loop, the parameters θ are updated via the meta-objective, where the goal is to minimize the sum of the error on D val i ∼ T i for each task in the inner loop: Using the inner-and outer-loop, MAML is able to learn a favorable initialization for few-shot adaptation.However, a non-trivial problem is how to apply different tasks within MAML, since not every task has the same number of classes.In the work of Bansal et al. (2019), they overcome this issue by generating task-dependent softmax parameters.This is realised by partitioning a mini-dataset D i from T i with N i classes, where each partition C n i contains samples x j from a corresponding class n ∈ [N i ].Each class partition is fed into the textencoder (BERT) h θ and then the encoded samples undergo a non-linear projection g ϕ .Resulting in a representation for class n: Here, g ϕ is a simple Multi-Layer Perceptron (MLP), where the parameters ϕ are meta-learned according to the MAML algorithm.The softmax parameters are then constructed by concatenating the class representation in eq. ( 3), as such: ) These parameters are further adjusted during the inner adaptation loop in MAML.Another extension LEOPARD implements is meta-learning the learning rates (Li et al., 2017) in the inner loop, i.e., α in eq. ( 1), on a per layer basis.For further details, we refer the reader to Bansal et al. (2019).

Active Learning
The scenario we consider is pool-based sampling AL, where a large data pool is readily available for a given task T. It consists of an initial small set of labeled data L of seed size s and a large pool of unlabeled data U .In each AL iteration, a model f is trained on L, then using some acquisition function a a batch Q of size q is selected from U .The acquired samples are then annotated by some oracle and added to the L. In the next AL iteration, model f is retrained from its initial parameter initialization with the new L. Retraining is done from scratch to avoid overfitting from data samples in previous AL iterations (Hu et al., 2019).
We considered the following acquisition functions for our experiments: • Random: q samples are selected randomly from the unlabeled pool U .
• Entropy: An uncertainty-based acquisition function that ranks all unlabeled samples by their predictive entropy (Lewis and Gale, 1994) according to the model f θ L trained on L, and then selects the q highest-ranked samples.
• BADGE: Tries to select uncertain, yet diverse samples, by calculating gradient embeddings g x , and then performs k-MEANS++ using q centers on g x for each unlabeled sample x.
The gradient embeddings contain information about model confidence and hidden representations.So, by clustering uncertainty and diversity are included (Ash et al., 2019).
• ALPS: Is a model f independent acquisition function that selects samples by calculating surprisal embeddings s x for each unlabeled sample x. s x is calculated by using a pretrained BERT to compute the masked language modeling (MLM) loss on 15% randomly chosen, unmasked tokens in x by evaluating their true token labels via cross-entropy loss.Then k-MEANS clustering is performed on s x with q centers and the closest unlabeled sample to each center is selected to be annotated (Yuan et al., 2020).
• CAL: Selects samples via contrastive examples by calculating the KL-divergence between an unlabeled sample x and its neighbourhood, consisting of the four nearest labeled samples.Samples x with the highest average KL-divergence are chosen for annotation (Margatina et al., 2021).

Outer loop
Initialize BERT from with

Train with
Use to select samples from Oracle annotates the samples Update and

ML AL
Figure 1: A block diagram of our approach.We first meta-learn on set of tasks T and then initialize the active learner f AL with the meta-learned parameters.

Meta-initializing of Active Learner
The first step in our methodology is to choose a set of tasks T for meta-learning, that could benefit the target task T during active learning.We suggest picking several tasks for T , preferably with substantial amounts of data, on how similar their objectives are and how close the vocabulary domains are to the target task.There are more rigorous ways to choose between different tasks for transfer learning (Poth et al., 2021).
Next is meta-learning the model f M L θ , with the LEOPARD method as described in section.3.1.The meta-learning model consists of three different layers: where h θ B is a pretrained BERT-base encoder (Devlin et al., 2018) and h θ P is a pre-classification layer consisting of two linear layers with a tanh activation function between them.W i and b i are generated by g ϕ as in eq. ( 3) and ( 4), where g ϕ has the same structure as h θ P .The model is trained over M meta-iterations, i.e., M outer loops are performed which are characterized by eq. ( 2).For each meta-iteration, t tasks are sampled for the inner loop, such that the probability of sampling a task is the proportion of the square root of each task size.For each task T i , one D gen i is sampled to generate the softmax weights with g ϕ , m minidatasets D tr i for iteratively updating θ ′ i , and one D val i for the meta-objective.When the model is finished meta-learning, we choose the model f M L θ with the highest average accuracy across all tasks in T .The weights of this model are denoted by θ * .
The active learning model f AL φ has a similar architecture as f M L θ : The difference is that the f AL φ does not generate its softmax parameters, instead it has a basic classification layer h φ C which is a simple linear layer.
Here we only transfer the weights θ * P and θ * B from f M L θ * to the corresponding layers in f AL φ : ))} (7) Figure 1 visualizes the initialization step.The decision to not include the generated softmax layer is due to its dependence on the availability of data samples of every class in a task.Otherwise, the weights created with eq. ( 4) will have incorrect dimensions, since it lacks representations for one or more classes.This scenario is probable when the initial size of L is small and where the tasks have a higher number of classes.Additionally, generated softmax weights have a bias toward the samples used to generate them, which can be detrimental to the performance.Lastly, during active learning, there is only one target task T, so the main advantage of learning across multiple tasks is unnecessary.Unless it's a multi-task active learning scenario (Ikhwantri et al., 2018).
The meta-learned learning rates are also not included, because they are trained in a strictly controlled environment with balanced batches, where each class is represented evenly and they are optimized for fast adaption using only a few examples.There is no guarantee that these learning rates are optimal for larger amount of data and if the batches are more random.In fact, these learning rates can become negative during meta-training (Starshak, 2022).This is beneficial for meta-learning because according to Starshak (2022) it pushes parameters with negative learning rates to learn universal features.However, it is obvious that during adaptation to the target task, positive learning rates are required to learn (Bernacchia, 2021), but there is no clear strategy on how to change the negative learning rates for adaptation.For these reasons, we opt for a standard learning rate α AL for all parameters.
In the ablation study, we compared the performance between including and excluding the generated softmax weights and the meta-learned learning rates.

Training Tasks
For our experiments, we train two different metainitalizations with different sets of tasks.The first initialization is trained with a set of tasks T consisting of GLUE benchmark tasks: MNLI(m/mm), SST-2, QNLI, QQP, MRPC, and RTE (Wang et al., 2018) and the SNLI dataset (Bowman et al., 2015).These diverse tasks have been considered to be valuable for general language understanding and have been useful for transfer and few-shot learning across multiple domains (Poth et al., 2021;Bansal et al., 2019) 1 .
The tasks in T provide general linguistic knowledge, however for topic classification it is essential to extract relevant keywords or phrases for a specific topic.By adding topic classification tasks the model can learn a pooling strategy to create a vector that represent these keywords or phrases for classification.Our second set of tasks T topic , therefore includes topic classification tasks by swapping QNLI and MRPC for Yahoo!Answers and DB-Pedia (Zhang et al., 2015).Yahoo!Answers is a question and answer dataset with topic classes and we use the question title for classification.DBPedia is a dataset created with Wikipedia articles from 14 different topics.

Evaluation and baselines
To evaluate the meta-initialization with T , we perform active learning on SciTaiL, (Khot et al., 2018) (Saravia et al., 2018), AG news, Yelp Review (Zhang et al., 2015), Amazon Kitchen Reviews2 , and TREC (Li and Roth, 2002).We modify the Amazon Kitchen dataset to transform it into a sentiment analysis task by classifying the 1-and 2-star reviews as negative, the 4-and 5-star reviews as positive, and filtering out the 3-star reviews since their sentiment is ambiguous.We keep the Yelp review dataset unchanged as a rating task.We have AG News and TREC as topic classification tasks, using TREC's fine-grained labels.SciTaiL is an NLI task in the scientific domain and Emotion is a task that classifies sentences by their emotion.All training and evaluation datasets were downloaded from the HuggingFace online datasets repository3 .In Appendix A are the dataset statistics in Table 2  and 3.
We use the pretrained BERT-base initialization4 θ BERT from Devlin et al. (2018) as our baseline for all evaluation tasks and we also compare the performance of meta-initialization T topic on the topic classification tasks.For all target task T and initializations, we perform active learning with the five acquisition functions mentioned in section 3.2: random, entropy, BADGE, ALPS and CAL.

Implementation Details
For training the two meta-based initialization θ * T and θ * T topic , we use M = 100K meta-iterations (i.e.outer loops), t = 4 task per meta-iteration, m = 7 mini-datasets D tr i is sampled for each T i in the inner-loops.For all tasks, we always classify between every pair of labels, even for tasks with more than 2 labels (Bansal et al., 2019).Therefore, each mini-dataset D consists of n = 2 classes, with each class having k = 5 examples.The learning rate for the outer-loop is β = 1e − 5 and the per-layer learning rates in the inner-loop are initialized with α = 1e − 5.These are the best hyper-parameters given in Bansal et al. (2019).The weights are updated using the Adam optimizer (Kingma and Ba, 2015) in the outer-loop and the inner-loop with SGD (Ruder, 2016).The meta-trained models with the highest average accuracy across all training tasks are chosen for meta-initializing the active learner.showing the mean accuracy (%).The mean is the average accuracy across all acquisition functions.The shade is twice the std-dev, calculated separately for above and under the mean.The meta-initialization θ * T outperforms the baseline SciTaiL consistently by a larger margin.Whereas, for Emotion, θ * T only outperforms the baseline for the first 300 additional samples.
During active learning, we consider a low-budget scenario, where a maximum of 1000 additional annotated data samples are acquired.We examine several set-ups with seed data s = 20, 50, 100 and acquisition sizes q = 50, 100 , constraining the number of AL iterations to 20 and 10, respectively.Any results not shown in Section 5, are shown in Appendix C. In each AL iteration, we train each model for 25 epochs on a given target task T and choose the model with the highest accuracy on the validation set to evaluate on the test set.We do this

Performance on NLP tasks
In Table 1 we present the results on the SciTaiL, Amazon, Yelp, and Emotion datasets with q = 50, with the acquisition function a * that performed the best on the base initialization.This is to showcase how the meta-initialized model performs in comparison with the baseline over different seed sizes and additionally acquired samples.In most of the scenarios, the meta-initialized model outperforms the baseline and in the remaining, it has a very similar performance to the baseline.To see the full results see Appendix C. The θ * T initialization outperforms the baseline on SciTaiL and Amazon datasets consistently as shown in Table 1.Since T contains four NLI tasks (MNLI, QNLI, RTE, and SNLI), θ * T must have learned the semantic relationships required for the NLI task.As a result, the difference between the two initialization decreases slowly, but stays significantly larger, as can be observed in Figure 2a.Interestingly, the same is true for the Amazon dataset, while T only contains one sentiment classification task (SST-2).It seems that SST-2 in combination with general NLU tasks provides sufficient information to learn how to discriminate between negative and positive sentiment.
The same is less pronounced for Yelp and Emotion.After about 400 to 600 annotated samples, the gap between the two initializations becomes smaller and eventually negligible, as shown in Figure 2b for Emotion.T does not contain a task that is in the same domain as Yelp or Emotion.However, SST-2 is somewhat related because Yelp is a fine-grained sentiment classification task, and emotions are often expressed in terms of positive and negative emotions.Therefore, T only provides the model with information about intermediate features needed for the downstream task of Yelp and Emotion.
The key observation is that the metainitialization θ * T always performs better than the baseline on average for 200 or fewer samples sampled, and almost always for 600 or fewer acquired samples.Especially, when less seed data is available.This shows that meta-initialization provides a competitive advantage over baseline in a low-budget and cold start setting by being able to learn in a few-shot steps.
However, the difference in performance shrinks as more samples are acquired.This trend is expected because the ability to learn rapidly is the most useful when less information is available.When enough information is obtained the same performance can be achieved by learning with a large number of data.Crucially, this means that learning fast with θ * T is not detrimental when a large number of data is available.This is generally what Table 1 shows, when the base initialization performs better at 600 or more acquired samples, often by a small margin Additionally, learning slows down for both when the proportion q |L| becomes smaller.Each newly acquired batch of annotated sample provides less and less new information as L grows larger, which might mask some difference in performance between the two initialization at larger L.
We also notice that the best acquisition function for the baseline θ BERT mostly coincides with the best acquisition function for θ * T .It is an indication that selecting the best acquisition function is mostly dataset dependent.

Topic Classification
The set of tasks T does not contain any topic classification tasks or any related tasks, consequently in Figure 3, it shows that θ * T is actually detrimental for AG news and provides no significant advantage towards TREC.So, for the proposed method it is essential to have similar or related tasks for meta-training the initialization.This necessity is clearly portrayed in Figure 3, since by active learning with θ * T T opic the performance is improved by a significant amount for both tasks, by adding two topic classification task during training.TREC performs increasingly better than θ * T and θ BERT as the number of annotated samples grows.For AG news the difference in performance is immediately noticeable at low amounts of annotated samples, due to Yahoo and DBPedia containing similar topics as AG news.Although, the performance reaches a plateau at around 500 additional samples, most likely due to new samples not providing a critical amount of information as mentioned in Section 5.1.

Query size
In the previous two sections we have only considered the results with a query size of q = 50.When compared to the results with q = 100 another advantage is exposed.If we compare Figure 2 and Figure 4 , we see that T with q = 100 perform consistently better than θ BERT , as opposed to q = 50, where their performance become indistinguishable around 300-400 additional samples.This advantage is important when the amount of AL iterations is constraint (Bullard et al., 2019).The meta-initialization θ * T works better with a larger q compared to θ BERT , since the initial performance gain implies a better representation of the samples in the encoded space.This directly benefits the acquisition functions that depend on the encoded space, such as CAL .BADGE and entropy benefit indirectly, as their evaluation method is affected by how well the samples are represented by the model in action.Similarly, when using a more advanced text encoder architecture (Lu and MacNamee, 2020).In short, the model-dependent acquisition functions score unlabeled samples more accurately, and by accumulating better-annotated samples, the meta-initialised active learner outperforms the baseline at larger query sizes q.

Generated Softmax and Learning Rates
To see the effects of using the generated softmax weights and the meta-learned learning rates α during active learning, we perform the active learning experiment with these elements included and initialized with θ * T , where s = 100, q = 50 and the BADGE acquisition function.As mentioned in Section 3.3, α can become negative dur-ing meta-learning and Figure 5 shows that there are a high density of negative learning rates after meta-learning on T .For these experiments, we set 0.010 0.005 0.000 0.005 0.010 learning rate 0 10 0 10 1 density any negative learning rate to α = 2e − 5.
The results are shown in Figure 6.Here, observe that after an initial gain in performance, the model seems to stop to learn.The average performance fluctuates around accuracy of 83%.We suspect this is caused by learning with learning rates above 1e − 4 (see Figure 5 when there is a sufficient amount annotated samples its probable to have consecutive batches which all consist of samples of the same class.Thus, pushes the active learner rapidly towards biased learning, that it cannot even recover when trained with a large amount of samples.A solution could be learning with more balanced batches at the start of training, this we will leave as future work.

Additional Comparisons
To further demonstrate the effectiveness of metalearning the initial parameters for active learning, we provide two additional baselines.In the experiments of Section 5, the baseline initialization was used off-the-shelf and has not seen any data from tasks T used in meta-learning.Meaning that, the meta-learned initialization has seen 10 million more datapoints.To make a more even comparison, we fine-tune the baseline by performing Masked Language Model (MLM) training on the datapoints in task T to provide equal data support.Then, we perform active learning on SciTail with the entropy acquisition function with s = 20 and q = 50.Figure 7 shows that the MLM trained initialization performs moderately better than the default baseline, but significantly performs worse than the meta-learned initialization.This shows that the performance gain by meta-learning an initialization is not simply due to a larger data support, but due to its ability to few-shot learn.
The second additional comparison we make, is fine-tuning the baseline initialization by pretraining on the DBPedia dataset and then perform active learning on AG News to show that our methodology is more beneficial than simple transfer learning.This experiment is performed with s = 20 and q = 50 with the entropy acquisition function.
We see that by simply pre-training on DBPedia the baseline is able to perform closely to the topic meta-initialization.However, if we put more emphasis (80%) on the topic tasks in T topic we can outperform both initializations significantly, as shown by topic-meta80 in Figure 8.This shows that the of tasks during meta-learning is an important factor that can impact the learning capabilities given a target task.

Conclusion
In this work, we have shown how to initialize the active-learner with meta-trained parameters on related tasks.This method provides a significant boost in performance in a low-budget setting.This effect is exaggerated during the early stages of active learning, due to fast adaptation capabilities by being able to few-shot learn.Furthermore, the ability to few-shot learn provides better representations in the encoded space, making the scoring from implemented acquisition functions more accurate.Which consequently, results in better active learning performance using larger acquisition batches.

Limitations
An obvious limitation is that the proposed method likely does not provide a significant advantage in performance when applied in a high-budget situation, where large amounts of data points are annotated.One way it might have a noticeable effect is to start with a smaller acquisition/query size and increase it during the active learning process, with the assumption that with a high budget, a larger acquisition size is used.This could leverage the better encoding space that few-shot meta-learning provides to improve the selection of samples to be annotated in the early stages.Possibly, resulting in a cumulative advantage in performance.
Another limitation is that closely related tasks might be nonexistent or rare.This would be detrimental, because if there are no related tasks in the meta-learning process, our method is worse than the BERT baseline, as shown with AG News.

B.2 Computation Time
Training with LEOPARD is computationally expensive.For every task T i the gradient needs to be calculated 8 times for all parameters, and therefore 32 times per meta-batch.The average time to train on one meta-batch B or meta-iteration with 4 tasks using LEOPARD is 3.8 seconds.Therefore, training with 100K meta-iterations takes roughly 105 hours.By using parallelization across the 4 tasks, the time should be able to be brought down to about 30 hours.
During AL, at most|L max | = 1100 are available if a seed size s = 100 is used.A full epoch of training with L max annotated samples through f AL takes roughly 15.5 seconds.The largest dataset we perform our experiments on is SciTaiL, because other datasets above 100K are downsized to 20K as mentioned in Section 4.3.We measure the time it takes to acquire q = 50 samples for 20 iterations up to a 1000 additional annotated samples for each acquisition function with SciTaiL.The time to update the L and U and process the samples is included.For random it takes 694 seconds, entropy 1672 seconds, BADGE 1794 seconds, ALP S 1812 seconds and for CAL 1762 seconds.
The training of the meta-learned models is run on a single Nvidia Tesla V100 GPU and each active learning experiment is run on four Nvidia Tesla V100 GPUs.

C.1 General NLP tasks
The results for all acquisition functions from Table 1 for each s = 20, 50, 100 with q = 50 can be found in Table 4, 5 and 6, respectively.Similarly, we also show here the graphs for all tasks discussed in Section 5.1 with s = 20 and q = 50 in Figure 9.We observe the same trends as described in Section 5.1.

C.2 Topic Classification
In Figure 10 and 11, we see that even starting with larger seed data that the meta-initialization θ * T T opic still outperforms the baseline as in Section 5.2.For AG News we observe that the performance becomes equal faster as the seed size becomes larger.However, the learning plateaus at around the same number of annotated samples.Indicating that the model is reaching its limit in performance or needs to learn with a larger L, e.g, L = 2000, 4000, to reach a significantly higher performance.
For TREC we observe in Figure 11 that the difference in performance stays constant for all seed sizes and even with a large amount of annotated samples.Showing that the meta-initialization θ * T T opic has gained knowledge on how to extract relevant keywords and phrases, therefore giving it a consistent performance advantage.

C.3 Query Size
In Figure 12 we see that using a larger query size the meta-initialization θ * T outperforms the baseline more consistently for SciTaiL and Emotion compared to in Figure 9.However, we do not see the same trend for Amazon and Yelp.For Yelp the reason might be that fine-grained sentiment classification task is too complex for the model to represent in its encoded space.random and ALP S often reach the highest performance, as shown in Table 4, 5 and 6.Showing that model-independent acquisition functions are preferable.This is an indication that the representations in the encoded space are unreliable.Therefore, the advantage of the metainitialization the model at larger q = 100 does not materialize.For Amazon we do not observe a significant difference between the two scenarios, it might be that the task is too simple and the difference in the encoded space between meta-initialized active learner and the baseline are not noticeable when picking 50 or 100 samples.4: AL performance between the base and meta-learned initialization θ * T across different tasks, seed size s = 20 and acquisition size of q = 50.The table shows the accuracy at 0, 200, 400, 600, 800 and 1000 additionally acquired samples.Where R = random, E = entropy, B = BADGE, A = ALP S and C = CAL.

Figure 2 :
Figure2: Results for AL on with s = 20 and q = 50 showing the mean accuracy (%).The mean is the average accuracy across all acquisition functions.The shade is twice the std-dev, calculated separately for above and under the mean.The meta-initialization θ * T outperforms the baseline SciTaiL consistently by a larger margin.Whereas, for Emotion, θ * T only outperforms the baseline for the first 300 additional samples.

Figure 3 :
Figure3: Results for AL with s = 20 and q = 50 showing the mean accuracy (%) of all acquisition functions.The baseline outperforms the meta-initialized model.However, the initialization with θ * T T opic outperforms the baseline.Showing the importance of task selection.

Figure 4 :
Figure 4: Results for AL on Emotion with s = 20 and q = 100 showing the mean accuracy (%) of all acquisition functions.The meta-learned initialization outperforms the baseline more consistently on Emotion.

Figure 5 :
Figure 5: Meta-learned learning rates α from LEOP-ARD trained on T and shows negative learning rates.

Figure 6 :
Figure6: Results for AL on SciTaiL with s = 100 and query size q = 50 with BADGE, showing the mean accuracy (%) across four runs.Gensoft-meta is initialized with θ * T and includes generated softmax weights and meta-learned learning rates.The gensoft-meta performance does not improve with more annotated samples.

Figure 7 :
Figure7: Results for AL on SciTail with s = 20 and query size q = 50 with entropy, showing the mean accuracy (%) across four runs.The MLM trained BERT initialization performs better than the baseline, but not the meta-learned initialization.

Figure 8 :
Figure 8: Results for AL on AG News with s = 50 and query size q = 50 with entropy, showing the mean accuracy (%) across four runs.

Figure 9 :Figure 10 :Figure 11 :Figure 12 :
Figure9: AL results on s = 20 and q = 50 showing the mean accuracy (%) across all acquisition functions.The metainitialization in general outperforms the baseline, especially up to 300-400 additional annotated samples.For 400 or more the baseline and meta-initialization become or less equal, for Yelp and Emotion.
* on the base initialization in each specific scenario.Where R = random, E = entropy, B = BADGE, A = ALP S and C = CAL.

Table 2 :
Statistics for datasets used during training.NLI stands for Natural Language inference, SA for Sentiment Analysis, QA for Question and Answering, PP for ParaPhrase and TC for Topic Classification.The test set is excluded because it was not used during metalearning.If there was no validation set we used the test set as validation set.